Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Size: px
Start display at page:

Download "Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook"

Transcription

1 Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

2 Agenda Advanced HDFS Features Apache Kafka Apache Cassandra Redis (but more this time) Cluster Planning

3 ADVANCED HDFS FEATURES

4 Highly Available NameNode Highly Available NameNode feature eliminates SPOF Requires two NameNodes and some extra configuration Active/Passive or Active/Active Clients only contact the active NameNode DataNodes report in and heartbeat with both NameNodes Active NameNode writes metadata to a quorum of JournalNodes Standby NameNode reads the JournalNodes to stay in sync There is no CheckPointNode (SecondaryNameNode) The passive NameNode performs checkpoint operations

5 HA NameNode Failover There are two failover scenarios Graceful Performed by an administrator for maintenance Automated Active NameNode fails Failed NameNode must be fenced Eliminates the 'split brain syndrome' Two fencing methods are available sshfence Kill NameNodes daemon shell script disables access to the NameNode, shuts down the network switch port, sends power off to the failed NameNode There is no 'default' fencing method

6 Release lock ZKFC NN Active ZooKeeper NFS or QJM Shared NN State Lock Released Created Create Lock ZKFC NN Become Active Standby Active I'm the Boss Data Node Data Node Data Node

7 HDFS Federation Useful for: Isolation/multi-tenancy Horizontal scalability of HDFS namespace Performance Allows for multiple independent NameNodes using the same collection of DataNodes DataNodes store blocks from all NameNode pools

8 Federated NameNodes File-system namespace scalable beyond heap size NameNode performance no longer a bottleneck NameNode failure/degradation is isolated Only data managed by the failed NameNode is unavailable Each NameNode can be made Highly Available

9 Hadoop Security Hadoop's original design web crawler and indexing Not designed for processing of confidential data Small number of trusted users Access to cluster controlled by providing user accounts Little / no control on what a user could do once logged in HDFS permissions were added in the Hadoop 0.16 release Similar to basic UNIX file permissions HDFS permissions can be disabled via dfs.permissions Basically for protection against user-induced accidents Did not protect from attacks Authentication is accomplished on the client side Easily subverted via a simple configuration parameter

10 Kerberos Kerberos support introduced in the Hadoop release Developed at MIT / freely available Not a Hadoop-specific feature Not included in Hadoop releases Works on the basis of 'tickets' Allow communicating nodes to securely identify each other across unsecure networks Primarily a client/server model implementing mutual authentication The user and the server verify each other's identity

11 How Kerberos Works Client forwards the username to KDC A. KDC sends Client/TGS Session Key, encrypted with user's password B. KDC issues a TGT, encrypted with TGS's key C. Sends B and service ID to TGS D. Authenticator encrypted w/a E. TGS issues CTS ticket, encrypted with SS key F. TGS issues CSS, encrypted w/a G. New authenticator encrypted with F H. Timestamp found in G+1 KDC - Key Distribution Center TGS Ticket Granting Service TGT Ticket Granting Ticket CTS Client-to-Server Ticket CSS Client Server Session Key

12 Kerberos Services Authentication Server Authenticates client Gives client enough information to authenticate with Service Server Service Server Authenticates client Authenticates itself to client Provides services to client

13 Kerberos Limitations Single point of failure Must use multiple servers Implement failback authentication mechanisms Strict time requirements 'tickets' are time stamped Clocks on all host must be carefully synchronized All authentication is controlled by the KDC Compromise of this infrastructure will allow attackers to impersonate any user Each network service requiring a different host name must have its own set of Kerberos keys Complicates virtual hosting of clusters

14 APACHE KAFKA

15 Overview Kafka is a publish-subscribe messaging rethought as a distributed commit log Fast Scalable Durable Distributed

16 Kafka adoption and use cases LinkedIn: activity streams, operational metrics, data bus 400 nodes, 18k topics, 220B msg/day (peak 3.2M msg/s), May 2014 Netflix: real-time monitoring and event processing Twitter: as part of their Storm real-time data pipelines Spotify: log delivery (from 4h down to 10s), Hadoop Loggly: log collection and processing Mozilla: telemetry data Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber, 16

17 How fast is Kafka? Up to 2 million writes/sec on 3 cheap machines Using 3 producers on 3 different machines, 3x async replication Only 1 producer/machine because NIC already saturated Sustained throughput as stored data grows Slightly different test config than 2M writes/sec above. 17

18 Why is Kafka so fast? Fast writes: While Kafka persists all data to disk, essentially all writes go to the page cache of OS, i.e. RAM. Fast reads: Very efficient to transfer data from page cache to a network socket Linux: sendfile() system call Combination of the two = fast Kafka! Example (Operations): On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache. 18

19 A first look The who is who Producers write data to brokers. Consumers read data from brokers. All this is distributed. The data Data is stored in topics. Topics are split into partitions, which are replicated. 19

20 A first look 20

21 Topics Topic: feed name to which messages are published Example: zerg.hydra Kafka prunes head based on age or max size or key Kafka topic new Producer A1 Producer A2 Producer An Older msgs Newer msgs Producers always append to tail (think: append to a file) Broker(s) 21

22 Topics Consumer group C1 Consumer group C2 Consumers use an offset pointer to track/control their read progress (and decide the pace of consumption) Older msgs Newer msgs new Producer A1 Producer A2 Producer An Producers always append to tail (think: append to a file) Broker(s) 22

23 A topic consists of partitions. Partitions Partition: ordered + immutable sequence of messages that is continually appended to 23

24 Partitions #partitions of a topic is configurable #partitions determines max consumer (group) parallelism cf. parallelism of Storm s KafkaSpout via builder.setspout(,,n) Consumer group A, with 2 consumers, reads from a 4-partition topic Consumer group B, with 4 consumers, reads from the same topic 24

25 Partition offsets Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1 25

26 Replicas of a partition Replicas: backups of a partition They exist solely to prevent data loss. Replicas are never read from, never written to. They do NOT help to increase producer or consumer parallelism! Kafka tolerates (numreplicas - 1) dead brokers before losing data LinkedIn: numreplicas == 2 1 broker can die 26

27 APACHE CASSANDRA

28 In a couple dozen words... Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database with a lot of adjectives

29 Overview Originally created by Facebook and opened sourced in 2008 Based on Google Big Table & Amazon Dynamo Massively Scalable Easy to use No relation to Hadoop Specifically, data is not stored on HDFS

30 Distributed and Decentralized Distributed Can run on multiple machines Decentralized No single point of failure No master or slave issues by using a peer-to-peer architecture (gossip protocol, specifically) Can run across geographic datacenters

31 Elastic Scalability Scales horizontally Adding nodes linearly increases performance Decreasing and increasing nodecounts happen seamlessly

32 Highly Available and Fault Tolerant Multiple networked computers in a cluster Facility for recognizing node failures Forward failing over requests to another part of the system

33 Tunable Consistency Choice between strong and eventual consistency Adjustable for reads and write operations separately Conflicts are solved during reads

34 Stored in spare multidimensional hash tables Row can have multiple columns, and not necessarily the same amount of columns for each row Each row has a unique key used for partitioning Column-Oriented

35 Query with CQL Familiar SQL-like syntax that maps to Cassandra's storage engine and simplifies data modeling CREATE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text, data blob, tags set <text> ); INSERT INTO songs (id, title, artist, album, tags) VALUES ( 'a3e648f...', 'La Grange', 'ZZ Top', 'Tres Hombres', {'cool', 'hot'}); SELECT * FROM songs WHERE id = 'a3e648f...';

36 When should I use this? Key features to compliment a Hadoop system: Geographical distribution Large deployments of structured data

37 REDIS

38 Introduction ANSI C open-source advanced key-value store Commonly referred to as a data structure server, since keys can contain strings, hashes, lists, sets, and sorted sets Operations are atomic and there are a bunch of them All data is stored in-memory, and can be persisted using snapshots or transaction logs Trivial master-slave replication

39 Clients Redis itself is ANSI C, but the protocol is opensource and developers have created support in many languages C C# C++ Clojure Common Lisp D Dart Emacs lisp Erland Fancy GNU Prolog Go Haskell haxe Java Lua Node.js Objective-C Perl PHP Pure Data Python Ruby Rust Scala Scheme Smalltalk Tcl

40 Data Types Redis keys can be anything from a string to a byte array of a JPEG Keys have associated data types, and we should talk about them Strings Lists Hashes Sets Sorted Sets HyperLogLogs

41 Strings! The simplest type Supports a number of operations, including sets, gets, and incremental operations for values > SET mkey "my binary safe value" OK > GET mkey "my binary safe value"

42 Lists! Linked Lists, actually, i.e. O(1) for inserts into the head or tail of the list Accessing an element by index... O(N) > RPUSH messages "Hello how are you?: (integer) 1 > RPUSH messages "Fine thanks. I'm having fun with Redis" (integer) 2 > RPUSH messages "I should look into this NOSQL thing ASAP" (integer) 3 > LRANGE messages 0 2 1) "Hello how are you?" 2) "Fine thanks. I'm having fun with Redis" 3) "I should look into this NOSQL thing ASAP"

43 Hashes! Maps between string fields and string values > HMSET user:1000 username antirez password P1pp0 age 34 OK > HGETALL user:1000 1) "username" 2) "antirez" 3) "password" 4) "P1pp0" 5) "age" 6) "34" > HSET user:100 password (integer) 0 > HGETALL user:1000 1) "username" 2) "antirez" 3) "password" 4) "12345" 5) "age" 6) "34"

44 Sets! Unordered collection of strings Supports adds, gets, is-member checks, intersections, unions, sorting... > SADD myset 1 (integer) 1 > SADD myset 2 (integer) 1 > SADD myset 3 (integer) 1 > SMEMBERS myset 1) "1" 2) "2" 3) "3" > SADD myotherset 2 (integer) 1 > SINTER myset myotherset 1) "2" > SUNION myset myotherset 1) "1" 2) "2" 3) "3"

45 Sorted Sets! Similar to sorted sets, but they have an associated score and can return items in order Elements are already sorted via an O(log(n)) operation, so returning them is easy > ZADD hackers 1940 "Alan Kay" > ZRANGE hackers 0-1 (integer) 1 1) "Alan Turing" > ZADD hackers 1953 "Richard Stallman" 2) "Claude Shannon" (integer) 1 3) "Alan Kay" > ZADD hackers 1965 "Yukihiro Matsumoto" 4) "Richard Stallman" (integer) 1 5) "Yukihiro Matsumoto" > ZADD hackers 1916 "Claude Shannon" 6) "Linus Torvalds" (integer) 1 > ZADD hackers 1969 "Linus Torvalds" (integer) 1 > ZADD hackers 1912 "Alan Turing" (integer) 1

46 HyperLogLogs! Probabilistic data structure to estimate the cardinality of a set Very useful when you have a set with high cardinality Talking millions Returns 1 if the cardinality changed, 0 otherwise > PFADD hll a b c d e f g (integer) 1 > PFCOUNT hll (integer) 7 > PFADD hll a (integer) 0 > PFADD hll h (integer) 1 > PFCOUNT hll (integer) 8

47 Features Transactions Pub/Sub Lua Scripting Key Expiration Redis Clustering

48 Transactions Guarantees no client requests are served in the middle of a transaction Either all commands or none are processed, so they are atomic MULTI begins a transaction, and EXEC commits it Redis will queue commands and process them upon EXEC All commands in the queue are processed, even if one fails > MULTI OK > INCR foo QUEUED > INCR bar QUEUED > EXEC 1) (integer) 1 2) (integer) 1

49 Pub/Sub Messaging paradigm where publishers send messages to subscribers (if any) via channels Subscribers express interest in channels, and receive messages from publishers (if any) SUBSCRIBE test Clients can subscribe to channels and messages from publishers will be pushed to them by Redis PUBLISH test Hello Can do pattern-based subscriptions to channels PSUBSCRIBE news.*

50 Lua Scripting You can run Lui scripts to manipulate Redis > eval "return redis.call('set','foo','bar')" 0 OK

51 Expire Keys after time Set a timeout on a key, having Redis automatically delete it after the set time Use case: Maintain session information for a user for the last 60 seconds to recommend related products MULTI RPUSH pagewviews.user:<userid> EXPIRE pagewviews.user:<userid> 60 EXEC

52 Redis Cluster Redis Cluster is not production ready, but can be used to do partitioning of your data cross multiple Redis instances A few abstractions exist today to partition among Multiple instances, but they are not out-of-the-box with a Redis download

53 Use Cases Session Cache Ranking lists Auto Complete Twitter/Github/Pinterest/Snapchat/Craiglist/ StackOverflow/Flicker

54 CLUSTER PLANNING

55 Workload Considerations Balanced workloads Jobs are distributed across various job types CPU bound Disk I/O bound Network I/O bound Compute intensive workloads - Data Analytics CPU bound workloads require: Large numbers of CPU's Large amounts of memory to store in-process data I/O intensive workloads - Sorting I/O bound workloads require: Larger number of spindles ( disks ) per node Not sure go with balance workloads configuration

56 Hardware Topology Hadoop uses a master / slave topology Master Nodes include: NameNode - maintains system metadata Backup NN- performs checkpoint operations and host standby ResourceManager- manages task assignment Slave Nodes include: DataNode - stores hdfs files / manages read and write requests Preferably co-located with TaskTracker NodeManager - performs map / reduce tasks

57 Sizing The Cluster Remember... Scaling is a relatively simple task Start with a moderate sized cluster Grow the cluster as requirements dictate Develop a scaling strategy As simple as scaling is adding new nodes takes time and resources Don't want to be adding new nodes each week Amount of data typically defines initial cluster size rate at which the volume of data increases Drivers for determining when to grow your cluster Storage requirements Processing requirements Memory requirements

58 Storage Reqs Drive Cluster Growth Data volume increases at a rate of 1TB / week 3TB of storage are required to store the data alone Remember block replication Consider additional overhead - typically 30% Remember files that are stored on a nodes local disk If DataNodes incorporate 4-1TB drives 1 new node per week is required 2 years of data - roughly 100TB will require 100 new nodes

59 Things Break Things are going to break This assumption is a core premise of Hadoop If a disk fails, the infrastructure must accommodate If a DataNode fails, the NameNode must manage this If a task fails, the ApplicationMaster must manage this failure Master nodes are typically a SPOF unless using a Highly Available configuration NameNode goes down, HDFS is inaccessible Use NameNode HA ResourceManager goes down, can't run any jobs Use RM HA (in development)

60 Cluster Nodes Cluster nodes should be commodity hardware Buy more nodes... Not more expensive nodes Workload patterns and cluster size drive CPU choice Small cluster - 50 nodes or less Quad core / medium clock speed is usually sufficient Large cluster Dual 8-core CPUs with a medium clock speed is sufficient Compute intensive workloads might require higher clock speeds General guideline is to buy more hardware instead of faster hardware Lots of memory - 48GB / 64GB / 128GB / 256GB Each map / reduce task consumes 1GB to 3GB of memory OS / Daemons consume memory as well

61 Cluster Storage 4 to 12 drives of 1TB / 2TB capacity - up to 24TB / node 3TB drives work Network performance penalty if a node fails 7200 rpm SATA drives are sufficient Slightly above average MTBF is advantageous JBOD configuration RAID is slow RAID is not required due to block replication More smaller disks is preferred over fewer larger disks Increased parallelism for DataNodes Slaves should never use virtual memory

62 Master Nodes Still commodity hardware, but... better Redundant everything Power supplies Dual Ethernet cards 16 to 24 CPU cores on NameNodes NameNodes and their clients are very chatty and need more cores to handle messaging traffic Medium clock speeds should be sufficient

63 Master Nodes HDFS namespace is limited to the amount of memory on the NameNode RAID and NFS storage on NameNode Typically RAID5 with hot spare Second remote directory such as NFS Quorum Journal Manager for HA

64 Network Considerations Hadoop is bandwidth intensive This can be a significant bottleneck Use dedicated switches 10Gb Ethernet is pretty good for large clusters

65 Which Operating System? Choose an OS that you are comfortable and familiar with Consider you admin resources / experience RedHat Enterprise Linux Includes support contract CentOS No support but the price is right Many other possibilities SuSE Enterprise Linux Ubuntu Fedora

66 Which Java Virtual Machine? Oracle Java is the only supported JVM Runs on OpenJDK, but use at your own risk Hadoop 1.0 requires Java JDK 1.6 or higher Hadoop 2.x requires Java JDK 1.7

67 References Give it a test drive! b11-final12.pdf basic-training-verisign

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011 BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

HADOOP MOCK TEST HADOOP MOCK TEST I

HADOOP MOCK TEST HADOOP MOCK TEST I http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Practical Cassandra. Vitalii Tymchyshyn [email protected] @tivv00

Practical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00 Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Technology Core Hadoop: HDFS-YARN Internals Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class

More information

HDFS Users Guide. Table of contents

HDFS Users Guide. Table of contents Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9

More information

Assignment # 1 (Cloud Computing Security)

Assignment # 1 (Cloud Computing Security) Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual

More information

Kafka & Redis for Big Data Solutions

Kafka & Redis for Big Data Solutions Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Apache Hadoop Cluster Configuration Guide

Apache Hadoop Cluster Configuration Guide Community Driven Apache Hadoop Apache Hadoop Cluster Configuration Guide April 2013 2013 Hortonworks Inc. http://www.hortonworks.com Introduction Sizing a Hadoop cluster is important, as the right resources

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg HDB++: HIGH AVAILABILITY WITH Page 1 OVERVIEW What is Cassandra (C*)? Who is using C*? CQL C* architecture Request Coordination Consistency Monitoring tool HDB++ Page 2 OVERVIEW What is Cassandra (C*)?

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

A Survey of Distributed Database Management Systems

A Survey of Distributed Database Management Systems Brady Kyle CSC-557 4-27-14 A Survey of Distributed Database Management Systems Big data has been described as having some or all of the following characteristics: high velocity, heterogeneous structure,

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam [email protected]

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam [email protected] Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Jordan Prosch, Matt Kipps Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014 Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Hadoop Technology HADOOP CLUSTER

Hadoop Technology HADOOP CLUSTER RESEARCH ARTICLE OPEN ACCESS Hadoop Technology Ankita M.Lahariya 4 th year, Department of Computer Science and Engineering, College of Engineering and Technology,Akola. [email protected] ABSTRACT

More information

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP MOCK TEST HADOOP MOCK TEST II http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014 Highly available, scalable and secure data with Cassandra and DataStax Enterprise GOTO Berlin 27 th February 2014 About Us Steve van den Berg Johnny Miller Solutions Architect Regional Director Western

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election

More information

Cloud Based Application Architectures using Smart Computing

Cloud Based Application Architectures using Smart Computing Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products

More information

Mark Bennett. Search and the Virtual Machine

Mark Bennett. Search and the Virtual Machine Mark Bennett Search and the Virtual Machine Agenda Intro / Business Drivers What to do with Search + Virtual What Makes Search Fast (or Slow!) Virtual Platforms Test Results Trends / Wrap Up / Q & A Business

More information

Scalable Architecture on Amazon AWS Cloud

Scalable Architecture on Amazon AWS Cloud Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies [email protected] 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...

More information

Distributed Filesystems

Distributed Filesystems Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

More information

CS 145: NoSQL Activity Stanford University, Fall 2015 A Quick Introdution to Redis

CS 145: NoSQL Activity Stanford University, Fall 2015 A Quick Introdution to Redis CS 145: NoSQL Activity Stanford University, Fall 2015 A Quick Introdution to Redis For this assignment, compile your answers on a separate pdf to submit and verify that they work using Redis. Installing

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Google File System. Web and scalability

Google File System. Web and scalability Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might

More information

Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @

Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @ Using Kafka to Optimize Data Movement and System Integration Alex Holmes @ https://www.flickr.com/photos/tom_bennett/7095600611 THIS SUCKS E T (circa 2560 B.C.E.) L a few years later... 2,014 C.E. i need

More information

Evaluation of NoSQL databases for large-scale decentralized microblogging

Evaluation of NoSQL databases for large-scale decentralized microblogging Evaluation of NoSQL databases for large-scale decentralized microblogging Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Decentralized Systems - 2nd semester 2012/2013 Universitat Politècnica

More information

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at

More information

Big Data with Component Based Software

Big Data with Component Based Software Big Data with Component Based Software Who am I Erik who? Erik Forsberg Linköping University, 1998-2003. Computer Science programme + lot's of time at Lysator ACS At Opera Software

More information

High Availability Solutions for the MariaDB and MySQL Database

High Availability Solutions for the MariaDB and MySQL Database High Availability Solutions for the MariaDB and MySQL Database 1 Introduction This paper introduces recommendations and some of the solutions used to create an availability or high availability environment

More information

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store Oracle NoSQL Database A Distributed Key-Value Store Charles Lamb, Consulting MTS The following is intended to outline our general product direction. It is intended for information

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades

More information

ZooKeeper. Table of contents

ZooKeeper. Table of contents by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011 Real-time Analytics at Facebook: Data Freeway and Puma Zheng Shao 12/2/2011 Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works Analytics and Real-time what and why Facebook Insights

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

Understanding Neo4j Scalability

Understanding Neo4j Scalability Understanding Neo4j Scalability David Montag January 2013 Understanding Neo4j Scalability Scalability means different things to different people. Common traits associated include: 1. Redundancy in the

More information

SAN Conceptual and Design Basics

SAN Conceptual and Design Basics TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer

More information

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,

More information

Parallels Cloud Storage

Parallels Cloud Storage Parallels Cloud Storage White Paper Best Practices for Configuring a Parallels Cloud Storage Cluster www.parallels.com Table of Contents Introduction... 3 How Parallels Cloud Storage Works... 3 Deploying

More information

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Bigdata High Availability (HA) Architecture

Bigdata High Availability (HA) Architecture Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources

More information

Data Pipeline with Kafka

Data Pipeline with Kafka Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA Senior Software Engineer Agoda.com Contributor Thai Java User Group (THJUG.com) Contributor Agile66 AGENDA Big Data & Data Pipeline Kafka Introduction

More information

Deploying and Optimizing SQL Server for Virtual Machines

Deploying and Optimizing SQL Server for Virtual Machines Deploying and Optimizing SQL Server for Virtual Machines Deploying and Optimizing SQL Server for Virtual Machines Much has been written over the years regarding best practices for deploying Microsoft SQL

More information

High Throughput Computing on P2P Networks. Carlos Pérez Miguel [email protected]

High Throughput Computing on P2P Networks. Carlos Pérez Miguel carlos.perezm@ehu.es High Throughput Computing on P2P Networks Carlos Pérez Miguel [email protected] Overview High Throughput Computing Motivation All things distributed: Peer-to-peer Non structured overlays Structured

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5

More information

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk Benchmarking Couchbase Server for Interactive Applications By Alexey Diomin and Kirill Grigorchuk Contents 1. Introduction... 3 2. A brief overview of Cassandra, MongoDB, and Couchbase... 3 3. Key criteria

More information