Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
|
|
|
- Winifred Leonard
- 10 years ago
- Views:
Transcription
1 Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
2 Agenda Advanced HDFS Features Apache Kafka Apache Cassandra Redis (but more this time) Cluster Planning
3 ADVANCED HDFS FEATURES
4 Highly Available NameNode Highly Available NameNode feature eliminates SPOF Requires two NameNodes and some extra configuration Active/Passive or Active/Active Clients only contact the active NameNode DataNodes report in and heartbeat with both NameNodes Active NameNode writes metadata to a quorum of JournalNodes Standby NameNode reads the JournalNodes to stay in sync There is no CheckPointNode (SecondaryNameNode) The passive NameNode performs checkpoint operations
5 HA NameNode Failover There are two failover scenarios Graceful Performed by an administrator for maintenance Automated Active NameNode fails Failed NameNode must be fenced Eliminates the 'split brain syndrome' Two fencing methods are available sshfence Kill NameNodes daemon shell script disables access to the NameNode, shuts down the network switch port, sends power off to the failed NameNode There is no 'default' fencing method
6 Release lock ZKFC NN Active ZooKeeper NFS or QJM Shared NN State Lock Released Created Create Lock ZKFC NN Become Active Standby Active I'm the Boss Data Node Data Node Data Node
7 HDFS Federation Useful for: Isolation/multi-tenancy Horizontal scalability of HDFS namespace Performance Allows for multiple independent NameNodes using the same collection of DataNodes DataNodes store blocks from all NameNode pools
8 Federated NameNodes File-system namespace scalable beyond heap size NameNode performance no longer a bottleneck NameNode failure/degradation is isolated Only data managed by the failed NameNode is unavailable Each NameNode can be made Highly Available
9 Hadoop Security Hadoop's original design web crawler and indexing Not designed for processing of confidential data Small number of trusted users Access to cluster controlled by providing user accounts Little / no control on what a user could do once logged in HDFS permissions were added in the Hadoop 0.16 release Similar to basic UNIX file permissions HDFS permissions can be disabled via dfs.permissions Basically for protection against user-induced accidents Did not protect from attacks Authentication is accomplished on the client side Easily subverted via a simple configuration parameter
10 Kerberos Kerberos support introduced in the Hadoop release Developed at MIT / freely available Not a Hadoop-specific feature Not included in Hadoop releases Works on the basis of 'tickets' Allow communicating nodes to securely identify each other across unsecure networks Primarily a client/server model implementing mutual authentication The user and the server verify each other's identity
11 How Kerberos Works Client forwards the username to KDC A. KDC sends Client/TGS Session Key, encrypted with user's password B. KDC issues a TGT, encrypted with TGS's key C. Sends B and service ID to TGS D. Authenticator encrypted w/a E. TGS issues CTS ticket, encrypted with SS key F. TGS issues CSS, encrypted w/a G. New authenticator encrypted with F H. Timestamp found in G+1 KDC - Key Distribution Center TGS Ticket Granting Service TGT Ticket Granting Ticket CTS Client-to-Server Ticket CSS Client Server Session Key
12 Kerberos Services Authentication Server Authenticates client Gives client enough information to authenticate with Service Server Service Server Authenticates client Authenticates itself to client Provides services to client
13 Kerberos Limitations Single point of failure Must use multiple servers Implement failback authentication mechanisms Strict time requirements 'tickets' are time stamped Clocks on all host must be carefully synchronized All authentication is controlled by the KDC Compromise of this infrastructure will allow attackers to impersonate any user Each network service requiring a different host name must have its own set of Kerberos keys Complicates virtual hosting of clusters
14 APACHE KAFKA
15 Overview Kafka is a publish-subscribe messaging rethought as a distributed commit log Fast Scalable Durable Distributed
16 Kafka adoption and use cases LinkedIn: activity streams, operational metrics, data bus 400 nodes, 18k topics, 220B msg/day (peak 3.2M msg/s), May 2014 Netflix: real-time monitoring and event processing Twitter: as part of their Storm real-time data pipelines Spotify: log delivery (from 4h down to 10s), Hadoop Loggly: log collection and processing Mozilla: telemetry data Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber, 16
17 How fast is Kafka? Up to 2 million writes/sec on 3 cheap machines Using 3 producers on 3 different machines, 3x async replication Only 1 producer/machine because NIC already saturated Sustained throughput as stored data grows Slightly different test config than 2M writes/sec above. 17
18 Why is Kafka so fast? Fast writes: While Kafka persists all data to disk, essentially all writes go to the page cache of OS, i.e. RAM. Fast reads: Very efficient to transfer data from page cache to a network socket Linux: sendfile() system call Combination of the two = fast Kafka! Example (Operations): On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache. 18
19 A first look The who is who Producers write data to brokers. Consumers read data from brokers. All this is distributed. The data Data is stored in topics. Topics are split into partitions, which are replicated. 19
20 A first look 20
21 Topics Topic: feed name to which messages are published Example: zerg.hydra Kafka prunes head based on age or max size or key Kafka topic new Producer A1 Producer A2 Producer An Older msgs Newer msgs Producers always append to tail (think: append to a file) Broker(s) 21
22 Topics Consumer group C1 Consumer group C2 Consumers use an offset pointer to track/control their read progress (and decide the pace of consumption) Older msgs Newer msgs new Producer A1 Producer A2 Producer An Producers always append to tail (think: append to a file) Broker(s) 22
23 A topic consists of partitions. Partitions Partition: ordered + immutable sequence of messages that is continually appended to 23
24 Partitions #partitions of a topic is configurable #partitions determines max consumer (group) parallelism cf. parallelism of Storm s KafkaSpout via builder.setspout(,,n) Consumer group A, with 2 consumers, reads from a 4-partition topic Consumer group B, with 4 consumers, reads from the same topic 24
25 Partition offsets Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1 25
26 Replicas of a partition Replicas: backups of a partition They exist solely to prevent data loss. Replicas are never read from, never written to. They do NOT help to increase producer or consumer parallelism! Kafka tolerates (numreplicas - 1) dead brokers before losing data LinkedIn: numreplicas == 2 1 broker can die 26
27 APACHE CASSANDRA
28 In a couple dozen words... Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database with a lot of adjectives
29 Overview Originally created by Facebook and opened sourced in 2008 Based on Google Big Table & Amazon Dynamo Massively Scalable Easy to use No relation to Hadoop Specifically, data is not stored on HDFS
30 Distributed and Decentralized Distributed Can run on multiple machines Decentralized No single point of failure No master or slave issues by using a peer-to-peer architecture (gossip protocol, specifically) Can run across geographic datacenters
31 Elastic Scalability Scales horizontally Adding nodes linearly increases performance Decreasing and increasing nodecounts happen seamlessly
32 Highly Available and Fault Tolerant Multiple networked computers in a cluster Facility for recognizing node failures Forward failing over requests to another part of the system
33 Tunable Consistency Choice between strong and eventual consistency Adjustable for reads and write operations separately Conflicts are solved during reads
34 Stored in spare multidimensional hash tables Row can have multiple columns, and not necessarily the same amount of columns for each row Each row has a unique key used for partitioning Column-Oriented
35 Query with CQL Familiar SQL-like syntax that maps to Cassandra's storage engine and simplifies data modeling CREATE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text, data blob, tags set <text> ); INSERT INTO songs (id, title, artist, album, tags) VALUES ( 'a3e648f...', 'La Grange', 'ZZ Top', 'Tres Hombres', {'cool', 'hot'}); SELECT * FROM songs WHERE id = 'a3e648f...';
36 When should I use this? Key features to compliment a Hadoop system: Geographical distribution Large deployments of structured data
37 REDIS
38 Introduction ANSI C open-source advanced key-value store Commonly referred to as a data structure server, since keys can contain strings, hashes, lists, sets, and sorted sets Operations are atomic and there are a bunch of them All data is stored in-memory, and can be persisted using snapshots or transaction logs Trivial master-slave replication
39 Clients Redis itself is ANSI C, but the protocol is opensource and developers have created support in many languages C C# C++ Clojure Common Lisp D Dart Emacs lisp Erland Fancy GNU Prolog Go Haskell haxe Java Lua Node.js Objective-C Perl PHP Pure Data Python Ruby Rust Scala Scheme Smalltalk Tcl
40 Data Types Redis keys can be anything from a string to a byte array of a JPEG Keys have associated data types, and we should talk about them Strings Lists Hashes Sets Sorted Sets HyperLogLogs
41 Strings! The simplest type Supports a number of operations, including sets, gets, and incremental operations for values > SET mkey "my binary safe value" OK > GET mkey "my binary safe value"
42 Lists! Linked Lists, actually, i.e. O(1) for inserts into the head or tail of the list Accessing an element by index... O(N) > RPUSH messages "Hello how are you?: (integer) 1 > RPUSH messages "Fine thanks. I'm having fun with Redis" (integer) 2 > RPUSH messages "I should look into this NOSQL thing ASAP" (integer) 3 > LRANGE messages 0 2 1) "Hello how are you?" 2) "Fine thanks. I'm having fun with Redis" 3) "I should look into this NOSQL thing ASAP"
43 Hashes! Maps between string fields and string values > HMSET user:1000 username antirez password P1pp0 age 34 OK > HGETALL user:1000 1) "username" 2) "antirez" 3) "password" 4) "P1pp0" 5) "age" 6) "34" > HSET user:100 password (integer) 0 > HGETALL user:1000 1) "username" 2) "antirez" 3) "password" 4) "12345" 5) "age" 6) "34"
44 Sets! Unordered collection of strings Supports adds, gets, is-member checks, intersections, unions, sorting... > SADD myset 1 (integer) 1 > SADD myset 2 (integer) 1 > SADD myset 3 (integer) 1 > SMEMBERS myset 1) "1" 2) "2" 3) "3" > SADD myotherset 2 (integer) 1 > SINTER myset myotherset 1) "2" > SUNION myset myotherset 1) "1" 2) "2" 3) "3"
45 Sorted Sets! Similar to sorted sets, but they have an associated score and can return items in order Elements are already sorted via an O(log(n)) operation, so returning them is easy > ZADD hackers 1940 "Alan Kay" > ZRANGE hackers 0-1 (integer) 1 1) "Alan Turing" > ZADD hackers 1953 "Richard Stallman" 2) "Claude Shannon" (integer) 1 3) "Alan Kay" > ZADD hackers 1965 "Yukihiro Matsumoto" 4) "Richard Stallman" (integer) 1 5) "Yukihiro Matsumoto" > ZADD hackers 1916 "Claude Shannon" 6) "Linus Torvalds" (integer) 1 > ZADD hackers 1969 "Linus Torvalds" (integer) 1 > ZADD hackers 1912 "Alan Turing" (integer) 1
46 HyperLogLogs! Probabilistic data structure to estimate the cardinality of a set Very useful when you have a set with high cardinality Talking millions Returns 1 if the cardinality changed, 0 otherwise > PFADD hll a b c d e f g (integer) 1 > PFCOUNT hll (integer) 7 > PFADD hll a (integer) 0 > PFADD hll h (integer) 1 > PFCOUNT hll (integer) 8
47 Features Transactions Pub/Sub Lua Scripting Key Expiration Redis Clustering
48 Transactions Guarantees no client requests are served in the middle of a transaction Either all commands or none are processed, so they are atomic MULTI begins a transaction, and EXEC commits it Redis will queue commands and process them upon EXEC All commands in the queue are processed, even if one fails > MULTI OK > INCR foo QUEUED > INCR bar QUEUED > EXEC 1) (integer) 1 2) (integer) 1
49 Pub/Sub Messaging paradigm where publishers send messages to subscribers (if any) via channels Subscribers express interest in channels, and receive messages from publishers (if any) SUBSCRIBE test Clients can subscribe to channels and messages from publishers will be pushed to them by Redis PUBLISH test Hello Can do pattern-based subscriptions to channels PSUBSCRIBE news.*
50 Lua Scripting You can run Lui scripts to manipulate Redis > eval "return redis.call('set','foo','bar')" 0 OK
51 Expire Keys after time Set a timeout on a key, having Redis automatically delete it after the set time Use case: Maintain session information for a user for the last 60 seconds to recommend related products MULTI RPUSH pagewviews.user:<userid> EXPIRE pagewviews.user:<userid> 60 EXEC
52 Redis Cluster Redis Cluster is not production ready, but can be used to do partitioning of your data cross multiple Redis instances A few abstractions exist today to partition among Multiple instances, but they are not out-of-the-box with a Redis download
53 Use Cases Session Cache Ranking lists Auto Complete Twitter/Github/Pinterest/Snapchat/Craiglist/ StackOverflow/Flicker
54 CLUSTER PLANNING
55 Workload Considerations Balanced workloads Jobs are distributed across various job types CPU bound Disk I/O bound Network I/O bound Compute intensive workloads - Data Analytics CPU bound workloads require: Large numbers of CPU's Large amounts of memory to store in-process data I/O intensive workloads - Sorting I/O bound workloads require: Larger number of spindles ( disks ) per node Not sure go with balance workloads configuration
56 Hardware Topology Hadoop uses a master / slave topology Master Nodes include: NameNode - maintains system metadata Backup NN- performs checkpoint operations and host standby ResourceManager- manages task assignment Slave Nodes include: DataNode - stores hdfs files / manages read and write requests Preferably co-located with TaskTracker NodeManager - performs map / reduce tasks
57 Sizing The Cluster Remember... Scaling is a relatively simple task Start with a moderate sized cluster Grow the cluster as requirements dictate Develop a scaling strategy As simple as scaling is adding new nodes takes time and resources Don't want to be adding new nodes each week Amount of data typically defines initial cluster size rate at which the volume of data increases Drivers for determining when to grow your cluster Storage requirements Processing requirements Memory requirements
58 Storage Reqs Drive Cluster Growth Data volume increases at a rate of 1TB / week 3TB of storage are required to store the data alone Remember block replication Consider additional overhead - typically 30% Remember files that are stored on a nodes local disk If DataNodes incorporate 4-1TB drives 1 new node per week is required 2 years of data - roughly 100TB will require 100 new nodes
59 Things Break Things are going to break This assumption is a core premise of Hadoop If a disk fails, the infrastructure must accommodate If a DataNode fails, the NameNode must manage this If a task fails, the ApplicationMaster must manage this failure Master nodes are typically a SPOF unless using a Highly Available configuration NameNode goes down, HDFS is inaccessible Use NameNode HA ResourceManager goes down, can't run any jobs Use RM HA (in development)
60 Cluster Nodes Cluster nodes should be commodity hardware Buy more nodes... Not more expensive nodes Workload patterns and cluster size drive CPU choice Small cluster - 50 nodes or less Quad core / medium clock speed is usually sufficient Large cluster Dual 8-core CPUs with a medium clock speed is sufficient Compute intensive workloads might require higher clock speeds General guideline is to buy more hardware instead of faster hardware Lots of memory - 48GB / 64GB / 128GB / 256GB Each map / reduce task consumes 1GB to 3GB of memory OS / Daemons consume memory as well
61 Cluster Storage 4 to 12 drives of 1TB / 2TB capacity - up to 24TB / node 3TB drives work Network performance penalty if a node fails 7200 rpm SATA drives are sufficient Slightly above average MTBF is advantageous JBOD configuration RAID is slow RAID is not required due to block replication More smaller disks is preferred over fewer larger disks Increased parallelism for DataNodes Slaves should never use virtual memory
62 Master Nodes Still commodity hardware, but... better Redundant everything Power supplies Dual Ethernet cards 16 to 24 CPU cores on NameNodes NameNodes and their clients are very chatty and need more cores to handle messaging traffic Medium clock speeds should be sufficient
63 Master Nodes HDFS namespace is limited to the amount of memory on the NameNode RAID and NFS storage on NameNode Typically RAID5 with hot spare Second remote directory such as NFS Quorum Journal Manager for HA
64 Network Considerations Hadoop is bandwidth intensive This can be a significant bottleneck Use dedicated switches 10Gb Ethernet is pretty good for large clusters
65 Which Operating System? Choose an OS that you are comfortable and familiar with Consider you admin resources / experience RedHat Enterprise Linux Includes support contract CentOS No support but the price is right Many other possibilities SuSE Enterprise Linux Ubuntu Fedora
66 Which Java Virtual Machine? Oracle Java is the only supported JVM Runs on OpenJDK, but use at your own risk Hadoop 1.0 requires Java JDK 1.6 or higher Hadoop 2.x requires Java JDK 1.7
67 References Give it a test drive! b11-final12.pdf basic-training-verisign
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Practical Cassandra. Vitalii Tymchyshyn [email protected] @tivv00
Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Big Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
Assignment # 1 (Cloud Computing Security)
Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual
Kafka & Redis for Big Data Solutions
Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
The Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Apache Hadoop Cluster Configuration Guide
Community Driven Apache Hadoop Apache Hadoop Cluster Configuration Guide April 2013 2013 Hortonworks Inc. http://www.hortonworks.com Introduction Sizing a Hadoop cluster is important, as the right resources
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg
HDB++: HIGH AVAILABILITY WITH Page 1 OVERVIEW What is Cassandra (C*)? Who is using C*? CQL C* architecture Request Coordination Consistency Monitoring tool HDB++ Page 2 OVERVIEW What is Cassandra (C*)?
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
A Survey of Distributed Database Management Systems
Brady Kyle CSC-557 4-27-14 A Survey of Distributed Database Management Systems Big data has been described as having some or all of the following characteristics: high velocity, heterogeneous structure,
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam [email protected]
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam [email protected] Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Hadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Hadoop Technology HADOOP CLUSTER
RESEARCH ARTICLE OPEN ACCESS Hadoop Technology Ankita M.Lahariya 4 th year, Department of Computer Science and Engineering, College of Engineering and Technology,Akola. [email protected] ABSTRACT
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014
Highly available, scalable and secure data with Cassandra and DataStax Enterprise GOTO Berlin 27 th February 2014 About Us Steve van den Berg Johnny Miller Solutions Architect Regional Director Western
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
Cloud Based Application Architectures using Smart Computing
Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products
Mark Bennett. Search and the Virtual Machine
Mark Bennett Search and the Virtual Machine Agenda Intro / Business Drivers What to do with Search + Virtual What Makes Search Fast (or Slow!) Virtual Platforms Test Results Trends / Wrap Up / Q & A Business
Scalable Architecture on Amazon AWS Cloud
Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies [email protected] 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform
On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
CS 145: NoSQL Activity Stanford University, Fall 2015 A Quick Introdution to Redis
CS 145: NoSQL Activity Stanford University, Fall 2015 A Quick Introdution to Redis For this assignment, compile your answers on a separate pdf to submit and verify that they work using Redis. Installing
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Google File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @
Using Kafka to Optimize Data Movement and System Integration Alex Holmes @ https://www.flickr.com/photos/tom_bennett/7095600611 THIS SUCKS E T (circa 2560 B.C.E.) L a few years later... 2,014 C.E. i need
Evaluation of NoSQL databases for large-scale decentralized microblogging
Evaluation of NoSQL databases for large-scale decentralized microblogging Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Decentralized Systems - 2nd semester 2012/2013 Universitat Politècnica
Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens
Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at
Big Data with Component Based Software
Big Data with Component Based Software Who am I Erik who? Erik Forsberg Linköping University, 1998-2003. Computer Science programme + lot's of time at Lysator ACS At Opera Software
High Availability Solutions for the MariaDB and MySQL Database
High Availability Solutions for the MariaDB and MySQL Database 1 Introduction This paper introduces recommendations and some of the solutions used to create an availability or high availability environment
<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store
Oracle NoSQL Database A Distributed Key-Value Store Charles Lamb, Consulting MTS The following is intended to outline our general product direction. It is intended for information
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades
ZooKeeper. Table of contents
by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011
Real-time Analytics at Facebook: Data Freeway and Puma Zheng Shao 12/2/2011 Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works Analytics and Real-time what and why Facebook Insights
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Understanding Neo4j Scalability
Understanding Neo4j Scalability David Montag January 2013 Understanding Neo4j Scalability Scalability means different things to different people. Common traits associated include: 1. Redundancy in the
SAN Conceptual and Design Basics
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
Parallels Cloud Storage
Parallels Cloud Storage White Paper Best Practices for Configuring a Parallels Cloud Storage Cluster www.parallels.com Table of Contents Introduction... 3 How Parallels Cloud Storage Works... 3 Deploying
HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367
HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Bigdata High Availability (HA) Architecture
Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources
Data Pipeline with Kafka
Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA Senior Software Engineer Agoda.com Contributor Thai Java User Group (THJUG.com) Contributor Agile66 AGENDA Big Data & Data Pipeline Kafka Introduction
Deploying and Optimizing SQL Server for Virtual Machines
Deploying and Optimizing SQL Server for Virtual Machines Deploying and Optimizing SQL Server for Virtual Machines Much has been written over the years regarding best practices for deploying Microsoft SQL
High Throughput Computing on P2P Networks. Carlos Pérez Miguel [email protected]
High Throughput Computing on P2P Networks Carlos Pérez Miguel [email protected] Overview High Throughput Computing Motivation All things distributed: Peer-to-peer Non structured overlays Structured
HDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk
Benchmarking Couchbase Server for Interactive Applications By Alexey Diomin and Kirill Grigorchuk Contents 1. Introduction... 3 2. A brief overview of Cassandra, MongoDB, and Couchbase... 3 3. Key criteria
