Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014
|
|
|
- Brice Mason
- 10 years ago
- Views:
Transcription
1 Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014
2 Scale, Security, Schema
3 Scale
4 to scale 1 - (vt) to change the size of something
5 let s scale the cluster up to twice the original size
6 to scale 2 - (vi) to function properly at a large scale
7 Accumulo scales
8 What is Large Scale?
9 Notebook Computer 16 GB DRAM 512 GB Flash Storage 2.3 GHz quad-core i7 CPU
10 Modern Server 100s of GB DRAM 10s of TB on disk 10s of cores
11 Large Scale Laptop Server 10 Node Cluster 100 Nodes 1000 Nodes 10,000 Nodes 10 GB 100 GB 1 TB 10 TB 100 TB 1 PB 10 PB 100 PB In RAM On Disk
12 Data Composition Original Raw Derivative QFDs Indexes January February March April
13 Accumulo Scales From GB to PB, Accumulo keeps two things low: Administrative effort Scan latency
14 Scan Latency
15 Administrative Overhead Failed Machines Admin Intervention
16 Accumulo Scales From GB to PB three things grow linearly: Total storage size Ingest Rate Concurrent scans
17 Ingest Benchmark Millions of entries per second
18 AWB Benchmark
19 1000 machines
20 100 M entries written per second
21 408 terabytes
22 7.56 trillion total entries
23 Graph Benchmark
24 1200 machines
25 4.4 trillion vertices
26 70.4 trillion edges
27 149 M edges traversed per second
28 1 petabyte
29 Graph Analysis Billions of Edges ,000 1, Twitter Yahoo! Facebook Accumulo
30 Accumulo is designed after Google s BigTable
31 BigTable powers hundreds of applications at Google
32 BigTable serves 2+ exabytes
33 600 M queries per second organization wide
34 From 10 to 10,000
35 Starting with ten machines 10 1
36 One rack
37 1 TB RAM
38 TB Disk
39 Hardware failures rare
40 Test Application Designs
41 Designing Applications for Scale
42 Keys to Scaling 1. Live writes go to all servers 2. User requests are satisfied by few scans 3. Turning updates into inserts
43 Keys to Scaling Writes on all servers Few Scans
44 Hash / UUID Keys Key Value RowID Col Value usera:name Bob af362de4 Bob usera:age 43 usera:account $30 b23dc4be b98de2ff Annie Joe userb:name Annie c48e2ade $30 userb:age 32 userb:account $25 c7e43fb2 $25 d938ff3d 32 userc:name Joe e2e4dac4 59 userc:age 59 e98f2eab3 43 Uniform writes
45 Monitor Participating Tablet Servers MyTable Servers Hosted Tablets Ingest r1n k r1n k r2n k r2n k
46 Hash / UUID Keys RowID Col Value get(usera) af362de4 b23dc4be b98de2ff Bob Annie Joe c48e2ade $30 c7e43fb2 $25 d938ff3d 32 e2e4dac4 59 e98f2eab x 1-entry scans on 3 servers
47 Keys to Scaling Writes on all servers Few Scans Hash / UUID Keys
48 Group for Locality Key usera:name Value Bob RowID Col Value af362de4 name Annie usera:age 43 userb:name usera:account Annie $30 userb:age userb:name 32 Annie userc:name userb:age Fred 32 userc:age userb:account 29 $25 af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob userd:name userc:name Joe e2e4dac4 age 43 userd:age userc:age 59 e2e4dac4 account $30 Still fairly uniform writes
49 Group for Locality RowID Col Value get(usera) af362de4 name Annie af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 1 x 3-entry scan on 1 server
50 Keys to Scaling Writes on all servers Few Scans Grouped Keys
51 Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie userb:age 32 userc:name Fred userc:age 29 userd:name Joe userd:age 59
52 Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie 23 userb:age userc:name Fred userc:age 29 userd:name Joe userd:age 59
53 Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie 23 userb:age userc:name Fred 31 userc:age userd:name Joe 25 userd:age Always write to one server
54 No write parallelism
55 Temporal Keys RowID Col Value get( to ) Fetching ranges uses few scans
56 Keys to Scaling Writes on all servers Few Scans Temporal Keys
57 Binned Temporal Keys Key Value RowID Col Value usera:name Bob 44 0_ usera:age userb:name Annie 23 userb:age 32 1_ userc:name Fred userc:age 29 userd:name Joe 2_ userd:age 59 Uniform Writes
58 Binned Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age _ _ userb:name Annie 23 userb:age userc:name Fred 31 1_ _ userc:age userd:name Joe 2_ userd:age 59 2_ Uniform Writes
59 Binned Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie 23 userb:age userc:name Fred 31 userc:age userd:name Joe 25 userd:age _ _ _ _ _ _ _ _ Uniform Writes
60 Binned Temporal Keys get( to ) RowID Col Value 0_ _ _ _ _ _ _ _ One scan per bin
61 Keys to Scaling Writes on all servers Few Scans Binned Temporal Keys
62 Keys to Scaling Key design is critical Group data under common row IDs to reduce scans Prepend bins to row IDs to increase write parallelism
63 Splits Pre-split or organic splits Going from dev to production, can ingest a representative sample, obtain split points and use them to pre-split a larger system Hundreds or thousands of tablets per server is ok Want at least one tablet per server
64 Effect of Compression Similar sorted keys compress well May need more data than you think to auto-split
65 Inserts are fast 10s of thousands per second per machine
66 Updates *can* be
67 Update Types Overwrite Combine Complex
68 Update - Overwrite Performance same as insert Ignore (don t read) existing value Accumulo s Versioning Iterator does the overwrite
69 Update - Overwrite RowID Col Value af362de4 name Annie userb:age -> 34 af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
70 Update - Overwrite RowID Col Value af362de4 name Annie userb:age -> 34 af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
71 Update - Combine Things like X = X + 1 Normally one would have to read the old value to do this, but Accumulo Iterators allow multiple inserts to be combined at scan time, or compaction Performance is same as inserts
72 Update - Combine RowID Col Value af362de4 name Annie userb:account -> +10 af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
73 Update - Combine RowID Col Value af362de4 name Annie userb:account -> +10 af362de4 age 34 af362de4 account $25 af362de4 account $10 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
74 Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $25 af362de4 account $10 getaccount(userb) $35 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
75 Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 After compaction c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
76 Update - Complex Some updates require looking at more data than Iterators have access to - such as multiple rows These require reading the data out in order to write the new value Performance will be much slower
77 Update - Complex userc:account = getbalance(usera) + getbalance(userb) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe = 65 c48e2ade age 59 c48e2ade account $40 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
78 Update - Complex userc:account = getbalance(usera) + getbalance(userb) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe = 65 c48e2ade age 59 c48e2ade account $65 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
79 Planning a Larger-Scale Cluster
80 Storage vs Ingest Ingest Rate 1x1TB 12x3TB Millions of Entries per second ,000 12,000 10,000 1,200 1, Storage Terabytes
81 Model for Ingest Rates N - Number of machines S - Single Server throughput (entries / second) A - Aggregate Cluster throughput (entries / second) A = 0.85 log 2 N * N * S Expect 85% increase in write rate when doubling the size of the cluster
82 Estimating Machines Required N - Number of machines S - Single Server throughput (entries / second) A - Target Aggregate throughput (entries / second) N = 2 (log (A/S) / ) 2 Expect 85% increase in write rate when doubling the size of the cluster
83 Predicted Cluster Sizes Number of Machines Millions of Entries per Second
84 100 Machines 10 2
85 Multiple racks
86 10 TB RAM
87 100 TB - 1PB Disk
88 Some hardware failures in the first week (burn in)
89 Expect 3 failed HDs in first 3 mo
90 Another 4 within the first year research.google.com/en/us/archive/disk_failures.pdf
91 Can process the 1000 Genomes data set 260 TB
92 Can store and index the Common Crawl Corpus commoncrawl.org! 2.8 Billion web pages 541 TB
93 One year of Twitter 182 trillion tweets 483 TB / /d564001ds1.htm
94 Deploying an Application Users Clients Tablet Servers
95 May not see the affect of writing to disk for a while
96 1000 machines 10 3
97 Multiple rows of racks
98 100 TB RAM
99 1-10 PB Disk
100 Hardware failure is a regular occurrence
101 Hard drive failure about every 5 days (average). Will be skewed towards beginning of! the year
102 Can traverse the brain graph 70 trillion edges, 1 PB
103 Facebook Graph 1s of PB xldb2012_wed_1105_dhrubaborthakur.pdf
104 Netflix Video Master Copies 3.14 PB
105 World of Warcraft Backend Storage 1.3 PB wows-back-end-10-data-centers cores/
106 Webpages, live on the Internet 14.3 Trillion total-number-of-websites-size-of.html
107 Things like the difference between two compression algorithms start to make a big difference
108 Use range compactions to affect changes on portions of table
109 Lay off Zookeeper
110 Watch Garbage Collector and Namenode ops
111 Garbage Collection > 5 minutes?
112 Start thinking about NameNode Federation
113 Accumulo 1.6
114 Multiple NameNodes Accumulo Namenode Namenode DataNodes DataNodes Multiple HDFS Clusters
115 Multiple NameNodes Accumulo Namenode Namenode DataNodes Multiple NameNodes, shared DataNodes (Federation. Requires Hadoop 2.0)
116 More Namenodes = higher risk of one going down.! Can use HA Namenodes in conjunction w/ Federation
117 10,000 machines 10 4
118 You, my friend, are here to kick a** and chew bubble gum
119 1 PB RAM
120 PB Disk
121 1 hardware failure every hour on average
122 Entire Internet Archive 15 PB internet-archive-wayback-machine-brewster-kahle
123 A year s worth of data from the Large Hadron Collider 15 PB
124 0.1% of all Internet traffic in PB total-number-of-websites-size-of.html
125 Facebook Messaging Data 10s of PB xldb2012_wed_1105_dhrubaborthakur.pdf
126 Facebook Photos 240 billion High 10s of PB xldb2012_wed_1105_dhrubaborthakur.pdf
127 Must use multiple NameNodes
128 Can tune back heartbeats, periodicity of central processes in general
129 Can combine multiple PB data sets
130 Up to 10 quadrillion entries in a single table
131 While maintaining sub-second lookup times
132 Only with Accumulo 1.6
133 Dealing with data over time
134 Data Over Time - Patterns Initial Load Increasing Velocity Focus on Recency Historical Summaries
135 Initial Load Get a pile of old data into Accumulo fast Latency not important (data is old) Throughput critical
136 Bulk Load RFiles
137 Bulk Loading MapReduce RFiles Accumulo
138 Increasing velocity
139 If your data isn t big today, wait a little while
140 Accumulo scales up dynamically, online. No downtime
141 The first scale, can change size
142 Scaling Up Clients Accumulo HDFS 3 physical servers Each running a Tablet Server process and a Data Node process
143 Scaling Up Clients Accumulo HDFS Start 3 new Tablet Server procs 3 new Data node processes
144 Scaling Up Clients Accumulo HDFS master immediately assigns tablets
145 Clients Scaling Up Clients immediately begin querying new Tablet Servers Accumulo HDFS
146 Scaling Up Clients Accumulo HDFS new Tablet Servers read data from old Data nodes
147 Scaling Up Clients Accumulo HDFS new Tablet Servers write data to new Data Nodes
148 Never really seen anyone do this
149 Except myself
150 20 machines in Amazon EC2
151 to 400 machines
152 all during the same MapReduce job reading data out of Accumulo, summarizing, and writing back
153 Scaled back down to 20 machines when done
154 Just killed Tablet Servers
155 Decommissioned Data Nodes for safe data consolidation to remaining 20 nodes
156 Other ways to go from 10 x to 10 x+1
157 Accumulo Table Export
158 followed by HDFS DistCP to new cluster
159 Maybe new replication feature
160 Newer Data is Read more Often
161 Accumulo keeps newly written data in memory
162 Block Cache can keep recently queried data in memory
163 Combining Iterators make maintaining summaries of large amounts of raw events easy
164 Reduces storage burden
165 Historical Summaries Unique Entities Stored Raw Events Processed April May June July
166 Age-off iterator can automatically remove data over a certain age
167 IBM estimates 2.5 exabytes of data is created every day what-is-big-data.html
168 90% of available data created in last 2 years what-is-big-data.html
169 25 new 10k node Accumulo clusters per day
170 Accumulo is doing it s part to get in front of the big data trend
171 Questions?
172 @aaroncordova
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
From Internet Data Centers to Data Centers in the Cloud
From Internet Data Centers to Data Centers in the Cloud This case study is a short extract from a keynote address given to the Doctoral Symposium at Middleware 2009 by Lucy Cherkasova of HP Research Labs
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Hadoop and its Usage at Facebook. Dhruba Borthakur [email protected], June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur [email protected], June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
CS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Hadoop Hardware @Twitter: Size does matter. @joep and @eecraft Hadoop Summit 2013
Hadoop Hardware : Size does matter. @joep and @eecraft Hadoop Summit 2013 v2.3 About us Joep Rottinghuis Software Engineer @ Twitter Engineering Manager Hadoop/HBase team @ Twitter Follow me @joep Jay
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Bigtable is a proven design Underpins 100+ Google services:
Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh [email protected]. Stratis Viglas Extreme Computing 1
Extreme Computing Big Data Stratis Viglas School of Informatics University of Edinburgh [email protected] Stratis Viglas Extreme Computing 1 Petabyte Age Big Data Challenges Stratis Viglas Extreme Computing
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013
U.S. National Security Agency Research Directorate - R6 Technical Report NSA-RD-2013-056002v1 May 20, 2013 Graphs are everywhere! A graph is a collection of binary relationships, i.e. networks of pairwise
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Cloud Computing Where ISR Data Will Go for Exploitation
Cloud Computing Where ISR Data Will Go for Exploitation 22 September 2009 Albert Reuther, Jeremy Kepner, Peter Michaleas, William Smith This work is sponsored by the Department of the Air Force under Air
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
Cost-Effective Business Intelligence with Red Hat and Open Source
Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
InfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
Quantcast Petabyte Storage at Half Price with QFS!
9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme
Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics data 4
Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems
Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems 1 Some Numbers (2010) Over 260 Billion images (20 PB) 65 Billion X 4 different sizes for each image. 1 Billion
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution
Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Jonathan Halstuch, COO, RackTop Systems [email protected] Big Data Invasion We hear so much on Big Data and
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Extending Hadoop beyond MapReduce
Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core
MinCopysets: Derandomizing Replication In Cloud Storage
MinCopysets: Derandomizing Replication In Cloud Storage Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University [email protected], {stutsman,rumble,skatti,ouster,mendel}@cs.stanford.edu
EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics
BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
NextGen Infrastructure for Big DATA Analytics.
NextGen Infrastructure for Big DATA Analytics. So What is Big Data? Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn t fit the structures
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Testing 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
Hadoop Scalability at Facebook. Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011
Hadoop Scalability at Facebook Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011 How Facebook uses Hadoop Hadoop Scalability Hadoop High Availability HDFS Raid How Facebook uses Hadoop Usages
Comparing Scalable NOSQL Databases
Comparing Scalable NOSQL Databases Functionalities and Measurements Dory Thibault UCL Contact : [email protected] Sponsor : Euranova Website : nosqlbenchmarking.com February 15, 2011 Clarications
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect
on AWS Services Overview Bernie Nallamotu Principle Solutions Architect \ So what is it? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze
Data Centric Computing Revisited
Piyush Chaudhary Technical Computing Solutions Data Centric Computing Revisited SPXXL/SCICOMP Summer 2013 Bottom line: It is a time of Powerful Information Data volume is on the rise Dimensions of data
