Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Size: px
Start display at page:

Download "Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014"

Transcription

1 Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014

2 Scale, Security, Schema

3 Scale

4 to scale 1 - (vt) to change the size of something

5 let s scale the cluster up to twice the original size

6 to scale 2 - (vi) to function properly at a large scale

7 Accumulo scales

8 What is Large Scale?

9 Notebook Computer 16 GB DRAM 512 GB Flash Storage 2.3 GHz quad-core i7 CPU

10 Modern Server 100s of GB DRAM 10s of TB on disk 10s of cores

11 Large Scale Laptop Server 10 Node Cluster 100 Nodes 1000 Nodes 10,000 Nodes 10 GB 100 GB 1 TB 10 TB 100 TB 1 PB 10 PB 100 PB In RAM On Disk

12 Data Composition Original Raw Derivative QFDs Indexes January February March April

13 Accumulo Scales From GB to PB, Accumulo keeps two things low: Administrative effort Scan latency

14 Scan Latency

15 Administrative Overhead Failed Machines Admin Intervention

16 Accumulo Scales From GB to PB three things grow linearly: Total storage size Ingest Rate Concurrent scans

17 Ingest Benchmark Millions of entries per second

18 AWB Benchmark

19 1000 machines

20 100 M entries written per second

21 408 terabytes

22 7.56 trillion total entries

23 Graph Benchmark

24 1200 machines

25 4.4 trillion vertices

26 70.4 trillion edges

27 149 M edges traversed per second

28 1 petabyte

29 Graph Analysis Billions of Edges ,000 1, Twitter Yahoo! Facebook Accumulo

30 Accumulo is designed after Google s BigTable

31 BigTable powers hundreds of applications at Google

32 BigTable serves 2+ exabytes

33 600 M queries per second organization wide

34 From 10 to 10,000

35 Starting with ten machines 10 1

36 One rack

37 1 TB RAM

38 TB Disk

39 Hardware failures rare

40 Test Application Designs

41 Designing Applications for Scale

42 Keys to Scaling 1. Live writes go to all servers 2. User requests are satisfied by few scans 3. Turning updates into inserts

43 Keys to Scaling Writes on all servers Few Scans

44 Hash / UUID Keys Key Value RowID Col Value usera:name Bob af362de4 Bob usera:age 43 usera:account $30 b23dc4be b98de2ff Annie Joe userb:name Annie c48e2ade $30 userb:age 32 userb:account $25 c7e43fb2 $25 d938ff3d 32 userc:name Joe e2e4dac4 59 userc:age 59 e98f2eab3 43 Uniform writes

45 Monitor Participating Tablet Servers MyTable Servers Hosted Tablets Ingest r1n k r1n k r2n k r2n k

46 Hash / UUID Keys RowID Col Value get(usera) af362de4 b23dc4be b98de2ff Bob Annie Joe c48e2ade $30 c7e43fb2 $25 d938ff3d 32 e2e4dac4 59 e98f2eab x 1-entry scans on 3 servers

47 Keys to Scaling Writes on all servers Few Scans Hash / UUID Keys

48 Group for Locality Key usera:name Value Bob RowID Col Value af362de4 name Annie usera:age 43 userb:name usera:account Annie $30 userb:age userb:name 32 Annie userc:name userb:age Fred 32 userc:age userb:account 29 $25 af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob userd:name userc:name Joe e2e4dac4 age 43 userd:age userc:age 59 e2e4dac4 account $30 Still fairly uniform writes

49 Group for Locality RowID Col Value get(usera) af362de4 name Annie af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 1 x 3-entry scan on 1 server

50 Keys to Scaling Writes on all servers Few Scans Grouped Keys

51 Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie userb:age 32 userc:name Fred userc:age 29 userd:name Joe userd:age 59

52 Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie 23 userb:age userc:name Fred userc:age 29 userd:name Joe userd:age 59

53 Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie 23 userb:age userc:name Fred 31 userc:age userd:name Joe 25 userd:age Always write to one server

54 No write parallelism

55 Temporal Keys RowID Col Value get( to ) Fetching ranges uses few scans

56 Keys to Scaling Writes on all servers Few Scans Temporal Keys

57 Binned Temporal Keys Key Value RowID Col Value usera:name Bob 44 0_ usera:age userb:name Annie 23 userb:age 32 1_ userc:name Fred userc:age 29 userd:name Joe 2_ userd:age 59 Uniform Writes

58 Binned Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age _ _ userb:name Annie 23 userb:age userc:name Fred 31 1_ _ userc:age userd:name Joe 2_ userd:age 59 2_ Uniform Writes

59 Binned Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie 23 userb:age userc:name Fred 31 userc:age userd:name Joe 25 userd:age _ _ _ _ _ _ _ _ Uniform Writes

60 Binned Temporal Keys get( to ) RowID Col Value 0_ _ _ _ _ _ _ _ One scan per bin

61 Keys to Scaling Writes on all servers Few Scans Binned Temporal Keys

62 Keys to Scaling Key design is critical Group data under common row IDs to reduce scans Prepend bins to row IDs to increase write parallelism

63 Splits Pre-split or organic splits Going from dev to production, can ingest a representative sample, obtain split points and use them to pre-split a larger system Hundreds or thousands of tablets per server is ok Want at least one tablet per server

64 Effect of Compression Similar sorted keys compress well May need more data than you think to auto-split

65 Inserts are fast 10s of thousands per second per machine

66 Updates *can* be

67 Update Types Overwrite Combine Complex

68 Update - Overwrite Performance same as insert Ignore (don t read) existing value Accumulo s Versioning Iterator does the overwrite

69 Update - Overwrite RowID Col Value af362de4 name Annie userb:age -> 34 af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

70 Update - Overwrite RowID Col Value af362de4 name Annie userb:age -> 34 af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

71 Update - Combine Things like X = X + 1 Normally one would have to read the old value to do this, but Accumulo Iterators allow multiple inserts to be combined at scan time, or compaction Performance is same as inserts

72 Update - Combine RowID Col Value af362de4 name Annie userb:account -> +10 af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

73 Update - Combine RowID Col Value af362de4 name Annie userb:account -> +10 af362de4 age 34 af362de4 account $25 af362de4 account $10 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

74 Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $25 af362de4 account $10 getaccount(userb) $35 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

75 Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 After compaction c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

76 Update - Complex Some updates require looking at more data than Iterators have access to - such as multiple rows These require reading the data out in order to write the new value Performance will be much slower

77 Update - Complex userc:account = getbalance(usera) + getbalance(userb) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe = 65 c48e2ade age 59 c48e2ade account $40 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

78 Update - Complex userc:account = getbalance(usera) + getbalance(userb) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe = 65 c48e2ade age 59 c48e2ade account $65 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

79 Planning a Larger-Scale Cluster

80 Storage vs Ingest Ingest Rate 1x1TB 12x3TB Millions of Entries per second ,000 12,000 10,000 1,200 1, Storage Terabytes

81 Model for Ingest Rates N - Number of machines S - Single Server throughput (entries / second) A - Aggregate Cluster throughput (entries / second) A = 0.85 log 2 N * N * S Expect 85% increase in write rate when doubling the size of the cluster

82 Estimating Machines Required N - Number of machines S - Single Server throughput (entries / second) A - Target Aggregate throughput (entries / second) N = 2 (log (A/S) / ) 2 Expect 85% increase in write rate when doubling the size of the cluster

83 Predicted Cluster Sizes Number of Machines Millions of Entries per Second

84 100 Machines 10 2

85 Multiple racks

86 10 TB RAM

87 100 TB - 1PB Disk

88 Some hardware failures in the first week (burn in)

89 Expect 3 failed HDs in first 3 mo

90 Another 4 within the first year research.google.com/en/us/archive/disk_failures.pdf

91 Can process the 1000 Genomes data set 260 TB

92 Can store and index the Common Crawl Corpus commoncrawl.org! 2.8 Billion web pages 541 TB

93 One year of Twitter 182 trillion tweets 483 TB / /d564001ds1.htm

94 Deploying an Application Users Clients Tablet Servers

95 May not see the affect of writing to disk for a while

96 1000 machines 10 3

97 Multiple rows of racks

98 100 TB RAM

99 1-10 PB Disk

100 Hardware failure is a regular occurrence

101 Hard drive failure about every 5 days (average). Will be skewed towards beginning of! the year

102 Can traverse the brain graph 70 trillion edges, 1 PB

103 Facebook Graph 1s of PB xldb2012_wed_1105_dhrubaborthakur.pdf

104 Netflix Video Master Copies 3.14 PB

105 World of Warcraft Backend Storage 1.3 PB wows-back-end-10-data-centers cores/

106 Webpages, live on the Internet 14.3 Trillion total-number-of-websites-size-of.html

107 Things like the difference between two compression algorithms start to make a big difference

108 Use range compactions to affect changes on portions of table

109 Lay off Zookeeper

110 Watch Garbage Collector and Namenode ops

111 Garbage Collection > 5 minutes?

112 Start thinking about NameNode Federation

113 Accumulo 1.6

114 Multiple NameNodes Accumulo Namenode Namenode DataNodes DataNodes Multiple HDFS Clusters

115 Multiple NameNodes Accumulo Namenode Namenode DataNodes Multiple NameNodes, shared DataNodes (Federation. Requires Hadoop 2.0)

116 More Namenodes = higher risk of one going down.! Can use HA Namenodes in conjunction w/ Federation

117 10,000 machines 10 4

118 You, my friend, are here to kick a** and chew bubble gum

119 1 PB RAM

120 PB Disk

121 1 hardware failure every hour on average

122 Entire Internet Archive 15 PB internet-archive-wayback-machine-brewster-kahle

123 A year s worth of data from the Large Hadron Collider 15 PB

124 0.1% of all Internet traffic in PB total-number-of-websites-size-of.html

125 Facebook Messaging Data 10s of PB xldb2012_wed_1105_dhrubaborthakur.pdf

126 Facebook Photos 240 billion High 10s of PB xldb2012_wed_1105_dhrubaborthakur.pdf

127 Must use multiple NameNodes

128 Can tune back heartbeats, periodicity of central processes in general

129 Can combine multiple PB data sets

130 Up to 10 quadrillion entries in a single table

131 While maintaining sub-second lookup times

132 Only with Accumulo 1.6

133 Dealing with data over time

134 Data Over Time - Patterns Initial Load Increasing Velocity Focus on Recency Historical Summaries

135 Initial Load Get a pile of old data into Accumulo fast Latency not important (data is old) Throughput critical

136 Bulk Load RFiles

137 Bulk Loading MapReduce RFiles Accumulo

138 Increasing velocity

139 If your data isn t big today, wait a little while

140 Accumulo scales up dynamically, online. No downtime

141 The first scale, can change size

142 Scaling Up Clients Accumulo HDFS 3 physical servers Each running a Tablet Server process and a Data Node process

143 Scaling Up Clients Accumulo HDFS Start 3 new Tablet Server procs 3 new Data node processes

144 Scaling Up Clients Accumulo HDFS master immediately assigns tablets

145 Clients Scaling Up Clients immediately begin querying new Tablet Servers Accumulo HDFS

146 Scaling Up Clients Accumulo HDFS new Tablet Servers read data from old Data nodes

147 Scaling Up Clients Accumulo HDFS new Tablet Servers write data to new Data Nodes

148 Never really seen anyone do this

149 Except myself

150 20 machines in Amazon EC2

151 to 400 machines

152 all during the same MapReduce job reading data out of Accumulo, summarizing, and writing back

153 Scaled back down to 20 machines when done

154 Just killed Tablet Servers

155 Decommissioned Data Nodes for safe data consolidation to remaining 20 nodes

156 Other ways to go from 10 x to 10 x+1

157 Accumulo Table Export

158 followed by HDFS DistCP to new cluster

159 Maybe new replication feature

160 Newer Data is Read more Often

161 Accumulo keeps newly written data in memory

162 Block Cache can keep recently queried data in memory

163 Combining Iterators make maintaining summaries of large amounts of raw events easy

164 Reduces storage burden

165 Historical Summaries Unique Entities Stored Raw Events Processed April May June July

166 Age-off iterator can automatically remove data over a certain age

167 IBM estimates 2.5 exabytes of data is created every day what-is-big-data.html

168 90% of available data created in last 2 years what-is-big-data.html

169 25 new 10k node Accumulo clusters per day

170 Accumulo is doing it s part to get in front of the big data trend

171 Questions?

172 @aaroncordova

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Large-Scale Data Processing

Large-Scale Data Processing Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

From Internet Data Centers to Data Centers in the Cloud

From Internet Data Centers to Data Centers in the Cloud From Internet Data Centers to Data Centers in the Cloud This case study is a short extract from a keynote address given to the Doctoral Symposium at Middleware 2009 by Lucy Cherkasova of HP Research Labs

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Hadoop and its Usage at Facebook. Dhruba Borthakur [email protected], June 22 rd, 2009

Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Hadoop and its Usage at Facebook Dhruba Borthakur [email protected], June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Hadoop Hardware @Twitter: Size does matter. @joep and @eecraft Hadoop Summit 2013

Hadoop Hardware @Twitter: Size does matter. @joep and @eecraft Hadoop Summit 2013 Hadoop Hardware : Size does matter. @joep and @eecraft Hadoop Summit 2013 v2.3 About us Joep Rottinghuis Software Engineer @ Twitter Engineering Manager Hadoop/HBase team @ Twitter Follow me @joep Jay

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Bigtable is a proven design Underpins 100+ Google services:

Bigtable is a proven design Underpins 100+ Google services: Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing. Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

More information

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh [email protected]. Stratis Viglas Extreme Computing 1

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk. Stratis Viglas Extreme Computing 1 Extreme Computing Big Data Stratis Viglas School of Informatics University of Edinburgh [email protected] Stratis Viglas Extreme Computing 1 Petabyte Age Big Data Challenges Stratis Viglas Extreme Computing

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013 U.S. National Security Agency Research Directorate - R6 Technical Report NSA-RD-2013-056002v1 May 20, 2013 Graphs are everywhere! A graph is a collection of binary relationships, i.e. networks of pairwise

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Cloud Computing Where ISR Data Will Go for Exploitation

Cloud Computing Where ISR Data Will Go for Exploitation Cloud Computing Where ISR Data Will Go for Exploitation 22 September 2009 Albert Reuther, Jeremy Kepner, Peter Michaleas, William Smith This work is sponsored by the Department of the Air Force under Air

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

Cost-Effective Business Intelligence with Red Hat and Open Source

Cost-Effective Business Intelligence with Red Hat and Open Source Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011 BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Quantcast Petabyte Storage at Half Price with QFS!

Quantcast Petabyte Storage at Half Price with QFS! 9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory) WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

More information

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012 Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics data 4

More information

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems 1 Some Numbers (2010) Over 260 Billion images (20 PB) 65 Billion X 4 different sizes for each image. 1 Billion

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Jonathan Halstuch, COO, RackTop Systems [email protected] Big Data Invasion We hear so much on Big Data and

More information

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially

More information

Extending Hadoop beyond MapReduce

Extending Hadoop beyond MapReduce Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core

More information

MinCopysets: Derandomizing Replication In Cloud Storage

MinCopysets: Derandomizing Replication In Cloud Storage MinCopysets: Derandomizing Replication In Cloud Storage Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University [email protected], {stutsman,rumble,skatti,ouster,mendel}@cs.stanford.edu

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

NextGen Infrastructure for Big DATA Analytics.

NextGen Infrastructure for Big DATA Analytics. NextGen Infrastructure for Big DATA Analytics. So What is Big Data? Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn t fit the structures

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Testing 3Vs (Volume, Variety and Velocity) of Big Data Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used

More information

Hadoop Scalability at Facebook. Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011

Hadoop Scalability at Facebook. Dmytro Molkov (dms@fb.com) YaC, Moscow, September 19, 2011 Hadoop Scalability at Facebook Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011 How Facebook uses Hadoop Hadoop Scalability Hadoop High Availability HDFS Raid How Facebook uses Hadoop Usages

More information

Comparing Scalable NOSQL Databases

Comparing Scalable NOSQL Databases Comparing Scalable NOSQL Databases Functionalities and Measurements Dory Thibault UCL Contact : [email protected] Sponsor : Euranova Website : nosqlbenchmarking.com February 15, 2011 Clarications

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.

More information

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect on AWS Services Overview Bernie Nallamotu Principle Solutions Architect \ So what is it? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze

More information

Data Centric Computing Revisited

Data Centric Computing Revisited Piyush Chaudhary Technical Computing Solutions Data Centric Computing Revisited SPXXL/SCICOMP Summer 2013 Bottom line: It is a time of Powerful Information Data volume is on the rise Dimensions of data

More information