Probabilistic Deduplication for Cluster-Based Storage Systems
|
|
- Loreen Webb
- 8 years ago
- Views:
Transcription
1 Probabilistic Deduplication for Cluster-Based Storage Systems Davide Frey, Anne-Marie Kermarrec, Konstantinos Kloudas INRIA Rennes, France
2 Motivation Volume of data stored increases exponentially. Provided services are highly dependent on data. 2
3 Motivation Traditional solutions combine data in tarballs and store them on tape. Pros: cost efficient. Cons: low throughput. Tape Disk Acquisition cost $407,000 $1,620,000 Operational cost $205,000 $573,000 Total cost $612,000 $2,193,000 * Source: 3
4 Deduplication Store data only once and replace duplicates with references. 4
5 Deduplication Store data only once and replace duplicates with references. file1 5
6 Deduplication Store data only once and replace duplicates with references. file1 file2 6
7 Deduplication Store data only once and replace duplicates with references. file1 file2 7
8 Deduplication Store data only once and replace duplicates with references. file1 file2 8
9 Challenges Single-node deduplication systems. Compact indexing structures. Efficient duplicate detection. 9
10 Challenges Single-node deduplication systems. Compact indexing structures. Efficient duplicate detection. Cluster-based solutions. Single-machine tradeoffs. Deduplication vs Load balancing. We focus on Cluster-based Deduplication Systems. 10
11 Example: Deduplication Vs Load Balancing A client wants to store a file. A B C D Clients Coordinator Storage Nodes 11
12 Example: Deduplication Vs Load Balancing The client sends the file to the Coordinator. A B C D Clients Coordinator Storage Nodes 12
13 Example: Deduplication Vs Load Balancing The Coordinator computes the overlap between the contents of and those of each Storage Node. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 13
14 Example: Deduplication Vs Load Balancing To maximize DEDUPLICATION, the new file should go to node C. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 14
15 Example: Deduplication Vs Load Balancing To achieve LOAD BALANCING, the new file should go to node D. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 15
16 Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. 16
17 Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. Scalability. Minimize memory usage at Coordinator. 17
18 Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. Scalability. Minimize memory usage at Coordinator. Good Throughput. Minimize CPU/Memory usage at Coordinator. 18
19 PRODUCK architecture Client Coordinator Split the file in chunks of data. Store and retrieve data. Assign chunks to nodes. Keep the system load balanced. Storage Nodes Store the chunks. Provide directory services. 19
20 Client: chunking Chunks: use content-based chunking techniques. basic deduplication unit. Super-chunks: group of consecutive chunks. basic routing and storage unit. 20
21 Client: chunking Split the file in chunks 21
22 Client: chunking Organize the chunks in super-chunks 22
23 Client: chunking 23
24 PRODUCK architecture Client Coordinator Split the file in chunks of data. Store and retrieve data. Assign chunks to nodes. Keep the system load balanced. Storage Nodes Store the chunks. Provide directory services. 24
25 Coordinator: goals Estimate the overlap between a super-chunk and the chunks of a given node. Maximize deduplication. Equally distribute storage load among nodes. Guarantee a load balanced system. 25
26 Coordinator: our contributions Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 26
27 Coordinator: Overlap Estimation Main observation : Do not need the exact matches. Need only an estimation of the size of the overlap. PCSA permits : Compact set descriptors. Accurate intersection estimation. Computationally efficient. 27
28 Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 28
29 Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash()
30 Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash() p(y) = min(bit(y, k)) BITMAP
31 Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash() INTUITION P(bitmap[0] = 1) = 1/2 P(bitmap[1] = 1) = 1/4 P(bitmap[2] = 1) = 1/8 p(y) = min(bit(y, k)) BITMAP
32 Coordinator: Overlap Estimation Intersection Cardinality Estimation? Union Cardinality Estimation? BITMAP(A) BITMAP(B) BitwiseOR BITMAP(A V B)
33 Coordinator: Overlap Estimation PCSA set cardinality estimation. Set intersection estimation. Selection of best storage node. 33
34 Coordinator: our contributions Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 34
35 Load Balancing Existing solution: choose Storage Nodes that do not exceed average load by a percentage threshold. 35
36 Load Balancing Existing solution: choose Storage Nodes that do not exceed average load by a percentage threshold. Problems Too aggressive, especially when a few data are stored in the system. 36
37 Load Balancing: our solution Bucket-based storage quota management. Measure storage space in fixed-size buckets. Coordinator grants buckets to nodes one by one. No node can exceed the least loaded by more than a maximum allowed bucket difference. 37
38 Load Balancing: our solution Bucket-based storage quota management. Bucket 38
39 Load Balancing: our solution Bucket-based storage quota management. Bucket Can I get a new Bucket? 39
40 Load Balancing: our solution Bucket-based storage quota management. Yes, you can. Bucket 40
41 Load Balancing: our solution Bucket-based storage quota management. Yes, you can. Bucket 41
42 Load Balancing: our solution Bucket-based storage quota management. Bucket 42
43 Load Balancing: our solution Bucket-based storage quota management. Bucket 43
44 Load Balancing: our solution Bucket-based storage quota management. Bucket Can I get a new Bucket? 44
45 Load Balancing: our solution Bucket-based storage quota management. NO you cannot! Bucket 45
46 Load Balancing: our solution Bucket-based storage quota management. Searching for the second biggest overlap. Bucket 46
47 Load Balancing: our solution Bucket-based storage quota management. Bucket 47
48 Contribution Summary Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 48
49 in l- ge t. ser nt ith file ble ge idch he ng he he ck on binary dat a, t hus covering many common cases. Table 1 presents more details on these datasets. In particular, the Deduplication Evaluation: Factor is the ratio of the Datasets original size of each dataset divided by its size after being deduplicated based on our chunking mechanism. 2 real world workloads: D at aset Size D eduplication Fact or m at Dat a For - (G B ) W ikipedia HT ML Images OS images 2 competitors Table 1: [Dong D at asetet D escr al. 2011]: ipt ion Minhash BloomFilter 4.2 Competitors Before evaluat ing t he specific aspect s of Pr oduck, we compare it against two state-of-the-art cluster-based deduplication storage systems presented in [6]: Bl oomfil t er,a stateful strategy, and M inhash, astateless one. T he latter is used in a commercial product [6], thus representing the 49
50 Evaluation: Competitors MinHash: Use the minimum hash from a super-chunk as its fingerprint. Assign super-chunks to bins using the mod(# bins) operator. Initially assign bins to nodes randomly and reassign bins to nodes when unbalanced. 50
51 Evaluation: Competitors BloomFilter: The Coordinator keeps a Bloom filter for each one of the Storage Nodes. If a node deviates more than 5% from the average load, he is considered overloaded. 51
52 Evaluation: Metrics Deduplication: Load balancing: Overall: ED and TD are normalized to the performance of a singlenode system to ease comparison. Throughput : 52
53 Evaluation: Effective Deduplication Wikipedia Images 32 nodes : Wikipedia 7% Images 16% 64 nodes : Wikipedia 16% Images 21% 53
54 Evaluation: Throughput Wikipedia Images 32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X 54
55 Evaluation: Throughput Memory : 64KB for Produck 9,6bits/chunk or 168GB for 140TB/node Wikipedia Images 32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X 55
56 Evaluation: Load Balancing Wikipedia Load Balancing Images 56
57 57
Tradeoffs in Scalable Data Routing for Deduplication Clusters
Tradeoffs in Scalable Data Routing for Deduplication Clusters Wei Dong Princeton University Fred Douglis EMC Kai Li Princeton University and EMC Hugo Patterson EMC Sazzala Reddy EMC Philip Shilane EMC
More informationMAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services
MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng School of Computer, Huazhong University of Science and Technology,
More informationCURRENTLY, the enterprise data centers manage PB or
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 61, NO. 11, JANUARY 21 1 : Distributed Deduplication for Big Storage in the Cloud Shengmei Luo, Guangyan Zhang, Chengwen Wu, Samee U. Khan, Senior Member, IEEE,
More informationBALANCING FOR DISTRIBUTED BACKUP
CONTENT-AWARE LOAD BALANCING FOR DISTRIBUTED BACKUP Fred Douglis 1, Deepti Bhardwaj 1, Hangwei Qian 2, and Philip Shilane 1 1 EMC 2 Case Western Reserve University 1 Starting Point Deduplicating disk-based
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
More informationVM-Centric Snapshot Deduplication for Cloud Data Backup
-Centric Snapshot Deduplication for Cloud Data Backup Wei Zhang, Daniel Agun, Tao Yang, Rich Wolski, Hong Tang University of California at Santa Barbara Pure Storage Inc. Alibaba Inc. Email: wei@purestorage.com,
More informationWAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression
WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression Philip Shilane, Mark Huang, Grant Wallace, and Windsor Hsu Backup Recovery Systems Division EMC Corporation Abstract
More informationA Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of
More informationOnline De-duplication in a Log-Structured File System for Primary Storage
Online De-duplication in a Log-Structured File System for Primary Storage Technical Report UCSC-SSRC-11-03 May 2011 Stephanie N. Jones snjones@cs.ucsc.edu Storage Systems Research Center Baskin School
More informationA Novel Deduplication Avoiding Chunk Index in RAM
A Novel Deduplication Avoiding Chunk Index in RAM 1 Zhike Zhang, 2 Zejun Jiang, 3 Xiaobin Cai, 4 Chengzhang Peng 1, First Author Northwestern Polytehnical University, 127 Youyixilu, Xi an, Shaanxi, P.R.
More informationAvoiding the Disk Bottleneck in the Data Domain Deduplication File System
Avoiding the Disk Bottleneck in the Data Domain Deduplication File System Benjamin Zhu Data Domain, Inc. Kai Li Data Domain, Inc. and Princeton University Hugo Patterson Data Domain, Inc. Abstract Disk-based
More informationBloom Filter based Inter-domain Name Resolution: A Feasibility Study
Bloom Filter based Inter-domain Name Resolution: A Feasibility Study Konstantinos V. Katsaros, Wei Koong Chai and George Pavlou University College London, UK Outline Inter-domain name resolution in ICN
More informationData Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.
Data Structures for Big Data: Bloom Filter Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014. is relative is not defined by a specific number of TB, PB, EB is when it becomes big for you is
More informationCloud De-duplication Cost Model THESIS
Cloud De-duplication Cost Model THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Christopher Scott Hocker
More informationSpeeding Up Cloud/Server Applications Using Flash Memory
Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta Microsoft Research, Redmond, WA, USA Contains work that is joint with B. Debnath (Univ. of Minnesota) and J. Li (Microsoft Research,
More informationSecurity Ensured Redundant Data Management under Cloud Environment
Security Ensured Redundant Data Management under Cloud Environment K. Malathi 1 M. Saratha 2 1 PG Scholar, Dept. of CSE, Vivekanandha College of Technology for Women, Namakkal. 2 Assistant Professor, Dept.
More informationUnderstanding EMC Avamar with EMC Data Protection Advisor
Understanding EMC Avamar with EMC Data Protection Advisor Applied Technology Abstract EMC Data Protection Advisor provides a comprehensive set of features to reduce the complexity of managing data protection
More informationReducing Replication Bandwidth for Distributed Document Databases
Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1, Andy Pavlo 1, Sudipta Sengupta 2 Jin Li 2, Greg Ganger 1 Carnegie Mellon University 1, Microsoft Research 2 #1 You can
More informationDesign and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen
Design and Implementation of a Storage Repository Using Commonality Factoring IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen Axion Overview Potentially infinite historic versioning for rollback and
More informationPrimary Data Deduplication Large Scale Study and System Design
Primary Data Deduplication Large Scale Study and System Design Ahmed El-Shimi Ran Kalach Ankit Kumar Adi Oltean Jin Li Sudipta Sengupta Microsoft Corporation, Redmond, WA, USA Abstract We present a large
More informationFAST 11. Yongseok Oh <ysoh@uos.ac.kr> University of Seoul. Mobile Embedded System Laboratory
CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of flash Memory based Solid State Drives FAST 11 Yongseok Oh University of Seoul Mobile Embedded System Laboratory
More informationData Deduplication: An Essential Component of your Data Protection Strategy
WHITE PAPER: THE EVOLUTION OF DATA DEDUPLICATION Data Deduplication: An Essential Component of your Data Protection Strategy JULY 2010 Andy Brewerton CA TECHNOLOGIES RECOVERY MANAGEMENT AND DATA MODELLING
More informationQuanqing XU Quanqing.Xu@nicta.com.au. YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud
Quanqing XU Quanqing.Xu@nicta.com.au YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud Outline Motivation YuruBackup s Architecture Backup Client File Scan, Data
More informationReference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges
Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges September 2011 Table of Contents The Enterprise and Mobile Storage Landscapes... 3 Increased
More informationHow To Make A Backup System More Efficient
Identifying the Hidden Risk of Data De-duplication: How the HYDRAstor Solution Proactively Solves the Problem October, 2006 Introduction Data de-duplication has recently gained significant industry attention,
More informationData Deduplication in a Hybrid Architecture for Improving Write Performance
Data Deduplication in a Hybrid Architecture for Improving Write Performance Data-intensive Salable Computing Laboratory Department of Computer Science Texas Tech University Lubbock, Texas June 10th, 2013
More informationThe Advantages and Disadvantages of Network Computing Nodes
Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node
More informationFile Systems Management and Examples
File Systems Management and Examples Today! Efficiency, performance, recovery! Examples Next! Distributed systems Disk space management! Once decided to store a file as sequence of blocks What s the size
More informationWindows Server 2012 授 權 說 明
Windows Server 2012 授 權 說 明 PROCESSOR + CAL HA 功 能 相 同 的 記 憶 體 及 處 理 器 容 量 虛 擬 化 Windows Server 2008 R2 Datacenter Price: NTD173,720 (2 CPU) Packaging All features Unlimited virtual instances Per processor
More informationIMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS
IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS Nehal Markandeya 1, Sandip Khillare 2, Rekha Bagate 3, Sayali Badave 4 Vaishali Barkade 5 12 3 4 5 (Department
More informationLecture 1: Data Storage & Index
Lecture 1: Data Storage & Index R&G Chapter 8-11 Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager
More informationG22.3250-001. Porcupine. Robert Grimm New York University
G22.3250-001 Porcupine Robert Grimm New York University Altogether Now: The Three Questions! What is the problem?! What is new or different?! What are the contributions and limitations? Porcupine from
More informationE-Guide. Sponsored By:
E-Guide An in-depth look at data deduplication methods This E-Guide will discuss the various approaches to data deduplication. You ll learn the pros and cons of each, and will benefit from independent
More informationMagFS: The Ideal File System for the Cloud
: The Ideal File System for the Cloud is the first true file system for the cloud. It provides lower cost, easier administration, and better scalability and performance than any alternative in-cloud file
More informationLDA, the new family of Lortu Data Appliances
LDA, the new family of Lortu Data Appliances Based on Lortu Byte-Level Deduplication Technology February, 2011 Copyright Lortu Software, S.L. 2011 1 Index Executive Summary 3 Lortu deduplication technology
More informationPhysical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
More informationHow swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda
How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda 1 Outline Build a cost-efficient Swift cluster with expected performance Background & Problem Solution Experiments
More informationCharacterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
More informationChapter 12 File Management
Operating Systems: Internals and Design Principles Chapter 12 File Management Eighth Edition By William Stallings Files Data collections created by users The File System is one of the most important parts
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationMaginatics Cloud Storage Platform for Elastic NAS Workloads
Maginatics Cloud Storage Platform for Elastic NAS Workloads Optimized for Cloud Maginatics Cloud Storage Platform () is the first solution optimized for the cloud. It provides lower cost, easier administration,
More informationSoftware-Defined Traffic Measurement with OpenSketch
Software-Defined Traffic Measurement with OpenSketch Lavanya Jose Stanford University Joint work with Minlan Yu and Rui Miao at USC 1 1 Management is Control + Measurement control - Access Control - Routing
More informationsulbhaghadling@gmail.com
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 4 Issue 3 March 2015, Page No. 10715-10720 Data DeDuplication Using Optimized Fingerprint Lookup Method for
More informationStorage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann
Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies
More informationThe Classical Architecture. Storage 1 / 36
1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage
More informationSkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage
SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage Biplob Debnath,1 Sudipta Sengupta Jin Li Microsoft Research, Redmond, WA, USA EMC Corporation, Santa Clara, CA, USA ABSTRACT We present
More informationA survey of big data architectures for handling massive data
CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - jordydomingos@gmail.com Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context
More informationBerlin 2015. Storage, Backup and Disaster Recovery in the Cloud AWS Customer Case Study: HERE Maps for Life
Berlin 2015 Storage, Backup and Disaster Recovery in the Cloud AWS Customer Case Study: HERE Maps for Life Storage, Backup and Disaster Recovery in the Cloud Robert Schmid, Storage Business Development,
More information3Gen Data Deduplication Technical
3Gen Data Deduplication Technical Discussion NOTICE: This White Paper may contain proprietary information protected by copyright. Information in this White Paper is subject to change without notice and
More informationContent-aware Load Balancing for Distributed Backup. Deepti Bhardwaj EMC Deepti.Bhardwaj@emc.com Philip Shilane EMC Philip.Shilane@emc.
Content-aware Load Balancing for Distributed Backup Fred Douglis EMC Fred.Douglis@emc.com Deepti Bhardwaj EMC Deepti.Bhardwaj@emc.com Philip Shilane EMC Philip.Shilane@emc.com Hangwei Qian Case Western
More informationBuilding a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos
Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos Symantec Research Labs Symantec FY 2013 (4/1/2012 to 3/31/2013) Revenue: $ 6.9 billion Segment Revenue Example Business
More informationChunkStash: Speeding up Inline Storage Deduplication using Flash Memory
ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory Biplob Debnath Sudipta Sengupta Jin Li Microsoft Research, Redmond, WA, USA University of Minnesota, Twin Cities, USA Abstract Storage
More informationRunning a typical ROOT HEP analysis on Hadoop/MapReduce. Stefano Alberto Russo Michele Pinamonti Marina Cobal
Running a typical ROOT HEP analysis on Hadoop/MapReduce Stefano Alberto Russo Michele Pinamonti Marina Cobal CHEP 2013 Amsterdam 14-18/10/2013 Topics The Hadoop/MapReduce model Hadoop and High Energy Physics
More informationidedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti
idedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti Advanced Technology Group NetApp 1 idedup overview/context Storage Clients
More informationData Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
More informationWorkload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace
Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace Beth Plale Indiana University plale@cs.indiana.edu LEAD TR 001, V3.0 V3.0 dated January 24, 2007 V2.0 dated August
More informationCloud Optimize Your IT
Cloud Optimize Your IT Windows Server 2012 The information contained in this presentation relates to a pre-release product which may be substantially modified before it is commercially released. This pre-release
More informationReal-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011
Real-time Analytics at Facebook: Data Freeway and Puma Zheng Shao 12/2/2011 Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works Analytics and Real-time what and why Facebook Insights
More informationHP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper
HP StoreOnce D2D Understanding the challenges associated with NetApp s deduplication Business white paper Table of contents Challenge #1: Primary deduplication: Understanding the tradeoffs...4 Not all
More informationOverview of Storage and Indexing
Overview of Storage and Indexing Chapter 8 How index-learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) Database Management Systems 3ed, R. Ramakrishnan
More informationA Deduplication File System & Course Review
A Deduplication File System & Course Review Kai Li 12/13/12 Topics A Deduplication File System Review 12/13/12 2 Traditional Data Center Storage Hierarchy Clients Network Server SAN Storage Remote mirror
More informationMaximizing SQL Server Virtualization Performance
Maximizing SQL Server Virtualization Performance Michael Otey Senior Technical Director Windows IT Pro SQL Server Pro 1 What this presentation covers Host configuration guidelines CPU, RAM, networking
More informationUsing HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup
Technical white paper Using HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup Table of contents Executive summary... 2 Introduction... 2 What is NDMP?... 2 Technology overview... 3 HP
More informationWHITE PAPER. How Deduplication Benefits Companies of All Sizes An Acronis White Paper
How Deduplication Benefits Companies of All Sizes An Acronis White Paper Copyright Acronis, Inc., 2000 2009 Table of contents Executive Summary... 3 What is deduplication?... 4 File-level deduplication
More informationBarracuda Backup Deduplication. White Paper
Barracuda Backup Deduplication White Paper Abstract Data protection technologies play a critical role in organizations of all sizes, but they present a number of challenges in optimizing their operation.
More informationDiscovery of Electronically Stored Information ECBA conference Tallinn October 2012
Discovery of Electronically Stored Information ECBA conference Tallinn October 2012 Jan Balatka, Deloitte Czech Republic, Analytic & Forensic Technology unit Agenda Introduction ediscovery investigation
More informationLecture 2 Cloud Computing & Virtualization. Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 2 Cloud Computing & Virtualization Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Virtualization The Major Approaches
More informationThe Advantages of Flash Storage
FIVE REASONS YOU SHOULD BE THINKING FLASH Table of Contents INTRO: WHY FLASH? REASON 1: FLASH CAN HANDLE IT REASON 2: FLEXIBLE AND SCALABLE REASON 3: HIGH PERFORMANCE, LOW LATENCY REASON 4: LONG-TERM
More informationUnderstanding EMC Avamar with EMC Data Protection Advisor
Understanding EMC Avamar with EMC Data Protection Advisor Applied Technology Abstract EMC Data Protection Advisor provides a comprehensive set of features that reduce the complexity of managing data protection
More informationWide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)
Wide-area Network Acceleration for the Developing World Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton) POOR INTERNET ACCESS IN THE DEVELOPING WORLD Internet access is a scarce
More informationOverview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8
Overview of Storage and Indexing Chapter 8 How index-learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) Database Management Systems 3ed, R. Ramakrishnan
More informationewave: Leveraging Energy-Awareness for In-line Deduplication Clusters
ewave: Leveraging Energy-Awareness for In-line Deduplication Clusters Raúl Gracia-Tinedo Universitat Rovira i Virgili Tarragona, Spain raul.gracia@urv.cat Marc Sánchez-Artigas Universitat Rovira i Virgili
More informationA Middleware Strategy to Survive Compute Peak Loads in Cloud
A Middleware Strategy to Survive Compute Peak Loads in Cloud Sasko Ristov Ss. Cyril and Methodius University Faculty of Information Sciences and Computer Engineering Skopje, Macedonia Email: sashko.ristov@finki.ukim.mk
More informationBerkeley Ninja Architecture
Berkeley Ninja Architecture ACID vs BASE 1.Strong Consistency 2. Availability not considered 3. Conservative 1. Weak consistency 2. Availability is a primary design element 3. Aggressive --> Traditional
More informationAssuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets
Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets Young Jin Nam School of Computer and Information Technology Daegu University Gyeongsan, Gyeongbuk, KOREA 7-7 Email:
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationSymantec Enterprise Vault And NetApp Better Together
Symantec Enterprise Vault And NetApp Better Together John Martin, Consulting Systems Engineer Information Archival with Symantec and NetApp Today s Customer Headaches Data is growing exponentially Scaling
More informationACHIEVING STORAGE EFFICIENCY WITH DATA DEDUPLICATION
ACHIEVING STORAGE EFFICIENCY WITH DATA DEDUPLICATION Dell NX4 Dell Inc. Visit dell.com/nx4 for more information and additional resources Copyright 2008 Dell Inc. THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES
More informationMINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationRAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array
ATTO Technology, Inc. Corporate Headquarters 155 Crosspoint Parkway Amherst, NY 14068 Phone: 716-691-1999 Fax: 716-691-9353 www.attotech.com sales@attotech.com RAID Overview: Identifying What RAID Levels
More informationWrite a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical
Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or
More informationMemory Buddies: Exploiting Page Sharing for Smart Colocation in Virtualized Data Centers
Memory Buddies: Exploiting Page Sharing for Smart Colocation in Virtualized Data Centers Timothy Wood, Gabriel Tarasuk-Levin, Prashant Shenoy, Peter Desnoyers, Emmanuel Cecchet, Mark D. Corner Department
More informationTheoretical Aspects of Storage Systems Autumn 2009
Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann News Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements
More informationProtect Microsoft Exchange databases, achieve long-term data retention
Technical white paper Protect Microsoft Exchange databases, achieve long-term data retention HP StoreOnce Backup systems, HP StoreOnce Catalyst, and Symantec NetBackup OpenStorage Table of contents Introduction...
More informationTableau Server 7.0 scalability
Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different
More informationA Generic API for Load Balancing in Structued P2P Systems
A Generic API for Load Balancing in Structued P2P Systems Maeva Antoine, Laurent Pellegrino, Fabrice Huet and Françoise Baude University of Nice Sophia-Antipolis (France), CNRS, I3S, UMR 7271 Motivation
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationProtecting enterprise servers with StoreOnce and CommVault Simpana
Technical white paper Protecting enterprise servers with StoreOnce and CommVault Simpana HP StoreOnce Backup systems Table of contents Introduction 2 Technology overview 2 HP StoreOnce Backup systems key
More informationA SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP
A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP Dilip N Simha (Stony Brook University, NY & ITRI, Taiwan) Maohua Lu (IBM Almaden Research Labs, CA) Tzi-cker Chiueh (Stony
More informationLow-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy, and Hong Tang University of California at Santa Barbara, Alibaba Inc. Abstract In a virtualized
More informationHyLARD: A Hybrid Locality-Aware Request Distribution Policy in Cluster-based Web Servers
TANET2007 臺 灣 網 際 網 路 研 討 會 論 文 集 二 HyLARD: A Hybrid Locality-Aware Request Distribution Policy in Cluster-based Web Servers Shang-Yi Zhuang, Mei-Ling Chiang Department of Information Management National
More informationSTRATEGIC PLANNING ASSUMPTION(S)
STRATEGIC PLANNING ASSUMPTION(S) By 2016, one-third of organizations will change backup vendors due to frustration over cost, complexity and/or capability. By 2014, 80% of the industry will choose disk-based
More informationInline Deduplication
Inline Deduplication binarywarriors5@gmail.com 1.1 Inline Vs Post-process Deduplication In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e.
More informationLoad Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco
Engines Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco Stream Processing Engines Online Machine Learning Real Time Query Processing ConCnuous ComputaCon Distributed RPC 2 Stream Processing Engines
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationDon t be duped by dedupe - Modern Data Deduplication with Arcserve UDP
Don t be duped by dedupe - Modern Data Deduplication with Arcserve UDP by Christophe Bertrand, VP of Product Marketing Too much data, not enough time, not enough storage space, and not enough budget, sound
More informationTurnkey Deduplication Solution for the Enterprise
Symantec NetBackup 5000 Appliance Turnkey Deduplication Solution for the Enterprise Mayur Dewaikar Sr. Product Manager, Information Management Group White Paper: A Deduplication Appliance Solution for
More informationCloud Computing. Chapter 4 Infrastructure as a Service (IaaS)
Cloud Computing Chapter 4 Infrastructure as a Service (IaaS) Learning Objectives Define and describe IaaS and identify IaaS solution providers. Define and describe colocation. Define and describe system
More information