Hadoop s Rise in Life Sciences
|
|
- Toby Malone
- 8 years ago
- Views:
Transcription
1 Exploring EMC Isilon scale-out storage solutions Hadoop s Rise in Life Sciences By John Russell, Contributing Editor, Bio IT World Produced by Cambridge Healthtech Media Group
2 By now the Big Data challenge is familiar to the entire life sciences community. Modern high-throughput experimental technologies generate vast data sets that can only be tackled with high performance computing (HPC). Genomics, of course, is the leading example. At the end of 2011, global annual sequencing capacity was estimated at 13 quadrillion bases and growing rapidly 1. It s worth noting a single base pair typically represents about 100 bytes of data (raw, analyzed, and interpreted). The need to manage and analyze these massive data sets, not just in life sciences but throughout all of science and industry, has spurred many new approaches to HPC infrastructure and led to many important IT advances, particularly in distributed computing. While there isn t a single right answer, one approach the Hadoop storage and compute framework is emerging as a compelling contender for use in life sciences to cope with the deluge of data. Created in 2004 by Doug Cutting (who famously named it after his son s stuffed elephant) and elevated to a top-level Apache Foundation project in 2008, Hadoop is intended to run large-scale distributed data analysis on commodity clusters. Cutting was initially inspired by a paper 2 from Google Labs describing Google s BigTable infrastructure and MapReduce application layers. (For a detailed perspective see Ronald Taylor s, An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. 3 ) The Hadoop Distributed File System (HDFS) and compute framework (MapReduce) enable Hadoop to break extremely large data sets into chunks, to distribute/ store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Broadly, Hadoop uses a file system (Hadoop Distributed File System (HDFS) and framework software (MapReduce) to break extremely large data sets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop s distinguishing feature is it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating of data and processing power (proximity computing) significantly accelerates performance and in April 2008 a Hadoop program, running on 910-node cluster, broke a world record, sorting a terabyte of data in less than 3.5 minutes. 4 1 DNA Sequencing Caught in Deluge of Data, New York Times, Nov. 30, 2011, 2 OSDI 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004, google.com/archive/mapreduce.html 3 An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, pmc/articles/pmc / 4 Hadoop wins Terabyte sort benchmark, Apr 2008, Apr. 2009, last accessed Dec 2011 Hadoop s Rise in Life Sciences 2
3 Part of the improved performance stems from MapReduce s key:value programming model which speeds up and scales up parallelized job execution better than many alternatives such as the GridEngine architecture for High Performance Computing (HPC). (One of the earliest use-cases of the Sun GridEngine 5 HPC was the DNA sequence comparison BLAST search.) The MapReduce layer is a batch query processor with dynamic data schema and linear scaling for unstructured or semistructured data. Its data is not normalized (decomposition of data into smaller structured relationships). Therefore higher level interpreted programming languages like Ruby and Python and a compiled language like C++ provide easier access to MapReduce to represent the program as MapReduce jobs. It turns out that Hadoop a fault-tolerant, share-nothing architecture in which tasks must have no dependence on each other is an excellent choice for many life sciences applications. Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV. The Hadoop R (statistical language) interface, RHIPE, is also popular in the life sciences community. It turns out that Hadoop a fault-tolerant, share-nothing architecture in which tasks must have no dependence on each other is an excellent choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured filebased data and ideally suited for embarrassingly parallel computation. Moreover, the use of commodity hardware (e.g. Linux cluster) keeps cost down, and little or no hardware modification is required 6. Not surprisingly life sciences organizations were among Hadoop s earliest adopters. The first large-scale MapReduce project was initiated by the Broad Institute (in 2008) and resulted in the comprehensive Genome Analysis Tool Kit (GATK) 7. The Hadoop CrossBow project from Johns Hopkins University came soon after 8. 5 Altschul SF, et al, Basic local alignment search tool. J Mol Biol 215 (3): , October An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, pmc/articles/pmc / 7 McKenna A, et al, The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data, Genome Research, 20: , July Hadoop s Rise in Life Sciences 3
4 Here are a few current Hadoop-based bioinformatics applications 9 : Crossbow. Whole genome resequencing analysis; SNP genotyping from short reads. Contrail. De novo assembly from short sequencing reads. Myrna. Ultrafast short read alignment and differential gene expression from large RNA-seq data sets. PeakRanger. Cloud-enabled peak caller for ChIP-seq data. Quake. Quality-aware detection and sequencing error correction tool. BlastReduce. High-performance short read mapping. CloudBLAST. Hadoop implementation of NCBI s Blast. MrsRF. Algorithm for analyzing large evolutionary trees. (For a more detailed example of Hadoop in operation see sidebar, Genomics Example: Calling SNPs with Crossbow.) Genomics Example: Calling SNPs with CrossBow Next Generation Sequencers (NGS) like Illumina Hiseq can produce data in the order of 200 billion base pairs (200 Gbp) in a single one-week run for a 60x human genome coverage, which means that each base was present on an average of 60 reads. The larger the coverage, the more statistically significant is the result. Sequence reads are much shorter than traditional Sanger sequencing. This data requires specialized software algorithms called short read aligners. CrossBow is a combination of several algorithms that provide SNP calling and short read alignment, which are common tasks in NGS. Figure 1 alongside explains the steps necessary to process genome data to look for SNPs. The Map-Sort-Reduce process is ideally suited for a Hadoop framework. The cluster as shown is a traditional N-node Hadoop cluster. All of the Hadoop features like HDFS, program management and fault tolerance are available. The Map step is the short read alignment algorithm, called BoWTie (named after the Burrows Wheeler Transform, BWT). Multiple instances of BoWTie are run in parallel in Hadoop. The input tuples (an ordered list of elements) are the sequence reads and the output tuples are the alignments of the short reads. The Sort step apportions the alignments according to a primary key (the genome partition) and sorts based on a secondary key (which is the offset for that partition). The data here are the sorted alignments. The Reduce step calls SNPs for each reference genome partition. Many parallel instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP) run in the cluster. Input tuples are sorted alignments for a partition and the output tuples are SNP calls. Results are stored via HDFS, and then archived in SOAPsnp format. 9 Got Hadoop?, Sept. 2011, Genome Technology, Hadoop s Rise in Life Sciences 4
5 After several years of steady development in academic environments, Hadoop is now poised for rapid commercialization and broader uptake in biopharma and healthcare. Early adoption has been strongest among next generation sequencing (NGS) centers where NGS workflows can generate 2 TeraBytes (TB) of data per run per week per sequencer that s not including the raw images. For these organizations, the need for scale-out storage that integrates with HPC is a line item requirement. EMC Isilon, long a leader in scale-out NAS storage solutions, understands these challenges and has provided the scale-out storage for nearly all the workflows for all the DNA sequencer instrument manufacturers in the market today at more than 150 customers. Since 2008, the EMC Isilon OneFS storage platform has an overall installed base of more than 65 PetaBytes (PB). Recently, EMC introduced the industry s first scale-out NAS system with native Hadoop support (via HDFS). Hadoop meets all the tenets of Jim Gray s Laws of Data Engineering which have not changed in 15 years. Sanjay Joshi CTO, Life Sciences, EMC Isilon Storage Division The EMC Isilon OneFS file system now provides for connectivity to the Hadoop Distributed File System (HDFS) just like any other shared file system protocol: NFS, CIFS or SMB 10. This allows for the data co-location of the storage with its compute nodes using the standard higher-level Java application programming interface (API) to build MapReduce jobs. EMC has gone one step further by combining its OneFS-based NAS solution with EMC Greenplum HD, a powerful analytics platform, to create a Hadoop appliance. Together, the two offerings relieve users of the burden of cobbling together various open source Hadoop components, which sometimes proves problematic. Hadoop meets all the tenets of Jim Gray s Laws of Data Engineering 11 which have not changed in 15 years, says Sanjay Joshi, CTO, Life Sciences, EMC Isilon Storage Division. Those tenets include: scientific computing is very data intensive, with no real limits; the solution is a scale-out architecture with distributed data access; and bring computation to the data, rather than data to the computations. 10 Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h From Jim Gray, Scalable Computing, presentation at Nortel: Microsoft Research, April 1999 Hadoop s Rise in Life Sciences 5
6 Isilon built the industry s first Scale Out storage architecture. Now with its native and enterprise-ready HDFS protocol via OneFS and GreenPlum HD, EMC brings simplicity to Big Data in Science. says Joshi. EMC Isilon OneFS combines the three layers of traditional storage architectures the file system, volume manager, and RAID into one unified software layer, creating a single intelligent distributed file system that runs on one storage cluster. Important advantages of OneFS for Hadoop are: Scalable: Linear scale with increasing capacity from 18TB to 16PB in a single filesystem and a single global namespace. Scale out as needs grow, independent of the compute layer. Predictable: Dynamic content balancing is performed as nodes are added, upgraded or capacity changes. No added management time is required since this process is simple. Available: OneFS protects your data from power loss, node or disk failures, loss of quorum and storage rebuild by distributing data, metadata and parity across all nodes. It also eliminates the single point of failure of a Hadoop Name Node. Therefore OneFS is self healing. Efficient: Compared to the average 50% efficiency of traditional RAID systems, OneFS provides over 80% efficiency, independent of CPU compute or cache. This efficiency is achieved by tier ing the process into three types as shown in the figure alongside and by the pools within these node types. This efficiency extends to the reduction from a 3x copy that Hadoop requires to the >80% efficient 1x storage via EMC Isilon s HDFS protocol. Enterprise-ready. Administration of the storage clusters is via an intuitive Web based UI. Connectivity to your process is through standard file protocols: CIFS, SMB, NFS, FTP/ HTTP, iscsi and HDFS. Standardized authentication and access control is available at scale: AD, LDAP and NIS. Storage tiers without fears based on performance reside in one global namespace, connected via a dedicated backend network. Hadoop s Rise in Life Sciences 6
7 CONCLUSION What began as an internal project at Google in 2004 has now matured into a scalable framework for two computing paradigms that are particularly suited for the life sciences: parallelization and distribution. Indeed, the post-processing streaming data patterns for text strings, clustering and sorting the core process patterns in the life sciences are ideal workflows for Hadoop. Case-in-point: The CrossBow example cited earlier aligned Illumina NGS reads for SNP calling over a 35x coverage of the human genome in under 3 hours using a 40-node Hadoop cluster; an order of magnitude better than traditional HPC technology for parallel processes. The EMC Isilon OneFS distributed file system handles the Hadoop distributed file system, HDFS, just like any other shared file system, and provides a shield for the single point of failure in Hadoop: the name node. The Hybrid Cloud model (source data mirror) with Hadoop as a Service (HaaS) is the current state-of-the-art. For more information visit EMC Isilon at Summary of Hadoop Attributes: Overview Write Once Read Many times (WORM) Co-locates data with compute, uses higher level architecture with Java API HDFS is a distributed file system that runs on large clusters Advantages Uses MapReduce framework a batch query processor, scales linearly EMC Isilon OneFS implements HDFS and eliminates the single point of failure, the name node Standard programming language development: Java, Ruby, Python, C++ create MapReduce jobs. FUSE and WebDAV interfaces provide architectural flexibility Challenges HDFS block size is 128 MB (can be increased), therefore large numbers of small files (<8KB) reduce its performance: use Hadoop Archive (HAR) Data coherency and latency remain issues for large scale implementations Not suited for low-latency, in process use-cases like real-time, spectral or video analysis Data transfer between Genome sequencing data sources to the Hadoop clusters in the Cloud remains an issue, the current business model is mirroring the data between source and Cloud and then utilizing Hadoop as a Service model on the mirrored data. Hadoop s Rise in Life Sciences 7
HADOOP IN THE LIFE SCIENCES:
White Paper HADOOP IN THE LIFE SCIENCES: An Introduction Abstract This introductory white paper reviews the Apache Hadoop TM technology, its components MapReduce and Hadoop Distributed File System (HDFS)
More informationCHALLENGES IN NEXT-GENERATION SEQUENCING
CHALLENGES IN NEXT-GENERATION SEQUENCING BASIC TENETS OF DATA AND HPC Gray s Laws of data engineering 1 : Scientific computing is very dataintensive, with no real limits. The solution is scale-out architecture
More informationHADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop
More informationEMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst
White Paper EMC s Enterprise Hadoop Solution Isilon Scale-out NAS and Greenplum HD By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst February 2012 This ESG White Paper was commissioned
More informationThe BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013
The BIG Data Era has arrived Re-invent your storage! Bratislava, Slovakia, 21st March 2013 Luka Topic Regional Manager East Europe EMC Isilon Storage Division luka.topic@emc.com 1 What is Big Data? 2 EXABYTES
More informationEMC IRODS RESOURCE DRIVERS
EMC IRODS RESOURCE DRIVERS PATRICK COMBES: PRINCIPAL SOLUTION ARCHITECT, LIFE SCIENCES 1 QUICK AGENDA Intro to Isilon (~2 hours) Isilon resource driver Intro to ECS (~1.5 hours) ECS Resource driver Possibilities
More informationIntro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
More informationEMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise
EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise ESSENTIALS Easy-to-use, single volume, single file system architecture Highly scalable with
More informationTHE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.
THE EMC ISILON STORY Big Data In The Enterprise 2012 1 Big Data In The Enterprise Isilon Overview Isilon Technology Summary 2 What is Big Data? 3 The Big Data Challenge File Shares 90 and Archives 80 Bioinformatics
More informationProtecting Big Data Data Protection Solutions for the Business Data Lake
White Paper Protecting Big Data Data Protection Solutions for the Business Data Lake Abstract Big Data use cases are maturing and customers are using Big Data to improve top and bottom line revenues. With
More informationWHITE PAPER. www.fusionstorm.com. Get Ready for Big Data:
WHitE PaPER: Easing the Way to the cloud: 1 WHITE PAPER Get Ready for Big Data: How Scale-Out NaS Delivers the Scalability, Performance, Resilience and manageability that Big Data Environments Demand 2
More informationHadoop. Bioinformatics Big Data
Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it Big Data Too much information! Big Data Explosive data growth proliferation of data capture
More informationObject Storage: Out of the Shadows and into the Spotlight
Technology Insight Paper Object Storage: Out of the Shadows and into the Spotlight By John Webster December 12, 2012 Enabling you to make the best technology decisions Object Storage: Out of the Shadows
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationBIG DATA-AS-A-SERVICE
White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationEXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS
EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS EMC Isilon solutions for oil and gas EMC PERSPECTIVE TABLE OF CONTENTS INTRODUCTION: THE HUNT FOR MORE RESOURCES... 3 KEEPING PACE WITH
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationHadoopizer : a cloud environment for bioinformatics data analysis
Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationBig Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationScala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
More informationEMC ISILON X-SERIES. Specifications. EMC Isilon X200. EMC Isilon X210. EMC Isilon X410 ARCHITECTURE
EMC ISILON X-SERIES EMC Isilon X200 EMC Isilon X210 The EMC Isilon X-Series, powered by the OneFS operating system, uses a highly versatile yet simple scale-out storage architecture to speed access to
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationStorage made simple. Essentials. Expand it... Simply
EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY Storage made simple Essentials Simple storage management, designed for ease of use Massive scalability with easy, grow-as-you-go flexibility World s fastest
More informationAn Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
More informationHadoopTM Analytics DDN
DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationHadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
More informationWhite Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
More informationIntegrated Grid Solutions. and Greenplum
EMC Perspective Integrated Grid Solutions from SAS, EMC Isilon and Greenplum Introduction Intensifying competitive pressure and vast growth in the capabilities of analytic computing platforms are driving
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationTHE BRIDGE FROM PACS TO VNA: SCALE-OUT STORAGE
White Paper THE BRIDGE FROM PACS TO VNA: SCALE-OUT STORAGE Authored by Michael Gray of Gray Consulting Abstract Moving to a VNA (vendor-neutral archive) for image archival, retrieval, and management requires
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationHow To Manage A Single Volume Of Data On A Single Disk (Isilon)
1 ISILON SCALE-OUT NAS OVERVIEW AND FUTURE DIRECTIONS PHIL BULLINGER, SVP, EMC ISILON 2 ROADMAP INFORMATION DISCLAIMER EMC makes no representation and undertakes no obligations with regard to product planning
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationData Storage. Vendor Neutral Data Archiving. May 2015 Sue Montagna. Imagination at work. GE Proprietary Information
Data Storage Vendor Neutral Data Archiving May 2015 Sue Montagna Imagination at work GE Proprietary Information Vendor Neutral Archiving Storing data in a standard format with a standard interface, such
More informationmarlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
More informationWhite. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014
White Paper EMC Isilon: A Scalable Storage Platform for Big Data By Nik Rouda, Senior Analyst and Terri McClure, Senior Analyst April 2014 This ESG White Paper was commissioned by EMC Isilon and is distributed
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationBig + Fast + Safe + Simple = Lowest Technical Risk
Big + Fast + Safe + Simple = Lowest Technical Risk The Synergy of Greenplum and Isilon Architecture in HP Environments Steffen Thuemmel (Isilon) Andreas Scherbaum (Greenplum) 1 Our problem 2 What is Big
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationData management challenges in todays Healthcare and Life Sciences ecosystems
Data management challenges in todays Healthcare and Life Sciences ecosystems Jose L. Alvarez Principal Engineer, WW Director Life Sciences jose.alvarez@seagate.com Evolution of Data Sets in Healthcare
More informationEMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY
SCALE-OUT STORAGE PRODUCT FAMILY Storage made simple ESSENTIALS Simple storage designed for ease of use Massive scalability with easy, grow-as-you-go flexibility World s fastest-performing NAS Unmatched
More informationStorage for Science. Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments. An Isilon Systems Whitepaper
Storage for Science Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments An Isilon Systems Whitepaper August 2008 Prepared by: Table of Contents Introduction
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationCONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY
White Paper CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY DVTel Latitude NVMS performance using EMC Isilon storage arrays Correct sizing for storage in a DVTel Latitude physical security
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationImplementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division
Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division Outline HDFS Overview OneFS Overview HDFS protocol on OneFS HDFS protocol server implementation
More informationKey Messages of Enterprise Cluster NAS Huawei OceanStor N8500
Messages of Enterprise Cluster NAS Huawei OceanStor Messages of Enterprise Cluster NAS 1. High performance and high reliability, addressing bid data challenges High performance: In the SPEC benchmark test,
More informationEMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY
SCALE-OUT STORAGE PRODUCT FAMILY Unstructured data storage made simple ESSENTIALS Simple storage management designed for ease of use Massive scalability of capacity and performance Unmatched efficiency
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationBig Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum
Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All
More informationData Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
More informationBIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationHADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director brett.weninger@adurant.com Dave Smelker, Managing Principal dave.smelker@adurant.com
More informationEMC ISILON ONEFS OPERATING SYSTEM
EMC ISILON ONEFS OPERATING SYSTEM Powering scale-out storage for the Big Data and Object workloads of today and tomorrow ESSENTIALS Easy-to-use, single volume, single file system architecture Highly scalable
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationEMC SOLUTION FOR SPLUNK
EMC SOLUTION FOR SPLUNK Splunk validation using all-flash EMC XtremIO and EMC Isilon scale-out NAS ABSTRACT This white paper provides details on the validation of functionality and performance of Splunk
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationHadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
More informationPutting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable
DDN Whitepaper Putting Genomes in the Cloud with WOS TM Making data sharing faster, easier and more scalable Table of Contents Cloud Computing 3 Build vs. Rent 4 Why WOS Fits the Cloud 4 Storing Sequences
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationCloud-Based Big Data Analytics in Bioinformatics
Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large
More informationMapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
More informationBig Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationUnderstanding Enterprise NAS
Anjan Dave, Principal Storage Engineer LSI Corporation Author: Anjan Dave, Principal Storage Engineer, LSI Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationDiagram 1: Islands of storage across a digital broadcast workflow
XOR MEDIA CLOUD AQUA Big Data and Traditional Storage The era of big data imposes new challenges on the storage technology industry. As companies accumulate massive amounts of data from video, sound, database,
More informationGeneric Log Analyzer Using Hadoop Mapreduce Framework
Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationBig Application Execution on Cloud using Hadoop Distributed File System
Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------
More informationIBM Smart Business Storage Cloud
GTS Systems Services IBM Smart Business Storage Cloud Reduce costs and improve performance with a scalable storage virtualization solution SoNAS Gerardo Kató Cloud Computing Solutions 2010 IBM Corporation
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationStorage Solutions for Bioinformatics
Storage Solutions for Bioinformatics Li Yan Director of FlexLab, Bioinformatics core technology laboratory liyan3@genomics.cn http://www.genomics.cn/flexlab/index.html Science and Technology Division,
More informationDesigning a Cloud Storage System
Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes
More informationDDN updates object storage platform as it aims to break out of HPC niche
DDN updates object storage platform as it aims to break out of HPC niche Analyst: Simon Robinson 18 Oct, 2013 DataDirect Networks has refreshed its Web Object Scaler (WOS), the company's platform for efficiently
More informationWOS. High Performance Object Storage
Datasheet WOS High Performance Object Storage The Big Data explosion brings both challenges and opportunities to businesses across all industry verticals. Providers of online services are building infrastructures
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationBIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
More informationI/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
More informationEMC SOLUTION FOR AGILE AND ROBUST ANALYTICS ON HADOOP DATA LAKE WITH PIVOTAL HDB
EMC SOLUTION FOR AGILE AND ROBUST ANALYTICS ON HADOOP DATA LAKE WITH PIVOTAL HDB ABSTRACT As companies increasingly adopt data lakes as a platform for storing data from a variety of sources, the need for
More informationCloud-based Analytics and Map Reduce
1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,
More informationNextGen Infrastructure for Big DATA Analytics.
NextGen Infrastructure for Big DATA Analytics. So What is Big Data? Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn t fit the structures
More informationVirtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
More informationHadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.
Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,
More informationNoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More information