Twister4Azure: Data Analytics in the Cloud
|
|
- Patrick McDaniel
- 2 years ago
- Views:
Transcription
1 Twister4Azure: Data Analytics in the Cloud Thilina Gunarathne, Xiaoming Gao and Judy Qiu, Indiana University Genome-scale data provided by next generation sequencing (NGS) has made it possible to identify new species and infer the evolutionary relationships between organisms. These techniques have been applied in medicine, such as to screen for recurrent mutations in cancer for use as biomarkers, to classify diseases and suggest treatment. However, developing a complete understanding of a genomic dataset relies heavily on a set of data analytics tools that are tractable to analyze potentially billions of read sequences, where the challenges are both unprecedented at scale and in complexity. Researchers of Informatics and Computing, Biology and Medicine at Indiana University have worked together on a NIH project to investigate emerging scalable computing systems interoperable of Cloud and HPC that meet the common needs across a series of life sciences problems [1]. This effort is exemplified by our bioinformatics pipeline of data storage, analysis, and visualization [2]. At its core, parallel programming paradigms such as Mapreduce and MPI provide powerful large-scale data processing capabilities. Our research is to clarify which applications are best suited for Clouds; which require HPC and which can use both effectively. The long-term goal is enabling cost effective and readily available analysis tool repository that removes the barrier of research in broader community -- making data analytics in the Cloud a reality. Integrating Scientific Challenge: A Typical Bioinformatics Pipeline The study of microbial genomes is complicated by the fact that only small number of species can be isolated successfully and the current way forward is metagenomic studies of culture-independent, collective sets of genomes in their natural environments. This requires identification of as many as millions of genes and thousands of species from individual samples. New sequencing technology can provide the required data samples with a throughput of 1 trillion base pairs per day and this rate will increase. A typical observation and data pipeline is shown in Fig. 1 with sequencers producing DNA samples that are assembled and subject to further analysis including BLAST-like comparison with existing datasets as well as clustering, dimension reduction and visualization to identify new gene families. The initial parts of the pipeline fit the Mapreduce (e.g. Hadoop) or many-task Cloud (e.g. Azure) model but the latter stages involve parallel linear algebra for the data mining (e.g. MPI or Twister). It is highly desirable to simplify the construction of distributed sequence analysis pipelines with a unified programming model, which motivated us to design and implement Azure4Twister. Twister and Twister4Azure interpolate between MPI and Mapreduce and, suitably configured, can mimic their characteristics, and, more interestingly, can be positioned as a programming model that has the performance of MPI and the fault tolerance and dynamic flexibility of the original Mapreduce. Figure 1 A Pipeline for Metagenomics Data Analysis
2 Technologies & Applications Our applications can be classified into three main categories based on their execution pattern, namely pleasingly parallel computations, Mapreduce computations and iterative Mapreduce computations. Twister4Azure [3] distributed decentralized iterative Mapreduce runtime for Windows Azure Cloud, which is the successor to MRRoles4Azure [4] Mapreduce framework and the Classic Cloud pleasingly parallel framework [5], was used as the distributed cloud data processing framework for our scientific computations. Twister4Azure extends the familiar, easy-to-use Mapreduce programming model with iterative extensions, enabling a wide array of large scale iterative as well as non-iterative data analysis and scientific applications to utilize Azure platform easily and efficiently in a fault-tolerant manner, supporting all three categories of applications. Figure 2 MRRoles4Azure Architecture Figure 1 Twister4Azure programming model Twister4Azure utilize the eventually-consistent, high-latency Azure cloud services effectively to deliver performance comparable to (non-iterative) and outperforming (for iterative computing) traditional Mapreduce runtimes. Twister4Azure supports multi-level caching of data across iterations as well as among workers running on the same compute instance and utilizes a novel hybrid task scheduling mechanism to perform cache aware scheduling with minimal overhead. Twister4Azure also supports data broadcasting, collective communication primitives as well as the invoking of multiple MapRedue applications inside an iteration. Pleasingly parallel Computations We performed Cap3 sequence assembly (Fig. 4), BLAST+ sequence search and dimension reduction interpolation computations on Azure using this framework. The performance and scalability are comparable to traditional Mapreduce run times [3]. For Cap3, we assembled up to 4096 FASTA files (each containing 458 reads) in less than one hour using 128 Azure small instances with a cost of around 16$. With BLAST+, the execution of queries using 16 Azure large instances was less than one hour with a cost of around 12$. Mapreduce Type Computations We performed SmithWatermann-GOTOH (SWG) pairwise sequence alignment computations on Azure [3][4] with performance and scalability comparable to the traditional Mapreduce frameworks running on traditional clusters (Fig. 5). We were able to perform up to 123 million sequence alignments using 192 Azure small instances with a cost of around 25$, which was less than the cost it took to run the same computation using Amazon ElasticMapReduce.
3 Iterative Mapreduce Type Computations The third and most important category of computation is the iterative Mapreduce type applications. These include majority of data mining, machine learning, dimension reduction, clustering and many more applications. We performed KMeans Clustering (Fig. 6) and Multi-Dimensional Scaling (MDS) (Fig. 7) scientific iterative Mapreduce computations on Azure cloud. MDS consists of two Mapreduce computations (BCCalc and StressCalc) per iteration and contains parallel linear algebra computations as its core. Figure 4 Cap3 Sequence Assembly on 128 instances/cores Figure 5 SWG Pairwise Distance Calculation Performance Figure 6 KMeans Clustering Performance Figure 7 MDS Performance Figure 8 Average Weekday vs. Weekend Transmission Speeds Figure 9 Average Upload vs. Download Speed Blob storage data transmission performance We conducted performance tests on the data transmission speed of the Windows Azure Blob service. The tests were done on a virtual machine hosted by the Windows Azure cloud, so the data transmission measured was intra-cloud traffic. Aspects tested include: data transmission speed for objects of different sizes; download vs. upload speed; weekday performance vs. weekend performance, etc. We illustrate our analysis in Fig. 8 and Fig. 9 and summarize three conclusions below. 1. Both upload and download performance fluctuate, but download speed is generally faster than upload, suggesting Azure Blob may be a better option for storing input than saving output. Namely, Azure Blob is not suitable for those scientific applications that generate a huge amount of output. 2. Download speed is generally faster for files of hundreds of MB than for those of tens of MB. So packing scientific input data into larger files may help improve the data transmission efficiency. 3. The top download speed could be high (60+MB/s), but real-time speed fluctuates a lot. Keeping the computation jobs close to where the Azure data blobs will be important for scheduling scientific jobs in Azure.
4 Some Lessons Learned The overall major challenge for this research is building a system capable of handling the incredible increases in dataset sizes while solving the technical challenges of portability with scaling performance and fault tolerance using an attractive powerful programming model. Further, these challenges must be met for both computation and storage. Cloud enables persistent storage like Azure Blob. Mapreduce leverages the possibility of collocating data and compute and provides more flexibility to bring data to compute, bring data close to compute or bring compute to data. In concert with virtualization technology, data center model like Azure Cloud is well suited for hosting data analysis for bioinformatics applications as services on demand. Below we give specific comments on Azure. Azure Programming Model Issues 1. Intermittent performance inconsistency of Azure instances when executing long running memory intensive applications. Fig. 10 shows a sample histogram of computational tasks in a MDS iterative computation. Adjacent blue & red areas represent a single iteration. In ideal conditions, the time taken for tasks in an iteration should be nearly identical across the iteration except for the first iteration. However, after several iterations we started to notice that some tasks randomly take much longer to finish. This result in slowing down of the whole iteration and cause degradation of the computation efficiency as we can see in Fig. 7. We are still investigating the cause for this anomaly and we suspect it to be a behavior of Azure instances after stressing the instance memory for a certain time BCCalc 20 StressCalc Task Execution Time (s) Map Task ID Figure 10 MDS Task Execution Time Histogram for 20 iterations 2. Deployment issues Changing the number of roles in a running deployment results in a non-responding deployment. This happened to us several times in the past and lately we avoid this by stopping the deployment before changing the number of roles. Deployment unreliability. Some roles never come to running state or take a very long time to start. This became a major issue for us when running the performance benchmarks. For an example, typically we had to start 132 or more roles to get 128 working roles. 3. API issues More user friendly error messages for the service requests. Most of the error messages we get from Azure services are generic invalid request messages, which do not provide information about what really went wrong in the request. This leads to lot of wasted time with trial and error type of development. Better coherent documentation
5 4. Feature requests Visibility time limit change support for queue messages. This would allow us to support much richer fault tolerance patterns. Amazon SQS supports this through ChangeMessageVisiblity operation. Server-side Count or Sum operation for table entries. It s very inefficient to download all the entries in a table to perform a simple sum or a count operation for a single table field. Acknowledgements This work was made possible using the computing usage grant provided by Microsoft Azure Cloud. The work reported in this document is partially funded by NIH Grant number RC2HG We would also like to appreciate SALSA group for their support. References [1] Judy Qiu, Jaliya Ekanayake, Thilina Gunarathne, Jong Youl Choi, Seung-Hee Bae, Hui Li, Bingjing Zhang, Yang Ryan, Saliya Ekanayake, Tak-Lon Wu, Adam Hughes, Geoffrey Fox Hybrid Cloud and Cluster Computing Paradigms for Life Science Applications, Journal of BMC Bioinformatics. Open Conferences System BOSC Proceedings (BOSC 2010), VOL.11(Suppl 12): p.s3, August 18, [2] Judy Qiu, Jaliya Ekanayake, Thilina Gunarathne, Jong Youl Choi, Seung-Hee Bae, Yang Ruan, Saliya Ekanayake, Stephen Wu, Geoffrey Fox, Mina Rho, Haixu Tang, Data Intensive Computing for Bioinformatics, a book chapter of Data Intensive Distributed Computing, ISBN13: , IGI Publishers, [3] T. Gunarathne, B. Zhang, T.-L. Wu, and J. Qiu, "Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure," presented at the Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure, Melbourne, Australia, [4] T. Gunarathne, W. Tak-Lon, J. Qiu, and G. Fox, "MapReduce in the Clouds for Science," in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, 2010, pp [5] T. Gunarathne, T.-L. Wu, J. Y. Choi, S.-H. Bae, and J. Qiu, "Cloud computing paradigms for pleasingly parallel biomedical applications," Concurrency and Computation: Practice and Experience, 2011.
1. Introduction 2. A Cloud Defined
1. Introduction The importance of simulation is well established with large programs, especially in Europe, USA, Japan and China supporting it in a variety of academic and government initiatives. The requirements
Hybrid cloud and cluster computing paradigms for life science applications
Hybrid cloud and cluster computing paradigms for life science applications Judy Qiu 1,2, Jaliya Ekanayake 1,2 *, Thilina Gunarathne 1,2 *, Jong Youl Choi1 1,2 *, Seung-Hee Bae 1,2 *, Hui Li 1,2 *, Bingjing
Scalable Parallel Computing on Clouds Using Twister4Azure Iterative MapReduce
Scalable Parallel Computing on Clouds Using Twister4Azure Iterative MapReduce Thilina Gunarathne, Bingjing Zhang, Tak-Lon Wu, Judy Qiu School of Informatics and Computing Indiana University, Bloomington.
Hybrid cloud and cluster computing paradigms for life science applications
PROCEEDINGS Open Access Hybrid cloud and cluster computing paradigms for life science applications Judy Qiu 1,2*, Jaliya Ekanayake 1,2, Thilina Gunarathne 1,2, Jong Youl Choi 1,2, Seung-Hee Bae 1,2, Hui
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org
Using Cloud Technologies for Bioinformatics Applications
Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November 16 2009 Judy Qiu xqiu@indiana.edu http://salsaweb/salsa Community Grids Laboratory Pervasive Technology
MapReduce in the Clouds for Science
MapReduce in the Clouds for Science Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox School of Informatics and Computing / Pervasive Technology Institute Indiana University, Bloomington {tgunarat,
Big Application Execution on Cloud using Hadoop Distributed File System
Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
MapReduce and Data Intensive Applications
MapReduce and Data Intensive Applications Tak-Lon (Stephen) Wu Computer Science, School of Informatics and Computing Indiana University, Bloomington, IN taklwu@indiana.edu Abstract Distributed and parallel
A Review on Load Balancing Algorithms in Cloud
A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi johnpm12@gmail.com Yedhu Sastri Dept. of IT, RSET,
Scientific Computing Supported by Clouds, Grids and HPC(Exascale) Systems
Scientific Computing Supported by Clouds, Grids and HPC(Exascale) Systems June 24 2012 HPC 2012 Cetraro, Italy Geoffrey Fox gcf@indiana.edu Informatics, Computing and Physics Indiana University Bloomington
Portable Data Mining on Azure and HPC Platform
Portable Data Mining on Azure and HPC Platform Judy Qiu, Thilina Gunarathne, Bingjing Zhang, Xiaoming Gao, Fei Teng SALSA HPC Group http://salsahpc.indiana.edu School of Informatics and Computing Indiana
Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model
Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Subramanyam. RBV, Sonam. Gupta Abstract An important problem that appears often when analyzing data involves
CloudClustering: Toward an iterative data processing pattern on the cloud
CloudClustering: Toward an iterative data processing pattern on the cloud Ankur Dave University of California, Berkeley Berkeley, California, USA ankurd@eecs.berkeley.edu Wei Lu, Jared Jackson, Roger Barga
NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons
The NIH Commons Summary The Commons is a shared virtual space where scientists can work with the digital objects of biomedical research, i.e. it is a system that will allow investigators to find, manage,
Hadoopizer : a cloud environment for bioinformatics data analysis
Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,
Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson,Nelson Araujo, Dennis Gannon, Wei Lu, and
Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson,Nelson Araujo, Dennis Gannon, Wei Lu, and Jaliya Ekanayake Range in size from edge facilities
Introduction to. Thilina Gunarathne Salsa Group, Indiana University. With contributions from Saliya Ekanayake.
Introduction to Amazon Web Services Thilina Gunarathne Salsa Group, Indiana University. With contributions from Saliya Ekanayake. Introduction Fourth Paradigm Data intensive scientific discovery DNA Sequencing
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
FutureGrid Education: Using Case Studies to Develop A Curriculum for Communicating Parallel and Distributed Computing Concepts
FutureGrid Education: Using Case Studies to Develop A Curriculum for Communicating Parallel and Distributed Computing Concepts Jerome E. Mitchell, Judy Qiu, Massimo Canonio, Shantenu Jha, Linda Hayden,
Cloud-Based Big Data Analytics in Bioinformatics
Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large
Richmond, VA. Richmond, VA. 2 Department of Microbiology and Immunology, Virginia Commonwealth University,
Massive Multi-Omics Microbiome Database (M 3 DB): A Scalable Data Warehouse and Analytics Platform for Microbiome Datasets Shaun W. Norris 1 (norrissw@vcu.edu) Steven P. Bradley 2 (bradleysp@vcu.edu) Hardik
Big Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Data Semantics Aware Cloud for High Performance Analytics
Data Semantics Aware Cloud for High Performance Analytics Microsoft Future Cloud Workshop 2011 June 2nd 2011, Prof. Jun Wang, Computer Architecture and Storage System Laboratory (CASS) Acknowledgement
Building a Scalable Big Data Infrastructure for Dynamic Workflows
Building a Scalable Big Data Infrastructure for Dynamic Workflows INTRODUCTION Organizations of all types and sizes are looking to big data to help them make faster, more intelligent decisions. Many efforts
BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS
BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS ESSENTIALS Executive Summary Big Data is placing new demands on IT infrastructures. The challenge is how to meet growing performance demands
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Concept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
Development of Bio-Cloud Service for Genomic Analysis Based on Virtual
Development of Bio-Cloud Service for Genomic Analysis Based on Virtual Infrastructure 1 Jung-Ho Um, 2 Sang Bae Park, 3 Hoon Choi, 4 Hanmin Jung 1, First Author Korea Institute of Science and Technology
Hadoop vs Apache Spark
Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies
Contents. Preface Acknowledgements. Chapter 1 Introduction 1.1
Preface xi Acknowledgements xv Chapter 1 Introduction 1.1 1.1 Cloud Computing at a Glance 1.1 1.1.1 The Vision of Cloud Computing 1.2 1.1.2 Defining a Cloud 1.4 1.1.3 A Closer Look 1.6 1.1.4 Cloud Computing
III Big Data Technologies
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Mobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
UPS battery remote monitoring system in cloud computing
, pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
Chapter 4 Cloud Computing Applications and Paradigms. Cloud Computing: Theory and Practice. 1
Chapter 4 Cloud Computing Applications and Paradigms Chapter 4 1 Contents Challenges for cloud computing. Existing cloud applications and new opportunities. Architectural styles for cloud applications.
Resource Scalability for Efficient Parallel Processing in Cloud
Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the
CloudCmp:Comparing Cloud Providers. Raja Abhinay Moparthi
CloudCmp:Comparing Cloud Providers Raja Abhinay Moparthi 1 Outline Motivation Cloud Computing Service Models Charging schemes Cloud Common Services Goal CloudCom Working Challenges Designing Benchmark
Assignment # 1 (Cloud Computing Security)
Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual
Distinguishing Parallel and Distributed Computing Performance
Distinguishing Parallel and Distributed Computing Performance CCDSC 2016 http://www.netlib.org/utk/people/jackdongarra/ccdsc-2016/ La maison des contes Lyon France Geoffrey Fox, Saliya Ekanayake, Supun
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li, Geoffrey Fox, Judy Qiu School of Informatics and Computing, Pervasive Technology Institute Indiana University
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline
Cloud Computing for Research Roger Barga Cloud Computing Futures, Microsoft Research
Cloud Computing for Research Roger Barga Cloud Computing Futures, Microsoft Research Trends: Data on an Exponential Scale Scientific data doubles every year Combination of inexpensive sensors + exponentially
White Paper November 2015. Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses
White Paper November 2015 Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses Our Evolutionary Approach to Integration With the proliferation of SaaS adoption, a gap
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies
CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies Lecture 8 Cloud Programming & Software Environments Part 1 of 2 Spring 2013 A Specialty Course for Purdue University s M.S. in Technology
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio
BlobSeer: Towards efficient data storage management on large-scale, distributed systems
: Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu
The public-cloud-computing market has
View from the Cloud Editor: George Pallis gpallis@cs.ucy.ac.cy Comparing Public- Cloud Providers Ang Li and Xiaowei Yang Duke University Srikanth Kandula and Ming Zhang Microsoft Research As cloud computing
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
CHAPTER 7 SUMMARY AND CONCLUSION
179 CHAPTER 7 SUMMARY AND CONCLUSION This chapter summarizes our research achievements and conclude this thesis with discussions and interesting avenues for future exploration. The thesis describes a novel
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Hadoop in the Hybrid Cloud
Presented by Hortonworks and Microsoft Introduction An increasing number of enterprises are either currently using or are planning to use cloud deployment models to expand their IT infrastructure. Big
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Leveraging Map Reduce With Hadoop for Weather Data Analytics
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 3, Ver. II (May Jun. 2015), PP 06-12 www.iosrjournals.org Leveraging Map Reduce With Hadoop for Weather
Data Consistency on Private Cloud Storage System
Volume, Issue, May-June 202 ISS 2278-6856 Data Consistency on Private Cloud Storage System Yin yein Aye University of Computer Studies,Yangon yinnyeinaye.ptn@email.com Abstract: Cloud computing paradigm
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Tableau Server 7.0 scalability
Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Windows HPC Server 2008 R2 Service Pack 3 (V3 SP3)
Windows HPC Server 2008 R2 Service Pack 3 (V3 SP3) Greg Burgess, Principal Development Manager Windows Azure High Performance Computing Microsoft Corporation HPC Server Components Job Scheduler Distributed
Scientific and Technical Applications as a Service in the Cloud
Scientific and Technical Applications as a Service in the Cloud University of Bern, 28.11.2011 adapted version Wibke Sudholt CloudBroker GmbH Technoparkstrasse 1, CH-8005 Zurich, Switzerland Phone: +41
Study on Architecture and Implementation of Port Logistics Information Service Platform Based on Cloud Computing 1
, pp. 331-342 http://dx.doi.org/10.14257/ijfgcn.2015.8.2.27 Study on Architecture and Implementation of Port Logistics Information Service Platform Based on Cloud Computing 1 Changming Li, Jie Shen and
HyMR: a Hybrid MapReduce Workflow System
HyMR: a Hybrid MapReduce Workflow System Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox School of Informatics and Computing Indiana University Bloomington {yangruan, zhguo, yuduo, xqiu, gcf}@indiana.edu
Scala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM
A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal lazymesh@gmail.com,
Performance Analysis of Lucene Index on HBase Environment
Performance Analysis of Lucene Index on HBase Environment Anand Hegde & Prerna Shraff aghegde@indiana.edu & pshraff@indiana.edu School of Informatics and Computing Indiana University, Bloomington B-649
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
The Kentik Data Engine
The Kentik Data Engine A CLUSTERED, MULTI-TENANT APPROACH FOR MANAGING THE WORLD S LARGEST NETWORKS When Kentik set out to build a network traffic monitoring solution that would be scalable, high-precision,
ASPERA HIGH-SPEED TRANSFER SOFTWARE. Moving the world s data at maximum speed
ASPERA HIGH-SPEED TRANSFER SOFTWARE Moving the world s data at maximum speed PRESENTERS AND AGENDA PRESENTER John Heaton Aspera Director of Sales Engineering john@asperasoft.com AGENDA How Cloud is used
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Overview of Cloud Computing Platforms
Overview of Cloud Computing Platforms July 28, 2010 Big Data for Science Workshop Judy Qiu xqiu@indiana.edu http://salsahpc.indiana.edu Pervasive Technology Institute School of Informatics and Computing
Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing
Cloud Technologies for Bioinformatics Applications
Cloud Technologies for Bioinformatics Applications Xiaohong Qiu, Jaliya Ekanayake,, Scott Beason, Thilina Gunarathne,, Geoffrey Fox, Pervasive Technology Institute, School of Informatics and Computing,
Data-intensive HPC: opportunities and challenges. Patrick Valduriez
Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,
Big Data Mining Services and Knowledge Discovery Applications on Clouds
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director brett.weninger@adurant.com Dave Smelker, Managing Principal dave.smelker@adurant.com
Six Strategies for Building High Performance SOA Applications
Six Strategies for Building High Performance SOA Applications Uwe Breitenbücher, Oliver Kopp, Frank Leymann, Michael Reiter, Dieter Roller, and Tobias Unger University of Stuttgart, Institute of Architecture
OVERVIEW OF MICROSOFT AZURE
Hybrid Cloud Solution to Increase Business Value CloudLink is a hybrid cloud solution that interacts with existing onpremises ERP systems. With the hybrid approach, we can leverage the on-premises software
Brian Connolly Systems Engineer, LabKey Software brian@labkey.com. LabKey Server in the Cloud
Brian Connolly Systems Engineer, LabKey Software brian@labkey.com LabKey Server in the Cloud 1 Agenda What is the Cloud? Why would I want to use the cloud? What will it cost? Using LabKey in the cloud
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Overview of Cloud Computing (ENCS 691K Chapter 1)
Overview of Cloud Computing (ENCS 691K Chapter 1) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ Overview of Cloud Computing Towards a definition
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
Hadoop. Bioinformatics Big Data
Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it Big Data Too much information! Big Data Explosive data growth proliferation of data capture