High-Performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: The SHARD Triple-Store
|
|
|
- Pierce Lamb
- 10 years ago
- Views:
Transcription
1 High-Performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: The SHARD Triple-Store Kurt Rohloff BBN Technologies Cambridge, MA, USA Richard E. Schantz BBN Technologies Cambridge, MA, USA ABSTRACT In this paper we discuss the use of the MapReduce software framework to address the challenge of constructing highperformance, massively-scalable distributed systems. We discuss several design considerations associated with constructing complex distributed systems using the MapReduce software framework, including the difficulty of scalably building indexes. We focus on Hadoop, the most popular MapReduce implementation. Our discussion and analysis are motivated by our construction of SHARD, a massively scalable, high-performance and robust triple-store technology on top of Hadoop. We provide a general approach to construct an information system from the MapReduce software framework that responds to data queries. We provide experimental results generated of an early version of SHARD. We close with a discussion of hypothetical MapReduce alternatives that can be used for the construction of more scalable distributed computing systems. Categories and Subject Descriptors D.2.2 [Software Engineering]: Design methodologies. General Terms Design, Algorithms, Software Engineering, Performance, Design, Experimentation. Keywords Distributed computing, MapReduce, Programming, Systems, Semantic Web, Graph Data, SPARQL, Performance Evaluation. 1. INTRODUCTION Lately there have been a number of advances in software frameworks, such as MapReduce [4], that can be used to address the challenges inherent to the construction of highly parallel, highperformance and highly scalable distributed computing systems much easier. Although MapReduce has been very successful as a distributed computing substrate for relatively simple highly parallel applications like search and data storage, there have been few more complex systems such as scalable data management systems built using this or similar frameworks. We speculate that this is primarily because these frameworks (and MapReduce in particular) are too low-level [5]. In this paper we focus on the design aspects inherent to using the MapReduce software framework to construct highly-parallel, high-performance and scalable information management systems. We also suggest alternative software frameworks for the easier design and development of highly-scalable information management systems. Our insight and discussions in these areas are motivated by our experience designing, constructing and evaluating our initial implementation of the SHARD (Scalable, High-Performance, Robust and Distributed) triple-store. A triplestore is an information storage and retrieval environment for graph data, traditionally represented in RDF formats [17]. (RDF is a standard data format for representing triples which are edges in data graphs.) SHARD persists graph data as RDF triples and responds to queries over this data in the SPARQL query language. We use the Hadoop implementation of MapReduce to construct SHARD. We discuss lessons learned and insight for future revisions of our design and implementation. To support these claims, we present our initial experimental results evaluating SHARD with the standard LUBM benchmark for triple-stores [6]. The problem context driving our information system design is the need for web-scale information systems. For example, one of the singular advancements over the past several years in the Semantic Web domain has been the explosion of graph data available in semantic formats [10]. Unfortunately, Semantic Web data processing technologies, which rely on graph data information systems, are designed for deployment on a single (or a small number of) machine(s). This is fine when data is small, but current methodologies to design high-level information systems for graph data are limited by data processing and analysis bottlenecks with graphs on the order of a billion edges [10][18]. These scalability constraints are the greatest barriers to achieve the fundamental web-scale Semantic Web vision and have hindered the broader adoption of Semantic Web technologies. Other scalable approaches to triple-store design based on keyvalue and column stores (such as Cassandra [3] and Project Voldemort [16], among many others) are feasible. However, these other technologies do not provide the native data processing capabilities supported by MapReduce implementations like Hadoop that enable more efficient query processing. We discuss work-in-progress to address these scalability limitations in the Semantic Web by designing and developing SHARD using the MapReduce software framework. In particular, we describe initial results from deploying an early version of SHARD into an Amazon EC2 cloud [1] and running the standard LUBM triple-store benchmark. We find that SHARD already performs better than current industry-standard triple-stores for datasets, on the order of a billion triples. The remainder of this paper is organized as follows. In Section 2 we provide an introduction to MapReduce, Hadoop, and their relevant properties for the design of information management systems. In Section 3 we provide a brief overview of relevant SHARD design goals and graph data processing concepts. In Section 4 we describe the design of information management systems such as SHARD using the MapReduce software framework. In Section 5 we describe our experimental results from the deployment of an early version of SHARD into an Amazon EC2 cloud. In Section 6 we discuss design insight we gained from experimentation. We conclude in Section 7 with a
2 discussion of ongoing and alternative designs for highperformance, massively scalable information systems. 2. MAPREDUCE AND HADOOP MapReduce is a software framework for processing and generating large data sets [4]. Users specify a map function that splits data into key/value pairs and a reduce function that merges all key/value pairs based on the key. Many real world low-level tasks are expressible in this model including word counting and the Page-rank algorithm. The MapReduce software framework is easily parallelizable for execution on large clusters of commodity machines. This enables the construction of high-performance, highly-scalable applications. One of the more popular MapReduce implementations is Hadoop [8]. Hadoop takes care of the details of managing data on compute nodes through the Hadoop Distributed File System (HDFS), scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows for the design and implementation of high-level functionality using the MapReduce framework to construct high-performance and highly scalable applications. A key aspect of the MapReduce software framework, as expressed in the Hadoop implementation, is the use of a special, centralized compute node, called the NameNode. The name node directs the placement of data onto compute nodes through HDFS, assigns compute jobs to the various nodes, tracks failures and manages the shuffling of data after the Map step completes. There are several benefits as well as drawbacks from using MapReduce to design high-performance information systems, irrespective of how those information systems are designed. These benefits include that MapReduce implementations such as Hadoop are generally easy to set up and debug, and applications are easy to write efficiently in several programming languages. The drawbacks of the Hadoop implementation of the MapReduce framework include that only Java programs can be used natively for more complex applications, it is difficult to run Java code on compute nodes that need runtime customization, NameNode creates a bottleneck for HDFS access, and NameNode failures can be catastrophic. made by Ford, car0 was made in Detroit, Detroit is a city and Cambridge is a city. We use SPARQL [19] as a representative query language - it is the standard Semantic Web query language. SPARQL semantics are general purpose and similar to the more well-known SQL. An example SPARQL query for the above graph data is the following: SELECT?person WHERE {?person :owns?car.?car :a :car.?car :madein :Detroit. } The above SPARQL query has three clauses and asks for all matches to the variable?person such that?person owns an entity represented by the variable?car which is a car and was made in Detroit. Note that the above query can be represented as a directed graph as seen in Figure 2. Figure 2: A Directed Graph Representation of a Query. Processing of SPARQL queries in the context of a data graph such as the one above consists of identifying which variables in the query clauses can be bound to nodes in the data graph such that the query clauses align with the data triples. This alignment process for query processing is fairly general across many data representations and query languages. An example of this alignment for our example query and data can be seen in Figure 3. Figure 1: A Small Graph of Triple Data. 3. DESIGN GOALS Our primary information system design motivations are the ability to persist and rapidly query very large data graphs. To align with Semantic Web data standards, we consider graphs represented as subject-predicate-object triples [2][7]. A small example graph can be seen in Figure 1 that contains 7 triples Kurt lives in Cambridge, Kurt owns an object car0, car0 is a car, car0 was Figure 3: An Alignment of SPARQL Query Variables with Triple Data. Our functional design goals for the SHARD triple-store are to: 1. Serve as a persistent store for triple data in RDF format. 2. Serve as a SPARQL endpoint to process SPARQL queries.
3 There have been a number of other design approaches for triplestores with similar if not the same functional design goals [18]. Several of these triple-stores have achieved very good performance on single compute-node systems by using designs based around memory mapping index information [11]. However, disk and memory limitations have driven the need for distributed computing approaches to triple-stores [12][14]. There have been a number of recent attempts to develop designs of triple stores using distributed computing frameworks [20]. 4. SYSTEM DESIGN 4.1 Data Persistence In order to use a distributed computing approach to information management system design, it is generally infeasible to pass large volume input data directly to and from the user. This data passing would involve the coordinated transfer of data onto and off of the compute nodes when data needs to be processed. The large scale of data makes this approach impractical due to data churn. Consequently, large input (data, queries) and output (query results) data sets need to be stored directly on the compute nodes. In order to best leverage the MapReduce software framework and its Hadoop implementations to construct an information management system, we made this design decision with the understanding that the input data and output results are generally very large and not feasible to output directly to the user. The data storage directly on compute nodes is done natively using the Hadoop implementation of MapReduce by placing data in the HDFS distributed file system. We persist SHARD data in flat files in the HDFS file system such that each line of the triple-store text file represents all triples associated with a different subject. Consider the following exemplar line saved in SHARD from the LUBM domain that represents three triples associated with the entity subject Pub1: Pub1 :author Prof0 :name "Pub1" a :Publication This line represents that the entity Pub1 has an author entity Prof0, Pub1 has a name Pub1 and that Pub1 is a publication. Although this approach to persisting triple data as flat text files is rudimentary as compared to other information management approaches, we found that it offers a number of important benefits for several general application domains. For one, this approach, particularly in the HDFS implementation, brings a level of automated robustness by replicating data and MapReduce operations across multiple nodes. The data is also stored in a simple, easy to read format that lends itself to easier, user focused drill-down diagnostics of query results returned by the triple-store. Most importantly, however, although this approach to storing triples is inefficient for query processing that requires the inspection of only a small number of triples, this approach is efficient in the context of Hadoop for scanning over large sets of triples to respond to queries that will generate a large number of results, as Hadoop natively scans over input data during the Map stage of its Map-Reduce operations. 4.2 Query Processing MapReduce provides only simple data manipulation techniques by splitting data into key-value pairs, and accumulating all values with the same keys. In order to provide more advanced query processing that can leverage highly scalable implementations of the MapReduce software framework, information systems would need to iterate over clauses in queries to incrementally attempt to bind query variables to literals in the triple data while satisfying all of the query constraints. Each iteration consists of a MapReduce operation for a single query clause. This iteration is non-trivial because results of previous clauses would need to be continually joined with the results of more recent clauses over the iterations of the MapReduce steps. We describe here how we designed these iterations with a focus on joining intermediate results as an approach to constructing more complex systems. We use our SHARD context of graph data and SPARQL queries to concretely discuss a design for this iterative query processing using the MapReduce software framework. A schematic overview of this iterative query binding process for the graph data and SPARQL context can be seen in Figure 4. This schematic consists of multiple MapReduce operations. Figure 4: A Schematic Overview of the Iterative Algorithm to process SPARQL queries with Triple Data. The first map MapReduce step maps the triple data to a list of variable bindings which satisfy the first clause of the query. The key of the Map step is the list of variable bindings. The Reduce step removes duplicate results and saves them to disk with the variable bindings as the key. The intermediate query binding steps continue to iteratively bind variables to literals as new variables are introduced by processing successive query clauses and/or filtering the previous bindings which cannot fit the new clauses. The intermediate steps perform a MapReduce operation over both the triple data and the previously bound variables which were saved to disk. The ith intermediate Map step identifies all variables in the tripledata which satisfy the ith clause and saves this result with the key being any variables in the ith clause which appeared in previous clauses. The value of this Map step is the bindings of other variables not previously seen in the query clauses, if any. This iteration of the Map set also rearranges the results of the previous variable bindings saved to disk to the same name of a variable key in the ith clause that appeared in previous clauses. The value of this key-value pair is the list of variable bindings which occurred in previous clauses but not in the ith clause. The ith Reduce step runs a join operation over the intermediate results from the Map step by iterating over all pairs of results from the previous clause and the new clause with the same key assignment. This iteration of map-reduce-join continues until all clauses are processed and variables are assigned which satisfy the query
4 clauses. SHARD is designed to save intermediate results of the query processing to speed up the processing of similar later queries. The storage of intermediate results is a byproduct of the Hadoop MapReduce implementation. The final MapReduce step consists of filtering bound variable assignments to satisfy the SELECT clause of the SPARQL query. In particular, the Map step filters each of the bindings, and the Reduce step removes duplicates where the key value for both Map and Reduce are the bound variables in the SELECT clause. 5. EXPERIMENTATION To test the performance of our general design of a scalable information management system based on the MapReduce framework, we developed an early version of SHARD using the Cloudera version of the Hadoop implementation that we deployed onto an Amazon EC2 cloud environment of 20 XL compute nodes [1] running RedHat Linux and Cloudera Hadoop. The version of SHARD we deployed for evaluation supports basic SPARQL query functionality (without support for prefixes, optional clauses or results ordering) over full RDF data. This unimplemented functionality is generally associated with the pre- or postprocessing of queries and we don t expect that adding this extra functionality will substantially detract from the performance exhibited by the current implementation of SHARD. Although possible to implement, the deployed version of SHARD does not perform any query manipulation/reordering/etc normally done for increased performance by SPARQL endpoints in mature triple-stores. Also, the deployed version of SHARD does not yet take advantage of any possible query caching made possible by our design choices. 5.1 LUBM Benchmark We used the LUBM benchmark to evaluate the performance of SHARD. The LUBM benchmark creates artificial data about the publishing, coursework and advising activities of students and faculty in departments in universities. The LUBM code natively generates OWL ontology files [15]. OWL ontology files represent relationships between properties, but because our early version of SHARD takes N3 (an RDF serialization format) data as input, we provided functionality to convert the generated LUBM data into N3 format over many universities and automatically store this generated data in the SHARD HDFS backend using Hadoop. We used code from the LUBM benchmark to generate triple data for 6000 universities which is approximately 800 million triples to parallel the performance evaluations made in a previous triple-store comparison study [18]. After loading the triple data into the SHARD triple store, We evaluated the performance of SHARD in responding to queries 1, 9 and 14 of LUBM as was done in the previous triple-store study. Query 1 is very simple and asks for the students that take a particular course and returns a very small set of responses. Query 9 is relatively more complicated query with a triangular pattern of relationships - it asks for all teachers, students and courses such that the teacher is the adviser of the student who takes a course taught by the teacher. Query 14 is relatively simple as it asks for all undergraduate students (but the response is very large). 5.2 Performance SHARD achieved the following query response times for 6000 universities (approx. 800 million triples) using the LUBM benchmark when deployed on an Amazon AWS cloud with 20 compute nodes: Query 1: 404 sec. (approx 0.1 hr.) Query 9: 740 sec. (approx 0.2 hr.) Query 14: 118 sec. (approx 0.03 hr.) We generally found the SHARD performance increased with the number of compute nodes, but we found this performance increase to be sub-linear. This sub-linear increase was mostly likely due to the communication overhead of the MapReduce steps. For comparison, in the triple store study [18] the industrial singlemachine DAMLDB triple-store (released as the open-source project Parliament 1 ) was able to achieve the following performance on the same queries coupled with the Sesame 2 and Jena 3 Semantic Web frameworks to aid query processing. Sesame+DAMLDB took: Query 1: approx 0.1hr. Query 9: approx 1 hr. Query 14: approx. 1 hr. For Jena+DAMLDB we have no data on performance over 550 million triples due to the difficulty of loading triples into this dataset, but based on observed trends this triple-store probably would of taken the following: Query 1: approx hr. Query 9: approx 1 hr. Query 14: approx. 5 hr. Note that the only query where SHARD performed noticeably worse than DAMLDB was on query 1. Query 1 returns a very small subset of literals bound to variables. Although MapReduce is traditionally used to build indices, its implementations (e.g. Hadoop) provide little native support for accessing data stored in HDFS files. Conversely, DAMLDB has some special indexing optimizations for simple queries like that for Query 1, that are not yet implemented in SHARD. We discuss how this aspect of MapReduce may be improved upon below. Except for this one exception, SHARD performed better than other known technologies due to the highly parallel implementations of the MapReduce framework that we leverage in our design of SHARD. Also, due to the inherent scalability of the Hadoop and HDFS approach to the SHARD design, the SHARD triple-store could potentially be used for extremely large datasets (more than billions of triples) without requiring any specialized hardware, as is required for other monolithic triple-stores. 6. DESIGN INSIGHTS The performance of SHARD over previous triple-store implementations demonstrate the viability of our information system design approach based on MapReduce. Our design enables the efficient search through data to find matches that satisfy queries. This design is easily distributed across many compute nodes for highly parallel and highly scalable operation. It is also low-cost as it can run on commodity hardware. There are a number of areas for improvement in an alternative to the MapReduce software framework and its implementations for the easier design of information systems in general and triplestores in particular. Most notably, the MapReduce, and consequently our design, are biased towards operations over large
5 datasets without the search for individual key-value pairs. This could be improved upon with native indexing capabilities, possibly supported during pre-processing operations. These preprocessing operations could also be used to reason over the data, so that SHARD could correctly respond to queries that require reasoning. A more advanced modification to support information system design would be an alternative software framework that provides better data linking. Instead of having to store lists of data in flat files in an HDFS-like construct, a software framework could provide a native linked-data construct that pairs data elements with pointers to related data. This linked data framework would provide faster localized query processing without requiring exhaustive search of the data set on every query request. 7. ONGOING WORK Development work is ongoing with our information system design based on the MapReduce software framework. Based on our experience with the initial SHARD deployment, we have several short- and long-term activities to further improve performance and applicability from both a design and software framework perspective. First among the improvements is a more effective method to index data. This will most likely need to be supported by an alternative to the MapReduce framework that supports native indexing instead of basic Map operations over all data elements. Additional performance improvement of our design in a targeted production environment could be provided by using cached partial results both locally for high-performance parallel operations and globally by a NameNode-like entity that tracks local caching of partial results.. This will require additional capability in a software framework to track partial results that were previously cached and possibly to track which cached results could be thrown out to save disk space in the cloud (if this is a deployment concern.) There is a general need for improved tools for high-level highlyparallel computations in clusters. The MapReduce tools we explore operate at a low level. It is difficult to coordinate the storage and use of intermediate results in parallel without frequent, temporally expensive disk access. An enhanced software framework might provide this by caching data in memory and persisting data to disk as a backup for failure recovery. 8. ACKNOWLEDGMENTS The authors would like to thank Gail Mitchell, Doug Reid and Prakash Manghwani from BBN, Philip Zeyliger from Cloudera and Hanspeter Pfister from Harvard University for their assistance. 9. REFERENCES [1] Amazon. (2010) Amazon EC2 Instance Types. Retrieved from [2] Berners-Lee, Tim; James Hendler and Ora Lassila (May 17, 2001). "The Semantic Web". Scientific American Magazine. [3] Cassandra. (2010) Retrieved from [4] Dean J. and Ghemawat S., MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design & Implementation (OSDI), pp [5] DeWitt D., Stonebraker M. MapReduce: A major step backwards. databasecolumn.com. Retrieved [6] Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics 3(2) (2005) [7] Grigoris A., van Harmelen F. A Semantic Web Primer, 2nd Edition. The MIT Press, [8] Hadoop. (2010). Apache Hadoop. Retrieved from [9] Hendler J., Web 3.0: The Dawn of Semantic Search. In IEEE Computer, Jan [10] Kiryakov A., Tashev Z., Ognyanoff D., Velkov R., Momtchev V., Balev B., Peikov I. "Validation goals and metrics for the LarKC platform." LarKC Report FP Retrieved from [11] Kolas D., Emmons I. and Dean M., Efficient Linked-List RDF Indexing in Parliament. In the Proceedings of the Scalable Semantic Web (SSWS) Workshop of ISWC 09, [12] Li P., Zeng Y., Kotoulas S., Urbani J., and Zhong N., "The Quest for Parallel Reasoning on the Semantic Web," in Proceedings of the 2009 International Conference on Active Media Technology, LNCS, [13] LinkingOpenData. (2010) Retrieved from ects/linkingopendata [14] Mika, P. and Tummarello, G Web Semantics in the Clouds. IEEE Intelligent Systems 23, 5 (Sep. 2008), [15] OWL. (2010) Web Ontology Language (OWL.) Retrieved from [16] Project Voldemort. (2010) Retrieved from [17] RDF. (2010) Resource Description Framework (RDF) Retrieved from [18] Rohloff K., Dean M., Emmons I., Ryder D., Sumner J.. An Evaluation of Triple-Store Technologies for Large Data Stores. 3rd International Workshop On Scalable Semantic Web Knowledge Base Systems (SSWS '07), Vilamoura, Portugal, Nov 27, [19] SPARQL. (2010) SPARQL Query Language for RDF [20] Urbani J., Kotoulas S., Oren E., and van Harmelen F., "Scalable Distributed Reasoning using MapReduce," In Proceedings of the ISWC 09, 2009.
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. [email protected] Mrs.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
Semantic Web Standard in Cloud Computing
ETIC DEC 15-16, 2011 Chennai India International Journal of Soft Computing and Engineering (IJSCE) Semantic Web Standard in Cloud Computing Malini Siva, A. Poobalan Abstract - CLOUD computing is an emerging
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Bigdata : Enabling the Semantic Web at Web Scale
Bigdata : Enabling the Semantic Web at Web Scale Presentation outline What is big data? Bigdata Architecture Bigdata RDF Database Performance Roadmap What is big data? Big data is a new way of thinking
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
An Efficient and Scalable Management of Ontology
An Efficient and Scalable Management of Ontology Myung-Jae Park 1, Jihyun Lee 1, Chun-Hee Lee 1, Jiexi Lin 1, Olivier Serres 2, and Chin-Wan Chung 1 1 Korea Advanced Institute of Science and Technology,
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING
Exploration on Service Matching Methodology Based On Description Logic using Similarity Performance Parameters K.Jayasri Final Year Student IFET College of engineering [email protected] R.Rajmohan
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
bigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Industry 4.0 and Big Data
Industry 4.0 and Big Data Marek Obitko, [email protected] Senior Research Engineer 03/25/2015 PUBLIC PUBLIC - 5058-CO900H 2 Background Joint work with Czech Institute of Informatics, Robotics and
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Comparison of Triple Stores
Comparison of Triple Stores Abstract In this report we present evaluation of triple stores. We present load times and discuss the inferencing capabilities of Jena SDB backed with MySQL, Sesame native,
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
LDIF - Linked Data Integration Framework
LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany [email protected],
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
Data Semantics Aware Cloud for High Performance Analytics
Data Semantics Aware Cloud for High Performance Analytics Microsoft Future Cloud Workshop 2011 June 2nd 2011, Prof. Jun Wang, Computer Architecture and Storage System Laboratory (CASS) Acknowledgement
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Policy-based Pre-Processing in Hadoop
Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden [email protected], [email protected] Abstract While big data analytics provides
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Mining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Graph Database Performance: An Oracle Perspective
Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
CiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
The Inside Scoop on Hadoop
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. [email protected] [email protected] @OrionGM The Inside Scoop
A computational model for MapReduce job flow
A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125
AllegroGraph. a graph database. Gary King [email protected]
AllegroGraph a graph database Gary King [email protected] Overview What we store How we store it the possibilities Using AllegroGraph Databases Put stuff in Get stuff out quickly safely Stuff things with
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
