HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks

Size: px
Start display at page:

Download "HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks"

Transcription

1 HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks Chris Sweeney Liu Liu Sean Arietta Jason Lawrence University of Virginia Images 1k Map 1 Reduce 1 Hipi Image Cull Shuffle Result n-k.n Map i Reduce j Figure 1: A typical MapReduce pipeline using our Hadoop Image Processing Interface with n, i map nodes, and j reduce nodes Abstract The amount of being uploaded to the internet is rapidly increasing, with Facebook users uploading over 2.5 billion new photos every month [Facebook 2010], however, applications that make use of this data are severely lacking. Current computer vision applications use a small number of input because of the difficulty is in acquiring computational resources and storage options for large amounts of data [Guo ; White et al. 2010]. As such, development of vision applications that use a large set of has been limited [Ghemawat and Gobioff ]. The Hadoop Mapreduce platform provides a system for large and computationally intensive distributed processing (Dean, 2004), though use of Hadoops system is severely limited by the technical complexities of developing useful applications [Ghemawat and Gobioff ; White et al. 2010]. To immediately address this, we propose an open-source Hadoop Image Processing Interface (HIPI) that aims to create an interface for computer vision with MapReduce technology. HIPI abstracts the highly technical details of Hadoop s system and is flexible enough to implement many techniques in current computer vision literature. This paper describes the HIPI framework, and describes two example applications that have been implemented with HIPI. The goal of HIPI is to create a tool that will make development of large-scale image processing and vision projects extremely accessible in hopes that it will empower researchers and students to create applications with ease. Keywords: mapreduce, computer vision, image processing 1 Introduction Many image processing and computer vision algorithms are applicable to large-scale data tasks. It is often desirable to run these algorithms on large data sets (e.g. larger than 1 TB) that are currently limited by the computational power of one computer [Guo ]. These tasks are typically performed on a distributed system by dividing the task across one or more of the following features: algorithm parameters,, or pixels [White et al. 2010]. Performing tasks across a particular parameter is incredibly parallel and can often be perfectly parallel. Face detection and landmark classification are examples of such algorithms [Li and Crandall ; Liu et al. 2009]. The ability to parallelize such tasks allows for scalable, efficient execution of resource-intensive applications. The MapReduce framework provides a platform for such applications. Basic vision applications that utilize Hadoops MapReduce framework require a staggering learning curve and overwhelming complexity [White et al. 2010]. The overhead required to implement such applications severely cripples the progress of researchers [White et al. 2010; Li and Crandall ]. HIPI removes the highly technical details of Hadoops system and provides users with the familiar feel of an image library with the access to the advanced resources of a distributed system [Dean and Ghemawat 2008; Apache 2010]. Our platform is focused around giving users unprecedented access to image-based data structures with a pipeline that is intuitive to the MapReduce system, allowing for easy and flexible use for vision applications. Because of the similar goals in our frameworks, we take particular inspiration from the Frankencamera project as a model for designing an open API to provide access to computing resources. We have designed HIPI with the aims of providing a platform specific enough to contain a relevant framework applicable for all image processing and computer vision applications but flexible enough to withstand continual changes and improvements within Hadoops Mapreduce system. We expect HIPI to promote vision research for these reasons. HIPI is largely a software design project, driven by the overarching goals of abstracting of Hadoop s functionality into an image-centric system and providing an extendible

2 system that will provide researchers with a tool to effectively use Hadoops Mapreduce system for image processing and computer vision. We believe that this ease of use and user control will make the process for creating large-scale vision experiments and applications. As a result, HIPI serves as an excellent tool for researchers in computer vision because it allows development of large-scale computer vision applications to be more accessible than ever. To our knowledge, we are the first group to provide an open interface for image processing and computer vision applications for Hadoops Mapreduce platform [White et al. 2010]. In the following section, we will describe previous work in this area. In particular, we discuss the motivation for creating an imagebased framework that allows large-scale vision applications. Next, we describe the the overview for the HIPI library including the cull, map, and reduce stages. Additionally, describe our approach for distributing tasks for the MapReduce pipeline. Finally, we demonstrate the capabilities of HIPI with two examples of vision applications using HIPI. 2 Prior Work With the proliferation of online photo storage and social medias from websites such as Facebook, Flickr, and Picasa, the amount of image data available is larger than ever before and growing more rapidly every day [Facebook 2010]. This alone provides an incredible database of that can scale up to billions of. Incredible statistical and probabilistic models can be built from such a large sample source. For instance, a database of all the textures found in a large collection of can be built and used by researchers or artists. The information can be incredibly helpful for understanding relationships in the world 1. If a picture is worth a thousand words, we could write an encyclopedia with the billions of available to us on the internet. These are enhanced, however, by the fact that users are supplying tags (of objects, faces, etc.), comments, titles, and descriptions of this data for us. This information supplies us with an amazing amount of unprecedented context for. Problems such as OCR that remain largely unsolved can make bigger strides with this available context guiding them. Stone et al. describe in detail how social networking sites can leverage facial tagging features to significantly enhance facial recognition. This idea can be applied to a wider range of image features that allow us to examine and analyze in a revolutionary way. 3 The HIPI Framework HIPI was created to empower researchers and present them with a capable tool that would enable research involving image processing and vision to be performed extremely easily. With the knowledge that HIPI would be used for researchers and as an educational tool, we designed HIPI with the following goals in mind. 1. Provide an open, extendible library for image processing and computer vision applications in a MapReduce framework 2. Store efficiently for use in MapReduce applications 3. Allow for simple filtering of a set of 4. Present users with an intuitive interface for image-based operations and hide the details of the MapReduce framework 5. HIPI will set up applications so that they are highly parallelized and balanced so that users do not have to worry about such details 3.1 Data Storage Hadoop uses a distributed file system to store files on various machines throughout the cluster. Hadoop allows files be accessed, however, without knowledge of where it is stored in the cluster, so that users can reference files the same way they would on a local machine and Hadoop will present the file accordingly. When performing MapReduce jobs, Hadoop attempts to run Map and Reduce tasks at the machines were the data being processed is located so that data does not have to be copied between machines [Apache 2010]. As such, MapReduce tasks run more efficiently when the input is one large file as opposed to many small files 2. Large files are significantly more likely to be stored on one machine whereas many small files will likely be spread out among many different machines, which requires significant overhead to copy all the data to the machine where the is [Ghemawat and Gobioff ]. This overhead can slow the runtime ten to one hundred times [White 2010]. Simply put, the MapReduce framework operates more efficiently when the data being processed is local to the machines performing the processing. It is these reasons that motivate the need for research with vision applications that take advantage of large sets of. MapReduce provides an extremely powerful framework that works well on data-intensive applications where the model for data processing is similar or the same. It is often the case with image-based operations that we perform similar operations throughout an input set, making MapReduce ideal for image-based applications. However, many researchers find it impractical to be able to collect a meaningful set of relevant to their studies [Guo ]. Additionally, many researchers do not have efficient ways to store and access such a set of. As a result, little research has been performed on extremely large image-sets. 1 One can imagine that applications such as object detection could provide information that enable researchers to recognize the relationships between certain objects (e.g. bumblebees are often in pictures of flowers). There are many examples of useful applications such as these. Figure 2: A depiction of the relationship between the index and data files in a 2 Small files are files that are considerably smaller than the file block size for the machine where the file resides

3 With this in mind, we created a data type that stores many in one large file so that MapReduce jobs can be performed more efficiently. A consists of two files: a data file containing concatenated and an index file containing information about the offsets of in the data file as shown in Figure-2. This setup allows us to easily access across the entire bundle without having to read in every image. We observed several benefits of the in tests against Hadoop s Sequence file and Hadoop Archive (HAR) formats. As White et. al. note, HARs are only useful for archiving files (as backups), and may actually perform slower than reading in files the standard way. Sequence files perform better than standard applications for small files, but must be read serially and take a very long time to generate. have similar speeds to Sequence files, do not have to be read serially, and can be generated with a MapReduce program [White 2010; Conner 2009]. Additionally, s are more customizable and are mutable, unlike Sequence and HAR files. For instance, we have implemented the ability to only read the header of an image file using s, which would be considerably more difficult with other file types. Further features of the s are highlighted in the following section 3.2 Image-based MapReduce Standard Hadoop MapReduce programs handle input and output of data very effectively, but struggle in representing in a format that is useful for researchers. Current methods involve significant overhead to obtain standard image representation. For instance, to distribute a set of to a set of Map nodes would require a user to pass the as a String, then decode each image in each map task before being able to do access pixel information. This is not only inefficient, but inconvenient. These tasks can create headaches for users and make the code look cluttered and difficult to interpret the intent of the code. As such, sharing code is less efficient because the code is harder to read and harder to debug. Our library focuses on bringing familiar image-based data types directly to the user for easy use in MapReduce applications. across all map nodes. We distribute such that we attempt maximize locality between the mapper machines and the machine where the image resides. Typically, a user would have to create InputFormat and RecordReader classes that specify how the MapReduce job will distribute the input, and what information gets sent to each machine. This is task is nontrivial and often becomes a large point of headaches for users. We have included InputFormat and RecordReaders that take care of this for the user. Our specification works on s for various image types, sizes, and varying amounts of header and exif information. We handle all of these different image permutations behind the scenes to bring straight to the user as. No work is needed to be done by the user, and are brought directly to the Map tasks in a highly parallelized fashion. During the distribution of inputs but before the map tasks start we introduce a culling stage to the MapReduce pipeline. The culling stage allows for to be filtered based on image properties. The user specifies a culling class that describes how the will be filtered (e.g. pictures smaller than 10 megapixels, pictures with GIS location header data). Only that pass the culling stage will be distributed to the map tasks, preventing unnecessary copying of data. This process is often very efficient because culling often occurs based on image header information, so it is not required to read the entire image. Additionally, are distributed as so that users can immediately have access to pixel values for image processing and vision operations. Images are always stored as standard image types (e.g. JPEG, PNG, etc.) for efficient storage, but HIPI takes care of encoding and decoding to present the user with within the MapReduce pipeline. As a result, programs such as calculating the mean value of all pixels in a set of can be written in merely lines. We provide operations such as cropping for image patches extraction. It is often desirable to access image header and exif information without need for pixel information, so we have abstracted this information from the pixel data. This is particularly useful for the culling stage, and for applications such as im2gis 3 that need access to metadata. Presenting users with intuitive interfaces for accessing data relevant to image processing and vision applications will allow for more efficient creation of MapReduce applications. INPUT: 4 Examples We describe two non-trivial applications performed using the HIPI framework to create MapReduce jobs. These applications are indicative of the types of applications that HIPI enables users concerned with large-scale image operations to easily do. These examples are difficult and inefficient on existing platforms, yet simple to implement with the HIPI API. Performed by HIPI behind the scenes Figure 3: The user only needs to specify a as an input, and HIPI will take care of parallelizing the task and sending to the mappers Using the data type as inputs, we have created an input specification that will distribute in the 4.1 Principal Components of Natural Images As an homage to Hancock et. al, we computed the first 15 principal components of natural. However, we decided instead of randomly sampling one patch from 15, we sampled over 1000 and 100 patches in each. The size of our input set was 10,000 times larger than the original experiment. Additionally, we did not limit our to natural like the original experiment (though we could do this in the cull stage). As such, results differ but hold similar characteristics. We parallelize the process of computing the covariance matrix for the according to the following formula, where x i is a sample 3

4 sive, unrestricted data set gives us unparalleled knowledge about. For tasks such as these, HIPI excels. 4.2 Downloading Millions of Images Figure 4: The first 15 principal components of 15 randomly sampled natural, as observed by Hancock et al. from left to right, top to bottom Step 1: Specifiy a list of to collect. We assume that there exists a well-formed list containing url s of to download. This list should be stored in a text file with exactly one image url per line. This list can be generated by hand, from MySQL, or from a search query (e.g. google, flickr, etc.). In addition to the list of, the user will input the number of nodes to run the task. According to this input, we divide the image set across the specified number of nodes for maximum efficiency and parallelism when downloading the. Each node in the will generate a containing all of the it downloaded, then the Reducer will merge all the s together to form one large. Map Node 1 1 List of image urls Map Node Reduce Map Node n n Figure 5: The first 15 principal components of 100,000 randomly sampled image patches, as calculated with HIPI from left to right, top to bottom patch and x is the sample patch mean COV = 1 n 1 nx (x i x)(x i x) T (1) i=1 Equation-1 suits HIPI perfectly because the summation is grossly parallel. In other words, we can easily compute each term in the summation independently (assuming we already have the mean), thus, we can compute each term in parallel. We first run a MapReduce job to compute the sample mean, then use that as x for future covariance calculation. Then, we run a MapReduce job that computes (x i x)(x i x) T for all 100 patches from each image 4. Because HIPI allocates one image per map task, it is simple to randomly sample an image for 100 patches and perform this calculation. Each map task will then emit the summation of its partial covariances sampled from the image to the reducer, where all partial covariances will be summed to calculate the covariance for the entire sample set. After determining the covariance for 100,000 randomly sampled patches, we used Matlab to find the first 15 principal components. As expected, do not correlate perfectly because we are using far different inputs to our experiments, and our display of positive and negative values also may differ slightly. However, certain principal components are the same (1, 7, 12), are merely switched (2 and 3, 4 and 5), or show some resemblance to the original experiment (15). Performing a principal component analysis on a mas- 4 We call this partial sum the partial covariance Figure 6: A demonstration of the parallelization in the Downloader application. The task of downloading the list of urls is split amongst n map tasks. Each mapper creates a, which is merged into one large in the Reduce phase Step 2: Split URLs into groups and send each group to a Mapper. Using the inputted list of image urls and the number of nodes used to download these, we equally distribute the task of downloading to the specified number of map nodes. This allows for maximum parallelization for the downloading process. Image urls are been distributed to the various nodes equally, and the map tasks will begin downloading each image in the set of urls it is responsible for, as Figure-6 shows. Step 3: Download from the internet. We then establish a connection to the url retrieved from the database and download the image using java s URLConnection class. Once connected, we check the file type to make sure it is a valid image, and get an Input- Stream to the connection. From this, we can use the InputStream to add the image to a. Step 4: Store in a. Once the Input- Stream is received from the URLConnection, we can add the image to a simply by passing the InputStream to the addimage method. Each map task will then generate a, and the Reduce phase will merge all of the bundles together into one large bundle. By storing this way, you are able to take advantage of our HIPI framework for MapReduce tasks that you want to perform on image sets at a later point. For example, to check the results of the Downloader program, we ran a very simple MapReduce Program (7 lines) that was able to take the and write out the to individual JPEG files on the HDFS effortlessly.

5 5 Conclusion This paper has described our library for image processing and vision applications on a MapReduce framework - Hadoop Image Processing Interface (HIPI). This library was carefully designed to hide the complex details of Hadoop s powerful MapReduce framework and bring to the forefront what users care about most:. Our system has been created with the intent to operate on large sets of. We provide a format for storing for efficient access within the MapReduce pipeline, and simple methods for creating such files. By providing a culling stage before the mapping stage, we give the user a simple way to filter image sets and control the types of being used in their MapReduce tasks. Finally, we provide image encoders and decoders that run behind the scenes and work to present the user with image types which are most useful for image processing and vision applications. Through these features, our interface brings about a new level of simplicity for creating large-scale vision applications with the aim of empowering researchers and teachers with a tool for efficiently creating MapReduce applications focused around. This paper describes two example applications built with HIPI that demonstrate the power it presents users with. We hope that by bring the resources and power of MapReduce to the vision community that we will enhance the ability to create new vision projects that will enable users to push the field of computer vision. LI, Y., AND CRANDALL..., D Landmark classification in large-scale image collections. Computer Vision (Jan). LIU, K., LI, S., TANG, L., AND WANG..., L Fast face tracking using parallel particle filter algorithm. Multimedia and Expo (Jan). STONE, Z., AND ZICKLER..., T Toward large-scale face recognition using social network context. Proceedings of the IEEE (Jan). WHITE, B., YEH, T., LIN, J., AND DAVIS, L Web-scale computer vision using mapreduce for multimedia data mining. Proceedings of the Tenth International Workshop on Multimedia Data Mining, WHITE, The small files problem. 6 Acknowledgements We would like to give particular thanks to PhD Candidate Sean Arietta for his guidance and mentoring throughout this project. His leadership and vision have been excellent models and points of learning for us throughout this process. Additionally, we must give great thanks to Assistant Professor Jason Lawrence for his support throughout the past several years and for welcoming us into UVa s Graphics Group as bright eyed undergraduates. References ADAMS, A., JACOBS, D., DOLSON, J., TICO, M., PULLI, K., TALVALA, E., AJDIN, B., VAQUERO, D., LENSCH, H., AND HOROWITZ, M The frankencamera: an experimental platform for computational photography. ACM SIGGRAPH 2010 papers, APACHE, Hadoop mapreduce framework. CONNER, J Customizing input file formats for image processing in hadoop. Arizona State University. Online at: asu. edu/node/97. DEAN, J., AND GHEMAWAT, S Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51, 1, FACEBOOK, Facebook image storage. GHEMAWAT, S., AND GOBIOFF..., H The google file system. ACM SIGOPS Operating... (Jan). GUO..., G Learning from examples in the small sample case: face expression recognition. Systems (Jan). HANCOCK, P., BADDELEY, R., AND SMITH, L The principal components of natural. Network: computation in neural systems 3, 1,

Image Search by MapReduce

Image Search by MapReduce Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Processing Large Amounts of Images on Hadoop with OpenCV

Processing Large Amounts of Images on Hadoop with OpenCV Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru

More information

Productivity frameworks in big data image processing computations - creating photographic mosaics with Hadoop and Scalding

Productivity frameworks in big data image processing computations - creating photographic mosaics with Hadoop and Scalding Procedia Computer Science Volume 29, 2014, Pages 2306 2314 ICCS 2014. 14th International Conference on Computational Science Productivity frameworks in big data image processing computations - creating

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Cloud Security in Map/Reduce An Analysis July 31, 2009. Jason Schlesinger ropyrusk@gmail.com

Cloud Security in Map/Reduce An Analysis July 31, 2009. Jason Schlesinger ropyrusk@gmail.com Cloud Security in Map/Reduce An Analysis July 31, 2009 Jason Schlesinger ropyrusk@gmail.com Presentation Overview Contents: 1. Define Cloud Computing 2. Introduce and Describe Map/Reduce 3. Introduce Hadoop

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

More information

Hadoop Performance Diagnosis By Post-execution Log Analysis

Hadoop Performance Diagnosis By Post-execution Log Analysis Hadoop Performance Diagnosis By Post-execution Log Analysis cs598 Project Proposal Zhijin Li, Chris Cai, Haitao Liu (zli12@illinois.edu,xiaocai2@illinois.edu,hliu45@illinois.edu) University of Illinois

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Radoop: Analyzing Big Data with RapidMiner and Hadoop

Radoop: Analyzing Big Data with RapidMiner and Hadoop Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

How To Create An Image Processing Cloud Project

How To Create An Image Processing Cloud Project Large-Scale Image Processing Research Cloud Yuzhong Yan, Lei Huang Department of Computer Science Prairie View A&M University Prairie View, TX Email: yyan@student.pvamu.edu, lhuang@pvamu.edu Abstract We

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

LARGE SCALE SATELLITE IMAGE PROCESSING USING HADOOP DISTRIBUTED SYSTEM

LARGE SCALE SATELLITE IMAGE PROCESSING USING HADOOP DISTRIBUTED SYSTEM LARGE SCALE SATELLITE IMAGE PROCESSING USING HADOOP DISTRIBUTED SYSTEM Sarade Shrikant D., Ghule Nilkanth B., Disale Swapnil P., Sasane Sandip R. Abstract- The processing of large amount of images is necessary

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Hadoop Image Processing Framework

Hadoop Image Processing Framework Hadoop Image Processing Framework Sridhar Vemula Computer Science Department Oklahoma State University Stillwater, Oklahoma Email: sridhar.vemula@okstate.edu Christopher Crick Computer Science Department

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

BIG DATA IN SCIENCE & EDUCATION

BIG DATA IN SCIENCE & EDUCATION BIG DATA IN SCIENCE & EDUCATION SURFsara Data & Computing Infrastructure Event, 12 March 2014 Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra WHY BIG DATA? 2 Source: Jimmy Lin & http://en.wikipedia.org/wiki/mount_everest

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful

More information

Large Scale Electroencephalography Processing With Hadoop

Large Scale Electroencephalography Processing With Hadoop Large Scale Electroencephalography Processing With Hadoop Matthew D. Burns I. INTRODUCTION Apache Hadoop [1] is an open-source implementation of the Google developed MapReduce [3] general programming model

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Prof.Deepak Gupta Computer Department, Siddhant College of Engineering Sudumbhare, Pune, Mahrashtra,India

Prof.Deepak Gupta Computer Department, Siddhant College of Engineering Sudumbhare, Pune, Mahrashtra,India Image data conversion module using Hadoop in cloud computing enviornment Prof.Deepak Gupta Computer Department, Siddhant College of Engineering Sudumbhare, Pune, Mahrashtra,India Ms.Vaishali Patil Computer

More information

BIG DATA, MAPREDUCE & HADOOP

BIG DATA, MAPREDUCE & HADOOP BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Constant time median filtering of extra large images using Hadoop

Constant time median filtering of extra large images using Hadoop Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 93 101 doi: 10.14794/ICAI.9.2014.1.93 Constant time median filtering of extra

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Data Aggregation and Cloud Computing

Data Aggregation and Cloud Computing Data Intensive Scalable Computing Harnessing the Power of Cloud Computing Randal E. Bryant February, 2009 Our world is awash in data. Millions of devices generate digital data, an estimated one zettabyte

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project

Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project Paul Bone pbone@csse.unimelb.edu.au June 2008 Contents 1 Introduction 1 2 Method 2 2.1 Hadoop and Python.........................

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

How To Write A Data Processing Pipeline In R

How To Write A Data Processing Pipeline In R New features and old concepts for handling large and streaming data in practice Simon Urbanek R Foundation Overview Motivation Custom connections Data processing pipelines Parallel processing Back-end

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

High Performance Computing with Hadoop WV HPC Summer Institute 2014

High Performance Computing with Hadoop WV HPC Summer Institute 2014 High Performance Computing with Hadoop WV HPC Summer Institute 2014 E. James Harner Director of Data Science Department of Statistics West Virginia University June 18, 2014 Outline Introduction Hadoop

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Solution White Paper Connect Hadoop to the Enterprise

Solution White Paper Connect Hadoop to the Enterprise Solution White Paper Connect Hadoop to the Enterprise Streamline workflow automation with BMC Control-M Application Integrator Table of Contents 1 EXECUTIVE SUMMARY 2 INTRODUCTION THE UNDERLYING CONCEPT

More information