Hadoop Mapreduce for Remote Sensing Image Analysis Mohamed H. Almeer

Size: px
Start display at page:

Download "Hadoop Mapreduce for Remote Sensing Image Analysis Mohamed H. Almeer"

Transcription

1 Hadoop Mapreduce for Remote Sensing Image Analysis Mohamed H. Almeer Computer Science & Engineering Department Qatar University Doha, QATAR Abstract Image processing algorithms related to remote sensing have been tested and utilized on the Hadoop MapReduce parallel platform by using an experimental 112- core high-performance cloud computing system that is situated in the Environmental Studies Center at the University of Qatar. Although there has been considerable research utilizing the Hadoop platform for image processing rather than for its original purpose of text processing, it had never been proved that Hadoop can be successfully utilized for high-volume image files. Hence, the successful utilization of Hadoop for image processing has been researched using eight different practical image processing algorithms. We extend the file approach in Hadoop to regard the whole TIFF image file as a unit by expanding the file format that Hadoop uses. Finally, we apply this to other image formats such as the JPEG, BMP, and GIF formats. Experiments have shown that the method is scalable and efficient in processing multiple large images used mostly for remote sensing applications, and the difference between the single PC runtime and the Hadoop runtime is clearly noticeable. Keywords Cloud, Hadoop, HPC, Image processing, Mapreduce. I. INTRODUCTION The remote sensing community has recognized the challenge of processing large and complex satellite datasets to derive customized products, and several efforts have been made in the past few years towards incorporation of high-performance computing models. This study analyzes the recent advancements in distributed computing technologies as embodied in the MapReduce programming model and extends that for image processing of collected remote sensing images of Qatar. We make two contributions, which are the alleviation of the scarcity of parallel algorithms for processing large numbers of images using the parallel Hadoop MapReduce framework and the implementation for Qatari remote sensing images. Our research has been conducted to find an efficient programming method for customized processing within the Hadoop MapReduce framework and to determine how this can be implemented. Performance tests for processing large archives of Landsat images were performed with Hadoop. The following eight image analysis algorithms were used: Sobel filtering, image resizing, image format conversion, auto contrasting, image sharpening, picture embedding, text embedding, and image quality inspection. The findings demonstrate that MapReduce has a potential for processing large-scale remotely sensed images and solving more complex geospatial problems Current approaches to processing images depend on processing a small number of images having a sequential processing nature. These processing loads can almost fit on a single computer equipped with a relatively small memory. Still, we can observe that more disk space is needed to store the large-scale image repository that usually results from satellite-collected data. The large-scale TIFF images of the Qatar Peninsula taken from satellites are used in this image processing approach. Other image formats such as BMP, GIF, and JPEG can also be handled by the Hadoop Java programming model developed. That is, the model can function on those formats as well as the original TIFF. The developed approach can also convert other formats, besides TIFF, to JPEG. II. RELATED WORKS Kocakulak and Temizel [3] used Hadoop and MapReduce to perform ballistic image analysis, which requires that a large database of images be compared with an unknown image. They used a correlation method to compare the library of images with the unknown image. The process imposed a high computational demand, but its processing time was reduced dramatically when 14 computational nodes were utilized. Li et al. [4] used a parallel clustering algorithm with Hadoop and MapReduce. This approach was intended to reduce the time taken by clustering algorithms when applied to a large number of satellite images. The process starts by clustering each pixel with its nearest cluster and then calculates all the new cluster centers on the basis of every pixel in one cluster set. Another clustering of remote sensing images has been reported by Lv et al. [2], but a parallel K-means approach was used. Objects with similar spectral values were clustered together without any former knowledge. 443

2 The parallel environment was assisted by the Hadoop MapReduce approach taken, since the execution of this algorithm is both time- and memory-consuming. Golpayegani and Halem [5] used a Hadoop MapReduce framework to operate on a large number of Aqua satellite images collected by the AIRS instruments. A gridding of satellite data is performed using this parallel approach. In all those cases, with the increase of image data the parallel algorithm conducted by MapReduce exhibited superiority over a single machine implementation. Moreover, by using higher performance hardware the superiority of the MapReduce algorithm was better reflected. In the research conducted on image processing operated on top of the Hadoop environment, which is a relatively new field started for working on satellite images, the number of successfully taken approaches has been few. That is why we decided to use the same eight techniques found in the ordinary discipline of image processing and apply the algorithms to remote sensing data on Qatar by means of a Java programming tool implemented in a Hadoop MapReduce environment within a cluster of computing machines. III. HADOOP-HDFS-MAPREDUCE Hadoop is an Apache software project that includes challenging subprojects such as the MapReduce implementation and the Hadoop distributed file system (HDFS) that is similar to the main Google file system implementation. In our study, and in others as well, the MapReduce programming model will be actively working with this distributed file system. HDFS is characterized as a highly fault-tolerant distributed file system that can store a large number of very large files on cluster nodes. MapReduce is built on top of the HDFS file system but is independent. HDFS is an open source implementation of the Google file system (GFS). Although it appears as an ordinary file system, its storage is actually distributed among different data nodes in different clusters. HDFS is built on the principle of the master-slave architecture. The master node (name node) provides data service and access permission to the slave nodes, while the slaves (data nodes) serve as storage for the HDFS. Large files are distributed and further divided among multiple data nodes. The map processing jobs located on all nodes are operated on their local copies of the data. It can be observed that the name node stores only the metadata and the log information while the data transfer to and from the HDFS is done through the Hadoop API. 444 MapReduce is a parallel programming model for processing large amounts of metadata on cluster computers with unreliable and weak communication links. MapReduce is based on the scale-out principle, which involves clustering a large number of desktop computers. The main point of using MapReduce is to move computations to data nodes, rather than bring data to computation nodes, and thus fully utilize the advantage of data locality. The code that divides work, exerts control, and merges output in MapReduce is entirely hidden from the application user inside the framework. In fact, most of the parallel applications can be implemented in MapReduce as long as synchronized and shared global states are not required. MapReduce allows the computation to be done in two stages: the map stage and then the reduce stage. The data are split sets of key value pairs and their instances are processed in parallel by the map stage, with a parallel number that matches the node number dedicated as slaves. This process generates intermediate key value pairs that are temporary and can later be directed to reduce stages. Within map stages or reduce stages, the processing is conducted in parallel. The map and reduce stages occur in a sequential manner by which the reduce stage starts when the map stages finishes. IV. NON HADOOP APPROACH The current processing of images goes through ordinary sequential ways to accomplish this job. The program loads image after image, processing each image alone before writing the newly processed image on a storage device. Generally, we use very ordinary tools that can be found in Photoshop, for example. Besides, many ordinary C and Java programs can be downloaded from the Internet or easily developed to perform such image processing tasks. Most of these tools run on a single computer with a Windows operating system. Although batch processing can be found in these single-processor programs, there will be problems with the processing due to limited capabilities. Therefore, we are in need of a new parallel approach to work effectively on massed image data. V. HADOOP APPROACH TO IMAGE PROCESSING In order to process a large number of images effectively, we use the Hadoop HDFS to store a large amount of remote sensing image data, and we use MapReduce to process these in parallel.

3 The advantages of this approach are three abilities: 1) to store and access images in parallel on a very large scale, 2) to perform image filtering and other processing effectively, and 3) to customize MapReduce to support image formats like TIFF. The file is visited pixel by pixel but accessed whole as one record. The main attraction of this project will be the distributed processing of large satellite images by using a MapReduce model to solve the problem in parallel. In past decades, the remote sensing data acquired by satellites have greatly assisted a wide range of applications to study the surface of the Earth. Moreover, these data are available in the form of 2D images, which assisted us to manipulate them easily using computers. Remote sensing image processing consists in the use of various algorithms for extracting information from images and analyzing complex scenes. In this study, we will be focusing on spatial transformations such as contrast enhancement, brightness correction, and edge detection, which are based on the image data space. A common approach in such transformations is that new pixel values are generated on the basis of mathematical operations performed on the current pixel in terms of the surrounding ones. These spatial transformations that will be used are characterized as local transforms of a local image within a small neighbourhood of a pixel. VI. SOFTWARE IMPLEMENTATION In this project, it is not our intention to use both mappers and reducers simultaneously. We simply use each map job to read an image file, process it, and finally write the processed image file back into an HDFS storage area. The name of the image file is considered the Key and its byte content is considered the Value. However, the output pair that is directed to the reducer job will not be used, because reducers will not be used at all, since what we need can be accomplished in the map phase only. This scheme is applied to all eight image processing algorithms used in this research, except the algorithm to detect image quality. Here, the Key and the Value for the input parameters of the Map job will remain the same, but the output Key value will be used exclusively. This Key value will be assigned to the name of the corrupted image file. Reducers in this exception will collect all the file names that have undergone defect detection and will finally be grouped in a single output file stored in the HDFS output storage area. In order to customize the MapReduce framework, which is essentially designed for text processing, some changes in Java classes are inevitable for image processing needs. The first thing to consider when using Hadoop for image processing algorithms is how to customize its data types, classes, and functions to read, write, and process binary image files as can be seen in sequential Java programs running on a PC. We can do that by changing the API of Hadoop to allow images in the BytesWritable class instead of the TextWritable class and stop the splitting of input files into chunks of data. A. WholeFileInput Class and Splitting In Hadoop, we use a WholeFileInputFormat class that allows us to deal with an image as a whole file without splitting it. This is a critical modification, since the contents of a file must be processed solely by one Map instance. That is why maps should have access to the full contents of a file. This can be carried out by changing the RecordReader class that delivers the file contents as a value of the record. This particular change can be seen in the WholeFileInputFormat class that defines a format in which the keys are not used, represented by NullWritable, and the values that are the file contents, represented by BytesWritable instances. The alterations will define two methods. In the first method, the issplitable() is overridden to return false and thus ensure that input files are never split. In the second method, the getrecordreader() returns a custom implementation of the usual RecordReader class. While the InputFormat is responsible for creating the input splits and dividing those into records, the InputSplit is a Java interface responsible for representing the data to be processed by one Mapper. The RecordReader will read the input splits and then divide the file into records. The record size is equal to the size of the split specified in the InputSplit interface, which specifies the key and value for every record as well. Splitting of image files is not accepted, since it converts the images into binary files with pixels in each split. This splitting of binary files belonging to a whole image will definitely cause a loss of image features leading to image noise or a poor-quality representation. Therefore, we enforce that the files be processed in parallel and as a single record. Control of splitting is important so that a single Mapper can process a single image file. The method for controlling the process will be implemented by overriding the IsSplitable() method to return false, thus ensuring that each file will be mapped by one Mapper only and will be represented by one record only. 445

4 Now each mapper has been guaranteed to read a single image file, process it, and then write the newly processed image back into the HDFS. B. Mappers The mapper that contains our image processing algorithms is applied to every input image represented by the key value pair to generate an intermediate image and store it back into the HDFS. The key of the Mapper is NullWritable (null) and the value is of BytesWritable type, which ensures that images are read one record at a time into every mapper. All the processing algorithms will be carried out in the Mapper and no reducer will be required. A typical Map function for one of the image filtering algorithms is listed in Table 1. TABLE I MAP FUNCTION Input: public void map(nullwritable key, BytesWritable value, Output: OutputCollector<Text, Text> output, Reporter reporter) (Null, Null) Set the input path and the output path Set the type Input Format to wholefileinputformat class and set the output format to NullOutputFormatclass. The image is read from the value that comes from the Mapper, and then the Mapper will read the image bytes for storage as BytesArrayInputStream. Read source bufferedimage using ImagIO.read. Create FSDataOutputStream instance variable to write images directly to FileSystem. Create destination BufferedImage with the height and width parameters equals to the source height and width. Create kernel instance, specify the matrix size to be 3 3, and set an array of floats, which is the kernel value. Create BufferedImageOp instance and assign the kernel to ConvolveOp method. Call filter method from BufferedImageOp to apply filter on the image. Encode the destination image in JPEG format using JPEGEncoder. Store the image in the HDFS. 446 C. Reducers The algorithm to test image quality is one newly developed in our experimentation and has a different implementation of map reduce. The Mapper task in this implementation is to obtain the names of processed images and pass those through a Key Value pair to Reducers. The Key Value pair generated by each Mapper will be the Null and the file name where a degree of content corruption is detected and evaluated. The reducer will collect the names of files and join those in a single output file. The following is a list of the other classes that were altered: RecordReader: The RecordReader is another class that has been customized for image file access. This is the one responsible for loading the data from its source and converting it into (key, value) pairs suitable for Mapper processing. By altering this to read a whole input image as one record passed to Mappers, we ensure no splitting of the image content as is usually done for text files. BufferedImage: The BufferedImage is introduced here as a way to contain an image file and ensure that its pixels can be accessed. This is a subclass that describes an Image with an accessible buffer of image data. A BufferedImage is comprised of a ColorModel and a Raster of image data. In order to process an image, BufferedImage must be used to read the image. JPEGImageEncoder: The JPEGImageEncoder is used extensively to encode buffers of image data as JPEG data streams, the preferred output format used to store images. Users of this interface are required to provide image data in a Raster or a BufferedImage, set the necessary parameters in the JPEGEncodeParams object, and successfully open the OutputStream that is the destination of the encoded JPEG stream. The JPEGImageEncoder interface can encode image data as abbreviated JPEG data streams that are written to the OutputStream provided to the encoder. VII. IMAGE PROCESSING ALGORITHMS USED IN REMOTE SENSING A. Color Sharpening Sharpening is one of the most frequently used transformations that can be applied to an image for enhancement purposes, and we will use the ordinary methods of sharpening here to bring out image details that were not apparent before. Image sharpening is used to enhance both the intensity and the edge of the image in order to obtain the perceived image.

5 The sharpening code is implemented using convolution, which is an operation that computes the destination pixel by multiplying the source pixel and its neighbors by a convolution kernel. The kernel is a linear operator that describes how a specified pixel and its surrounding pixels affect the value computed in the destination image of a filtering operation. Specifically, the kernel used here is represented by 3 3 matrixes of floating-point numbers. When the convolution operation is performed, this 3 3 matrix is used as a sliding mask to operate on the pixels of the source image. To compute the result of the convolution for a pixel located at coordinates (x, y) in the source image, the center of the kernel is positioned at these coordinates. To compute the value of the destination pixel at (x, y), a multiplication is performed on the kernel values with their corresponding color values in the source image. B. Sobel Filter The Sobel edge filter is used to detect edges and is based on applying horizontal and vertical filters in sequence. Both filters are applied to the image and summed to form the final result. Edges characterize boundaries; therefore, edges are fundamental in image processing. Edges in images are areas with strong intensity contrasts. Detecting the edges in an image significantly reduces the amount of data and filters out useless information while preserving the important structural properties in an image. The Sobel operator applies a 2-D spatial gradient measurement to an image and is represented by an equation. It is used to find the approximate gradient magnitude at each point in an input grayscale image. The discrete form of the Sobel operator can be represented by a pair of 3 3 convolution kernels, one estimating the gradient in the x-direction (columns) and the other estimating the gradient in the y-direction (rows). The Gx kernel highlights the edges in the horizontal direction, while the Gy kernel highlights the edges in the vertical direction. The overall magnitude of the outputs, G, detects edges in both directions and is the brightness value of the output image. f Gx x G, Gy f y (1) Gx Gy (2) In detail, the kernel mask is slid over an area of the input image, changes the value of that pixel, and then is shifted one pixel to the right. 447 It continues toward the right until it reaches the end of the row, and then it starts at the beginning of the next row. The center of the mask is placed over the pixel that will be manipulated in the image, and the file pointer is moved using the (i, j) values so that multiplications can be performed. After taking the magnitude of both kernels, the resulting output detects edges in both directions. C. Auto Contrast Automatic contrast adjustment (auto-contrast) is a point operation whose task is to modify the pixel intensity with respect to the range of values in an image. This is done by mapping the current darkest and brightest pixels to the lowest and highest available intensity values, respectively, and linearly distributing the intermediate values. Let us assume that a low and a high are the lowest and highest pixel values found in the current image, whose full intensity range is [a min, a max ]. We first map the smallest pixel value a low to zero, subsequently increase the contrast by the factor (a max a min )/(a high a low ), and finally shift to the target range by adding a min. According to this mathematical formula, the auto-contrast factor is calculated. f a a a low a a max high a a Provided that a high a low and for an 8-bit image, a min = 0 and a max = 255. The aim of this image process is to brighten the darkest places in satellite images by a certain factor. For example, in the images captured from the satellite the sea is represented by a very dark color and does not look obvious; therefore, using this process greatly helps to reveal the image details by increasing their brightness. min low VIII. EXPERIMENTATION A. Environment Setup Qatar University has recently received a dedicated cloud infrastructure called the QU Cloud. This is running a 14- blade IBM BladeCenter (cluster) with 112 physical cores and 7 TB of storage. The cloud has advanced management software (IBM Cloud 1.4) with a Tivoli provisioning manager that allows cloud resources to be provisioned through software. Our experiments were performed on this cluster equipped with Hadoop. This project has been provisioned with one NameNode and eight DataNodes. (3)

6 The NameNode was configured to use two 2.5-GHz CPUs, 2 GB of RAM, and 130 GB of storage space. Each DataNode was configured to use two 2.5-GHz CPUs, 2 GB of RAM, and 130 GB of disk storage. Besides this, all the computing nodes were connected by a gigabit switch. Red Hat Linux 5.3, Hadoop , and Java 1.6.0_6 were installed on both the NameNode and the DataNodes. Table 2 shows the master-slave hardware configuration, while Tables 4 and 5 show the cluster hardware and software configurations, respectively. TABLE III MASTER SLAVE CONFIGURATION Name Number Details Master GHz CPU Node, 2 GB RAM, 130 GB disk space TABLE IV CLUSTER HARDWARE CONFIGURATION Name Number Details Name Node GHz CPU Node, 2 GB RAM Name Node GHz CPU Node, 2 GB RAM Network 1 Gbit switch connecting all nodes Storage GB Slaves GHz CPU Node, 2 GB RAM, 130 GB disk space TABLE V SOFTWARE CONFIGURATION FOR CLUSTER COMPUTERS Name Version Details PC Personal Computer TABLE III SINGLE MACHINE CONFIGURATION Details Intel Core 2 Duo CPU, 2.27 GHz, RAM 4 GB, and total storage 500 GB. Operating System: Windows Vista Hadoop Installed on each node of the computer Xen Red Hat Linux 5.2 Pre-configured with Java and the Hadoop SDKs. IBM Java SDK 6 For Programming Image Processing 448

7 B. Data Source The data source was a real color satellite image of Qatar supplied by the Environmental Studies Center at Qatar University. Subsequent processing was done on those images to change their sizes so that the Java compiler with its limited heap size can open them successfully. The collection of remote sensing images taken from Landsat satellites consists of 200 image files, each of which has pixels resolution in TIF format. The spatial resolution was 30.0 meters per pixel, and in our experiments, only a small part of the images was used. The developed Java program reads the JPEG, TIFF, GIF, and BMP standard image formats. Although the JPEG, and GIF image formats are the most commonly used, the TIF format is the standard universal format for high-quality images and is used for archiving important images. Most satellite images are in TIF format because TIFF images are stored with their full quality; hence, the size of TIFF images is very large in uploading, resizing, and processing. When the processing finishes, all the images will have been encoded using the JPEG format by calling the JPEGEncoder method to perform the required JPEG compression for the image. Then, each image will be saved in a JPEG file. The JPEG format is used when an image of high quality is not necessary (e.g., to prove the processing here). JPEG compression reduces images to a size that is suitable for processing and non-archival purposes as needed in this study. recorded because each algorithm was uniquely different in its numerical processing overhead. These useful image processing algorithms were chosen in the Hadoop project for their frequent use in remote sensing and for their diverse computational loads. We paid more attention to the processing time than the number of images processed. Moreover, we considered the elapsed time as the sole performance measure. We compared the times taken by the single PC and the cluster to observe the speedup factors and finally analyzed the results. Figure 1. Auto-contrast chart IX. EXPERIMENT RESULT AND ANALYSIS In this section, we test the performance of some parallel image filtering algorithms applied to remote sensing images available in the Environmental Studies Center. We mainly consider the trend in time consumption with the increase in the volume of data, and we try to show the difference in run time between a single PC implementation and the parallel Hadoop implementation. Tables 3 5 list the different configurations of the cluster and single PC computing platforms used to obtain the experimental results. We repeated each experiment three times and the average reading was finally recorded. We implemented the image sharpening, Sobel filtering, and auto-contrasting by using Java. The programs ran well when the size of the input image was not large. Figures 1 3 show run time against file size for the two configurations; namely, the single machine and the cloud cluster. Three different image processing algorithms were used for experimentation, and different timings were Figure 2. Sobel filtering chart 449

8 Figure 3. Sharpening chart Figure 4. Sample image output for auto-contrast, Sobel filtering, and sharpening (Khor Al Udaid Beach). 450 X. CONCLUSION AND FUTURE WORK In this paper, we presented a case study for implementing parallel processing of remote sensing images in TIF format by using the Hadoop MapReduce framework. The experimental results have shown that the typical image processing algorithms can be effectively parallelized with acceptable run times when applied to remote sensing images. A large number of images cannot be processed efficiently in the customary sequential manner. Although originally designed for text processing, Hadoop MapReduce installed in a parallel cluster proved suitable to process TIF format images in large quantities. For companies that own massive images and need to process those in various ways, our implementation proved efficient and successful. In particular, the best results show that the auto-contrast algorithm reached a sixfold speedup when implemented on a cluster rather than a single PC. Moreover, the sharpening algorithm reached an eightfold reduction in run time, while the Sobel filtering algorithm reached a sevenfold reduction. We have observed that this parallel Hadoop implementation is better suited for large data sizes than for when a computationally intensive application is required. In the future, we might focus on using different image sources with different algorithms that can have a computationally intensive nature. REFERENCES [1] G. Fox, X. Qiu, S. Beason, J. Choi, J. Ekanayake, T. Gunarathne, M. Rho, H. Tang, N. Devadasan, and G. Liu, Biomedical case studies in data intensive computing, in Cloud Computing (LNCS vol. 5931), M. G. Jaatun, G. Zhao, and C. Rong, Eds. Berlin: Springer- Verlag, 2009, pp [2] Z. Lv, Y. Hu, H. Zhong, J. Wu, B. Li, and H. Zhao, Parallel K-means clustering of remote sensing images based on MapReduce, in Proc Int. Conf. Web Information Systems and Mining (WISM 10), pp Tavel, P Modeling and Simulation Design. AK Peters Ltd. [3] H. Kocakulak and T. T. Temizel, A Hadoop solution for ballistic image analysis and recognition, in 2011 Int. Conf. High Performance Computing and Simulation (HPCS), Istanbul, pp Forman, G An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3 (Mar. 2003), [4] B. Li, H. Zhao, Z. H. Lv, Parallel ISODATA clustering of remote sensing images based on MapReduce, in 2010 Int. Conf. Cyber- Enabled Distributed Computing and Knowledge Discovery (CyberC), Huangshan, pp [5] N. Golpayegani and M. Halem, Cloud computing for satellite data processing on high end compute clusters, in Proc IEEE Int. Conf. Cloud Computing (CLOUD 09), Bangalore, pp [6] T. Dalman, T. Döernemann, E. Juhnke, M. Weitzel, M. Smith, W. Wiechert, K. Nöh, and B. Freisleben, Metabolic flux analysis in the cloud, in IEEE 6th Int. Conf. e-science, Brisbane, 2010, pp

9 [7] Hadoop. [8] MapReduce. [9] HDFS. [10] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Commun. ACM, vol. 51, no. 1, pp ,

http://www.cisjournal.org

http://www.cisjournal.org Cloud Hadoop Map Reduce For Remote Sensing Image Analysis Mohamed H. Almeer Computer Science & Engineering DepartmentQatar University Doha, QATAR almeerqatar@gmail.com, almeer@qu.edu.qa ABSTRACT Image

More information

Constant time median filtering of extra large images using Hadoop

Constant time median filtering of extra large images using Hadoop Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 93 101 doi: 10.14794/ICAI.9.2014.1.93 Constant time median filtering of extra

More information

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031)

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031) using Hadoop MapReduce framework Binoy A Fernandez (200950006) Sameer Kumar (200950031) Objective To demonstrate how the hadoop mapreduce framework can be extended to work with image data for distributed

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Image Search by MapReduce

Image Search by MapReduce Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

LARGE SCALE SATELLITE IMAGE PROCESSING USING HADOOP DISTRIBUTED SYSTEM

LARGE SCALE SATELLITE IMAGE PROCESSING USING HADOOP DISTRIBUTED SYSTEM LARGE SCALE SATELLITE IMAGE PROCESSING USING HADOOP DISTRIBUTED SYSTEM Sarade Shrikant D., Ghule Nilkanth B., Disale Swapnil P., Sasane Sandip R. Abstract- The processing of large amount of images is necessary

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Processing Large Amounts of Images on Hadoop with OpenCV

Processing Large Amounts of Images on Hadoop with OpenCV Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru

More information

Map/Reduce Affinity Propagation Clustering Algorithm

Map/Reduce Affinity Propagation Clustering Algorithm Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Design of Electric Energy Acquisition System on Hadoop

Design of Electric Energy Acquisition System on Hadoop , pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Advances in Natural and Applied Sciences

Advances in Natural and Applied Sciences AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Adaptive Task Scheduling for Multi Job MapReduce

Adaptive Task Scheduling for Multi Job MapReduce Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center

More information

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce

More information

A Service for Data-Intensive Computations on Virtual Clusters

A Service for Data-Intensive Computations on Virtual Clusters A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King rainer.schmidt@arcs.ac.at Planets Project Permanent

More information

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services RESEARCH ARTICLE Adv. Sci. Lett. 4, 400 407, 2011 Copyright 2011 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 4, 400 407, 2011 Printed in the United States of America

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania) Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania) Outline Introduction EO challenges; EO and classical/cloud computing; EO Services The computing platform Cluster -> Grid -> Cloud

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Hadoop Optimization for Massive Image Processing: Case Study Face Detection

Hadoop Optimization for Massive Image Processing: Case Study Face Detection INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 9(6):664-671, December, 2014. Hadoop Optimization for Massive Image Processing: Case Study Face Detection İ. Demir, A. Sayar

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Storage and Retrieval of Data for Smart City using Hadoop

Storage and Retrieval of Data for Smart City using Hadoop Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks

HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks Chris Sweeney Liu Liu Sean Arietta Jason Lawrence University of Virginia Images 1k Map 1 Reduce 1 Hipi Image Cull Shuffle Result

More information

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library Web Archiving Austrian Books Online SCAPE at the Austrian National

More information

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ. Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Volume 1 Issue 5 (June 2014)

International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Volume 1 Issue 5 (June 2014) Enhancement of Map Function Image Processing System Using DHRF Algorithm on Big Data in the Private Cloud Tool Mehraj Ali. U, John Sanjeev Kumar. A Department of Computer Application, Thiagarajar College

More information

Scalable Services for Digital Preservation

Scalable Services for Digital Preservation Scalable Services for Digital Preservation A Perspective on Cloud Computing Rainer Schmidt, Christian Sadilek, and Ross King Digital Preservation (DP) Providing long-term access to growing collections

More information

Large data computing using Clustering algorithms based on Hadoop

Large data computing using Clustering algorithms based on Hadoop Large data computing using Clustering algorithms based on Hadoop Samrudhi Tabhane, Prof. R.A.Fadnavis Dept. Of Information Technology,YCCE, email id: sstabhane@gmail.com Abstract The Hadoop Distributed

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

More information

How To Balance In Cloud Computing

How To Balance In Cloud Computing A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi johnpm12@gmail.com Yedhu Sastri Dept. of IT, RSET,

More information