Spatial data are mainly used for scientific

Size: px
Start display at page:

Download "Spatial data are mainly used for scientific"

Transcription

1 Spatial Data and Hadoop Utilization Georgios Economides Georgios Piskas Sokratis Siozos-Drosos Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece Keywords: Data Parallelism, Hadoop, MapReduce, Spatial Data 1. Introduction 2. Scientific Background Spatial data are mainly used for scientific research purposes in order to model real world problems. They are capable of accumulating highly detailed information, including multi-variable data sets, resulting in enormous storage and processing demands. Google s MapReduce [1] and its implementation Hadoop [2] provide us with the ability to efficiently process large scale data sets by exploiting parallelization. Prior to current and future research presentation, the fundamentals of MapReduce framework and Hadoop are briefly introduced in section 2. We describe the step-by-step execution process of this programming model, as well as its core modules. Additionally, we analyze the distributed file system of Hadoop (HDFS) [4] and its specialised nodes. Section 2 also explains spatial data types vector and raster [3] and their common usage. In section 3 we explain how we can take advantage of Hadoop s parallelism, aiming to achieve efficient processing of spatial data. In addition, we present many research fields where Hadoop has been proven to be a benefitial solution, able to manage large scale data, also known as Big Data. Finally, in section 4 we point out the current scientific trends that will possibly be the center of attention of future research projects, regarding Hadoop and spatial data. georgido@csd.auth.gr, gpiskasv@csd.auth.gr, siozosdr@csd.auth.gr In subsection 2.1 we introduce MapReduce framework. In subsection 2.2 we introduce Hadoop, an open source implementation of MapReduce. In subsection 2.3 we introduce spatial data and their usage. 2.1 MapReduce MapReduce [1] is a distributed data-centric processing framework developed by Google, capable of managing large scale data efficiently, through a large number of machines (Nodes), collectively referred to as a cluster. It is used for both scientific research as well as commercial purposes. Typical examples of usage are webpage crawling, document processing and other computation-intensive tasks. MapReduce conceals its implementation from the programmer, by requiring the coding of only two functions, Map and Reduce. Secondary functions, such as the input reader, partitioner and the combiner, do not require implementation, but are vital to the framework s functionality. The components of MapReduce (Figure 1) are described below. The concept of the framework is based on divide and conquer philosophy and consists of five tasks: Iteration over the input. Computation of key/value pairs from each piece of input. Grouping of all intermediate values by key. Iteration over the resulting groups. Reduction of each group. 1

2 Figure 1: The MapReduce framework. Input Reader Prior to Map function, the Input Reader divides the input data which is also referred to as the problem into sub-problems of appropriate size. The partitioning outputs key/value pairs that are passed to the mappers for processing. Mapper & Partitioner A Mapper receives the previous pairs, processes each and outputs intermediate key/value pairs. Note that the input and output pairs may differ. These pairs are then grouped by their intermediate key and passed to the reducers. The allocation of an output pair to a specific reducer is determined by the partitioner, based on the output key and the number of reducers. Reducer A Reducer collects the results of subproblems, grouped by their intermediate key. These results are then combined in a manner predefined by the programmer, forming the output which is the answer to the original problem. Combiner In between mapping and reducing there exists an optional task, the Combiner. This function provides a partial aggregation of the key/value pairs produced by the mappers. This partial combination of intermediate pairs 2

3 aims to reduce the network overhead by transmitting a smaller amount of pairs. The combiner s implementation is very similar to the reducer s. 2.2 Hadoop Apache Hadoop [2] is an open source implementation of Google s MapReduce framework which was described in the previous section. Hadoop supports the execution of applications on large clusters consisting of commodity machines, where an application is divided into many fragments of work that are executed parallelly. In addition, it provides its own distributed file system (HDFS) [4], which was derived from Google File System (GFS). HDFS can achieve very high aggregate throughput and computational power across the cluster by replicating data on Datanodes, coordinated by the Namenode. Hadoop is written in Java programming language and is a large scale project maintained by Yahoo! and Apache. HDFS HDFS is a common example of master/slave architecture. The Namenode [4] is considered to be the master and Datanodes [4] the slaves. Datanodes consist of data blocks, while the Datanodes themselves are grouped into Racks. The architecture of HDFS is shown in Figure 2. It should be pointed out that the default replication factor of HDFS is 3. This means that a data block that is initially saved in a single Datanode on a random Rack, is then replicated on two other Datanodes. One of them, by default, is on the same Rack with the initial Datanode, while the second one is on a different Rack. The default HDFS replication policy defines that no Datanode is allowed to contain more than one replica of any block and no rack contains more than two replicas of the same block, provided there are sufficient racks on the cluster. Figure 2: HDFS architecture. [3] The greatest benefits of HDFS are [3][4]: Low cost per byte. Data redundancy and replication. Large storage capacity. Balanced storage utilization. High fault-tolerance. High throughput rate. Scalability. Namenode The role of the Namenode [4] is the coordination of the whole cluster. There exists only one Namenode per cluster, which is responsible for the storage and update operations of the namespace as well as job tracking. The namespace is a tree-like hierarchic structure of the filesystem, files and directories which maintains a physical address mapping of data block to the equivalent Datanode. Since every action on the cluster uses this tree, it is stored in the Namenode s main memory for faster queries. Due to the fact that there is a single Namenode per cluster, techniques such as metadata images, checkpoints and backup nodes exist to guarantee faulttolerance and recovery. To sum up, the Namenode is responsible for load balancing, job tracking and namespace handling. The latter consists of operations such as physical address and block mapping, write, read, open, close and rename. 3

4 Datanode Datanodes [4] are the cluster s storage space and processing units. Every Datanode in HDFS has a unique and permanent storage identifier, that is assigned to it by the Namenode, the moment it joins the cluster. The Namenode uses this ID to recognize the Datanodes and be able to communicate and exchange information with them. Once a new Datanode is connected to the cluster, the Namenode registers a unique storage ID to it and the newly added node is ready for use, as soon as the handshake between the two nodes is complete. The handshake is nothing more than an ID verification process. A Datanode regularly sends heartbeats to the Namenode in order to confirm that it is operational. sthe interval between heartbeat is 3 seconds by default. If no heartbeat has been received by the Namenode for 10 minutes, the Datanode is deemed out of service. In the above scenario, data stored in the unavailable Datanode are replicated to other Datanodes in order to sustain data redundancy and faulttolerance properties. Heartbeats, apart from notifying the Namenode about the availability of a Datanode, are used for carrying additional information concerning work load, storage capacity and utilization as well as current data traffic. Such information combined with all the other heartbeats received allow the Namenode to make load balancing decisions and efficient scheduling. 2.3 Spatial Data Spatial data [3][6] are multidimensional data sets of information through which we can represent real world features. Using predefined spatial models, we can represent natural or constructed entities or even phenomena in a manner that computer systems can interpret. Data is saved in either Vector or Raster form. The difference between the two is visualised in Figure 3. For instance, spatial data can be used to store coordinates by the Geographic Information System (GIS), sensor data or even represent biomedical, satellite, aerial and any other type of images [5]. As far as storage is concerned, customised spatial databases [6][7] have been designed for cost-effective, scalable and efficient spatial realtime query processing. These databases, by exploiting the benefits of MapReduce, provide advanced multidimentional indexing techniques and effectively deal with high modificationfrequency data. Figure 3: Difference between Vector and Raster data. Common real world examples of spatial data usage: Proximity assessment. Entity identification and estimation of likeness. Geometric computation. Digital representation of elevation data. Topological matching and pattern analysis. Multidimensional data representation. Raster Data Raster data [3] type is used to represent both discrete and continuous entities. This is achieved with the use of a regular grid, where the value in each grid cell corresponds to the characteristics of a spatial property at that specific location. Continuous representation can be achieved using multiple grids groupped in a stack topology. Difference between Raster and Vector data can be clearly noticed in Figure 3. 4

5 Vector Data Vector data [3] is another spatial data type represented by vectors. Vectors consist of points, lines, polygons and regions, through which we have unlimited level of detail compared to Raster data. For example, we can endlessly enlarge an image without any pixelation effect. Polygons are one of the most widely used spatial data type. They are capable of effectively representing two dimensional spatial features such as geological, medical and other scientific data. Phase 1: Input is partitioned according to data proximity and size properties. Phase 2: Hadoop processes the newly formed partitions, producing lower level R-Trees. Phase 3: Previous trees are combined and form the complete R-tree index of the data. 3. Current Research Spatial Database Design Despite the fact that current database technology is able to handle the massive storage requirements of spatial data, the efficiency of queries related to them is considered to be a very challenging research problem. A typical database needs an index in order to achieve fast query processing. An index for single-dimensional data can be easily constructed using one of the several variations of B-Trees [7]. As far as multidimensional such as spatial data are concerned, advanced techniques for projecting higher order dimensions to lower ones are needed. There exist mathematical models, such as Hilbert and Z-order spacefilling curves [6][7] that can map multidimensional data to a single dimension, while preserving locality properties. This projection makes the data compatible with the previously mentioned B-Trees. Another approximal solution to spatial indexing is the R-Tree data structure [3][6][7]. R-Tree indices group nearby entities and represent them using their minimum bounding rectangle. R-trees can be efficiently constructed using a three-phase approximal algorithm that Figure 4: R-tree construction phases. [6] harnesses Hadoop s parallelization feature. A brief description (Figure 4) of each phase follows. Apart from database indexing, numerous Spatial DataBase Management Systems (SDBMS) are being developed, such as MIGIS [5]. MIGIS is a Hadoop based framework, able to handle complicated spatial queries effectively and with high performance. Hadoop framework is extended by two components, YSmart [5] and RESQUE [5]. Architecture of MIGIS is shown in Figure 5. The user provides a typical SQL query, which is interpreted by YSmart. The role of YSmart is to convert the latter into MapReduce compatible query through the replacement of default operators with spatial ones, as well as optimize the execution tree ordering. RESQUE is an execution engine which parses Spatial execution trees, extracts the 5

6 equivalent data from HDFS and finally executes the query by taking advantage of Hadoop s parallelization. Figure 5: MIGIS architecture. [5] Geospatial & Spatiotemporal Analysis A substantial amount of research effort is focused on the analysis of physical entities or phenomena that are sampled via sensors, GPS satellites and other geospatial information systems [3]. The previously mentioned sampling tools produce a massive amount of data sets, also known as Big Data the processing of which is deemed a challenging task. By exploiting the parallelization advantage of Hadoop over outdated sequential processing tools such as PASSaGE [8], we can overcome this difficulty and efficiently handle them. MapReduce framework is also used to solve another spatial data problem related to mapping and topology, where the purpose of the study is to achieve improved accuracy on automatic road-to-map alignment, by combining satellite and vector data. Roads usually consist of continuous patterns along with other related parameters, such as ground and road color differences. Smart parallelizable algorithms utilize the above information aiming to confine human interaction and consequently reduce the error rates. In addition, since a variety of physical phenomena, such as motion, are not static, we need an extra variable in order to accurately represent them. This variable is time in the sense of sampling frequency. This type of augmented spatial data, also known as Spatiotemporal data, presumes massive available storage space corresponding to sampling frequency and processing efficiency. At this point, we should recall the scalability feature of Hadoop [2]. By adding an additional dimension, the problem is automatically scaled up, but HDFS combined with parallelization is once more proven to be an effective workaround. Biomedical Analysis Hadoop framework is also used in other important research fields, such as medicine and biomedical science, along with their applications. The most frequently produced spatial data are in the form of images, the resolution of which can exceed 100K x 100K pixels [5]. However, medical datasets consist of numerous images, causing storage requirements, during a study, to scale up to tens of Terabytes or even Petabytes. Moreover, the previous datasets include additional variables in order to express scientific information. Common query types that are submitted to the above datasets include the following [5]: a. Multiway spatial join query. b. Nearest neighbour query. c. Global density pattern query. Query a in (Figure 6) includes crossmatching, comparison and consolidation of algorithm results. For example a spatial join query is used for pattern matching and entity identification in a medical image. Query b in (Figure 6) includes the computation of nearest blood vessel for each cell and distance between them. Query c in (Figure 6) illustrates identification of tumor subtypes using density values and their characteristics through regional colocation patterns. 6

7 Figure 6: Medical queries. [5] In order to be able to process these complex and expensive queries, MIGIS [5] framework which was previously described has been developed especially for this purpose, since traditional database systems are impractical for such workload. Apart from static sample analysis, Hadoop can be used for simulation purposes, such as molecular dynamics simulation, [9] which is considered to be a computationally intensive application, due to atom spatial decomposition. 4. Future Research Due to the enormous need for efficient processing, query optimization [5] research is a prioritised task. Spatial databases are an indispensable part of every data intensive research project, thus we need a more intelligent optimizer which can pick the best plan according to query type and data topology. Current technology has reached a point where storage space is not an issue. However, there is still room for improvement regarding inter-cluster connectivity. It is crucial to reduce network overhead between nodes, racks and generally within the cluster while boosting the internal network bandwidth for faster data transmission. Apart from a CPU, nowadays most computing units include a Graphics Processing Unit (GPU) [10]. The most modern high end GPUs consist of multiple processors and redundant built-in memory, so that they can cope with demanding computation tasks. Due to their high performance, it would be wise to integrate them to a parallel processing framework such as Hadoop. An important drawback of a GPU is that it does not feature a programmer-friendly Application Programming Interface (API), resulting in difficult exploitation. A great deal of research activity is focused on the development of a hybrid MapReduce framework, harnessing both CPU and GPU resources. Mars [10] is an experimental framework that implements scheduling, load balancing and synchronization between the CPU and GPU. It can achieve up to 16 times greater performance than typical setups. Another field of interest is the evaluation of statistics and parameters by a Decision Support System (DSS) [11]. A decision made by this kind of systems is an aggregate assessment of the problem s variables. The point is that we need to extract a decision as fast as possible, thus a parallelization framework, such as Hadoop, combined with machine learning can greatly speed up the process. In addition, recommendation systems can also benefit, as they are closely related to decision support systems. 7

8 5. Conclusion In this paper, we outlined the importance of spatial data, since they are used in almost every research field that needs to depict multidimensional data. We illustrated how they can be efficiently processed in a parallel manner. Furthermore, due to ever-growing data sets, the need arises for technological advancements towards information management. It is clearly understandable that parallel computing is more beneficial over sequential methods. As a result, the way towards parallelization MapReduce and its implementation, Hadoop will be the solution to a variety of difficult computational problems. References [1] Ralf Lammel, 2008, Google s MapReduce programming model Revisited, Science of Computer Programming 2008, 70, pp [2] Apache Hadoop. [3] Abhishek Sagar, Umesh Bellur, 2011, Distributed Computation on Spatial Data on Hadoop Cluster, Department of Computer Science and Engineering Indian Institute of Technology, Bombay Mumbai [4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, 2010, The Hadoop Distributed File System, IEEE 2010, /10. [5] Ablimit Aji, Fusheng Wang, 2012, High Performance Spatial Query Processing for Large Scale Scientific Data, SIGMOD 12 PhD Symposium, pp [6] Ariel Cary, Zhengguo Sun, Vagelis Hristidis, Naphtali Rishe, 2009, Experiences on Processing Spatial Data with MapReduce, SSDBM 2009, pp [7] Liu Yi, Jing Ning, Chen Luo, Chen Huizhong, 2011, Parallel Bulk-Loading of Spatial Data with MapReduce: An R-tree Case, Wuhan University Journal of Natural Sciences, pp [8] Michael S. Rosenberg, Corey Devin Anderson, 2011, PASSaGE: Pattern Analysis, Spatial Statistics and Geographic Exegesis. Version 2, Methods in Ecology and Evolution 2011, 2, pp [9] Chen He, 2011, Molecular Dynamics Simulation Based on Hadoop MapReduce, DigitalCommons@University of Nebraska - Lincoln. [10] Bingsheng He, Naga K. Govindaraju, Wenbin Fang, Tuyong Wang, Qiong Luo, 2011, Mars: A MapReduce Framework on Graphics Processors, PACT 2008, ACM /08/10. [11] Samadi Alinia, M. R. Delavar, 2008, Applications of Spatial Data Infrastructure in Disaster Management, Management, Dept. of Surveying and Geomatics Eng., Collage of Eng., University of Tehran, Tehran, Iran. 8

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. II (May-Jun. 2014), PP 58-62 Highly Available Hadoop Name Node Architecture-Using Replicas

More information

The Recovery System for Hadoop Cluster

The Recovery System for Hadoop Cluster The Recovery System for Hadoop Cluster Prof. Priya Deshpande Dept. of Information Technology MIT College of engineering Pune, India priyardeshpande@gmail.com Darshan Bora Dept. of Information Technology

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

Efficient Analysis of Big Data Using Map Reduce Framework

Efficient Analysis of Big Data Using Map Reduce Framework Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,

More information

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop. A Brief Comparison Between the Two Systems Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

More information

MapReduce algorithms for GIS Polygonal Overlay Processing

MapReduce algorithms for GIS Polygonal Overlay Processing MapReduce algorithms for GIS Polygonal Overlay Processing Satish Puri, Dinesh Agarwal, Xi He, and Sushil K. Prasad Department of Computer Science Georgia State University Atlanta - 30303, USA Email: spuri2,

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

An Hadoop-based Platform for Massive Medical Data Storage

An Hadoop-based Platform for Massive Medical Data Storage 5 10 15 An Hadoop-based Platform for Massive Medical Data Storage WANG Heng * (School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876) Abstract:

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Performance Analysis of Book Recommendation System on Hadoop Platform

Performance Analysis of Book Recommendation System on Hadoop Platform Performance Analysis of Book Recommendation System on Hadoop Platform Sugandha Bhatia #1, Surbhi Sehgal #2, Seema Sharma #3 Department of Computer Science & Engineering, Amity School of Engineering & Technology,

More information

Design of Electric Energy Acquisition System on Hadoop

Design of Electric Energy Acquisition System on Hadoop , pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Storage and Retrieval of Data for Smart City using Hadoop

Storage and Retrieval of Data for Smart City using Hadoop Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

HDFS Space Consolidation

HDFS Space Consolidation HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

Distributed Consistency Method and Two-Phase Locking in Cloud Storage over Multiple Data Centers

Distributed Consistency Method and Two-Phase Locking in Cloud Storage over Multiple Data Centers BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 6 Special Issue on Logistics, Informatics and Service Science Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.

More information

Generic Log Analyzer Using Hadoop Mapreduce Framework

Generic Log Analyzer Using Hadoop Mapreduce Framework Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis , 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Contents. 1. Introduction

Contents. 1. Introduction Summary Cloud computing has become one of the key words in the IT industry. The cloud represents the internet or an infrastructure for the communication between all components, providing and receiving

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network

More information