Size: px
Start display at page:

Download "http://www.paper.edu.cn"

Transcription

1 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission of Education, Beijing Jiaotong University, Beijing ; 2. China Information Technology Security Evaluation Center, Beijing ) Abstract: With the development of national large-scale railway construction, massive railway information data emerge rapidly, and then how to store and manage these data effectively becomes very significant. This paper puts forward a method based on distributed computing technology to store and manage massive railway information data, builds massive railway information data storage platform by using the Linux cluster technology. This system consists of three levels including data access layer, data management layer, application interface layer, enjoying safety and reliability, low operation cost, fast processing speed, easy expansibility characteristics, which shall satisfy the massive railway information data storage requirement. Keywords: Massive Railway Information Data Storage; Hadoop Distributed Technology; Cluster System 0 Introduction The Medium and Long-Term Railway Network Plan has established the goal to finish railway construction in 2020, when the high standard and large-scale railway construction will appear in full swing and will generate huge amounts of railway information data. These data, being vast and complex, diverse, heterogeneous, dynamic, relate to various aspects such as railway geographic information, railway construction, railway operation and maintenance, and railway dispatching. However, the current situation is lack of the unified collection and storage criterion or standard, leading to the data island phenomenon. How to store and manage massive railway information data and how to make more efficient use of these data, become one of the key even the bottleneck projects in railway department, and that s what this paper is about. Traditional methods to deal with massive data mostly use distributed high performance computing and grid computing technology[1], which consume expensive computing resources, need tedious programming to realize effective segmentation of massive data and reasonable distribution of computing tasks. Fortunately, the new development of Hadoop distributed technology can solve these problems better[2]. Basing on Linux cluster technology and using the Hadoop distributed technology, this platform effectively processes the massive amounts of railway information data and stores them in the distributed database, which designs and implements an easily extended and effective massive data storage management system. In the section 2 of this article, the author gives an introduction to the massive data storage platform architecture. The section 3 focuses on the key technologies of system implementation needed. The section 4 illustrates the implementation and performance characteristics of the system. Finally, the section 5 presents the conclusions. Foundations: National Natural Science Foundation of China Grant (NO , NO ), Beijing Natural Science Foundation Grant (NO ), Research Fund for the Doctoral Program of Higher Education of China Grant (NO ), Beijing Science and Technology Program Grant (NO.Z ). Brief author introduction:shan Xu, (1989-), Male, Computer Network. Correspondance author: WANG Genying, (1958-), Male, Associate Professor,Computer Network

2 40 1 Platform Architecture 1.1 Platform overall framework 45 According to the actual needs, we employ the MVC three-tier framework for system design. The platform is divided into data resource layer, business logic layer and presentation layer. Each layer has a clear division of responsibilities, high cohesion and low coupling, making the structure more clear and easier to maintain. The platform overall Framework is shown in figure 1. Fig. 1 Platform Overall Framework Data resource layer, composed of several storage nodes, storing and managing massive railway information data, is the foundation of the whole platform. Business logic layer, provides the parallel loading storage of the massive railway information data and the management and support services to ensure the normal operation of the system, is the core of the whole platform. Presentation layer, provide users with user-friendly interface, convenient for users to query the railway information data and extend system, which meets users directly. 1.2 Network topology of the platform 60 The platform adopts the distributed, hierarchical structure in the design, which stores the resources in multiple nodes in different parts of the cluster server. The main service node schedules and manages the distributed server cluster nodes in an unified way. With the data volume increasing and the complex application requirements changing, this platform can be easily extended, and the existing relational database can also be integrated into the platform[3]. And through the de-isomerization process, the platform and the existing relational database can jointly provide storage services for users, which serves the users transparently massive railway information data by storage and management functions

3 65 Fig. 2 Network Topology of the Platform 1.3 Overall functional design of the platform 70 According to the system function, the system can be divided into three layers, including data access layer, data management layer and application interface layer, as is shown in figure 3. Fig. 3 Overall Functional Design of the Platform Data Access Layer, with the support of the upper data management, connects the storage nodes distributed in different parts through the local area network (LAN) and the Internet to form a distributed cluster system, which provides shielding for different sources of various databases, and provides database access by service function, only to provide a convenient management and deployment chance. Data Management Layer, after the huge amounts of data parallel processing, stores the processed data in the distributed database of this system, while it provides management support service to guarantee the system run normally. It is formed of six function modules, including system management module, log management module, parallel load module, load balancing module, storage module, parallel query module, backup recovery module. System management module is used for software management, including running state monitoring, remote system deployment and autonomous running and maintenance, etc. Log management module is used for software operation log management, including the system trajectories, key events and state records, - 3 -

4 etc.. Load balancing serves load balancing and fault tolerance of management of storage node. Parallel loading storage module provides parallel loading and storage for huge amounts of data. Parallel query module provides parallel query, user-defined function such as transaction processing. Backup recovery module provides system stored data backup management, backup storage, backup restore, etc[3]. Application Interface Layer, according to the actual business types, provides different application service interface, and provides various application service on the basis of user authentication to meet the needs of different users. 2 The key Technologies of the Platform Development According to the front introduction of Hadoop and functional design of the platform, the most important part is the data management layer. When to manage it, parallel loading storage module of the data become the core of the entire platform. The Hadoop distributed technology provides models and methods of data storage and data processing for the platform. We use the Hadoop distributed file system to store a huge number of source data, and use MapReduce distributed computing model to deal with these data, then use HBase distributed database to store the processed data, in order to realize the storage and management of massive railway information data. Hadoop storage architecture is shown in the figure 4[4]. 105 Fig. 4 Hadoop Storage System Architecture Diagram 2.1 The Hadoop distributed file system HDFS is the storage basis of distributed computing, which has high fault tolerance and can be deployed on cheap hardware equipment, to store massive data[4]. HDFS uses a structural model of master-slave (Master / Slave). A HDFS cluster consists of a NameNode and several DataNodes. NameNode plays as the primary server, which manages the file system namespace and the client operation access to the file. DataNode manages stored data. From internal point, the file is divided into a plurality of data blocks, and the plurality of data blocks are stored in a set of DataNode. The NameNode execute file system namespace operations, such as opening, closing, renaming file or directory. It is also responsible for mapping data block to the specific DataNode. DataNode is responsible for handling the file system client's file read and write requests, and carries on the data block create, delete, and copy job under the unified dispatching of the NameNode. The HDFS architecture is shown in figure

5 120 Fig. 5 HDFS Architecture Diagram HDFS is of high throughput of data reading and writing, provides a basis to massive storage for railway information data, stored as an unprocessed set of source data in the Hadoop distributed file system. In our platform, we use HDFS to store these large amounts of source data The MapReduce distributed computing model MapReduce is a summary of task decomposition and results. Map is to break the task down into multiple tasks, and Reduce is to sum the results of the breakdown multitasking together to get the final result. Calculation process can be concluded as the Map (in_key in_value) - > list (inter_key inter_value) and Reduce (inter_key, list (inter_value)) - > list (out_value). In this platform, we will firstly read vast railway information data from the HDFS and divide them into M pieces to operate in parallel Map, and secondly form the state intermediate pair < k, value >, and then operate in Group on k value, just form new < k list (value) > tuple, and then break the tuples into R segments to operate in parallel Reduce, and finally, the processed results shall be stored in the distributed database. Calculation model is shown as below Fig. 6 MapReduce Computing Model The implementation of the MapReduce computing model in this platform is composed of JobTracker running on the primary node singly and TaskTracker running on nodes of each cluster[5]. The primary node is responsible for all the tasks scheduling, which constitutes the whole work, and these tasks are distributed in different nodes. The master node monitors their execution, and reruns the failing tasks; sub-node is responsible for the tasks assigned by the master node. When a Job is submitted, the JobTracker receives operation and configuration information, it will distribute the configuration information to the sub-nodes, and schedule the task and monitor TaskTracker s execution. Our platform use this way to achieve the massive railway information - 5 -

6 data processing HBase distributed database HBase is of high-reliability, high performance, oriented column, scalable distributed storage system[6]. The data line includes three basic types, Row Key, Timestamp and Column Family. Each line includes a sortable line keywords that uniquely identifies the data row in a table. An optional time stamp, each data operation has an associated timestamp. One or more column clusters, each column clusters can consist of any number of columns, they can have data or not. Vast amounts of railway information data after the MapReduce computation can use the value of k as a line keyword for distributed storage, implements massive data storage and management functions. Railway information data storage sample as shown in table 1. Line keywords represent railway geographic, construction, operation and maintenance, and dispatching information. Timestamp means the time it cost to operate the data. As is shown in the table, at time t3 Jinan to Qingdao direction at 73km happens Roadbed damages, and in the moment of t6 shows the dispatching information that the G88 train starts from Shanghai to Beijing at 15:00. Tab. 1 Railway information data storage sample Row Key Timestamp Column Family Location Value Geographic t1 Construction t3 Jinan Jinan-Qingdao 73km Roadbed damage t2 Maintenance t4 Beijing Dispatching t5 Shanghai Shanghai-Beijing G18 15:00 3 The Platform Test and Result Analysis Platform performance test When testing the system, the data files are divided into different order of magnitude to get rule numeration, and time consuming of the single machine and of Hadoop cluster shall get comparative analysis. Test results is shown as below Fig. 7 Clustering Performance Test Results We can see from the diagram that when the system deals with 1GB data, the elapsed time the cluster takes is about 4 time of single machine, which is because the distributed architecture of cluster costs some time when the system initialization and intermediate files generated and passed. When the data quantity is small, the Hadoop cluster cannot play out the advantages of distributed computing. As the amount of input file data, Hadoop cluster advantages of distributed parallel computing plays out gradually. When the amount of entering data increases to 15GB from 5GB, single machine processing time increases significantly, while processing time of cluster system increases in a tiny amplitude. When data volumes get close to 20GB, cluster system takes about a - 6 -

7 quarter time of single one. Data test shows that with the amount of data increasing, the cluster saves more time than single machine, which embodies the advantage of the Hadoop cluster on the large amount of data processing speed. 3.2 The result analysis Through performance tests, this platform not only can efficiently store and manage massive data, but also has the following features: 1High safety and reliability. System will save file in different server in the form of multiple copies to ensure the security and integrity of data. 2Data processing speed is fast. System makes documents distribution to different local compute nodes to process the data, which reduces data transfer amount and improves the speed of data processing through the MapReduce model. 3Operation cost is low. Using distributed computing architecture, server performance requirements are lower, just to reduce the cost. 4God extensibility. System adopts parallel expansion method, which can extend cluster scale and storage capacity at any time according to need. 4 Conclusion This paper bases on the Linux cluster technology, uses Hadoop distributed technology, employs the HDFS distributed file system, Map/Reduce distributed computing model and HBase distributed database technology to deal with huge amounts of data, and designs and develops the vast railway information data storage platform based on Hadoop. By doing a lot of ordinary test on cheap computers, it meets the requirements of railway information efficient storage and management of massive data. This platform has the characteristics of high safety and reliability, fast data processing speed, low running cost, good scalability, which will provide certain reference value to railway department for data storage. Acknowledgements This work has been supported by the National Natural Science Foundation of China under Grant , , the Beijing Natural Science Foundation under Grant , the Research Fund for the Doctoral Program of Higher Education of China under Grant , the Beijing Science and Technology Program under Grant Z References [1] Dean J, Ghemawat S. MapReduce:Simplified Data Processing on Large Clusters[J]. Communications of the ACM, 2008, 51(1): [2] Apache Hadoop.Welcome to Apache Hadoop[OL].[2012]. [3] S. Papadimitriou and J. Sun. DisCo: distributed co-clustering with Map-Reduce[J]. IEEE ICDM, 2008, 08(1): [4] H.C.Yang, A.Dasdan, R.L.Hsiao,and D.S.Parker. Mapreduce-merge: simplified rela-tional data processing on large clusters[j]. SCMD, 2007, 07(01): [5] Li Ying-an. Research on Parallelization of Clustering Algorithm Based on MapReduce[D]. Guang Zhou: Zhongshan University, [6] DUO Xue-song, ZHANG Jing and GAO Qiang. A Massive Data Management System based on the Hadoop[J]. Microcomputer Information, 2010, 26(05-I):

8 一 种 海 量 铁 路 信 息 数 据 的 存 储 方 法 单 旭 1, 王 根 英 1, 刘 林 (1. 北 京 交 通 大 学 通 信 与 信 息 系 统 北 京 市 重 点 实 验 室, 北 京 ; 2. 中 国 信 息 安 全 测 评 中 心, 北 京 ) 摘 要 : 随 着 国 家 大 规 模 铁 路 建 设 的 展 开, 海 量 铁 路 信 息 数 据 飞 速 涌 现 出 来, 如 何 有 效 的 存 储 和 管 理 这 些 数 据 显 得 极 为 重 要 本 文 提 出 了 一 种 基 于 分 布 式 计 算 技 术 进 行 存 储 和 管 理 海 量 铁 路 信 息 数 据 的 方 法, 采 用 Linux 集 群 技 术, 构 建 了 海 量 铁 路 信 息 数 据 存 储 平 台. 本 系 统 由 数 据 访 问 层 数 据 管 理 层 应 用 接 口 层 三 个 层 次 组 成, 具 有 安 全 可 靠 运 行 成 本 低 处 理 速 度 快 易 扩 展 性 等 特 点, 能 够 满 足 海 量 铁 路 信 息 数 据 的 存 储 要 求. 关 键 词 : 海 量 数 据 存 储 ; 铁 路 信 息 数 据 ; 分 布 式 计 算 ; 集 群 系 统 中 图 分 类 号 :[U2-9] 2-8 -

Applied research on data mining platform for weather forecast based on cloud storage

Applied research on data mining platform for weather forecast based on cloud storage Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

Design of Electric Energy Acquisition System on Hadoop

Design of Electric Energy Acquisition System on Hadoop , pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

More information

Big Data Storage Architecture Design in Cloud Computing

Big Data Storage Architecture Design in Cloud Computing Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Query and Analysis of Data on Electric Consumption Based on Hadoop

Query and Analysis of Data on Electric Consumption Based on Hadoop , pp.153-160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast International Conference on Civil, Transportation and Environment (ICCTE 2016) Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast Xiaodong Zhang1, a, Baotian Dong1, b, Weijia Zhang2,

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

An Hadoop-based Platform for Massive Medical Data Storage

An Hadoop-based Platform for Massive Medical Data Storage 5 10 15 An Hadoop-based Platform for Massive Medical Data Storage WANG Heng * (School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876) Abstract:

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

On Cloud Computing Technology in the Construction of Digital Campus

On Cloud Computing Technology in the Construction of Digital Campus 2012 International Conference on Innovation and Information Management (ICIIM 2012) IPCSIT vol. 36 (2012) (2012) IACSIT Press, Singapore On Cloud Computing Technology in the Construction of Digital Campus

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Telecom Data processing and analysis based on Hadoop

Telecom Data processing and analysis based on Hadoop COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China

More information

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

DATA SECURITY MODEL FOR CLOUD COMPUTING

DATA SECURITY MODEL FOR CLOUD COMPUTING DATA SECURITY MODEL FOR CLOUD COMPUTING POOJA DHAWAN Assistant Professor, Deptt of Computer Application and Science Hindu Girls College, Jagadhri 135 001 poojadhawan786@gmail.com ABSTRACT Cloud Computing

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Storage and Data processing for Wireless Sensor Network integrated with Cloud Computing

Storage and Data processing for Wireless Sensor Network integrated with Cloud Computing International Journal of Research In Science & Engineering e-issn: 2394-8299 Special Issue: Techno-Xtreme 16 p-issn: 2394-8280 Storage and Data processing for Wireless Sensor Network integrated with Cloud

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Layers Construct Design for Data Mining Platform Based on Cloud Computing

Layers Construct Design for Data Mining Platform Based on Cloud Computing TELKOMNIKA Indonesian Journal of Electrical Engineering Vol. 12, No. 3, March 2014, pp. 2021 2027 DOI: http://dx.doi.org/10.11591/telkomnika.v12.i3.3864 2021 Layers Construct Design for Data Mining Platform

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

The WAMS Power Data Processing based on Hadoop

The WAMS Power Data Processing based on Hadoop Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore The WAMS Power Data Processing based on Hadoop Zhaoyang Qu 1, Shilin

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Analysis of Information Management and Scheduling Technology in Hadoop

Analysis of Information Management and Scheduling Technology in Hadoop Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Scalable Multiple NameNodes Hadoop Cloud Storage System

Scalable Multiple NameNodes Hadoop Cloud Storage System Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015) The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang Nanjing Communications

More information

HDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1

HDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1 HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

A Cloud Test Bed for China Railway Enterprise Data Center

A Cloud Test Bed for China Railway Enterprise Data Center A Cloud Test Bed for China Railway Enterprise Data Center BACKGROUND China Railway consists of eighteen regional bureaus, geographically distributed across China, with each regional bureau having their

More information

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Thuy D. Nguyen, Cynthia E. Irvine, Jean Khosalim Department of Computer Science Ground System Architectures Workshop

More information

Cloud Storage Solution for WSN Based on Internet Innovation Union

Cloud Storage Solution for WSN Based on Internet Innovation Union Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Open Access Research on Database Massive Data Processing and Mining Method based on Hadoop Cloud Platform

Open Access Research on Database Massive Data Processing and Mining Method based on Hadoop Cloud Platform Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 1463-1467 1463 Open Access Research on Database Massive Data Processing and Mining Method

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis , 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

The Regional Medical Business Process Optimization Based on Cloud Computing Medical Resources Sharing Environment

The Regional Medical Business Process Optimization Based on Cloud Computing Medical Resources Sharing Environment BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, Special Issue Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0034 The Regional Medical

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop. A Brief Comparison Between the Two Systems Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Development and Application Study of Marine Data Managing and Sharing Platform

Development and Application Study of Marine Data Managing and Sharing Platform Development and Application Study of Marine Data Managing and Sharing Platform Abstract North China Sea Marine Technical Support Center, State Oceanic Administration, China 266033 Corresponding author

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Memory Database Application in the Processing of Huge Amounts of Data Daqiang Xiao 1, Qi Qian 2, Jianhua Yang 3, Guang Chen 4

Memory Database Application in the Processing of Huge Amounts of Data Daqiang Xiao 1, Qi Qian 2, Jianhua Yang 3, Guang Chen 4 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) Memory Database Application in the Processing of Huge Amounts of Data Daqiang Xiao 1, Qi Qian 2, Jianhua Yang 3, Guang

More information

Cloud Computing based on the Hadoop Platform

Cloud Computing based on the Hadoop Platform Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.

More information

Storage and Retrieval of Data for Smart City using Hadoop

Storage and Retrieval of Data for Smart City using Hadoop Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL *Hung-Ming Chen, Chuan-Chien Hou, and Tsung-Hsi Lin Department of Construction Engineering National Taiwan University

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Exploration on Security System Structure of Smart Campus Based on Cloud Computing. Wei Zhou

Exploration on Security System Structure of Smart Campus Based on Cloud Computing. Wei Zhou 3rd International Conference on Science and Social Research (ICSSR 2014) Exploration on Security System Structure of Smart Campus Based on Cloud Computing Wei Zhou Information Center, Shanghai University

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute

More information

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF Non-Stop for Apache HBase: -active region server clusters TECHNICAL BRIEF Technical Brief: -active region server clusters -active region server clusters HBase is a non-relational database that provides

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

BPM Architecture Design Based on Cloud Computing

BPM Architecture Design Based on Cloud Computing Intelligent Information Management, 2010, 2, 329-333 doi:10.4236/iim.2010.25039 Published Online May 2010 (http://www.scirp.org/journal/iim) BPM Architecture Design Based on Cloud Computing Abstract Zhenyu

More information

Distributed Consistency Method and Two-Phase Locking in Cloud Storage over Multiple Data Centers

Distributed Consistency Method and Two-Phase Locking in Cloud Storage over Multiple Data Centers BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 6 Special Issue on Logistics, Informatics and Service Science Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

Resource Scheduling Technology

Resource Scheduling Technology A Cloud Computing Platform for FY 4 Based on Resource Scheduling Technology Xiangang Zhao, Manyun Lin, Lan Wei, Lizi Xie, Zhanyun Zhang, Peng Guo Nov 19 21, 2014, Shanghai, China Outline 1. IT scale of

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information