|
|
- Lambert Edwards
- 8 years ago
- Views:
Transcription
1 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission of Education, Beijing Jiaotong University, Beijing ; 2. China Information Technology Security Evaluation Center, Beijing ) Abstract: With the development of national large-scale railway construction, massive railway information data emerge rapidly, and then how to store and manage these data effectively becomes very significant. This paper puts forward a method based on distributed computing technology to store and manage massive railway information data, builds massive railway information data storage platform by using the Linux cluster technology. This system consists of three levels including data access layer, data management layer, application interface layer, enjoying safety and reliability, low operation cost, fast processing speed, easy expansibility characteristics, which shall satisfy the massive railway information data storage requirement. Keywords: Massive Railway Information Data Storage; Hadoop Distributed Technology; Cluster System 0 Introduction The Medium and Long-Term Railway Network Plan has established the goal to finish railway construction in 2020, when the high standard and large-scale railway construction will appear in full swing and will generate huge amounts of railway information data. These data, being vast and complex, diverse, heterogeneous, dynamic, relate to various aspects such as railway geographic information, railway construction, railway operation and maintenance, and railway dispatching. However, the current situation is lack of the unified collection and storage criterion or standard, leading to the data island phenomenon. How to store and manage massive railway information data and how to make more efficient use of these data, become one of the key even the bottleneck projects in railway department, and that s what this paper is about. Traditional methods to deal with massive data mostly use distributed high performance computing and grid computing technology[1], which consume expensive computing resources, need tedious programming to realize effective segmentation of massive data and reasonable distribution of computing tasks. Fortunately, the new development of Hadoop distributed technology can solve these problems better[2]. Basing on Linux cluster technology and using the Hadoop distributed technology, this platform effectively processes the massive amounts of railway information data and stores them in the distributed database, which designs and implements an easily extended and effective massive data storage management system. In the section 2 of this article, the author gives an introduction to the massive data storage platform architecture. The section 3 focuses on the key technologies of system implementation needed. The section 4 illustrates the implementation and performance characteristics of the system. Finally, the section 5 presents the conclusions. Foundations: National Natural Science Foundation of China Grant (NO , NO ), Beijing Natural Science Foundation Grant (NO ), Research Fund for the Doctoral Program of Higher Education of China Grant (NO ), Beijing Science and Technology Program Grant (NO.Z ). Brief author introduction:shan Xu, (1989-), Male, Computer Network. Correspondance author: WANG Genying, (1958-), Male, Associate Professor,Computer Network. gywang@bjtu.edu.cn - 1 -
2 40 1 Platform Architecture 1.1 Platform overall framework 45 According to the actual needs, we employ the MVC three-tier framework for system design. The platform is divided into data resource layer, business logic layer and presentation layer. Each layer has a clear division of responsibilities, high cohesion and low coupling, making the structure more clear and easier to maintain. The platform overall Framework is shown in figure 1. Fig. 1 Platform Overall Framework Data resource layer, composed of several storage nodes, storing and managing massive railway information data, is the foundation of the whole platform. Business logic layer, provides the parallel loading storage of the massive railway information data and the management and support services to ensure the normal operation of the system, is the core of the whole platform. Presentation layer, provide users with user-friendly interface, convenient for users to query the railway information data and extend system, which meets users directly. 1.2 Network topology of the platform 60 The platform adopts the distributed, hierarchical structure in the design, which stores the resources in multiple nodes in different parts of the cluster server. The main service node schedules and manages the distributed server cluster nodes in an unified way. With the data volume increasing and the complex application requirements changing, this platform can be easily extended, and the existing relational database can also be integrated into the platform[3]. And through the de-isomerization process, the platform and the existing relational database can jointly provide storage services for users, which serves the users transparently massive railway information data by storage and management functions
3 65 Fig. 2 Network Topology of the Platform 1.3 Overall functional design of the platform 70 According to the system function, the system can be divided into three layers, including data access layer, data management layer and application interface layer, as is shown in figure 3. Fig. 3 Overall Functional Design of the Platform Data Access Layer, with the support of the upper data management, connects the storage nodes distributed in different parts through the local area network (LAN) and the Internet to form a distributed cluster system, which provides shielding for different sources of various databases, and provides database access by service function, only to provide a convenient management and deployment chance. Data Management Layer, after the huge amounts of data parallel processing, stores the processed data in the distributed database of this system, while it provides management support service to guarantee the system run normally. It is formed of six function modules, including system management module, log management module, parallel load module, load balancing module, storage module, parallel query module, backup recovery module. System management module is used for software management, including running state monitoring, remote system deployment and autonomous running and maintenance, etc. Log management module is used for software operation log management, including the system trajectories, key events and state records, - 3 -
4 etc.. Load balancing serves load balancing and fault tolerance of management of storage node. Parallel loading storage module provides parallel loading and storage for huge amounts of data. Parallel query module provides parallel query, user-defined function such as transaction processing. Backup recovery module provides system stored data backup management, backup storage, backup restore, etc[3]. Application Interface Layer, according to the actual business types, provides different application service interface, and provides various application service on the basis of user authentication to meet the needs of different users. 2 The key Technologies of the Platform Development According to the front introduction of Hadoop and functional design of the platform, the most important part is the data management layer. When to manage it, parallel loading storage module of the data become the core of the entire platform. The Hadoop distributed technology provides models and methods of data storage and data processing for the platform. We use the Hadoop distributed file system to store a huge number of source data, and use MapReduce distributed computing model to deal with these data, then use HBase distributed database to store the processed data, in order to realize the storage and management of massive railway information data. Hadoop storage architecture is shown in the figure 4[4]. 105 Fig. 4 Hadoop Storage System Architecture Diagram 2.1 The Hadoop distributed file system HDFS is the storage basis of distributed computing, which has high fault tolerance and can be deployed on cheap hardware equipment, to store massive data[4]. HDFS uses a structural model of master-slave (Master / Slave). A HDFS cluster consists of a NameNode and several DataNodes. NameNode plays as the primary server, which manages the file system namespace and the client operation access to the file. DataNode manages stored data. From internal point, the file is divided into a plurality of data blocks, and the plurality of data blocks are stored in a set of DataNode. The NameNode execute file system namespace operations, such as opening, closing, renaming file or directory. It is also responsible for mapping data block to the specific DataNode. DataNode is responsible for handling the file system client's file read and write requests, and carries on the data block create, delete, and copy job under the unified dispatching of the NameNode. The HDFS architecture is shown in figure
5 120 Fig. 5 HDFS Architecture Diagram HDFS is of high throughput of data reading and writing, provides a basis to massive storage for railway information data, stored as an unprocessed set of source data in the Hadoop distributed file system. In our platform, we use HDFS to store these large amounts of source data The MapReduce distributed computing model MapReduce is a summary of task decomposition and results. Map is to break the task down into multiple tasks, and Reduce is to sum the results of the breakdown multitasking together to get the final result. Calculation process can be concluded as the Map (in_key in_value) - > list (inter_key inter_value) and Reduce (inter_key, list (inter_value)) - > list (out_value). In this platform, we will firstly read vast railway information data from the HDFS and divide them into M pieces to operate in parallel Map, and secondly form the state intermediate pair < k, value >, and then operate in Group on k value, just form new < k list (value) > tuple, and then break the tuples into R segments to operate in parallel Reduce, and finally, the processed results shall be stored in the distributed database. Calculation model is shown as below Fig. 6 MapReduce Computing Model The implementation of the MapReduce computing model in this platform is composed of JobTracker running on the primary node singly and TaskTracker running on nodes of each cluster[5]. The primary node is responsible for all the tasks scheduling, which constitutes the whole work, and these tasks are distributed in different nodes. The master node monitors their execution, and reruns the failing tasks; sub-node is responsible for the tasks assigned by the master node. When a Job is submitted, the JobTracker receives operation and configuration information, it will distribute the configuration information to the sub-nodes, and schedule the task and monitor TaskTracker s execution. Our platform use this way to achieve the massive railway information - 5 -
6 data processing HBase distributed database HBase is of high-reliability, high performance, oriented column, scalable distributed storage system[6]. The data line includes three basic types, Row Key, Timestamp and Column Family. Each line includes a sortable line keywords that uniquely identifies the data row in a table. An optional time stamp, each data operation has an associated timestamp. One or more column clusters, each column clusters can consist of any number of columns, they can have data or not. Vast amounts of railway information data after the MapReduce computation can use the value of k as a line keyword for distributed storage, implements massive data storage and management functions. Railway information data storage sample as shown in table 1. Line keywords represent railway geographic, construction, operation and maintenance, and dispatching information. Timestamp means the time it cost to operate the data. As is shown in the table, at time t3 Jinan to Qingdao direction at 73km happens Roadbed damages, and in the moment of t6 shows the dispatching information that the G88 train starts from Shanghai to Beijing at 15:00. Tab. 1 Railway information data storage sample Row Key Timestamp Column Family Location Value Geographic t1 Construction t3 Jinan Jinan-Qingdao 73km Roadbed damage t2 Maintenance t4 Beijing Dispatching t5 Shanghai Shanghai-Beijing G18 15:00 3 The Platform Test and Result Analysis Platform performance test When testing the system, the data files are divided into different order of magnitude to get rule numeration, and time consuming of the single machine and of Hadoop cluster shall get comparative analysis. Test results is shown as below Fig. 7 Clustering Performance Test Results We can see from the diagram that when the system deals with 1GB data, the elapsed time the cluster takes is about 4 time of single machine, which is because the distributed architecture of cluster costs some time when the system initialization and intermediate files generated and passed. When the data quantity is small, the Hadoop cluster cannot play out the advantages of distributed computing. As the amount of input file data, Hadoop cluster advantages of distributed parallel computing plays out gradually. When the amount of entering data increases to 15GB from 5GB, single machine processing time increases significantly, while processing time of cluster system increases in a tiny amplitude. When data volumes get close to 20GB, cluster system takes about a - 6 -
7 quarter time of single one. Data test shows that with the amount of data increasing, the cluster saves more time than single machine, which embodies the advantage of the Hadoop cluster on the large amount of data processing speed. 3.2 The result analysis Through performance tests, this platform not only can efficiently store and manage massive data, but also has the following features: 1High safety and reliability. System will save file in different server in the form of multiple copies to ensure the security and integrity of data. 2Data processing speed is fast. System makes documents distribution to different local compute nodes to process the data, which reduces data transfer amount and improves the speed of data processing through the MapReduce model. 3Operation cost is low. Using distributed computing architecture, server performance requirements are lower, just to reduce the cost. 4God extensibility. System adopts parallel expansion method, which can extend cluster scale and storage capacity at any time according to need. 4 Conclusion This paper bases on the Linux cluster technology, uses Hadoop distributed technology, employs the HDFS distributed file system, Map/Reduce distributed computing model and HBase distributed database technology to deal with huge amounts of data, and designs and develops the vast railway information data storage platform based on Hadoop. By doing a lot of ordinary test on cheap computers, it meets the requirements of railway information efficient storage and management of massive data. This platform has the characteristics of high safety and reliability, fast data processing speed, low running cost, good scalability, which will provide certain reference value to railway department for data storage. Acknowledgements This work has been supported by the National Natural Science Foundation of China under Grant , , the Beijing Natural Science Foundation under Grant , the Research Fund for the Doctoral Program of Higher Education of China under Grant , the Beijing Science and Technology Program under Grant Z References [1] Dean J, Ghemawat S. MapReduce:Simplified Data Processing on Large Clusters[J]. Communications of the ACM, 2008, 51(1): [2] Apache Hadoop.Welcome to Apache Hadoop[OL].[2012]. [3] S. Papadimitriou and J. Sun. DisCo: distributed co-clustering with Map-Reduce[J]. IEEE ICDM, 2008, 08(1): [4] H.C.Yang, A.Dasdan, R.L.Hsiao,and D.S.Parker. Mapreduce-merge: simplified rela-tional data processing on large clusters[j]. SCMD, 2007, 07(01): [5] Li Ying-an. Research on Parallelization of Clustering Algorithm Based on MapReduce[D]. Guang Zhou: Zhongshan University, [6] DUO Xue-song, ZHANG Jing and GAO Qiang. A Massive Data Management System based on the Hadoop[J]. Microcomputer Information, 2010, 26(05-I):
8 一 种 海 量 铁 路 信 息 数 据 的 存 储 方 法 单 旭 1, 王 根 英 1, 刘 林 (1. 北 京 交 通 大 学 通 信 与 信 息 系 统 北 京 市 重 点 实 验 室, 北 京 ; 2. 中 国 信 息 安 全 测 评 中 心, 北 京 ) 摘 要 : 随 着 国 家 大 规 模 铁 路 建 设 的 展 开, 海 量 铁 路 信 息 数 据 飞 速 涌 现 出 来, 如 何 有 效 的 存 储 和 管 理 这 些 数 据 显 得 极 为 重 要 本 文 提 出 了 一 种 基 于 分 布 式 计 算 技 术 进 行 存 储 和 管 理 海 量 铁 路 信 息 数 据 的 方 法, 采 用 Linux 集 群 技 术, 构 建 了 海 量 铁 路 信 息 数 据 存 储 平 台. 本 系 统 由 数 据 访 问 层 数 据 管 理 层 应 用 接 口 层 三 个 层 次 组 成, 具 有 安 全 可 靠 运 行 成 本 低 处 理 速 度 快 易 扩 展 性 等 特 点, 能 够 满 足 海 量 铁 路 信 息 数 据 的 存 储 要 求. 关 键 词 : 海 量 数 据 存 储 ; 铁 路 信 息 数 据 ; 分 布 式 计 算 ; 集 群 系 统 中 图 分 类 号 :[U2-9] 2-8 -
Applied research on data mining platform for weather forecast based on cloud storage
Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information
More informationUPS battery remote monitoring system in cloud computing
, pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology
More informationDesign of Electric Energy Acquisition System on Hadoop
, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University
More informationBig Data Storage Architecture Design in Cloud Computing
Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationAn Hadoop-based Platform for Massive Medical Data Storage
5 10 15 An Hadoop-based Platform for Massive Medical Data Storage WANG Heng * (School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876) Abstract:
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationMobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
More informationOn Cloud Computing Technology in the Construction of Digital Campus
2012 International Conference on Innovation and Information Management (ICIIM 2012) IPCSIT vol. 36 (2012) (2012) IACSIT Press, Singapore On Cloud Computing Technology in the Construction of Digital Campus
More informationResearch of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast
International Conference on Civil, Transportation and Environment (ICCTE 2016) Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast Xiaodong Zhang1, a, Baotian Dong1, b, Weijia Zhang2,
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationTelecom Data processing and analysis based on Hadoop
COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationApache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
More informationQuery and Analysis of Data on Electric Consumption Based on Hadoop
, pp.153-160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang
More informationJournal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More information!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationDATA SECURITY MODEL FOR CLOUD COMPUTING
DATA SECURITY MODEL FOR CLOUD COMPUTING POOJA DHAWAN Assistant Professor, Deptt of Computer Application and Science Hindu Girls College, Jagadhri 135 001 poojadhawan786@gmail.com ABSTRACT Cloud Computing
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationAnalysis of Information Management and Scheduling Technology in Hadoop
Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering
More informationHadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationThe WAMS Power Data Processing based on Hadoop
Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore The WAMS Power Data Processing based on Hadoop Zhaoyang Qu 1, Shilin
More informationFinal Project Proposal. CSCI.6500 Distributed Computing over the Internet
Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least
More informationHDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationCLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
More informationScalable Multiple NameNodes Hadoop Cloud Storage System
Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai
More informationThe Regional Medical Business Process Optimization Based on Cloud Computing Medical Resources Sharing Environment
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, Special Issue Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0034 The Regional Medical
More informationPerformance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationHadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationOpen Access Research on Database Massive Data Processing and Mining Method based on Hadoop Cloud Platform
Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 1463-1467 1463 Open Access Research on Database Massive Data Processing and Mining Method
More informationNetwork-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
More informationEfficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationResearch on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
More informationProblem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationCDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationmarlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
More informationThe Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang
International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015) The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang Nanjing Communications
More informationCloud Storage Solution for WSN Based on Internet Innovation Union
Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,
More informationCloud Computing based on the Hadoop Platform
Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.
More informationDevelopment and Application Study of Marine Data Managing and Sharing Platform
Development and Application Study of Marine Data Managing and Sharing Platform Abstract North China Sea Marine Technical Support Center, State Oceanic Administration, China 266033 Corresponding author
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationA Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud
A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Thuy D. Nguyen, Cynthia E. Irvine, Jean Khosalim Department of Computer Science Ground System Architectures Workshop
More informationA Cloud Test Bed for China Railway Enterprise Data Center
A Cloud Test Bed for China Railway Enterprise Data Center BACKGROUND China Railway consists of eighteen regional bureaus, geographically distributed across China, with each regional bureau having their
More informationR.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationBig Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014
White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationSURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM
SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM 1 KONG XIANGSHENG 1 Department of Computer & Information, Xinxiang University, Xinxiang, China E-mail:
More informationParallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationInternational Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationExploration on Security System Structure of Smart Campus Based on Cloud Computing. Wei Zhou
3rd International Conference on Science and Social Research (ICSSR 2014) Exploration on Security System Structure of Smart Campus Based on Cloud Computing Wei Zhou Information Center, Shanghai University
More informationEnergy-Saving Cloud Computing Platform Based On Micro-Embedded System
Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute
More informationStorage and Retrieval of Data for Smart City using Hadoop
Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with
More informationSurvey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
More informationWhite Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
More informationDistributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
More informationA CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL
A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL *Hung-Ming Chen, Chuan-Chien Hou, and Tsung-Hsi Lin Department of Construction Engineering National Taiwan University
More informationProcessing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
More informationSector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationNon-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF
Non-Stop for Apache HBase: -active region server clusters TECHNICAL BRIEF Technical Brief: -active region server clusters -active region server clusters HBase is a non-relational database that provides
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationHadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationResearch on Reliability of Hadoop Distributed File System
, pp.315-326 http://dx.doi.org/10.14257/ijmue.2015.10.11.30 Research on Reliability of Hadoop Distributed File System Daming Hu, Deyun Chen*, Shuhui Lou and Shujun Pei College of Computer Science and Technology,
More informationA Database Hadoop Hybrid Approach of Big Data
A Database Hadoop Hybrid Approach of Big Data Rupali Y. Behare #1, Prof. S.S.Dandge #2 M.E. (Student), Department of CSE, Department, PRMIT&R, Badnera, SGB Amravati University, India 1. Assistant Professor,
More informationBig Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationDistributed Consistency Method and Two-Phase Locking in Cloud Storage over Multiple Data Centers
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 6 Special Issue on Logistics, Informatics and Service Science Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081
More informationHadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
More informationDetection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
More informationThe Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
More information