An Hadoop-based Platform for Massive Medical Data Storage
|
|
- Darren French
- 7 years ago
- Views:
Transcription
1 An Hadoop-based Platform for Massive Medical Data Storage WANG Heng * (School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing ) Abstract: With the fast and sound development of information technology in healthcare, medical data rapidly emerge in large numbers. However, existing platforms for patients data storage cannot meet the needs of the even increasing large volume of medical data. Therefore, it is very important to develop effective storage platform to manage and store these massive medical data. Cloud computing offers low cost, high scalability, availability and fault tolerance which provides a good solution for some of the problems faced in storing and analyzing patients medical data. Based on the distributed computing technology, this paper proposes a novel approach for mass medical data storage and management. It includes a solution for storing massive medical data storage platform based on Hadoop by using Linux cluster technology. Extensive experimental results demonstrate the efficiency, reliability, high scalability, and fault tolerance. Key words: mass data storage; medical data; Hadoop; HBase; distributed computing Introduction We are giving more and more attention to disease prevention, inspection and diagnosis. As a result, more relevant data [1] are collected for disease tracking or other purposes. Pervious medical information platform is mainly based on the Hospital Information System [2].This platform collects all kinds of information for patients and the data are managed by hospitals. Similar systems are EMR (Electronic Medical Records) [3] or PHR (Personal Health Record) [4] that are able to store the medical data for patients. With the development of the Internet and social networks, these electronic medical records can be promoted to share with other doctors in other hospitals or healthcare organizations. However, the health care data of EMR or PHR is becoming larger and larger. These data are generated at distinct hospitals or organizations, and may be heterogeneously distributed at different places, having different data formats. Currently, for example, many health information systems and platforms store and manage these distributed data based on RDBMS [5].However, with geometric growth of structural and non-structural data of the platform, traditional relational database solutions have limitations in storing and managing these large and heterogeneous medical data. As a result, how to store and manage a large amount of medical data emerges as an important issue. The traditional large-scale data processing mostly takes advantage of distributed high-performance computing and grid computing. It requires expensive computing resources and tedious programming which is used to large-scale data segmentation and the rational allocation of computing tasks. However, the framework of Hadoop provides a solution for the problem of massive data storage [6,7]. Hadoop is an open-source organization s distributed computing framework. It can build a high reliability and good scalability of parallel and distributed systems through running the application on a cluster which is composed of a large number of commodity Brief author introduction:wang Heng (1989-), Male, Master, Wireless ehealth @163.com - 1 -
2 hardware devices. Hadoop Distributed File System (HDFS) [8], programming model and HBase distributed database [9] are three core technologies. The high performance of Hadoop-based system motivates us to design a novel data storage system for processing massive medical data in a cost effective way. Meanwhile, our experimental results show high scalability and robust fault tolerance by using the Hadoop-based system. This paper is divided into four sections. Section 2 describes the architecture of the proposed massive medical data storage platform. The functions of each system component are presented in Section 3. This section also includes the description of how to deploy and implement the platform. Hereafter, Section 4 illustrates the implementation. In this section, we evaluate the system performance and verify the efficiency of the proposed platform. Finally, Section 5 concludes this paper. 1 Framework of the massive medical data storage platform Medical data are massive which imposes many challenges for effective data storage and processing. We propose a platform to accommodate the massive medical data that could be large scale, geographically distributed and heterogeneous. Fig. 1 shows the framework of our proposed platform. In this framework, the user is located at the top level. In addition to the user, the platform consists of the following three parts: storage layer, management layer, and application layer. User User Access Authorization API Interface Services Application Layer Slave Storage Jobs Coordination Master Query Concurren cy Management Layer Storage Device Data Access Interface Storage Layer Storage Monitor Fig. 1 The architecture of massive medical storage platform In top layer of the framework, a user could be a medical staff, a patient, and system manager, or other person using the system. A medical staff accesses to patients physiological data, analyze patients information, and give patients diagnosis through the platform. The patients can retrieve their own records from the platform for various purposes, such as properly utilizing them to track and prevent disease. The data manager administrates the data such as specify data accessing privileges and has the ability to share the data based on certain policies. The users could be many and they simultaneously access the data. Thus, the data system should achieve application agnostic characteristics such that different applications can access the system simultaneously through the web portal
3 Storage Layer As shown in Fig. 1, there is a storage device for storing medical information. Storage device runs on clusters of commodity hardware, it will continue working even if any node fails, which is key to enable the service availability all the time. Data access interface provides the data source access to services. For example, it can provide different database access services. The medical data such as electronic medical records, medical image data require terabytes or petabytes storage. SQL-like databases are restricted to provide such capacities and thus they are unable to handle massive storage needs. This motivated use to develop to a system with horizontally scalable, distributed non-relational data stores, called NOSQL databases. Because of the high concurrent read and write, mass storage support, easily expandable and low cost, it is suitable for the current requirements [10]. This layer is use the non-relational database HBase to store data. HBase running on top of distributed file systems is designed for data storage and analysis. 1.2 Management Layer In the proposed architecture, master-slave architecture is implemented. While there is a coordination manager coordinate between master and slave. The master has two main components: Query and Concurrency. Query is an important component in the system. It provides the application layer of massive data parallel query. It contains the meta-data of the file system and data locations in the database, required for each query. Concurrency is mainly used for scheduling, distributing and managing the tasks on the slaves by coordinating with the query manager. In the slave part, there are two parts: Storage and Jobs. Storage is responsible for processing and maintaining data stored in the distributed file system. Jobs is responsible for concurrency management in the slave parts, including instantiating and monitoring individual tasks. Coordination is responsible for handling and managing the requests and responses in case of multi master-slave communication. 1.3 Application Layer The function of the entire layer is to provide access to users with user authorization. Users can access services through the network. For this layer, all of the available components are running in the Web framework. This layer has four components: User Access, API Interface, Authorization and Services Development of the massive medical data storage platform The main part of the medical data storage platform is Storage Layer and Management Layer, which provides services and data support for the entire platform. And the main technology to realize the two layers is Hadoop distributed technology. HDFS, and HBase are the three core technologies. is a high performance computing model of Hadoop. The application is divided into many small fragments of works, each of which could be executed or re-executed on any node in the cluster. HDFS is used to store data on the compute node and it can provide high aggregate bandwidth across the cluster. On top of HDFS, Hadoop has an open source, distributed, column-oriented database named HBase. HBase can be used to build massive storage cluster in low-cost computer servers [11]. The data analysis framework based on Hadoop is shown in Fig
4 Result HBASE HDFS Data Extraction Data Block Data Block Data Block Original Data Fig. 2 The data analysis framework based on Hadoop 2.1 Hadoop Distributed File System (HDFS) HDFS is a Hadoop distributed file system running on commodity hardware. Compared with other distributed file systems, HDFS has high fault tolerance. It can improve the throughput of data access and is suitable for large-scale data storage. In HDFS, data is organized into files and directories. Files are divided into uniform sized blocks and distributed across cluster nodes. HDFS has master/slave architecture. As shown as Fig. 3, an HDFS cluster consists of a NameNode runs on the master server and a number of DataNodes run on the slave servers. NameNode maintains the file namespace (metadata, directory structure, lists of files, list of blocks for each file, location for each block, file attributes, authorization and authentication). DataNodes creates, deletes or replicates the actual data blocks based on the instructions received from the NameNode and they report back periodically to the NameNode with the complete list of blocks that they are storing. Only one NameNode runs in the cluster, which simplifying the system architecture. Metadata ops NameNode Metadata (Name,replicas...) Client Block ops Read DataNodes DataNodes Replication Blocks Write Write Rack 1 Rack Client Fig. 3 The architecture of HDFS 135 Both NameNode and DataNode are designed to work with commodity hardware running on a Linux operating system. However, a common problem of using cluster of commodity hardware is higher probability of hardware failure and thereby data loss. The easiest way to solve this problem is replication mechanism: allow the system to store redundant copies so that when a failure occurs, there is always another copy available and thereby improving reliability and making system - 4 -
5 140 fault-tolerant. By default, HDFS stores three copies of each data block. The usual replication strategy for a data center is to have two replicas on the machines from the same rack and a replica on a machine located in another rack. This strategy limits the data traffic between racks. Meanwhile, in order to minimize latency for data access, HDFS tries to access the data from the nearest replica. 2.2 / is a simple software framework. It can process massive data in parallel with fault-tolerant. Applications based on can run on large clusters consisting of thousands of machines. The basic idea of / framework is divide and conquer partition a large problem into smaller sub-problems to the extent that the sub-problems are independent and they can be tackled in parallel by different slaves. / framework divides the work in two phases: the map phase and the reduce phase separated by data transfer between nodes in the cluster. At first, input data set of a job is usually split into several separate fixed-sized pieces called input splits. Then Hadoop creates one map task for each split. output is a set of records in the form of key-value pairs. The records for any given key-possibly spread across many nodes are aggregated at the node running the r for that key. The stage produces another sets of key-value pair, as final output based on user defined reduce function. The calculation model is shown in Fig. 4. Input Intermediate k1:v1 k2:v1 k2:v2 k1:v2 k3:v1 k4:v1 k2:v3 k4:v2 k5:v1 k1:v3 k3:v2 k4:v3 Group by Key Grouped k1:v1,v2,v3 k2:v1,v2,v3 k3:v1,v2 k4:v1,v2,v3 k5:v1 155 Output Fig. 4 The calculation model of 160 Similar to HDFS, / framework also uses master/slave architecture: a JobTracker (master) and a number of TaskTrackers (slaves). JobTracker is responsible for scheduling and managing TaskTracker, for example the tasks and tasks are assigned to idle TaskTrackers. JobTracker is also responsible for monitoring the operation of tasks. TaskTrackers run tasks and send progress reports to JobTrackers, which keeps a record of the overall progress of each job. 2.3 HBase 165 HBase is a distributed, column-oriented, NOSQL database. It is a high reliability, high performance, scalable distributed storage system. An important difference to relational databases is that there is no defined schema. That means that a record A might have 10 attributes while record B has only 7 attributes although they are in the same table. Data in HBase are organized in - 5 -
6 labeled tables. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Rows in HBase tables are sorted by row key which is also a byte array and serves as table s primary key. All table accesses are via the table row key. Meanwhile, each data operation has an associated timestamp. Timestamp is assigned automatically by HBase when it is inserted. Tab. 1 illustrates data structure with a simple example. The data structure consists of two main column families: Basic Information, Vital Signs. Each record, as defined by HBase, has a unique key and a timestamp defining the creation or update time. The Basic Information family consists of three or more columns: name, gender, date of birth, address and so on. The Vital Signs also consists of many columns: blood pressure, heartbeat and so on. Because of the flexibility of the HBase table, we can insert more column families, especially unstructured data such as graphs, medical imaging. Tab. 1 A Simple Example of an HBase Table Row Key Timestamp Basic Information Vital Signs Name gender Date of Brith Blood pressure Blood pressure heart 185 Patient ID T8 T5 Tom Male T HBase also adopts master/slave architecture. It consists of three parts: HMaster, HRegionServers and client. HMster is responsible for arranging regions to slaves and for recovering from HRegionServer failures. HRegionServer is responsible for managing client requests for reading and writing. Client is responsible for seeking address information in HRegionServer. 3 Experiments and performance evaluation 190 In order to validate the performance of the proposed massive medical data storage platform based on Hadoop, we implemented the platform. This section describes the hardware components in the platform and design several testing scenarios for case study. Fig. 5 shows the deployment model of the massive medical data storage platform. Client Test AppServer Hadoop Test 100 Mbps LAN NameNode HBase Master JobTracker DataNode HRegionServer TaskTracker Fig. 5 Deployment model of the massive medical data storage platform 195 This architecture assigns one node as master for the management of the Hadoop cluster. As for the slaves part, the system is able to add as many slaves as needed, because Hadoop scales linearly. The master contains NameNode, HBase Master and JobTracker. As for the data nodes i.e., - 6 -
7 200 slaves, where each contains the region server that run on data node with the task tracker. The data nodes are efficient in data storage management and jobs execution. Thrift was used as the client API for its simplicity and noticeable benefits to communicate with HBase. After successfully establishing connection with HBase server, the Restlet Framework is used as the web platform. The machines specifications are CPU of Inter(R) Core 2Quad, 2.66Hz processor, memory of 8GB and hard drive of 320GB. As for slaves, same specifications except for disk storage 160GB. The test is attempted to measure the system s query performance and scalability. Here are two experiments. 3.1 Experiments on Data Query 205 We run the data query operation with 100-concurrency rate under different data storage levels of 1000 records, 1 million records, 100 million records and 1 billion records. The performance of data query is shown in Tab. 2. Tab. 2 Comparation of Data Query Performance between Hadoop and RDBMS Data Query Time of Hadoop (s) Time of RDBMS(s) Minimum Average Maximum 90% Table of 1K Table of 1M Table of 100M Table of 1B From the table we can see that our Hadoop framework is outperformed than RDBMS on data query with all levels of data storage. 3.2 Experiments on Through-out 215 The number of requests per second is computed across the increasing number of data nodes. The experiments are conducted with 1000 input requests, 100-concurrency rate and 100Mbps transfer rate. Every time 5 data nodes (200,000 records) are added to the cluster. As shown in Fig. 6, although the data is progressively growing, the requests per second are approximately 200. It means the massive medical data storage platform maintains its performance level. Therefore, this ensures the system s scalability. 220 Fig. 6 Requests per second versus number of machines 4 Conclusion In this paper, we design and develop a Hadoop-based platform for storing and processing - 7 -
8 massive medical data in an effective way. Integrating with Hadoop distributed file system, / parallel-computing model and HBase database, we implemented the proposed approach and substantial tests are conducted in the experimental system that includes a large number of commodity hardware devices. Through performance testing on the experiment platform, result shows the system has a good performance, scalability and fault tolerance. Meanwhile, the proposed model introduced in this paper offers a flexible and portable platform for applications development and user access. Finally, promising output across the conducted tests was a good indicator for usability of the proposed system. References [1] Wei Liu and Dedao Gu. Research on construction of smart medical system based on the social security card [A]. EMEIT International Conference on Electronic and Mechanical Engineering and Information Technology [C]. Harbin, PR.China, ~4700. [2] Boqiang Liu, Xiaomei Li, Zhongguo Liu, er al. Design and implementation of information exchange between HIS and PACS based on HL7 standard [A]. ITAB International Conference on Information Technology and Application in Biomedicine [C]. Shenzhen, PR.China, ~555. [3] Yuwen Shuilim, Yang Xiaoping, Li Huiling. Research on the EMR Storage Model [A]. IFCSTA'09. International Forum on Computer Science-Technology and Application [C]. Chongqing, PR.China, ~ 226. [4] Faro, A., Giordano, D., Lavasidis, I., Spampinato, C.. A web 2.0 telemedicine system integrating TV-centric services and Personal Health Records [A]. ITAB th IEEE International Conference on Information Technology and Applications in Biomedicine [C]. Corfu, Greece, ~4. [5] Miquel, M., Tchounikine, A.. Software components integration in medical data warehouses: a proposal [A]. CBMS Proceedings of the 15th IEEE Symposium on Computer-Based Medical Systems [C]. Maribor, Slovenia, ~364. [6] Hongyong Yu, Deshuai Wang. Research and Implementation of Massive Health Care Data Management and Analysis Based on Hadoop [A]. ICCIS th International Conference on Computational and Information Sciences [C]. Chongqing, PR.China, ~517. [7] Dalia Sobhy, Yasser EI-Sonbaty, Mohamad Abou Elnasr. MedCloud: Healthcare Cloud Computing System [A]. ICITST International Conference for Internet Technology And Secured Transactions [C]. London, ~166. [8] The Apache Software Foundation. Apache Hadoop HDFS homepage [OL]. [2012]. [9] Lars George. HBase: The Definitive Guide, 1st edition [M]. O'Reilly Media, [10] J. Han, E. Haihong, G. Le, and J. Du. Survey on nosql database [A]. ICPCA th International Conference on Pervasive Computing and Application [C]. Port Elizabeth, South Africa, ~366. [11] The Apache Software Foundation. Apache Hadoop HBase homepage [OL]. [2013]. 基 于 Hadoop 的 海 量 医 疗 数 据 存 储 平 台 的 研 究 王 恒 ( 信 息 与 通 信 工 程 学 院, 北 京 邮 电 大 学, 北 京,100876) 摘 要 : 随 着 医 疗 信 息 化 又 好 又 快 的 发 展, 医 疗 数 据 迅 速 的 涌 现 出 来 然 而, 现 存 的 医 疗 数 据 存 储 平 台 不 能 够 满 足 日 益 增 长 的 海 量 数 据 的 需 求 因 此, 研 究 并 开 发 一 个 有 效 的 存 储 平 台 来 管 理 和 存 储 这 些 海 量 医 疗 数 据 显 得 尤 为 的 重 要 云 计 算 为 目 前 所 面 临 的 存 储 分 析 医 疗 数 据 的 困 境, 提 供 了 一 种 低 成 本 容 错 性 强 高 可 扩 展 性 和 可 用 性 的 解 决 方 案 基 于 分 布 式 计 算 技 术, 本 文 提 出 了 一 种 新 的 管 理 和 存 储 海 量 医 疗 数 据 的 方 法 它 提 供 了 一 种 基 于 Hadoop, 使 用 Linux 集 群 技 术 的 海 量 数 据 存 储 的 解 决 方 案 大 量 的 实 验 结 果 证 明 了 该 平 台 的 高 效 性 可 靠 性 高 可 扩 展 性 和 容 错 性 关 键 词 : 海 量 数 据 存 储 ; 医 疗 数 据 ;Hadoop;HBase; 分 布 式 计 算 中 图 分 类 号 :TP
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationhttp://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationNoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationMobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
More informationDesign of Electric Energy Acquisition System on Hadoop
, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University
More informationHadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More information!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationmarlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationUPS battery remote monitoring system in cloud computing
, pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationBig Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani
Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured
More informationBig Application Execution on Cloud using Hadoop Distributed File System
Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationProcessing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
More informationBig Data Analytics: Hadoop-Map Reduce & NoSQL Databases
Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Abinav Pothuganti Computer Science and Engineering, CBIT,Hyderabad, Telangana, India Abstract Today, we are surrounded by data like oxygen. The exponential
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationCriteria to Compare Cloud Computing with Current Database Technology
Criteria to Compare Cloud Computing with Current Database Technology Jean-Daniel Cryans, Alain April, and Alain Abran École de Technologie Supérieure, 1100 rue Notre-Dame Ouest Montréal, Québec, Canada
More informationResearch Article Hadoop-Based Distributed Sensor Node Management System
Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationWhite Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
More informationThe Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationTelecom Data processing and analysis based on Hadoop
COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China
More informationR.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationHadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationOracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
More informationAn Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationCan the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
More informationReduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.
More informationApache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
More informationHadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
More informationBBM467 Data Intensive ApplicaAons
Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationApache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
More informationHadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationVolume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
More informationDeploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution
More informationGeoGrid Project and Experiences with Hadoop
GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationDesign and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationNot Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)
Not Relational Models For The Management of Large Amount of Astronomical Data Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF) What is a DBMS A Data Base Management System is a software infrastructure
More informationEfficient Analysis of Big Data Using Map Reduce Framework
Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationCloud Storage Solution for WSN Based on Internet Innovation Union
Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,
More informationNetwork Traffic Analysis using HADOOP Architecture. Shan Zeng HEPiX, Beijing 17 Oct 2012
Network Traffic Analysis using HADOOP Architecture Shan Zeng HEPiX, Beijing 17 Oct 2012 Outline Introduction to Hadoop Traffic Information Capture Traffic Information Resolution Traffic Information Storage
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationSuresh Lakavath csir urdip Pune, India lsureshit@gmail.com.
A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. Ramlal Naik L Acme Tele Power LTD Haryana, India ramlalnaik@gmail.com. Abstract Big Data
More informationLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),
More information