An Hadoop-based Platform for Massive Medical Data Storage

Size: px
Start display at page:

Download "An Hadoop-based Platform for Massive Medical Data Storage"

Transcription

1 An Hadoop-based Platform for Massive Medical Data Storage WANG Heng * (School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing ) Abstract: With the fast and sound development of information technology in healthcare, medical data rapidly emerge in large numbers. However, existing platforms for patients data storage cannot meet the needs of the even increasing large volume of medical data. Therefore, it is very important to develop effective storage platform to manage and store these massive medical data. Cloud computing offers low cost, high scalability, availability and fault tolerance which provides a good solution for some of the problems faced in storing and analyzing patients medical data. Based on the distributed computing technology, this paper proposes a novel approach for mass medical data storage and management. It includes a solution for storing massive medical data storage platform based on Hadoop by using Linux cluster technology. Extensive experimental results demonstrate the efficiency, reliability, high scalability, and fault tolerance. Key words: mass data storage; medical data; Hadoop; HBase; distributed computing Introduction We are giving more and more attention to disease prevention, inspection and diagnosis. As a result, more relevant data [1] are collected for disease tracking or other purposes. Pervious medical information platform is mainly based on the Hospital Information System [2].This platform collects all kinds of information for patients and the data are managed by hospitals. Similar systems are EMR (Electronic Medical Records) [3] or PHR (Personal Health Record) [4] that are able to store the medical data for patients. With the development of the Internet and social networks, these electronic medical records can be promoted to share with other doctors in other hospitals or healthcare organizations. However, the health care data of EMR or PHR is becoming larger and larger. These data are generated at distinct hospitals or organizations, and may be heterogeneously distributed at different places, having different data formats. Currently, for example, many health information systems and platforms store and manage these distributed data based on RDBMS [5].However, with geometric growth of structural and non-structural data of the platform, traditional relational database solutions have limitations in storing and managing these large and heterogeneous medical data. As a result, how to store and manage a large amount of medical data emerges as an important issue. The traditional large-scale data processing mostly takes advantage of distributed high-performance computing and grid computing. It requires expensive computing resources and tedious programming which is used to large-scale data segmentation and the rational allocation of computing tasks. However, the framework of Hadoop provides a solution for the problem of massive data storage [6,7]. Hadoop is an open-source organization s distributed computing framework. It can build a high reliability and good scalability of parallel and distributed systems through running the application on a cluster which is composed of a large number of commodity Brief author introduction:wang Heng (1989-), Male, Master, Wireless ehealth @163.com - 1 -

2 hardware devices. Hadoop Distributed File System (HDFS) [8], programming model and HBase distributed database [9] are three core technologies. The high performance of Hadoop-based system motivates us to design a novel data storage system for processing massive medical data in a cost effective way. Meanwhile, our experimental results show high scalability and robust fault tolerance by using the Hadoop-based system. This paper is divided into four sections. Section 2 describes the architecture of the proposed massive medical data storage platform. The functions of each system component are presented in Section 3. This section also includes the description of how to deploy and implement the platform. Hereafter, Section 4 illustrates the implementation. In this section, we evaluate the system performance and verify the efficiency of the proposed platform. Finally, Section 5 concludes this paper. 1 Framework of the massive medical data storage platform Medical data are massive which imposes many challenges for effective data storage and processing. We propose a platform to accommodate the massive medical data that could be large scale, geographically distributed and heterogeneous. Fig. 1 shows the framework of our proposed platform. In this framework, the user is located at the top level. In addition to the user, the platform consists of the following three parts: storage layer, management layer, and application layer. User User Access Authorization API Interface Services Application Layer Slave Storage Jobs Coordination Master Query Concurren cy Management Layer Storage Device Data Access Interface Storage Layer Storage Monitor Fig. 1 The architecture of massive medical storage platform In top layer of the framework, a user could be a medical staff, a patient, and system manager, or other person using the system. A medical staff accesses to patients physiological data, analyze patients information, and give patients diagnosis through the platform. The patients can retrieve their own records from the platform for various purposes, such as properly utilizing them to track and prevent disease. The data manager administrates the data such as specify data accessing privileges and has the ability to share the data based on certain policies. The users could be many and they simultaneously access the data. Thus, the data system should achieve application agnostic characteristics such that different applications can access the system simultaneously through the web portal

3 Storage Layer As shown in Fig. 1, there is a storage device for storing medical information. Storage device runs on clusters of commodity hardware, it will continue working even if any node fails, which is key to enable the service availability all the time. Data access interface provides the data source access to services. For example, it can provide different database access services. The medical data such as electronic medical records, medical image data require terabytes or petabytes storage. SQL-like databases are restricted to provide such capacities and thus they are unable to handle massive storage needs. This motivated use to develop to a system with horizontally scalable, distributed non-relational data stores, called NOSQL databases. Because of the high concurrent read and write, mass storage support, easily expandable and low cost, it is suitable for the current requirements [10]. This layer is use the non-relational database HBase to store data. HBase running on top of distributed file systems is designed for data storage and analysis. 1.2 Management Layer In the proposed architecture, master-slave architecture is implemented. While there is a coordination manager coordinate between master and slave. The master has two main components: Query and Concurrency. Query is an important component in the system. It provides the application layer of massive data parallel query. It contains the meta-data of the file system and data locations in the database, required for each query. Concurrency is mainly used for scheduling, distributing and managing the tasks on the slaves by coordinating with the query manager. In the slave part, there are two parts: Storage and Jobs. Storage is responsible for processing and maintaining data stored in the distributed file system. Jobs is responsible for concurrency management in the slave parts, including instantiating and monitoring individual tasks. Coordination is responsible for handling and managing the requests and responses in case of multi master-slave communication. 1.3 Application Layer The function of the entire layer is to provide access to users with user authorization. Users can access services through the network. For this layer, all of the available components are running in the Web framework. This layer has four components: User Access, API Interface, Authorization and Services Development of the massive medical data storage platform The main part of the medical data storage platform is Storage Layer and Management Layer, which provides services and data support for the entire platform. And the main technology to realize the two layers is Hadoop distributed technology. HDFS, and HBase are the three core technologies. is a high performance computing model of Hadoop. The application is divided into many small fragments of works, each of which could be executed or re-executed on any node in the cluster. HDFS is used to store data on the compute node and it can provide high aggregate bandwidth across the cluster. On top of HDFS, Hadoop has an open source, distributed, column-oriented database named HBase. HBase can be used to build massive storage cluster in low-cost computer servers [11]. The data analysis framework based on Hadoop is shown in Fig

4 Result HBASE HDFS Data Extraction Data Block Data Block Data Block Original Data Fig. 2 The data analysis framework based on Hadoop 2.1 Hadoop Distributed File System (HDFS) HDFS is a Hadoop distributed file system running on commodity hardware. Compared with other distributed file systems, HDFS has high fault tolerance. It can improve the throughput of data access and is suitable for large-scale data storage. In HDFS, data is organized into files and directories. Files are divided into uniform sized blocks and distributed across cluster nodes. HDFS has master/slave architecture. As shown as Fig. 3, an HDFS cluster consists of a NameNode runs on the master server and a number of DataNodes run on the slave servers. NameNode maintains the file namespace (metadata, directory structure, lists of files, list of blocks for each file, location for each block, file attributes, authorization and authentication). DataNodes creates, deletes or replicates the actual data blocks based on the instructions received from the NameNode and they report back periodically to the NameNode with the complete list of blocks that they are storing. Only one NameNode runs in the cluster, which simplifying the system architecture. Metadata ops NameNode Metadata (Name,replicas...) Client Block ops Read DataNodes DataNodes Replication Blocks Write Write Rack 1 Rack Client Fig. 3 The architecture of HDFS 135 Both NameNode and DataNode are designed to work with commodity hardware running on a Linux operating system. However, a common problem of using cluster of commodity hardware is higher probability of hardware failure and thereby data loss. The easiest way to solve this problem is replication mechanism: allow the system to store redundant copies so that when a failure occurs, there is always another copy available and thereby improving reliability and making system - 4 -

5 140 fault-tolerant. By default, HDFS stores three copies of each data block. The usual replication strategy for a data center is to have two replicas on the machines from the same rack and a replica on a machine located in another rack. This strategy limits the data traffic between racks. Meanwhile, in order to minimize latency for data access, HDFS tries to access the data from the nearest replica. 2.2 / is a simple software framework. It can process massive data in parallel with fault-tolerant. Applications based on can run on large clusters consisting of thousands of machines. The basic idea of / framework is divide and conquer partition a large problem into smaller sub-problems to the extent that the sub-problems are independent and they can be tackled in parallel by different slaves. / framework divides the work in two phases: the map phase and the reduce phase separated by data transfer between nodes in the cluster. At first, input data set of a job is usually split into several separate fixed-sized pieces called input splits. Then Hadoop creates one map task for each split. output is a set of records in the form of key-value pairs. The records for any given key-possibly spread across many nodes are aggregated at the node running the r for that key. The stage produces another sets of key-value pair, as final output based on user defined reduce function. The calculation model is shown in Fig. 4. Input Intermediate k1:v1 k2:v1 k2:v2 k1:v2 k3:v1 k4:v1 k2:v3 k4:v2 k5:v1 k1:v3 k3:v2 k4:v3 Group by Key Grouped k1:v1,v2,v3 k2:v1,v2,v3 k3:v1,v2 k4:v1,v2,v3 k5:v1 155 Output Fig. 4 The calculation model of 160 Similar to HDFS, / framework also uses master/slave architecture: a JobTracker (master) and a number of TaskTrackers (slaves). JobTracker is responsible for scheduling and managing TaskTracker, for example the tasks and tasks are assigned to idle TaskTrackers. JobTracker is also responsible for monitoring the operation of tasks. TaskTrackers run tasks and send progress reports to JobTrackers, which keeps a record of the overall progress of each job. 2.3 HBase 165 HBase is a distributed, column-oriented, NOSQL database. It is a high reliability, high performance, scalable distributed storage system. An important difference to relational databases is that there is no defined schema. That means that a record A might have 10 attributes while record B has only 7 attributes although they are in the same table. Data in HBase are organized in - 5 -

6 labeled tables. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Rows in HBase tables are sorted by row key which is also a byte array and serves as table s primary key. All table accesses are via the table row key. Meanwhile, each data operation has an associated timestamp. Timestamp is assigned automatically by HBase when it is inserted. Tab. 1 illustrates data structure with a simple example. The data structure consists of two main column families: Basic Information, Vital Signs. Each record, as defined by HBase, has a unique key and a timestamp defining the creation or update time. The Basic Information family consists of three or more columns: name, gender, date of birth, address and so on. The Vital Signs also consists of many columns: blood pressure, heartbeat and so on. Because of the flexibility of the HBase table, we can insert more column families, especially unstructured data such as graphs, medical imaging. Tab. 1 A Simple Example of an HBase Table Row Key Timestamp Basic Information Vital Signs Name gender Date of Brith Blood pressure Blood pressure heart 185 Patient ID T8 T5 Tom Male T HBase also adopts master/slave architecture. It consists of three parts: HMaster, HRegionServers and client. HMster is responsible for arranging regions to slaves and for recovering from HRegionServer failures. HRegionServer is responsible for managing client requests for reading and writing. Client is responsible for seeking address information in HRegionServer. 3 Experiments and performance evaluation 190 In order to validate the performance of the proposed massive medical data storage platform based on Hadoop, we implemented the platform. This section describes the hardware components in the platform and design several testing scenarios for case study. Fig. 5 shows the deployment model of the massive medical data storage platform. Client Test AppServer Hadoop Test 100 Mbps LAN NameNode HBase Master JobTracker DataNode HRegionServer TaskTracker Fig. 5 Deployment model of the massive medical data storage platform 195 This architecture assigns one node as master for the management of the Hadoop cluster. As for the slaves part, the system is able to add as many slaves as needed, because Hadoop scales linearly. The master contains NameNode, HBase Master and JobTracker. As for the data nodes i.e., - 6 -

7 200 slaves, where each contains the region server that run on data node with the task tracker. The data nodes are efficient in data storage management and jobs execution. Thrift was used as the client API for its simplicity and noticeable benefits to communicate with HBase. After successfully establishing connection with HBase server, the Restlet Framework is used as the web platform. The machines specifications are CPU of Inter(R) Core 2Quad, 2.66Hz processor, memory of 8GB and hard drive of 320GB. As for slaves, same specifications except for disk storage 160GB. The test is attempted to measure the system s query performance and scalability. Here are two experiments. 3.1 Experiments on Data Query 205 We run the data query operation with 100-concurrency rate under different data storage levels of 1000 records, 1 million records, 100 million records and 1 billion records. The performance of data query is shown in Tab. 2. Tab. 2 Comparation of Data Query Performance between Hadoop and RDBMS Data Query Time of Hadoop (s) Time of RDBMS(s) Minimum Average Maximum 90% Table of 1K Table of 1M Table of 100M Table of 1B From the table we can see that our Hadoop framework is outperformed than RDBMS on data query with all levels of data storage. 3.2 Experiments on Through-out 215 The number of requests per second is computed across the increasing number of data nodes. The experiments are conducted with 1000 input requests, 100-concurrency rate and 100Mbps transfer rate. Every time 5 data nodes (200,000 records) are added to the cluster. As shown in Fig. 6, although the data is progressively growing, the requests per second are approximately 200. It means the massive medical data storage platform maintains its performance level. Therefore, this ensures the system s scalability. 220 Fig. 6 Requests per second versus number of machines 4 Conclusion In this paper, we design and develop a Hadoop-based platform for storing and processing - 7 -

8 massive medical data in an effective way. Integrating with Hadoop distributed file system, / parallel-computing model and HBase database, we implemented the proposed approach and substantial tests are conducted in the experimental system that includes a large number of commodity hardware devices. Through performance testing on the experiment platform, result shows the system has a good performance, scalability and fault tolerance. Meanwhile, the proposed model introduced in this paper offers a flexible and portable platform for applications development and user access. Finally, promising output across the conducted tests was a good indicator for usability of the proposed system. References [1] Wei Liu and Dedao Gu. Research on construction of smart medical system based on the social security card [A]. EMEIT International Conference on Electronic and Mechanical Engineering and Information Technology [C]. Harbin, PR.China, ~4700. [2] Boqiang Liu, Xiaomei Li, Zhongguo Liu, er al. Design and implementation of information exchange between HIS and PACS based on HL7 standard [A]. ITAB International Conference on Information Technology and Application in Biomedicine [C]. Shenzhen, PR.China, ~555. [3] Yuwen Shuilim, Yang Xiaoping, Li Huiling. Research on the EMR Storage Model [A]. IFCSTA'09. International Forum on Computer Science-Technology and Application [C]. Chongqing, PR.China, ~ 226. [4] Faro, A., Giordano, D., Lavasidis, I., Spampinato, C.. A web 2.0 telemedicine system integrating TV-centric services and Personal Health Records [A]. ITAB th IEEE International Conference on Information Technology and Applications in Biomedicine [C]. Corfu, Greece, ~4. [5] Miquel, M., Tchounikine, A.. Software components integration in medical data warehouses: a proposal [A]. CBMS Proceedings of the 15th IEEE Symposium on Computer-Based Medical Systems [C]. Maribor, Slovenia, ~364. [6] Hongyong Yu, Deshuai Wang. Research and Implementation of Massive Health Care Data Management and Analysis Based on Hadoop [A]. ICCIS th International Conference on Computational and Information Sciences [C]. Chongqing, PR.China, ~517. [7] Dalia Sobhy, Yasser EI-Sonbaty, Mohamad Abou Elnasr. MedCloud: Healthcare Cloud Computing System [A]. ICITST International Conference for Internet Technology And Secured Transactions [C]. London, ~166. [8] The Apache Software Foundation. Apache Hadoop HDFS homepage [OL]. [2012]. [9] Lars George. HBase: The Definitive Guide, 1st edition [M]. O'Reilly Media, [10] J. Han, E. Haihong, G. Le, and J. Du. Survey on nosql database [A]. ICPCA th International Conference on Pervasive Computing and Application [C]. Port Elizabeth, South Africa, ~366. [11] The Apache Software Foundation. Apache Hadoop HBase homepage [OL]. [2013]. 基 于 Hadoop 的 海 量 医 疗 数 据 存 储 平 台 的 研 究 王 恒 ( 信 息 与 通 信 工 程 学 院, 北 京 邮 电 大 学, 北 京,100876) 摘 要 : 随 着 医 疗 信 息 化 又 好 又 快 的 发 展, 医 疗 数 据 迅 速 的 涌 现 出 来 然 而, 现 存 的 医 疗 数 据 存 储 平 台 不 能 够 满 足 日 益 增 长 的 海 量 数 据 的 需 求 因 此, 研 究 并 开 发 一 个 有 效 的 存 储 平 台 来 管 理 和 存 储 这 些 海 量 医 疗 数 据 显 得 尤 为 的 重 要 云 计 算 为 目 前 所 面 临 的 存 储 分 析 医 疗 数 据 的 困 境, 提 供 了 一 种 低 成 本 容 错 性 强 高 可 扩 展 性 和 可 用 性 的 解 决 方 案 基 于 分 布 式 计 算 技 术, 本 文 提 出 了 一 种 新 的 管 理 和 存 储 海 量 医 疗 数 据 的 方 法 它 提 供 了 一 种 基 于 Hadoop, 使 用 Linux 集 群 技 术 的 海 量 数 据 存 储 的 解 决 方 案 大 量 的 实 验 结 果 证 明 了 该 平 台 的 高 效 性 可 靠 性 高 可 扩 展 性 和 容 错 性 关 键 词 : 海 量 数 据 存 储 ; 医 疗 数 据 ;Hadoop;HBase; 分 布 式 计 算 中 图 分 类 号 :TP

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Design of Electric Energy Acquisition System on Hadoop

Design of Electric Energy Acquisition System on Hadoop , pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Abinav Pothuganti Computer Science and Engineering, CBIT,Hyderabad, Telangana, India Abstract Today, we are surrounded by data like oxygen. The exponential

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Criteria to Compare Cloud Computing with Current Database Technology

Criteria to Compare Cloud Computing with Current Database Technology Criteria to Compare Cloud Computing with Current Database Technology Jean-Daniel Cryans, Alain April, and Alain Abran École de Technologie Supérieure, 1100 rue Notre-Dame Ouest Montréal, Québec, Canada

More information

Research Article Hadoop-Based Distributed Sensor Node Management System

Research Article Hadoop-Based Distributed Sensor Node Management System Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Telecom Data processing and analysis based on Hadoop

Telecom Data processing and analysis based on Hadoop COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Jordan Prosch, Matt Kipps Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?

More information

BBM467 Data Intensive ApplicaAons

BBM467 Data Intensive ApplicaAons Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

GeoGrid Project and Experiences with Hadoop

GeoGrid Project and Experiences with Hadoop GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF) Not Relational Models For The Management of Large Amount of Astronomical Data Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF) What is a DBMS A Data Base Management System is a software infrastructure

More information

Efficient Analysis of Big Data Using Map Reduce Framework

Efficient Analysis of Big Data Using Map Reduce Framework Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Cloud Storage Solution for WSN Based on Internet Innovation Union

Cloud Storage Solution for WSN Based on Internet Innovation Union Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,

More information

Network Traffic Analysis using HADOOP Architecture. Shan Zeng HEPiX, Beijing 17 Oct 2012

Network Traffic Analysis using HADOOP Architecture. Shan Zeng HEPiX, Beijing 17 Oct 2012 Network Traffic Analysis using HADOOP Architecture Shan Zeng HEPiX, Beijing 17 Oct 2012 Outline Introduction to Hadoop Traffic Information Capture Traffic Information Resolution Traffic Information Storage

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com.

Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. Ramlal Naik L Acme Tele Power LTD Haryana, India ramlalnaik@gmail.com. Abstract Big Data

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information