A token-based authentication security scheme for Hadoop distributed file system using elliptic curve cryptography
|
|
|
- Maximilian Boyd
- 10 years ago
- Views:
Transcription
1 J Comput Virol Hack Tech (2015) 11: DOI /s ORIGINAL PAPER A token-based authentication security scheme for Hadoop distributed file system using elliptic curve cryptography Yoon-Su Jeong Yong-Tae Kim Received: 2 September 2014 / Accepted: 29 December 2014 / Published online: 27 February 2015 The Author(s) This article is published with open access at Springerlink.com Abstract In recent years, a number of platforms for building Big Data applications, both open-source and proprietary, have been proposed. One of the most popular platforms is Apache Hadoop, an open-source software framework for Big Data processing used by leading companies like Yahoo and Facebook. Historically, earlier versions of Hadoop did not prioritize security, so Hadoop has continued to make security modifications. In particular, the Hadoop Distributed File System (HDFS) upon which Hadoop modules are built did not provide robust security for user authentication. This paper proposes a token-based authentication scheme that protects sensitive data stored in HDFS against replay and impersonation attacks. The proposed scheme allows HDFS clients to be authenticated by the datanode via the block access token. Unlike most HDFS authentication protocols adopting public key exchange approaches, the proposed scheme uses the hash chain of keys. The proposed scheme has the performance (communication power, computing power and area efficiency) as good as that of existing HDFS systems. Special Issue: Convergence Security Systems. This work was supported by the Security Engineering Research Center, granted by the Korea Ministry of Knowledge Economy. Y.-S. Jeong Division of Information and Communication Convergence Engineering, Mokwon University, 88 Doanbuk-ro, Seo-gu, Daejeon , Korea [email protected] Y.-T. Kim (B) Division of Multimedia Engineering, Hannam University, 133 Ojeong-dong, Daedeok-Gu, Daejeon , Korea [email protected] 1 Introduction With the growth of social networks and smart devices, the use of Big Data has increased dramatically over the past few years. Big Data requires new technologies that efficiently collect and analyze enormous volumes of real-world data over the network [1 3]. In particular, open-source platforms for scalable and distributed processing of data are being actively studied in Cloud Computing in which dynamically scalable and often virtualized IT resources are provided as a service over the Internet [4]. Apache s Hadoop is an open-source software framework for Big Data processing used by many of the world s largest online media companies including Yahoo, Facebook and Twitter. Hadoop allows for the distributed processing of large data sets across clusters of computers [5,6]. HDFS, Hadoop s distributed file system, is designed to scale up from single servers to thousands of machines, each offering local computation and storage. MapReduce, another Hadoop component, is a scalable, parallel processing framework that works in tandem with HDFS. Hadoop enables users to store and process huge amounts of data at very low costs, but it lacks security measures to perform sufficient authentication and authorization of both users and services [7,8]. Traditional data processing and management systems such as Data Warehouse and Relational Database Management Systems are designed to process structured data, so they are not effective in handling large-scale, unstructured data included in Big Data [9,10]. Initially, secure deployment and use of Hadoop were not a concern [11 13]. The data in Hadoop was not sensitive and access to the cluster could be sufficiently limited. As Hadoop matured, more data and more variety of data including sensitive enterprise and personal information are moved to HDFS. However, HDFS s
2 138 Y.-S. Jeong, Y.-T. Kim lax authentication allows any user to impersonate any other user or cluster services. For additional authentication security, Hadoop provides Kerberos that uses symmetric key operations [4,6,14]. If all of the components of the Hadoop system (i.e., namenodes, datanodes, and Hadoop clients) are authenticated using Kerberos, the Kerberos key distribution center (KDC) might have a bottleneck. To reduce Kerberos traffic (and thereby putting less load on the Kerberos infrastructure), Hadoop relies on delegation tokens. Delegation token approaches use symmetric encryption and the shared keys may be distributed to hundreds or even thousands of hosts depending upon the token type [15,16]. This leaves Hadoop communication vulnerable to eavesdropping and modification, thus making replay and impersonation attacks more likely. For example, an attacker can gain access to the data block by eavesdropping the block access token [17]. Hadoop security controls require the namenode and the datanode to share a private key to use the block access token. If the shared key is disclosed to an attacker, the data on all datanodes is vulnerable [11,18]. This paper proposes a token-based authentication scheme that protects sensitive HDFS data against replay and impersonation attacks. The proposed scheme utilizes the hash chain of authentication keys, rather than public key-based authentication key exchanges commonly found in existing HDFS systems. The proposed scheme allows clients to be authenticated to the datanode via the block access token. In terms of performance, the proposed scheme achieves communication power, computing power and area efficiency similar to those of existing HDFS systems. The remainder of this paper is organized as follows. Section 2 describes Hadoop s two primary components: HDFS and MapReduce. Section 3 presents the proposed authentication scheme for HDFS. In Sect. 4, the proposed scheme is evaluated in terms of security and performance. Finally conclusions are given in Sect Related work Hadoop is an open-source platform that supports the processing of Big Data in a distributed computing environment [1,2]. Hadoop is made up of stand-alone modules such as a distributed file system called HDFS, a library for parallel processing of large distributed datasets called MapReduce, and so on. HDFS and MapReduce were inspired by technologies created inside Google Google File System (GFS) and Google MapReduce, respectively [5,7]. 2.1 HDFS HDFS is a distributed, scalable file system that stores large files across multiple machines. Files are stored on the HDFS in blocks which are typically 64 MB in size. That is, an HDFS Fig. 1 The architecture of HDFS file is chopped up into 64 MB chunks and data that is smaller than the block size is stored as it is, not occupying the full block space. Figure 1 shows a typical architecture of an HDFS cluster, consisting of a namenode, a secondary namenode and multiple datanodes [6,14]. HDFS stores its metadata (i.e., the information about file names, directories, permissions, etc.) on the namenode. HDFS is based on a master slave architecture so a single master node (namenode) maintains all the files in the file system. The namenode manages the HDFS s namespace and regulates access to the files by clients. It distributes data blocks to datanodes and stores this mapping information. In addition, the namenode keeps track of which blocks need to be replicated and initiates replication whenever necessary. Multiple copies of each file provide data protection and computational performance. The datanodes are responsible for storing application data and serve read/write requests from clients. The datanodes also perform block creation, deletion and replication upon instruction from the namenode. Datanodes periodically send Heartbeats and Blockreports to the namenode. Receipt of a Heartbeat implies that the datanode is functioning properly. The namenode marks datanodes without recent Heartbeats as dead and does not forward any new I/O requests to them. A Blockreport contains a list of all blocks on a datanode. Blockreports provide the namenode with an up-to-date view of where blocks are located on the cluster. The namenode constructs and maintains latest metadata from Blockreports. Figure 2 depicts the overall operations of HDFS. As shown in the figure, all servers are fully connected and communicate with each other using TCP-based protocols. 2.2 MapReduce MapReduce is a framework for processing large data sets in parallel across a Hadoop cluster. Like HDFS, it is based
3 A token-based authentication security scheme 139 Fig. 2 The operational view of HDFS on a master slave model. The master is a special node which coordinates activity between several worker nodes. The master receives the input data that is to be processed. The input data is split into smaller chunks and all these chunks are distributed to and processed in parallel on multiple worker nodes. This is called the Map phase. The workers send their results back to the master node which aggregates these results to produce the sum total. This is called the Reduce phase. MapReduce operates on key-value pairs. That is, all input and output in MapReduce is in key-value pairs [14]. Conceptually, a MapReduce job takes a set of input key-value pairs and produces a set of output key-value pairs by passing the data through Map and Reduce functions. The Map tasks produce an intermediate set of key-value pairs that the Reduce tasks uses as input [15,17]. In the MapReduce model designed for batch-oriented computations over large data sets, each Map job runs to completion before it can be consumed by a Reduce job. MapReduce Online, a modified MapReduce architecture, allows data to be pipelined between Map and Reduce jobs. This enables Reduce jobs to have access to early returns from Map jobs as they are being produced. That is, the Reduce job can start earlier, which reduces MapReduce completion times and improves the utilization of worker nodes. This architecture is effective in continuous service environments where MapReduce jobs can run continuously accepting new data as it arrives and analyzing it immediately [16,18]. 3 Elliptic curve-based authentication token generation This section presents the proposed HDFS authentication scheme that creates delegation tokens using elliptic curve cryptography (ECC). 3.1 Overview The proposed authentication scheme strengthens the protection of authentication information exchanged between Fig. 3 The overview of the proposed authentication scheme namenode and datanode. The proposed scheme is designed to operate in a master slave architecture presented in Fig. 3. That is, it requires an operation environment where a single master node (namenode) coordinates activity between several slave nodes (datanodes). The proposed scheme allows HDFS clients to be authenticated by the datanode via the block access token. Unlike most HDFS authentication protocols adopting public key exchange approaches, the proposed scheme uses the hash chain of keys. The proposed scheme has the performance (communication power, computing power and area efficiency) as good as that of existing HDFS systems. Figure 3 shows an overview of the proposed authentication scheme. The proposed scheme exploits delegation tokens to add another layer of protection to the existing HDFS authentication. 3.2 Assumptions The proposed authentication scheme has the following assumptions: Clock synchronization is provided. That is, all HDFS nodes (client, namenode, and datanode) refer to a common notion of time. The symmetric key encryption functions that encrypt plaintext M with private key K share hash function H().
4 140 Y.-S. Jeong, Y.-T. Kim For secure communication between namenode and datanode, Secure Shell (SSH) is used. 3.3 Terms and definitions Table 1 presents the terms used in the proposed scheme. 3.4 Token generation In the proposed scheme, tokens are generated based on ECC and used to enhance the security of authentication messages communicated between namenode and datanode. This offline token generation process consists of the following four steps. Step 1: The namenode chooses a random number, as represented in Eq. (1). The namenode then generates a private/public key pair using Eq. (2). d S [2, n 2] (1) Q S d S P (2) Step 2: The namenode sends public key Q S generated using Eq. (2) to the datanode. Upon receiving Q S,the datanode performs Eq. (3) through (6) in order to create a digital signature. This digital signature is sent to the namenode. Choose K S [2, n 2], l S (3) Choose R S = K S P, K 1 mod P (4) t S = R S,X (5) TK S = K S 1 S.X, I S, t S ) + d CH r S (6) Table 1 Notation Symbol E N P Q X d S t z I x Q XY.Z Description GF(p) on the elliptic curve The largest number Large prime The public key of x The selected private key of [2, n-2] Expiration time of the authentication of x Temporary identifier of x Mutually agreed Key between x and y e Z Selected value at the h( Qxy.Z, x, t x, I x ) TK x Token Information of x R x Points of the elliptic curve x r x Coordinate x values of elliptic curve (r,s) Certificate pair for the message In Eq. (6), temporary datanode identifier I S and credentials expiry time t S are included to facilitate managing datanodes. R S,X, a pre-shared key between namenode and datanode, is replaced with r S. A pair of r S and token TK S serves as credentials denoted as (r S, TK S ). Step 3: In Eq. (7), the datanode sends its public key Q CH to the namenode along with I S, (r S, TK S ) and t S. Q CH, I S, (r S, TK S ), t S (7) Step 4: Among the credentials sent from the datanode to the namenode, Q CH, I S and t S are processed by the hash function, as denoted in Eq. (8). The resulting value e S is stored along with other authentication information sent by the datanode, as represented in Eq. (9). e S H (Q S, Q S.x, I S, t S ) (8) Store S = Q S, Q Ch, I S, t S, e S,(r S, TK S ) (9) 3.5 Authentication in client and datanode communication The namenode stores private key K i that is used in the hash chain and sets seed value c j,0 for generating and verifying a delegation token. Seed value c j,0 is generated by applying the hash function to random number c i, j n times. c i, j denotes a random number of client i and j. The clients are divided into several groups, each of which has at least n clients. The authentication between client and datanode is performed in the following six steps. Step 1: The client sends block access token (r S, TK S ), timestamp t S and seed value c j,0 to the datanode, as represented in Eq. (10). ( ) Transfer E Ki, j (rs, TK S ) t S, c j,0, K i, (K i K j ) (10) Step 2: The datanode decrypts the message from the client using private key K i, j that is shared by client and datanode. This is represented in Eq. (11). ( ) D Ki, j (rs, TK S ) t S, c j,0 (11) Step 3: The datanode obtains seed value c j,0 after the decryption represented in Eq. (11). The datanode decrypts (K i K j ) using seed value c j,0 and key K i, thereby obtaining key K j. This is denoted in Eq. (12). ( ) D Ki, j Ki K j (12) Step 4: The datanode verifies timestamp t S and authenticates block access token (r S, TK S ). If the client is a
5 A token-based authentication security scheme 141 legitimate user, the datanode encrypts the requested data using private key K i, j and sends it to the client. Check (r S, TK S ), t S (13) Transfer E Ki, j (Data) (14) Step 5: The datanode decrypts the ciphertext presented in Eq. (15) using private key K i, j. If the decrypted data is valid, the datanode regenerates E Ki, j ((r S, TK S ) t S, c j,0 ), K i, (K i K j ) and sends it to the client. Re generate E Ki, j ( (rs, TK S ) t S, c j,0 ), K i, ( Ki K j ) (15) Step 6: The client decrypts the data received from the datanode using private key K i, j. 4 Evaluation 4.1 Security evaluation Table 2 show security evaluation of attack type (replay attack, disguise attack, data encryption) used in HDFS, HDFS with public key system, [11] and the proposed scheme. Replay and impersonation attack scenarios that can occur in the authentication of HDFS clients and data were devised and the resilience of the proposed scheme against those attacks was assessed. An attacker posing as a legitimate client sends the intercepted message E g j,n (D K i (K i K j )) and E ki (c j,n ) to the namenode. The attacker then receives E g j,n (B K i (K i K j )) and E ki (c j,n ) from the namenode. However, the attacker does not know K i and K j so the attacker cannot decrypt the message sent by the namenode. Therefore, the replay attack fails. An attacker posing as a legitimate client sends the intercepted message E Ki, j ((r S, TK S ) t S, c j,0 ), K i and (K i K j ) to the datanode. The datanode verifies timestamp t S and realizes that it is not a valid request. Note that the attacker is unaware of private key K i, j so the attacker cannot modify the timestamp. Given that there is a malicious node impersonating a datanode, this node cannot know the request message sent by the client because it is unaware of hash chain value c j,n and g j,n, the information kept private by each client. Even if this node is able to produce the block access token, it cannot create E Ki, j ((r S, TK S ) t S, c j,0 ), K i and (K i K j ) to be sent to the client. Therefore, the propose scheme eliminates the threat of datanode impersonation attacks. Table 2 Security evaluation Security analysis HDFS HDFS with public key system [11] Proposed scheme Replay attack Disguise attack Data encryption 4.2 Performance evaluation In conventional HDFS, the datanode requires only one key vk to issue a delegation token, and it keeps the session key until the delegation token is issued. Compared to conventional HDFS, HDFS with public key-based authentication requires m additional keys at the namenode and m additional keys at the datanode. At the client side, a public key that is shared by the client and the namenode and n additional keys are required. In the authentication protocol suggested in [11], the namenode does not need to store any key for the client because it receives E mk (K NC ) sent by the client. Similarly, the datanode does not store any key for the client as it receives E vk (K CD ) from the client. In the proposed scheme, the client and datanode store K i and (K i K j ) in tables in order to load the block access token in memory. The stored information is revoked when the token expires. The memory required by the proposed scheme is similar to that required by conventional HDFS and the protocol in [11]. Compared to HDFS with public keybased authentication, the proposed scheme requires much less memory. HDFS with public key-based authentication requires m additional key exchanges between namenode and client and n m additional key exchanges between datanode and client, in addition to the 6-round key exchange of conventional HDFS systems. In the proposed scheme and the protocol in [11], all key exchanges are performed during the conventional 6-round authentication process so additional key exchanges are not needed. Figure 4 is shows the bandwindth and storage time according to the ECC key length. As a result of the Fig. 2, ECC key length is increasing bandwidth and storage time. The time of the total time needed for the key establishment is also consumed more which was added to the decode time of the material only as a message time. 5 Conclusion As the popularity of cloud-based Hadoop grows, efforts to improve the state of Hadoop security increase. This paper proposes a token-based authentication scheme that aims to provide more secure protection of sensitive HDFS data with-
6 142 Y.-S. Jeong, Y.-T. Kim Fig. 4 The bandwidth and storage according to ECC key length of time out overburdening authentication operations. The proposed scheme adds another layer of protection to the existing symmetric key HDFS authentication. ECC technology employed in the proposed scheme makes generated authentication keys anonymous, thereby protecting them against breaches or accidental exposures. The proposed scheme allows the Hadoop clients to be authenticated by the datanode via the block access token while thwarting replay and impersonation attacks. In addition, the proposed scheme makes use of the hash chain of authentication keys, gaining performance advantage over public key-based HDFS authentication protocols. In the future, ways to protect the privacy of enterprise and personal data stored in HDFS will be studied. Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited. References 3. Cao, Y., Miao, Q., Liu, J., Gao, L.: Abstracting minimal securityrelevant behaviors for malware analysis. J. Comput. Virol. Hacking Tech. 9(4), (2013) 4. Hong, J.K.: The security policy for Big Data of US government. J. Digit. Converg. 11(10), (2013) 5. Ghemawat, S., Gobioff, H., Leung, S.: The google file system. In: Proceedings of the Nineteenth CM Symposium on Operating Systems Principles, vol. 37, issue no. 5, pp (2003) 6. Apache Hadoop MapReduce Tutorial: docs/r1.0.4/mapred_tutorial.html 7. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of Sixth Symposium on Operating System Design and Implementation (OSDI04), pp (2004) 8. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp (2010) 9. Vatamanu, C., Gavilut, D., Benchea, R.M.: Building a practical and reliable classifier for malware detection. J. Comput. Virol. Hacking Tech. 9(4), (2013) 10. Cimpoesu, M., Gavilut, D., Popescu, A.: The proactivity of perceptron derived algorithms in malwawre detection. J. Comput. Virol. Hacking Tech. 8(4), (2012) 11. Hong, J.K.: Kerberos Authentication Deployment Policy of US in Big Data environment. J. Digit. Converg. 11(11), (2013) 12. Jeong, Y.S., Han, K.H.: Service Management Scheme using security identification information adopt to Big Data environment. J. Digit. Converg. 11(12), (2013) 13. Lee, S.H., Lee, D.W.: Current status of Big Data utilization. J. Digit. Converg. 11(2), (2013) 14. White, T.: Hadoop The Definitive Guide, 2nd edn, pp O Reilly Media, Sebastopol (2009) 15. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce Online. In: Proceedings of NSDI 10 (2010) 16. Lee, M.Y.: Technology of distributed stream computing. Electron. Telecommun. Trends 26(1) (2011) 17. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), (2008) 18. Jeong, S.W., Kim, K.S., Jeong, I.R.: Secure authentication protocol in Hadoop distributed file system based on Hash chain. JKIISC 23(5), (2013) 1. Apache Hadoop: 2. Barakat, O.L., Hashim, S.J., Raja Abdullah, R.S.A.B., Ramli, A.R., Hashim, F., Samsudin, K., Rahman, M.A.: Malware analysis performance enhancement using cloud computing. J. Comput. Virol. Hacking Tech. 10(1), 1 10 (2014)
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Secret Sharing based on XOR for Efficient Data Recovery in Cloud
Secret Sharing based on XOR for Efficient Data Recovery in Cloud Computing Environment Su-Hyun Kim, Im-Yeong Lee, First Author Division of Computer Software Engineering, Soonchunhyang University, [email protected]
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Comparative analysis of Google File System and Hadoop Distributed File System
Comparative analysis of Google File System and Hadoop Distributed File System R.Vijayakumari, R.Kirankumar, K.Gangadhara Rao Dept. of Computer Science, Krishna University, Machilipatnam, India, [email protected]
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT Gita Shah 1, Annappa 2 and K. C. Shet 3 1,2,3 Department of Computer Science & Engineering, National Institute of Technology,
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT
METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department
A Study on Data Analysis Process Management System in MapReduce using BPM
A Study on Data Analysis Process Management System in MapReduce using BPM Yoon-Sik Yoo 1, Jaehak Yu 1, Hyo-Chan Bang 1, Cheong Hee Park 1 Electronics and Telecommunications Research Institute, 138 Gajeongno,
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Hadoop Distributed FileSystem on Cloud
Hadoop Distributed FileSystem on Cloud Giannis Kitsos, Antonis Papaioannou and Nikos Tsikoudis Department of Computer Science University of Crete {kitsos, papaioan, tsikudis}@csd.uoc.gr Abstract. The growing
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Storage and Retrieval of Data for Smart City using Hadoop
Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Module 8. Network Security. Version 2 CSE IIT, Kharagpur
Module 8 Network Security Lesson 2 Secured Communication Specific Instructional Objectives On completion of this lesson, the student will be able to: State various services needed for secured communication
Cyber Forensic for Hadoop based Cloud System
Cyber Forensic for Hadoop based Cloud System ChaeHo Cho 1, SungHo Chin 2 and * Kwang Sik Chung 3 1 Korea National Open University graduate school Dept. of Computer Science 2 LG Electronics CTO Division
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
The Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
Distributed File Systems
Distributed File Systems Alemnew Sheferaw Asrese University of Trento - Italy December 12, 2012 Acknowledgement: Mauro Fruet Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 1 / 55 Outline
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Hadoop Security Design Just Add Kerberos? Really?
isec Partners, Inc. Hadoop Security Design Just Add Kerberos? Really? isec Partners, Inc. is an information security firm that specializes in application, network, host, and product security. For more
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. II (May-Jun. 2014), PP 58-62 Highly Available Hadoop Name Node Architecture-Using Replicas
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Policy-based Pre-Processing in Hadoop
Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden [email protected], [email protected] Abstract While big data analytics provides
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
Scalable Multiple NameNodes Hadoop Cloud Storage System
Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
DATA SECURITY MODEL FOR CLOUD COMPUTING
DATA SECURITY MODEL FOR CLOUD COMPUTING POOJA DHAWAN Assistant Professor, Deptt of Computer Application and Science Hindu Girls College, Jagadhri 135 001 [email protected] ABSTRACT Cloud Computing
Suresh Lakavath csir urdip Pune, India [email protected].
A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India [email protected]. Ramlal Naik L Acme Tele Power LTD Haryana, India [email protected]. Abstract Big Data
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Hadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Research Article Hadoop-Based Distributed Sensor Node Management System
Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung
Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Single Sign-On Secure Authentication Password Mechanism
Single Sign-On Secure Authentication Password Mechanism Deepali M. Devkate, N.D.Kale ME Student, Department of CE, PVPIT, Bavdhan, SavitribaiPhule University Pune, Maharashtra,India. Assistant Professor,
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
SECURITY IMPLEMENTATION IN HADOOP. By Narsimha Chary(200607008) Siddalinga K M(200950034) Rahman(200950032)
SECURITY IMPLEMENTATION IN HADOOP By Narsimha Chary(200607008) Siddalinga K M(200950034) Rahman(200950032) AGENDA What is security? Security in Distributed File Systems? Current level of security in Hadoop!
Dynamic Bigdata and Security with Kerberos
Dynamic Bigdata and Security with Kerberos Sachin Choudhary 1, Sandesh Manohar 2, Sunil Salunkhe 3 1 Department of Computer Engineering, MGMCET, Navi Mumbai 2 Master of Computer Applications, IMCOST, Thane
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
A REVIEW: Distributed File System
Journal of Computer Networks and Communications Security VOL. 3, NO. 5, MAY 2015, 229 234 Available online at: www.ijcncs.org EISSN 23089830 (Online) / ISSN 24100595 (Print) A REVIEW: System Shiva Asadianfam
Big Data Analysis and Its Scheduling Policy Hadoop
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
Development of Real-time Big Data Analysis System and a Case Study on the Application of Information in a Medical Institution
, pp. 93-102 http://dx.doi.org/10.14257/ijseia.2015.9.7.10 Development of Real-time Big Data Analysis System and a Case Study on the Application of Information in a Medical Institution Mi-Jin Kim and Yun-Sik
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
BIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
