Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster
|
|
|
- Cassandra Watts
- 10 years ago
- Views:
Transcription
1 Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster Amresh Kumar Department of Computer Science & Engineering, Christ University Faculty of Engineering Bangalore, Karnataka, India Kiran M. Department of Computer Science & Engineering, Christ University Faculty of Engineering Bangalore, Karnataka, India Ravi Prakash G. Department of Computer Science & Engineering, Christ University Faculty of Engineering Bangalore, Karnataka, India Saikat Mukherjee Department of Computer Science & Engineering, Christ University Faculty of Engineering Bangalore, Karnataka, India ABSTRACT With the development of information technology, a large volume of data is growing and getting stored electronically. Thus, the data volumes processing by many applications will routinely cross the petabyte threshold range, in that case it would increase the computational requirements. Efficient processing algorithms and implementation techniques are the key in meeting the scalability and performance requirements in such scientific data analyses. So for the same here, it has been analyzed with the various MapReduce Programs and a parallel clustering algorithm (PKMeans) on Hadoop cluster, using the Concept of MapReduce. Here, in this experiment we have verified and validated various MapReduce applications like wordcount, grep, terasort and parallel K-Means Clustering Algorithm. It has been found that as the number of nodes increases the execution time decreases, but also some of the interesting cases has been found during the experiment and recorded the various performance change and drawn different performance graphs. This experiment is basically a research study of above MapReduce applications and also to verify and validate the MapReduce Program model for Parallel K- Means algorithm on Hadoop Cluster having four nodes. Keywords Machine learning, Hadoop, MapReduce, k-means, wordcount, grep, terasort. 1. INTRODUCTION Machine Learning is the study of how to build the systems that adapt and improve with experience. Machine Learning focus on designing of the algorithms that can recognize patterns & take decisions. Broadly speaking the two main subfields of machine learning are supervised learning and unsupervised learning. In supervised learning the focus is on accurate prediction, whereas in unsupervised learning the aim is to find compact descriptions of the data. Hadoop [1] was created by Doug Cutting; he is person behind the Apache Lucene creation, Apache Luence is the text search library which is being widely used. Hadoop has origin in Apache Nutch, Apache Nutch is an open source search engine and it is a web search engine, which is a part of the Lucene project. Apache Hadoop [1] is a software framework that supports data-intensive distributed applications. It empowers the applications to work with thousands of computational autonomous and independent computers and petabytes of data. Hadoop is the derivative of Google's MapReduce and Google File System (GFS) [2]. These include reliability achieved by replication, scales well to thousands of nodes, can handle petabytes of data, automatic handling of node failures, and is designed to run well on heterogeneous commodity class hardwares. However, Hadoop is still a fairly new project and limited example code and documentation is available for non-trivial applications. HDFS [3], the Hadoop Distributed File System, is a file system which is designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to the information. Files are stored in a redundant fashion across multiple machine s to ensure their durability to failure and high availability to parallel applications. HDFS presents a single view of multiple physical disks or file systems. MapReduce [4] is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. The name derives from 48
2 the application of map ( ) and reduce ( ) functions. In its simplest form MapReduce is a two-step process: Map step and Reduce step. In the Map step a master node divides a problem into a number of independent chunks of problems that are allocated to map tasks. Each map task performs its own assigned part of the problem and outputs results as keyvalue pairs. In the reduce step the node takes the outputs of the maps, where a particular reducer will receive only map outputs with a particular key and will process those. As the result of reduce step, a collection of values is produced. This report includes/presents the study and series of experiments of various MapReduce Algorithms and Parallel K-Means Clustering algorithm on Hadoop cluster, thereby achieving a collection of sample codes and documentations. The data set used here varies up to 1GB and also there is some of the interesting case for different algorithm implementation using MapReduce. This Experiment covers Literature review (Research Clarification & Descriptive Study I) it describes the machine learning, MapReduce and its design, MapReduce strategy, Hadoop, HDFS, and different clustering algorithms with brief about K-Means & canopy clustering algorithms, cloudera and lists of problems. Methodology (Prescriptive Study), it tells about Hadoop architecture, MR Programming model and Parallel K-Means algorithm. Work done and result (Descriptive Study II) with the brief of experimental setup and experimental results. Finally; conclusion, future work and references. Fig 1: Research Plan: Basic Means, Stages, Main Outcomes 2. LITERATURE REVIEW (RESEARCH CLARIFICATION & DESCRIPTIVE STUDY I) 2.1 Machine Learning Machine learning [9] is a branch of artificial intelligence, and it is a scientific approach which care of the design and expansion of an algorithms which take input as empirical data, such as data from the sensors or some databases, and yield patterns or predictions, that thought to be features of the original mechanism which generated the data. A major emphasis of the machine learning exploration is the design of algorithms that recognize complex patterns and make intelligent decisions based on input. Machine learning is the science of making computers that can take decisions by its own. In the past days, machine learning has given us automated cars, auto speech recognition, effective web searching, and an immensely improved understanding of the human genome (the complete set of human genetic information, stored as DNA sequences within the 23chromosome pairs of the cell nucleus and in a small DNA molecule within the mitochondrion). Machine learning is so pervasive today that it has been undoubtedly use it many times a day without knowing about it. Many researchers also think it is the best way to make progress towards human-level Artificial Intelligence (AI). In this way, it will be learned about the most effective machine learning techniques, and practice by implementing them and getting them to work for our self. There are two major categories of types of learning: Supervised learning and Un-supervised learning. In Supervised learning [9] [15], it is known (sometimes only approximately) that the values of the m samples in the training set T. It has been assumed that if a hypothesis h that closely agrees with f is been found for the members of T. Then this hypothesis will be a good guess for f especially if T is large. For example, in case of the classification problem, the learner estimates a function mapping a vector into classes by looking at the input-output examples of the function. Whereas, In Un-supervised learning [9] [15], there is a training set of vectors without function values for them. The problem in this case, usually, is to divide the training set into subsets, t1... tr, in some appropriate way. Unsupervised learning methods have application in taxonomic problems in which it is desired to invent ways to classify data into meaningful categories. Approaches to unsupervised learning include Clustering (e.g., k-means, mixture models, hierarchical clustering). It has been also describe methods that are intermediate between supervised and unsupervised learning i.e. Semisupervised learning. Semi-supervised learning [15] is a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of the unlabeled data. Semi-supervised learning lies between unsupervised learning and supervise learning. 2.2 MapReduce MapReduce is a programming model for expressing distributed computations on massive amounts of data, and also an execution framework for large-scale data processing on clusters of commodity servers. In other word it can tell that MapReduce represents to a framework that runs on a computational cluster to extract the Knowledge from a large datasets. The name MapReduce is derived from two functions map ( ) and reduce ( ) functions. The Map ( ) function usually applies to all the members of the dataset and then returns a list of results. And Reduce ( ) function collates and resolves the results from one or more mapping operations executed in parallel. MapReduce Model splits the input dataset into independent chunks called as subsets, which are processed by map ( ) and reduce ( ). Generally, compute nodes & storage nodes are the same. That is the entire computation involving map ( ) and reduce ( ) functions will be happening on DataNodes and result of computation is going to be stored locally. In the MapReduce Model, programs written in the functional style are automatically parallelized and executed on the large cluster of commodity hardware. The run-time system takes care of the details of broken input data, and schedules the program's execution across the number of machines; it handles the machine failures, and manages inter-machine communication. In this way without any 49
3 experience with parallel and distributed systems this allows programmers, to easily utilize the resources of a large distributed system MapReduce Design In MapReduce, records are treated in isolation by tasks called as Mappers. The output from the Mappers is then brought together into a second set of tasks called as Reducers; here results from many different mappers are being merged together. Problems suitable for processing with MapReduce must usually be easily split into independent subtasks that can be processed in parallel. The map and reduce functions are both specified in terms of data is structured in key-value pairs. The power of MapReduce is from the execution of many map tasks which run in parallel on a data set and gives output of the processed data in form of intermediate keyvalue pairs. Each reduce task receives and processes data for one particular key at a time and outputs the data which processes as key-value pairs. Apache v2 license. It enables many applications to work with thousands and thousands of computational independent computers and petabytes of the data. Hadoop was derived from Google's MapReduce and Google File System (GFS). Hadoop is a top-level Apache project which is built and used by a global communal of contributors, it has been written in the Java programming language. Hadoop includes several subprojects: [1] [3] Table 1. Hadoop Project Components (Sub-Projects) HDFS MapReduce HBase Pig Hive ZooKeeper Chukwa Avro Distribute File System; One of the Subject of this Experiment Computational Framework for distributed environment Column-oriented table service Dataflow Language and it is a Parallel execution framework Data warehouse infrastructure It is for distributed coordination service Collecting management data: System Data serialization system Fig 2: MapReduce key-value pair s generation MapReduce Strategy (Execution) The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M chunks. The input chunks can be processed in parallel by different machines. Reduce requests are being distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash (key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Figure give below shows the overall flow of a MapReduce operation. As soon as the MapReduce function is called by the user program, the following sequence starts 2.4 HDFS The Hadoop Distributed File System (HDFS) is designed to store large data sets with high reliability, and to stream those data sets with high bandwidth. In a large cluster, more that thousands of servers both host and Client are directly attached to storage and execute user application tasks. Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm.. Fig 3: MapReduce Execution overview 2.3 Hadoop Apache Hadoop [1] is an open source it is built on Java framework and it is built for implementing the reliable and scalable computational networks which supports data intensive distributed applications, and it is licensed under Fig 4: The architecture of HDFS HDFS is the file system constituent the Hadoop. While the interface to HDFS is being formed after the UNIX file system, the truthfulness to standards was sacrificed in favor of enhanced performance for the applications at hand. HDFS stores the file system, metadata and the application data independently. Alike to distributed file systems, like PVFS, Lustre and GFS. HDFS stores metadata on a dedicated server, known as NameNode. Application data are stored on other servers known as DataNodes. All servers are connected and communicate with each other using TCPbased protocols [1]. 50
4 2.4.1 HDFS Client Using the HDFS client [17], user applications access the file system. Similar to most conventional file systems, HDFS is having provisions of reading, writing and deleting files, and provides operations for creating and deleting directories HDFS Name Node (Master) It manages the file system name space,keeps track of job execution, manages the cluster, replicates data blocks and keeps them evenly distributed, Manages lists of files, list of blocks in each file, list of blocks per node, file attributes and other meta-data and also it Tracks HDFS file creation and deletion operations in an activity log. Depending on system load, the NameNode and JobTracker daemons may run on separate computers. JobTracker dispatches jobs and assigns splits (splits) to mappers or reducers as each stage completes Data Nodes (Slaves) It stores blocks of data in their local file system, stores meta-data for each block, serves data and meta-data to the job they execute, sends periodic status reports to the Name Node, and sends data blocks to other nodes required by the Name Node. Data nodes execute the DataNode and TaskTracker daemons. TaskTracker executes tasks sent by the JobTracker and reports status. 2.5 Clustering Algorithms Data clustering [5] is the partitioning of a data set or sets of data into similar subsets; this can be accomplished by using some of the clustering algorithms K-Means Algorithm K-MEANS [5] [6] [11] is the simplest algorithm used for clustering also it an unsupervised clustering algorithm. The K-Means algorithm is used to partitions the data set into k clusters using the cluster mean value so that in the resulting clusters is having high intra cluster similarity and low inter cluster similarity. K-Means clustering algorithm is iterative in nature. The K-means clustering algorithm is known to be efficient in clustering large data sets. This clustering algorithm was originally developed by MacQueen, and is one of the simplest and the best known unsupervised learning algorithms that solve the well-known clustering problem. The K-Means algorithm targets to partition a set of given objects into k clusters based on their features, where k is a user-defined constant. The core idea is to define k centroids, one centroid for each cluster. The centroid for a cluster is calculated and formed in such a way that it is closely related (in terms of similarity function; similarity can be measured by using different methods such as cosine similarity, Euclidean distance, Extended Jaccard) to all objects in that cluster Canopy Clustering Algorithm Canopy Clustering [5] [6] an unsupervised pre-clustering algorithm related to the K-means algorithm, which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering. The basic steps of the canopy clustering are described below. Given two threshold distance T1 and T2; T1>T2 and a collection of points. Now, to determine the Canopy Centers: there is iteration through the set of points, if the point is at distance decide the canopy membership - for each point in the input set if the point is at a distance < T1 from any of points in the list of canopy centers (generated in step) then point is member of the corresponding cluster. Fig 5: Canopy clustering description The Map Reduce implementation of K-Means Algorithm with Canopy Clustering has the following steps. Fig 6: Map Reduce Steps 2.6 Cloudera Cloudera [7] [8] Inc. is a software company that provides Apache Hadoop-based software, support and services called CDH. CDH has version of Apache Hadoop patches and updates. 2.7 List of Problems During the literature review and observation some problems like scalability issues, high computational costs, reliability, compatibility problems (with the Softwares & Hardwares) has been listed out. 3. METHODOLOGY (PRESCRIPTIVE STUDY) 3.1 Hadoop Architecture The architecture of a complete Hadoop cluster is in the form of the Master-Slave architecture. Here, in the Hadoop architecture, Master is NameNode and Slaves are DataNodes. 51
5 The HDFS NameNode runs the NameNode daemon. The job submission node runs the JobTracker, which is the single point of contact for a client wishing to execute a MapReduce job. The JobTracker monitors the progress of running MapReduce jobs and is responsible for coordinating the execution of the mappers and reducers. parallel parts and serial parts of the algorithms. Now, it has been seen that how the computations can be formalized as map and reduce operations in detail. So, there got the details of PK-Means Algorithm [11] [16] [17] [18] using MapReduce with the flow diagram shown below. Typically, all these services runs on two distinct machines, although in a small cluster they are often co-located. The bulk of a Hadoop cluster consists of slave nodes (only three of which are shown in the figure) that run both a TaskTracker, which is responsible running the user code, and a DataNode daemon, for serving HDFS data. Fig 9: Flow diagram of PKMeans Algorithm using MapReduce Fig 7: Hadoop Cluster Architecture 3.2 MR Programming Model MAP-REDUCE programming model [17] is defined by Dean et al. MAP-REDUCE computing model comprises of two functions, Map ( ) and Reduce ( ) functions. The Map and Reduce functions are both defined with data structure of (key1; value1) pairs. Map function is applied to each item in the input dataset according to the format of the (key1; value1) pairs; each call produces a list (key2; value2). 4. WORK DONE AND RESULT (DESCRIPTIVE STUDY II) 4.1 Experimental Setup The experiments were carried out on the Hadoop cluster. The Hadoop infrastructure consists of one cluster having four nodes, distributed in one single lab. For the series of experiments, I have used the nodes in the Hadoop cluster, with Intel Core 2 Duo CPU@ 2.53 GHZ, 2 CPUs and 2GB of RAM for each node. With a measured bandwidth for endto-end TCP sockets of 100 MB/s, Operating System: CentOS 6.2 (Final) and jdk 1.6.0_33 (SUN JAVA). 4.2 Experimental Result and its Analysis In this Experiment, performance has been shown with respect to execution time and number of nodes. Here, there are different cases as given in table below. Table 2. Various cases for verification and analysis. Graphical Analysis of Various Cases using Dataset Sr. No. Data Size No. of Nodes Case 1 Increasing Constant Case 2 Constant Increasing Case 3 Increasing Increasing Fig 8: Process of MAP and REDUCE is illustrated All the pairs which is having the same key in the output lists is kept to reduce function which generates one (value3) or an empty return. The results of all calls from a list, list (value3). This process of MAP and REDUCE is illustrated in figure below Experiment 01: Sequential Algorithm vs. MapReduce Algorithm 3.3 PKMeans Algorithm (Parallel K- means Algorithm) [11] Here is the presentation of a design for Parallel K-Means based on MapReduce. First, it has been already seen with the overview of the k-means algorithm and then analyzed 52
6 Fig 10: Speed of Sequential Algorithm vs. MapReduce Algorithm Experiment 02: Wordcount Case 1: When Data size is increasing and Number of nodes is Constant Fig 14: Results of grep MapReduce Application Fig 11: Results of wordcount MapReduce Application Case 2: When Data size is Constant and Number of nodes Case 2: When Data size is Constant and Number of nodes Fig 15: Results of grep MapReduce Application Fig 12: Results of wordcount MapReduce Application Case 3: When Data size is increasing and Number of nodes Case 3: When Data size is increasing and Number of nodes Fig 13: Results of wordcount MapReduce Application In case of Wordcount,it has been observed with an interesting case (Case 3) and also analyzed and verified that the execution time is first decreasing and then increasing, as data size as well as number of nodes (from 1 to 4) because the load on nodes and the communication between nodes are also increasing Experiment 02: Grep Case 1: When Data size is increasing and Number of nodes is Constant Fig 16: Results of grep MapReduce Application In case of Grep experiment, it has been observed with an interesting case (case 3). It has been analyzed and verified that the execution time in case of Job1 (256MB of Data Size) is showing first decreasing and then increasing pattern. This is happening because the load (Data set Size) is increasing on nodes and as well as communication between nodes is also increasing (From 1 to 4) Experiment 03: Terasort Case 1: When Data size is increasing and Number of nodes is Constant 53
7 4.2.5 Experiment 04: PKMeans Clustering Algorithm Case 1: When Data size is increasing and Number of nodes is Constant Fig 17: Results of terasort MapReduce Application Case 2: When Data size is Constant and Number of nodes Fig 20: Results of PKMeans MapReduce Application Case 2: When Data size is Constant and Number of nodes Fig 18: Results of terasort MapReduce Application Case 3: When Data size is increasing and Number of nodes Fig 21: Results of PKMeans MapReduce Application Case 3: When Data size is increasing and Number of nodes Fig 19: Results of terasort MapReduce Application In case of Terasort experiment, it has been observed with an interesting case (Case 2). It has been analyzed and verified that the execution time is first decreases and then increases, when Data size is kept constant and increasing number of nodes from 1 to 4. This is happening because the load (Data set size) is constant but the communication between nodes is increasing since number of Node is increasing from 1 to 4. Fig 22: Results of PKMeans MapReduce Application In case of PKMeans experiment, it has been observed with an interesting case (Case 2). It has been analyzed and verified that the execution time is first decreases and then increases. The reason behind this is that the load (Data set size) is constant but the communication between nodes is increasing, since number of Node is increasing from 1 to 4. 54
8 5. CONCLUSION AND FUTURE WORK In this experiment, the performance of MapReduce application has been shown with respect to execution time and number of nodes. Also, verified and validated that in MapReduce Program model as the number of nodes increases the execution time decreases. In this way, it has been shown that PKMeans performs well and efficiently and the results totally depend on the size of Hadoop cluster. The performance of above application has been shown with respect to execution time, dataset size and number of nodes. The Experiment involves the Hadoop cluster, which contains total of four nodes i.e. one master (NameNode) and three slaves (DataNodes). The future research includes developing mechanisms for Hadoop quality of service on the different size of the datasets using MapReduce. Performance evaluation of MapReduce with different version of software and hardware configurations. Security Issues with respect to Hadoop and MapReduce. The performance of PK-Means Algorithm can be enhanced by introducing different preprocessing steps such as Canopy Clustering algorithm. MapReduce concept can be look forward, for the algorithms such as different set of optimization algorithms, different set of Kernel algorithms. 6. REFERENCES [1] Apache Hadoop. [2] Sanjay Ghemawat, Howard Gobioff, and Shun- TakLeung The Google File System, Google,Sosp 03, October 19 22, 2003, Bolton Landing,New York, USA. Copyright 2003 ACM [3] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System. Yahoo! Sunnyvale, California USA, IEEE, 2010 [4] Jeffrey Dean and Sanjay Ghemawat MapReduce: Simplified Data Processing On Large Clusters 2009, [5],Google, Inc., Usenix Association OSDI 04: 6thSymposium on Operating Systems Design andimplementation. [6] [7] evelopments [8] [9] on+guide. [10] [11] Mahesh Maurya and Sunita Mahajan. Performance analysis of MapReduce Programs on Hadoop cluster, World Congress on Information and Communication Technologies [12] Weizhong Zhao, Huifang Ma, and Qing He, Parallel K-Means Clustering Based on MapReduce, [13] [14] [15] 23 [16] David Barber, Bayesian Reasoning and Machine Learning, Cambridge 2011; New York: Cambridge University Press. [17] Ron Bekkerman, Mikhail Bilenko, John Langford, Scalable Machine Learning Cambridge University Press, [18] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, April 2010, University of Maryland, College Park. [19] Tom White, Hadoop: The Definitive Guide, 2009 Published by O Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA IJCA TM : 55
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. II (May-Jun. 2014), PP 58-62 Highly Available Hadoop Name Node Architecture-Using Replicas
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT
METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: [email protected] & [email protected] Abstract : In the information industry,
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Performance Analysis of Book Recommendation System on Hadoop Platform
Performance Analysis of Book Recommendation System on Hadoop Platform Sugandha Bhatia #1, Surbhi Sehgal #2, Seema Sharma #3 Department of Computer Science & Engineering, Amity School of Engineering & Technology,
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Big Data Analysis and Its Scheduling Policy Hadoop
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
Using Hadoop for Webscale Computing. Ajay Anand Yahoo! [email protected] Usenix 2008
Using Hadoop for Webscale Computing Ajay Anand Yahoo! [email protected] Agenda The Problem Solution Approach / Introduction to Hadoop HDFS File System Map Reduce Programming Pig Hadoop implementation
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13
How to properly misuse Hadoop Marcel Huntemann NERSC tutorial session 2/12/13 History Created by Doug Cutting (also creator of Apache Lucene). 2002 Origin in Apache Nutch (open source web search engine).
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Big Application Execution on Cloud using Hadoop Distributed File System
Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 CONTENTS Hadoop Hadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References ABSTRACT Hadoop is an efficient
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies
Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies Savitha K Dept. of Computer Science, Research Scholar PSGR Krishnammal College for Women Coimbatore, India. Vijaya MS Dept.
How To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud
A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Thuy D. Nguyen, Cynthia E. Irvine, Jean Khosalim Department of Computer Science Ground System Architectures Workshop
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
