Architecture of Next Generation Apache Hadoop MapReduce Framework
|
|
|
- Joy Carr
- 9 years ago
- Views:
Transcription
1 Architecture of Next Generation Apache Hadoop MapReduce Framework Authors: Arun C. Murthy, Chris Douglas, Mahadev Konar, Owen O Malley, Sanjay Radia, Sharad Agarwal, Vinod K V BACKGROUND The Apache Hadoop Map-Reduce framework is showing it s age, clearly. In particular, the Map-Reduce JobTracker needs a drastic overhaul to address several technical deficiencies in its memory consumption, much better threading-model and scalability/reliability/performance given observed trends in cluster sizes and workloads. Periodically, we have done running repairs. However, lately these have come at an ever-growing cost as evinced by the worrying regular site-up issues we have seen in the past year. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007: REQUIREMENTS As we consider ways to improve the Hadoop Map-Reduce framework it is important to keep in mind the high-level requirements. The most pressing requirements for the next generation of the Map-Reduce framework from the customers of the Hadoop, as we peer ahead into 2011 and beyond, are: Reliability Availability Scalability - Clusters of nodes and 200,000 cores Backward Compatibility - Ensure customers Map-Reduce applications can run unchanged in the next version of the framework. Also implies forward compatibility. Evolution Ability for customers to control upgrades to the grid software stack. Predictable Latency A major customer concern. Cluster utilization The second tier of requirements is: Support for alternate programming paradigms to Map-Reduce Support for limited, short-lived services Given the above requirements, it is clear that we need a major re-think of the infrastructure used for data processing on the Grid. There is consensus around the fact that the current architecture of the Map-Reduce framework is incapable of meeting our states goals and that a Two Level Scheduler is the future. NEXTGEN MAPREDUCE (MRv2/YARN) The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map- Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager (NM), form the datacomputation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
2 The ResourceManager has two main components: Scheduler (S) ApplicationsManager (ASM) The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates elements such as memory, cpu, disk, network etc. The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in. The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure. The NodeManager is the per-machine framework agent who is responsible for launching the applications containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler. The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. Resource Manager Applications Manager Scheduler MPI Mast er NodeManag er MR Mast er User YARN Architecture
3 YARN v1.0 This section describes the POR for the first version of YARN. Requirements for v1 Here are the requirements for YARN-1.0: Reliability Availability Scalability - Clusters of 6000 nodes and 100,000 cores Backward Compatibility - Ensure customers Map-Reduce applications can run. Also implies forward compatibility. Evolution Ability for customers to control upgrades to the grid software stack. Cluster utilization ResourceManager The ResourceManager is the central authority that arbitrates resources among various competing applications in the cluster. The ResourceManager has two main components: Scheduler ApplicationsManager The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is purely a scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates elements such as memory, cpu, disk, network etc. The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure. Resource Model The Scheduler models only memory in YARN v1.0. Every node in the system is considered to be composed of multiple containers of minimum size of memory (say 512MB or 1 GB). The ApplicationMaster can request any container as a multiple of the minimum memory size. Eventually we want to move to a more generic resource model, however, for Yarn v1 we propose a rather straightforward model: The resource model is completely based on memory (RAM) and every node is made up discreet chunks of memory. Unlike Hadoop Map-Reduce the cluster is not artificially segregated into map and reduce slots. Every chunk of memory is fungible and this has huge benefits for cluster utilization - currently a well known problem with Hadoop Map-Reduce is that jobs are bottlenecked on reduce slots and the lack of fungible resources is a severe limiting factor.
4 In Yarn an application (via the ApplicationMaster) can ask for containers spanning any number of memory chunks, they can ask for varied number of container types also. The ApplicationMaster, usually, asks for specific hosts/racks with specified container capabilities. Resource Negotiation The ApplicationMaster can request for containers with appropriate resource requirements, inclusive of specific machines. They can also request for multiple containers on each machine. All resource requests are subject to capacity constraints for the application, it s queue etc. The ApplicationMaster is responsible for computing the resource requirements of the application e.g. input-splits for MR applications, and translating them into the protocol understood by the Scheduler. The protocol understood by the Scheduler is <priority, (host, rack, *), memory, #containers>. In the Map-Reduce case, the Map-Reduce ApplicationMaster takes the input-splits and presents to the ResourceManager Scheduler an inverted table keyed on the hosts, with limits on total containers it needs in it's lifetime, subject to change. A typical resource request from the per-application ApplicationMaster: Priority Hostname Resource Requirements Number of Containers (memory in GB) 1 h1001.company.com 1GB 5 1 h1010.company.com 1GB 3 1 h2031.company.com 1GB 6 1 rack11 1GB 8 1 rack45 1GB 6 1 * 1GB 14 2 * 2GB 3 The Scheduler will try to match appropriate machines for the application; it can also provide resources on the same rack or a different rack if the specific machine is unavailable. The ApplicationMaster might, at times, due to vagaries of busyness of the cluster, receive resources that are not the most appropriate; it can then reject them by returning them to the Scheduler without being charged for them. A very important improvement over Hadoop Map-Reduce is that the cluster resources are no longer split into map slots and reduce slots. This has a huge impact on cluster utilization since applications are no longer bottlenecked on slots of either type. Note: A rogue ApplicationMaster could, in any scenario, request more containers than necessary - the Scheduler uses application-limits, user-limits, queue-limits etc. to protect the cluster from abuse. Pros and Cons The main advantage of the proposed model is that it is extremely tight in terms of the amount of state necessary, per application, on the ResourceManager for scheduling and the amount of information passed around between the ApplicationMaster & ResourceManager. This is crucial for scaling the ResourceManager. The amount of information, per application, in this model is always O(cluster size), where-as, in the current Hadoop Map-Reduce JobTracker it is O(number of tasks) which could run into hundreds of thousands of tasks.
5 The main issue with the proposed resource model is that there is a loss of information in the translation from splits to hosts/racks as described above. This translation is one-way and irrevocable, and the ResourceManager has no concept of relationships between the resource asks for e.g. it the above e.g. there is no need for h46 if h43 and h32 were allocated to the application. To get around the above disadvantage and future proof the Scheduler we propose a simple extension - the idea is to add a complementary Scheduler component to the ApplicationMaster, which performs the translation without the help/knowledge of the ApplicationMaster who continues to think in terms of tasks/splits without the translation. This Scheduler component in the ApplicationMaster is paired with the Scheduler and we could, conceivably, remove this component in the future without affecting any of the implementations of the ApplicationMaster. Scheduling The Scheduler aggregates resource requests from all the running applications to build a global plan to allocate resources. The Scheduler then allocates resources based on application specific constraints such as appropriate machines and global constraints such as capacities of the application, queue, user etc. The Scheduler uses familiar notions of Capacity and Capacity Guarantees as the policy for arbitrating resources among competing applications. The scheduling algorithm is straightforward: Pick the most under-served queue in the system Pick the highest priority job in that queue Serve the job's resource asks Scheduler API There is a single API between the YARN Scheduler and the ApplicationMaster: Response allocate (List<ResourceRequest> ask, List<Container> release) The AM ask for specific resources via a list of ResourceRequests (ask) and releases unnecessary Containers which were allocated by the Scheduler. The Response contains a list of newly allocated Containers, the statuses of application-specific Containers that completed since the previous interaction between the AM and the RM and indicator to application about available headroom for cluster resources. The AM can use the Container statuses to glean information about completed containers and react to failure etc. The headroom can be used by the AM to tune it s future requests for e.g. the MR AM can use this information to decide to schedule maps and reduces appropriately to avoid deadlocks such as using up all its headroom for reduces etc. Resource Monitoring The Scheduler receives periodic information about the resource usages on allocated resources from the NodeManagers. The Scheduler also makes available status of completed Containers to the appropriate ApplicationMaster. Application Submission The workflow of application-submission is straightforward: The user (usually from the gateway) submits a job to the ApplicationsManager. o The user client first gets a new ApplicationID o Packages the application definition, uploads to HDFS in ${user}/.staging/${application_id}
6 o Submits the application to ApplicationsManager The ApplicationsManager accepts the application -submission The ApplicationsManager negotiates the first slot necessary for the ApplicationMaster from the Scheduler and launches the ApplicationMaster The ApplicationsManager is also responsible for providing details of the running ApplicationMaster to the client for monitoring application progress etc. Lifecycle of the ApplicationMaster The ApplicationsManager is responsible for managing the lifecycle of the ApplicationMaster of all applications in the system. As described in the section on Application Submission the ApplicationsManager is responsible for starting the ApplicationMaster. The ApplicationsManager then monitors the ApplicationMaster, who sends periodic heartbeats, and is responsible for ensuring its aliveness, restarting the ApplicationMaster on failure etc. ApplicationsManager - Components As described in the previous sections, the primary responsibility of the ApplicationsManager is to manage the lifecycle of the ApplicationMaster. In order to affect this, the ApplicationsManager has the following components: SchedulerNegotiator Component responsible for negotiating the container for the AM with the Scheduler AMContainerManager Component responsible for starting and stopping the container of the AM by talking to the appropriate NodeManager AMMonitor Component responsible for managing the aliveness of the AM and responsible for restarting the AM if necessary. Availability The ResourceManager stores its state in Zookeeper to ensure that it is highly available. It can be quickly restarted by relying on the saved state in Zookeeper. Please refer to the YARN availability section for more details. NodeManager The per-machine NodeManager is responsible for launching containers for applications once the Scheduler allocates them to the application. The NodeManager is also responsible for ensuring that the allocated containers do not exceed their allocated resource slice on the machine. The NodeManager is also responsible for setting up the environment of the container for the tasks. This includes binaries, jars etc. The NodeManager also provides a simple service for managing local storage on the node. Applications can continue to use the local storage even when they do not have an active allocation on the node. For e.g. Map-Reduce applications use this service to store the temporary map-outputs and shuffle them to the reduce tasks.
7 ApplicationMaster The ApplicationMaster is the per-application master who is responsible for negotiating resources from the Scheduler, running the component tasks in appropriate containers in conjunction with the NodeManagers and monitoring the tasks. It reacts to container failures by requesting alternate resources from the Scheduler. The ApplicationMaster is responsible for computing the resource requirements of the application e.g. input-splits for MR applications, and translating them into the protocol understood by the Scheduler. The ApplicationMaster is also recovering the application on its own failure. The ApplicationsManager only restarts the AM on failure; it is the responsibility of the AM to recover the application from saved persistent state. Of course, the AM can just run the application from the very beginning. Running Map-Reduce Applications in YARN This section describes the various components used to run Hadoop Map-Reduce jobs via YARN. Map-Reduce ApplicationMaster The Map-Reduce ApplicationMaster is the Map-Reduce specific kind of ApplicationMaster, which is responsible for managing MR jobs. The primary responsibility of the MR AM is to acquire appropriate resources from the ResourceManager, schedule tasks on allocated containers, monitor the running tasks and manage the lifecycle of the MR job to completion. Another responsibility of the MR AM is to save state of the job to persistent storage so that the MR job can be recovered if the AM itself fails. The proposal is to use an event-based, Finite State Machine (FSM) in the MR AM to manage the lifecycle and recovery of the MR tasks and job(s). This allows a natural expression of the state of the MR job as a FSM and allows for maintainability and observability. The following diagrams indicate the FSM for MR job, task and task-attempt:
8 State Transition Diagram for MR Job
9 The lifecycle of a Map-Reduce job running in YARN looks so: Hadoop MR JobClient submits the job to the yarn ResourceManager (ApplicationsManager) rather than to the Hadoop Map-Reduce JobTracker. The YARN ASM negotiates the container for the MR AM with the Scheduler and then launches the MR AM for the job. The MR AM starts up and registers with the ASM. The Hadoop Map-Reduce JobClient polls the ASM to obtain information about the MR AM and then directly talks to the AM for status, counters etc. The MR AM computes input-splits and constructs resource requests for all maps to the YARN Scheduler. The MR AM runs the requisite job setup APIs of the Hadoop MR OutputCommitter for the job. The MR AM submits the resource requests for the map/reduce tasks to the YARN Scheduler, gets containers from the RM and schedules appropriate tasks on the obtained containers by working with the NodeManager for each container. The MR AM monitors the individual tasks to completion, requests alternate resource if any of the tasks fail or stop responding. The MR AM also runs appropriate task cleanup code of completed tasks of the Hadoop MR OutputCommitter. Once the entire once map and reduce tasks are complete, the MR AM runs the requisite job commit or abort APIs of the Hadoop MR OutputCommitter for the job. The MR AM then exits since the job is complete. The Map-Reduce ApplicationMaster has the following components: Event Dispatcher The central component that acts as the coordinator that generates events for other components below. ContainerAllocator The component responsible for translating tasks resource requirements to resource requests that can be understood by the YARN Scheduler and negotiates resources with the RM. ClientService The component responsible for responding to the Hadoop Map-Reduce JobClient with job status, counters, progress etc. TaskListener Receives heartbeats from map/reduce tasks. TaskUmbilical The component responsible for receiving heartbeats and status updates form the map and reduce tasks. ContainerLauncher The component responsible for launching containers by working the appropriate NodeManagers. JobHistoryEventHandler Writes job history events to HDFS. Job The component responsible for maintaining the state of the job and component tasks.
10 Availability The ApplicationMaster stores its state in HDFS, typically, to ensure that it is highly available. Please refer to the YARN availability section for more details.
11 YARN Availability This section describes the design for affecting high availability for YARN. Failure Scenarios This section enumerates the various abnormal and failure scenarios for entities in YARN such as ResourceManager, ApplicationMasters, Containers etc. and also highlights the entity responsible for handling the situation. Scenarios and handling entity: Container (task) live, but wedged AM times out the Container and kills it. Container (task) dead The NM notices that the Container has exited, notifies the RM (Scheduler); the AM picks up the status of the Container during its next interaction with the YARN Scheduler (see section on Scheduler API). NM live, but wedged the RM notices that the NM has timed out, informs all AMs who had live Containers on that node during their next interaction. NM dead the RM notices that the NM has timed out, informs all AMs who had live Containers on that node during their next interaction. AM live, but wedged the ApplicationsManager times out the AM, frees the Container of the RM and restarts the AM by negotiating another Container AM dead the ApplicationsManager times out the AM, frees the Container of the RM and restarts the AM by negotiating another Containers RM dead Please see section on RM restart. Availability for Map-Reduce Applications and ApplicationMaster This section describes failover for MR ApplicationMaster and how the MR job is recovered. As described in the section on the ApplicationsManager (ASM) in the YARN ResourceManager, the ASM is responsible for monitoring and providing high availability for the MR ApplicationMaster or any ApplicationMaster for that matter. The MR ApplicationMaster is responsible for recovering the specific MR job by itself, the ASM only restarts the MR AM. We have multiple options for recovering the MR job when the MR AM is restarted: Restart the job from scratch. Restart only the incomplete map and reduce tasks. Point the running map and reduce tasks to the restarted MR AM and run to completion. Restarting the MR job from scratch is too expensive to do just for the failure of a single container i.e. the one on which the AM is running, and is thus an infeasible option. Restarting only incomplete map and reduce tasks is attractive since it ensures we do not run the completed tasks, however it still hurts reduces since they might have (nearly) completed the shuffle and are penalized by restarting them. Changing the running map and reduce tasks to point to the new AM is the most attractive option from the perspective of the job itself, however technically it is a much more challenging task. Given the above tradeoffs, we are proposing to go with option 2 for YARN-v1.0. It is a simpler implementation, but provides significant benefits. It is very likely that we will implement option 3 for a future version.
12 The proposal to implement option 2 is to have the MR AM write out a transaction log to HDFS which tracks all of the completed tasks. On restart we can replay the task completion events from transaction log, and use the state machine in the MR AM to track the completed tasks. Thus it is fairly straightforward to get the MR AM to run the remaining tasks. Availability for the YARN ResourceManager The primary goal here is to be able to quickly restart the ResourceManager and all the running/pending applications in face of catastrophic failures. Thus availability for Yarn has two facets: Availability for the ResourceManager. Availability for the individual applications and their ApplicationMaster. The design for high availability for YARN is conceptually very simple - the ResourceManager is responsible for recovering it s own state such as running AMs, allocated resources and NMs. The AM itself is responsible for recovering the state of it s own application. See section on Availability for MR ApplicationMaster for a detailed discussion on recovering MR jobs. This section describes the design for ensuring that the YARN ResourceManager is highly available. The YARN ResourceManager has the following state that needs to be available in order to provide high-availability: Queue definitions These are already present in persistent storage such as file-system or LDAP. Definition of running and pending applications Resource allocations to various applications and Queues. Containers already running and allocated to Applications Resource allocations on individual NodeManagers. The proposal is to use ZooKeeper to store state of the YARN ResourceManager that isn t already present in persistent storage i.e. state of AMs and resource allocations to individual Application. The resource allocations to various NMs and Queues can be quickly rebuilt by scanning allocations to Applications. ZooKeeper allows for quick failure recovery for the RM and failover via master election. A simple scheme for this is to create ZooKeeper ephemeral nodes for each of the NodeManagers and one node per allocated container with the following minimal information: ApplicationID Container capability An illustrative example: + app_1 + <container_for_am> + <container_1, host_0, 2G> + <container_11, host_10, 1G> + <container_34, host_9, 2G> + app2 + <container_for_am>
13 + <container_5, host_87, 1G> + <container_67, host_14, 4G> + <container_87, host_9, 2G> With this information, the ResourceManager, on a reboot, can quickly rebuild state via ZooKeeper. ZooKeeper is the source of truth for all assignments. Here is the flow of container assignment. ResourceManager gets a request for a container from ApplicationMaster ResourceManager creates a znode in ZooKeeper under /app$i with the containerid, node id and resource capability information. ResourceManager responds to the ApplicationMaster request of the resource capability. On release of the containers, these are the steps followed: ApplicationMaster sends a stopcontainer() request to the NodeManager to stop the container, on a failure it retries. ApplicationMaster then sends a request to the ResourceManager with a deallocatecontainer(containerid) ResourceManager then removes the container znode from ZooKeeper. To be able to keep all the container allocation in sync between the ApplicationMaster, ResourceManager and NodeManager, we need a few APIs: ApplicationMaster uses the getallocatedcontainers(appid) to keep itself in sync with the ResourceManager on the set of containers allocated to the ApplicationMaster. NodeManager keeps itself in sync with heartbeat to the ResourceManager which responds back with containers that the NodeManager should cleanup and remove. On restart the ResourceManager goes through all the application definitions in HDFS/ZK i.e. /systemdir and assumes all the RUNNING ApplicationMasters are alive. The ApplicationsManager is responsible for restarting any of the failed AMs that do not update the ApplicationsManager as described previously. The resource manager on startup publishes its host and port on ZooKeeper, under ${yarn-root}/yarn/rm. + yarn + <rm> This ZooKeeper node is an ephemeral node that gets deleted automatically if the ResourceManager dies. A backup ResourceManager can take over in that case via notification from ZooKeeper. Since the ResourceManager state is in ZooKeeper the backup ResourceManager can startup. All the NodeManagers watch ${yarn-root}/yarn/rm to get notified in case the znode gets deleted or updated. The ApplicationMaster also watches for the notifications on ${yarn-root}/yarn/rm. On deletion of the znode NodeManagers and ApplicationMasters get notified of the change. This change in ResourceManager can be notified via ZooKeeper to the yarn components.
14 YARN Security This section describes security aspects of YARN. A hard requirement is for YARN to at least as secure as the current Hadoop Map-Reduce framework. This implies several aspects such as: Kerberos based strong authentication across the board. YARN daemons such as RM and NM should run as secure, non-root Unix users. Individual applications run as the actual user who submitted the application and they have secure access to data on HDFS etc. Support for super-users for frameworks such as Oozie. Comparable authorization for queues, admin tools etc. The proposal is to use the existing security mechanisms and components from Hadoop to effect security for YARN. The only additional requirement for YARN is that NodeManagers can securely authenticate incoming requests to allocate and start containers by ensuring that the Containers were actually allocated by the ResourceManager. The proposal is to use a shared secret key between the RM and the NM. The RM uses the key to sign an authorized ContainerToken with the ContainerID, ApplicationID and allocated resource capability with the key. The NM can use the shared key to decode the ContainerToken and verify the request. Security for ZooKeeper ZooKeeper has pluggable security that can use either YCA or shared secret authentication. It has ACLs for authorization. The RM node in ZooKeeper for discovering the RM and the application container assignment mentioned in section on YARN RM Availability are writable by the ResourceManager and only readable by others. YCA is used as an authentication system. Note that since ZooKeeper doesn t yet support Kerberos, the individual ApplicationMasters and tasks are not allowed to write to ZooKeeper, all writes are done by the RM itself.
Apache Hadoop YARN: The Nextgeneration Distributed Operating. System. Zhijie Shen & Jian He @ Hortonworks
Apache Hadoop YARN: The Nextgeneration Distributed Operating System Zhijie Shen & Jian He @ Hortonworks About Us Software Engineer @ Hortonworks, Inc. Hadoop Committer @ The Apache Foundation We re doing
Extending Hadoop beyond MapReduce
Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades
6. How MapReduce Works. Jari-Pekka Voutilainen
6. How MapReduce Works Jari-Pekka Voutilainen MapReduce Implementations Apache Hadoop has 2 implementations of MapReduce: Classic MapReduce (MapReduce 1) YARN (MapReduce 2) Classic MapReduce The Client
YARN Apache Hadoop Next Generation Compute Platform
YARN Apache Hadoop Next Generation Compute Platform Bikas Saha @bikassaha Hortonworks Inc. 2013 Page 1 Apache Hadoop & YARN Apache Hadoop De facto Big Data open source platform Running for about 5 years
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Big Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
The Improved Job Scheduling Algorithm of Hadoop Platform
The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: [email protected]
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Hadoop 2.6 Configuration and More Examples
Hadoop 2.6 Configuration and More Examples Big Data 2015 Apache Hadoop & YARN Apache Hadoop (1.X)! De facto Big Data open source platform Running for about 5 years in production at hundreds of companies
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems
Procedia Computer Science Volume 51, 2015, Pages 2678 2682 ICCS 2015 International Conference On Computational Science MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems Hamza Zafar 1, Farrukh
Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System
Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System Renyu Yang( 杨 任 宇 ) Supervised by Prof. Jie Xu Ph.D. student@ Beihang University Research Intern @ Alibaba
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
YARN and how MapReduce works in Hadoop By Alex Holmes
YARN and how MapReduce works in Hadoop By Alex Holmes YARN was created so that Hadoop clusters could run any type of work. This meant MapReduce had to become a YARN application and required the Hadoop
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
CDH 5 Quick Start Guide
CDH 5 Quick Start Guide Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Hadoop Applications on High Performance Computing. Devaraj Kavali [email protected]
Hadoop Applications on High Performance Computing Devaraj Kavali [email protected] About Me Apache Hadoop Committer Yarn/MapReduce Contributor Senior Software Engineer @Intel Corporation 2 Agenda Objectives
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
Sujee Maniyam, ElephantScale
Hadoop PRESENTATION 2 : New TITLE and GOES Noteworthy HERE Sujee Maniyam, ElephantScale SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:
Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination
Hadoop Scalability at Facebook. Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011
Hadoop Scalability at Facebook Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011 How Facebook uses Hadoop Hadoop Scalability Hadoop High Availability HDFS Raid How Facebook uses Hadoop Usages
Certified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
CURSO: ADMINISTRADOR PARA APACHE HADOOP
CURSO: ADMINISTRADOR PARA APACHE HADOOP TEST DE EJEMPLO DEL EXÁMEN DE CERTIFICACIÓN www.formacionhadoop.com 1 Question: 1 A developer has submitted a long running MapReduce job with wrong data sets. You
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
Data-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
How MapReduce Works 資碩一 戴睿宸
How MapReduce Works MapReduce Entities four independent entities: The client The jobtracker The tasktrackers The distributed filesystem Steps 1. Asks the jobtracker for a new job ID 2. Checks the output
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 CONTENTS Hadoop Hadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References ABSTRACT Hadoop is an efficient
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Test-King.CCA-500.68Q.A. Cloudera CCA-500 Cloudera Certified Administrator for Apache Hadoop (CCAH)
Test-King.CCA-500.68Q.A Number: Cloudera CCA-500 Passing Score: 800 Time Limit: 120 min File Version: 5.1 http://www.gratisexam.com/ Cloudera CCA-500 Cloudera Certified Administrator for Apache Hadoop
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Managing large clusters resources
Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth
ZooKeeper. Table of contents
by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...
StorReduce Technical White Paper Cloud-based Data Deduplication
StorReduce Technical White Paper Cloud-based Data Deduplication See also at storreduce.com/docs StorReduce Quick Start Guide StorReduce FAQ StorReduce Solution Brief, and StorReduce Blog at storreduce.com/blog
Hadoop Security Design
Hadoop Security Design Owen O Malley, Kan Zhang, Sanjay Radia, Ram Marti, and Christopher Harrell Yahoo! {owen,kan,sradia,rmari,cnh}@yahoo-inc.com October 2009 Contents 1 Overview 2 1.1 Security risks.............................
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Adobe Deploys Hadoop as a Service on VMware vsphere
Adobe Deploys Hadoop as a Service A TECHNICAL CASE STUDY APRIL 2015 Table of Contents A Technical Case Study.... 3 Background... 3 Why Virtualize Hadoop on vsphere?.... 3 The Adobe Marketing Cloud and
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing
YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing Eric Charles [http://echarles.net] @echarles Datalayer [http://datalayer.io] @datalayerio FOSDEM 02 Feb 2014 NoSQL DevRoom
HDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
Hadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
HADOOP MOCK TEST HADOOP MOCK TEST
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Hadoop Fair Scheduler Design Document
Hadoop Fair Scheduler Design Document October 18, 2010 Contents 1 Introduction 2 2 Fair Scheduler Goals 2 3 Scheduler Features 2 3.1 Pools........................................ 2 3.2 Minimum Shares.................................
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Big Data Management and Security
Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value
Storage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Running a typical ROOT HEP analysis on Hadoop/MapReduce. Stefano Alberto Russo Michele Pinamonti Marina Cobal
Running a typical ROOT HEP analysis on Hadoop/MapReduce Stefano Alberto Russo Michele Pinamonti Marina Cobal CHEP 2013 Amsterdam 14-18/10/2013 Topics The Hadoop/MapReduce model Hadoop and High Energy Physics
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Evaluation of Security in Hadoop
Evaluation of Security in Hadoop MAHSA TABATABAEI Master s Degree Project Stockholm, Sweden December 22, 2014 XR-EE-LCN 2014:013 A B S T R A C T There are different ways to store and process large amount
Running a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Complete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
MapReduce on YARN Job Execution
2012 coreservlets.com and Dima May MapReduce on YARN Job Execution Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Comparison of the High Availability and Grid Options
Comparison of the High Availability and Grid Options 2008 Informatica Corporation Overview This article compares the following PowerCenter options: High availability option. When you configure high availability
Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200
Hadoop Learning Resources 1 Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Author: Hadoop Learning Resource Hadoop Training in Just $60/3000INR
Capacity Scheduler Guide
Table of contents 1 Purpose...2 2 Features... 2 3 Picking a task to run...2 4 Installation...3 5 Configuration... 3 5.1 Using the Capacity Scheduler... 3 5.2 Setting up queues...3 5.3 Configuring properties
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
Introduction to Apache YARN Schedulers & Queues
Introduction to Apache YARN Schedulers & Queues In a nutshell, YARN was designed to address the many limitations (performance/scalability) embedded into Hadoop version 1 (MapReduce & HDFS). Some of the
Astaro Deployment Guide High Availability Options Clustering and Hot Standby
Connect With Confidence Astaro Deployment Guide Clustering and Hot Standby Table of Contents Introduction... 2 Active/Passive HA (Hot Standby)... 2 Active/Active HA (Cluster)... 2 Astaro s HA Act as One...
Cloudera Backup and Disaster Recovery
Cloudera Backup and Disaster Recovery Important Note: Cloudera Manager 4 and CDH 4 have reached End of Maintenance (EOM) on August 9, 2015. Cloudera will not support or provide patches for any of the Cloudera
www.basho.com Technical Overview Simple, Scalable, Object Storage Software
www.basho.com Technical Overview Simple, Scalable, Object Storage Software Table of Contents Table of Contents... 1 Introduction & Overview... 1 Architecture... 2 How it Works... 2 APIs and Interfaces...
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
MapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014
Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014 1 Содержание Бигдайта: распределенные вычисления и тренды MapReduce: концепция и примеры реализации
HDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
