The Art of Scheduling for Big Data Science

Size: px
Start display at page:

Download "The Art of Scheduling for Big Data Science"

Transcription

1 C h a p t e r 6 The Art of Scheduling for Big Data Science Florin Pop and Valentin Cristea Contents 6.1 Introduction Requirements for Scheduling in Big Data Platforms Scheduling Models and Algorithms Data Transfer Scheduling Scheduling Policies Optimization Techniques for Scheduling Case Study on Hadoop and Big Data Applications Conclusions 118 References 118 Abstract Many applications generate big data, like social networking and social influence programs, Cloud applications, public websites, scientific experiments and simulations, data warehouses, monitoring platforms, and e-government services. Data grow rapidly, since applications produce continuously increasing volumes of unstructured and structured data. The impact on data processing, transfer, and storage is the need to reevaluate the approaches and solutions to better answer user needs. In this context, scheduling models and algorithms have an important role. A large variety of solutions for specific applications and platforms exist, so a thorough and systematic analysis of existing solutions for scheduling models, methods, and algorithms used in big data processing and storage environments has high importance. This chapter presents the best of existing solutions and creates an overview of current and near-future trends. It will highlight, from a research perspective, the performance and limitations of existing solutions and will offer the scientists from academia and designers from industry an overview of the current situation in the area of scheduling and resource management related to big data processing. 105 K23331_C006.indd 105 9/30/ :06:06 AM

2 106 Florin Pop and Valentin Cristea Please check whether leveling of section heads was done correctly. If HPC is an abbreviation, please define at first mention. If GPU is an abbreviation, please define at first mention. Should custer be cluster?...highly heterogeneous computing systems. Edit OK? 6.1 Introduction The rapid growth of data volume requires processing of petabytes of data per day. Cisco estimates that mobile data traffic alone will reach exabytes of data per month in The produced data is subject to different kinds of processing, from real-time processing with impact for context-aware applications to data mining analysis for valuable information extraction. The multi-v (volume, velocity, variety, veracity, and value) model is frequently used to characterize big data processing needs. Volume defines the amount of data, velocity means the rate of data production and processing, variety refers to data types, veracity describes how data can be a trusted function of its source, and value refers to the importance of data relative to a particular context [1]. Scheduling plays and important role in big data optimization, especially in reducing the time for processing. The main goal of scheduling in big data platforms is to plan the processing and completion of as many tasks as possible by handling and changing data in an efficient way with a minimum number of migrations. Various mechanisms are used for resource allocation in cloud, HPC, grid, and peer-to-peer systems, which have different architectural characteristics. For example, in HPC, the cluster used for data processing is homogeneous and can handle many tasks in parallel by applying predefined rules. On the other side, cloud systems are heterogeneous and widely distributed; task management and execution are aware of communication rules and offer the possibility to create particular rules for the scheduling mechanism. The actual scheduling methods used in big data processing frameworks are as follows: first in first out, fair scheduling, capacity scheduling, Longest Approximate Time to End (LATE) scheduling, deadline constraint scheduling, and adaptive scheduling [2,3]. Finding the best method for a particular processing request remains a significant challenge. We can see the big data processing as a big batch process that runs on an HPC cluster by splitting a job into smaller tasks and distributing the work to the cluster nodes. The new types of applications, like social networking, graph analytics, and complex business work flows, require data transfer and data storage. The processing models must be aware of data locality when deciding to move data to the computing nodes or to create new computing nodes near data locations. The workload optimization strategies are the key to guarantee profit to resource providers by using resources to their maximum capacity. For applications that are both computational and data intensive, the processing models combine different techniques like in-memory, CPU, and/or GPU big data processing. Moreover, big data platforms face the problem of environments heterogeneity due to the variety of distributed systems types like custer, grid, cloud, and peer-to-peer, which actually offer support for advanced processing. At the confluence of big data with widely distributed platforms, scheduling solutions consider solutions designed for efficient problem solving and parallel data transfers (that hide transfer latency) together with techniques for failure management in highly heterogeneous computing systems. In addition, handling heterogeneous data sets becomes a challenge for interoperability among various software systems. This chapter highlights the specific requirements of scheduling in big data platforms, scheduling models and algorithms, data transfer scheduling procedures, policies used in K23331_C006.indd 106 9/30/ :06:06 AM

3 The Art of Scheduling for Big Data Science 107 If NewSQL is an abbreviation, please define at first mention. different computing models, and optimization techniques. The chapter concludes with a case study on Hadoop and big data, and the description of the new fashion to integrate NewSQL databases with distributed file systems and computing environments. 6.2 requirements for Scheduling in Big Data Platforms The requirements of traditional scheduling models came from applications, databases, and storage resources, which did exponentially grow over the years. As a result, the cost and complexity of adapting traditional scheduling models to big data platforms have increased, prompting changes in the way data is stored, analyzed, and accessed. The traditional model is being expanded to incorporate new building blocks. They address the challenges of big data with new information processing frameworks built to meet big data s requirements. The integration of applications and services in big data platforms requires a cluster resource management layer as a middleware component but also particular scheduling and execution engines specific to different models: batch tasks, data flow, new SQL tasks, and so forth (Figure 6.1). So, we have scheduling methods and algorithms in both cluster and execution engine layers. The general requirements for scheduling in big data platforms define the functional and nonfunctional specifications for a scheduling service. The general requirements are as follows: Scalability and elasticity: A scheduling algorithm must take into consideration the peta-scale data volumes and hundred thousands of processors that can be involved in processing tasks. The scheduler must be aware of execution environment changes and be able to adapt to workload changes by provisioning or deprovisioning resources. General purpose: A scheduling approach should make assumptions about and have few restrictions to various types of applications that can be executed. Interactive jobs, The chapter concludes with a case study Edit OK? Should new SQL be NewSQL? Applications and services that generate/process/address big data Batch tasks Data flow (Pig) NewSQL tasks (Hive) Execution engine (Tez, G-Hadoop, Torque) Other models Real-time streams Platform services Cluster resource management (YARN) Heterogeneous systems with reliable storage (HDFS) Figure 6.1 Integration of application and service integration in big data platforms. K23331_C006.indd 107 9/30/ :06:06 AM

4 108 Florin Pop and Valentin Cristea distributed and parallel applications, as well as noninteractive batch jobs should all be supported with high performance. For example, a noninteractive batch job requiring high throughput may prefer time-sharing scheduling; similarly, a real-time job requiring short-time response prefers space-sharing scheduling. Dynamicity: The scheduling algorithm should exploit the full extent of available resources and may change its behavior to cope, for example, with many computing tasks. The scheduler needs to continuously adapt to resource availability changes, paying special attention to cloud systems and HPC clusters (data centers) as reliable solutions for big data [4]. Transparency: The host(s) on which the execution is performed should not affect tasks behavior and results. From the user perspective, there should be no difference between local and remote execution, and the user should not be aware about system changes or data movements for big data processing. Fairness: Sharing resources among users in a fair way to guarantee that each user obtains resources on demand. In a pay-per-use model in the cloud, a cluster of resources can be allocated dynamically or can be reserved in advance. Time efficiency: The scheduler should improve the performance of scheduled jobs as much as possible using different heuristics and state estimation suitable for specific task models. Multitasking systems can process multiple data sets for multiple users at the same time by mapping the tasks to resources in a way that optimizes their use. Cost (budget) efficiency: The scheduler should lower the total cost of execution by minimizing the total number of resources used and respect the total money budget. This aspect requires efficient resource usage. This can be done by optimizing the execution for mixed tasks using a high-performance queuing system and by reducing the computation and communication overhead. Load balancing: This is used as a scheduling method to share the load among all available resources. This is a challenging requirement when some resources do not match with tasks properties. There are classical approaches like round-robin scheduling, but also, new approaches that cope with large scale and heterogeneous systems were proposed: least connection, slow start time, or agent-based adaptive balancing. Support of data variety and different processing models: By handling multiple concurrent input streams, structured and unstructured content, multimedia content, and advanced analytics. Classifying tasks into small or large, high or low priority, and periodic or sporadic will address a specific scheduling technique. Integration with shared distributed middleware: The scheduler must consider various systems and middleware frameworks, like the sensor integration in any place following the Internet of Things paradigm or even mobile cloud solutions that use offloading techniques to save energy. The integration considers the data access and consumption and supports various sets of workloads produced by services and applications. K23331_C006.indd 108 9/30/ :06:06 AM

5 The Art of Scheduling for Big Data Science 109 The scheduling models and algorithms are designed and implemented by enterprise tools and integration support for applications, databases, and big data processing environments. The workload management tools must manage the input and output data at every data acquisition point, and any data set must be handled by specific solutions. The tools used for resource management must solve the following requirements: Capacity awareness: Estimate the percentage of resource allocation for a workload and understand the volume and the velocity of data that is produced and processed. Real-time, latency-free delivery and error-free analytics: Support the service-level agreements, with continuous business process flow and with an integrated labor-intensive and fault-tolerant big data environment. If API and HTTP are abbreviations, please define at first mention. API integration: With different operating systems, supports various-execution virtual machines (VMs) and wide visibility (end-to-end access thorough standard protocols like HTTP, FTP, and remote login, and also using a single and simple management console) [5]. variousexecution virtual machines Edit OK? 6.3 scheduling Models and Algorithms A scheduling model consists of a scheduling policy, an algorithm, a programing model, and a performance analysis model. The design of a scheduler that follows a model should specify the architecture, the communication model between entities involved in scheduling, the process type (static or dynamic), the objective function, and the state estimation [6]. It is important that all applications are completed as quickly as possible, all applications receive the necessary proportion of the resources, and those with a close deadline have priority over other applications that could be finished later. The scheduling model can be seen from another point of view, namely, the price. It is important for cloud service providers to have a higher profit but also for user applications to be executed so that the cost remains under the available budget [7]. Several approaches to the scheduling problem have been considered over time. These approaches consider different scenarios, which take into account the applications types, the execution platform, the types of algorithms used, and the various constraints that might be imposed. The existing schedulers suitable for large environments and also for big data platforms are as follows: If FIFO is an abbreviation, please define at first mention. FIFO oldest job first jobs are ordered according to the arrival time. The order can be affected by job priority. Fair scheduling each job gets an equal share of the available resources. Capacity scheduling provides a minimum capacity guarantee and shares excess capacity; it also considers the priorities of jobs in a queue. Adaptive scheduling balances between resource utilization and jobs parallelism in the cluster and adapts to specific dynamic changes of the context. K23331_C006.indd 109 9/30/ :06:06 AM

6 110 Florin Pop and Valentin Cristea Data locality-aware scheduling minimizes data movement. Provisioning-aware scheduling provisions VMs from larger physical clusters; virtual resource migration is considered in this model. On-demand scheduling uses a batch scheduler as resource manager for node allocation based on the needs of the virtual cluster. The use of these models takes into account different specific situations: For example, FIFO is recommended when the number of tasks is less than the cluster size, while fairness is the best one when the cluster size is less than the number of tasks. The capacity scheduling is used for multiple tasks and priorities specified as response times. In Ref. [8] a solution of a scheduling bag of tasks is presented. Users receive guidance regarding the plausible response time and are able to choose the way the application is executed: with more money and faster or with less money but slower. An important ingredient in this method is the phase of profiling the tasks in the actual bag. The basic scheduling is realized with a bounded knapsack algorithm. Ref. [9] presents the idea of scheduling based on scaling up and down the number of the machines in a cloud system. This solution allows users to choose among several types of VM instances while scheduling each instance s start-up and shutdown to respect the deadlines and ensure a reduced cost. A scheduling solution based on genetic algorithms is described in Ref. [10]. The scheduling is done on grid systems. Grids are different from cloud systems, but the principle used by the authors in assigning tasks to resources is the same. The scheduling solution works with applications that can be modeled as directed acyclic graphs. The idea is minimizing the duration of the application execution while the budget is respected. This approach takes into account the system s heterogeneity. Ref. [11] presents a scheduling model for instance-intensive work flows in the cloud, which takes into consideration both budget and deadline constraints. The level of user interaction is very high, the user being able to change dynamically the cost and deadline requirements and provide input to the scheduler during the work flow execution. The interventions can be made every schedule round. This is an interesting model because the user can choose to pay more or less depending on the scenario. The main characteristic is that the user has more decision power on work flow execution. In addition, the cloud estimates the time and cost during the work flow execution to provide hints to users and dynamically reschedule the workload. The Apache Hadoop framework is a software library that allows the processing of large data sets distributed across clusters of computers using a simple programming model [3]. The framework facilitates the execution of MapReduce applications. Usually, a cluster on which the Hadoop system is installed has two masters and several slave components (Figure 6.2) [12]. One of the masters is the JobTracker, which deals with processing projects coming from users and sends them to the scheduler used in that moment. The other master is NameNode, which manages the file system namespace and the user access control. The other machines act as slaves. A TaskTracker represents a machine in Hadoop, while a DataNode handles the operations with the Hadoop Distributed File System (HDFS), which K23331_C006.indd 110 9/30/ :06:06 AM

7 The Art of Scheduling for Big Data Science 111 Master Task tracker Slave Task tracker Computation layer (MapReduce) Job tracker Data layer (HDFS) Name node Data node Data node Figure 6.2 Hadoop general architecture. In this instance, is FAIR an abbreviation? If so, please define at first mention. Or should this be changed to Fair Scheduler? the schedulers mentioned in this section Edit OK? deals with data replication on all the slaves in the system. This is the way input data gets to map and to reduce tasks. Every time an operation occurs on one of the slaves, the results of the operation are immediately propagated into the system [13]. The Hadoop framework considers the capacity and fair scheduling algorithms. The Capacity Scheduler has a number of queues. Each of these queues is assigned a part of the system resources and has specific numbers of map and reduce slots, which are set through the configuration files. The queues receive users requests and order them by the associated priorities. There is also a limitation for each queue per user. This prevents the user from seizing the resources for a queue. The Fair Scheduler has pools in which job requests are placed for selection. Each user is assigned to a pool. Also, each pool is assigned a set of shares and uses them to manage the resources allocated to jobs, so that each user receives an equal share, no matter the number of jobs he/she submits. Anyway, if the system is not loaded, the remininig shares are distributed to existing jobs. The FAIR scheduler has been proposed in Ref. [14]. The authors demonstrated its special qualities regarding the reduced response time and the high throughput. There are several extension to scheduling models for Hadoop. In Ref. [15], a new scheduling algorithm is presented, LATE, which is highly robust to heterogeneity without using specific information about nodes. The solution solves the problems posed by heterogeneity in virtualized data centers and ensures good performance and flexibility for speculative tasks. In Ref. [16], a scheduler is presented that meets deadlines. This scheduler has a preliminary phase for estimating the possibility to achieve the deadline claimed by the user, as a function of several parameters: the runtimes of map and reduce tasks, the input data size, data distribution, and so forth. Jobs are scheduled only if the deadlines can be met. In comparison with the schedulers mentioned in this section, the genetic scheduler proposed in Ref. [7] approaches the deadline constraints but also takes into account the environment heterogeneity. In addition, it uses speculative techniques in order to increase the scheduler s K23331_C006.indd 111 9/30/ :06:07 AM

8 112 Florin Pop and Valentin Cristea If YARN: is an abbreviation, please define at first mention. power. The genetic scheduler has an estimation phase, where the processing data speed for each application is measured. The scheduler ensures that, once an application s execution has started, that application will end successfully in normal conditions. The Hadoop On Demand (HOD) [3] virtual cluster uses the Torque resource manager for node allocation and automatically prepares configuration files. Then it initializes the system based on the nodes within the virtual cluster. HOD can be used in a relatively independent way. To support multiuser situations [14,17], the Hadoop framework incorporates several components that are suitable for big data processing (Figure 6.3) since they offer high scalability though a large volume of data and support access to widely distributed data. Here is a very short description of these components: Hadoop Common consists of common utilities that support any Hadoop modules and any new extension. HDFS provides high-throughput access to application data. Hadoop YARN is a framework for job scheduling and cluster resource management that can be extended across multiple platforms. Hadoop MapReduce is a YARN-based system for parallel processing of large data sets. Facebook solution Corona [18] extends and improves the Hadoop model, offering better scalability and cluster utilization, lower latency for small jobs, the ability to upgrade without disruption, and scheduling based on actual task resource requirements rather than a count of map and reduce tasks. Corona was designed to answer the most important Facebook challenges: unique scalability (the largest cluster has more than 100 PB of data) and processing needs (crunch more than 60,000 Hive queries a day). The data warehouse inside Facebook has grown by 2500 times between 2008 and 2012, and it is expected to grow by the same factor until Cluster manager Job tracker Job tracker Scheduling and execution engine layer (Hadoop, Corona) Task tracker Task tracker Task tracker Task tracker Data blocks Data blocks Data layer (big data platforms) Data blocks Figure 6.3 Hadoop processing and big data platforms. K23331_C006.indd 112 9/30/ :06:08 AM

9 The Art of Scheduling for Big Data Science 113 Hadoop scheduler MPI scheduler ZooKeeper quorum Mesos master Standby master Standby master Mesos slave Mesos slave Mesos slave Hadoop executor MPI executor Hadoop executor MPI executor Task Task Task Task If MPI is an abbreviation, please define at first mention. Figure 6.4 Mesos architecture. Mesos [19] uses a model of resource isolation and sharing across distributed applications using Hadoop, MPI, Spark, and Aurora in dynamic way (Figure 6.4). The ZooKeeper quorum is used for master replication to ensure fault tolerance. By integration of multiple slave executors, Mesos offer support for multiresource scheduling (memory and CPU aware) using isolation between tasks with Linux Containers. The expected scalability goes over 10,000 nodes. YARN [20] splits the two major functionalities of the Hadoop JobTracker in two separate components: resource management and job scheduling/monitoring (application management). The Resource Manager is integrated in the data-computation framework and coordinates the resource for all jobs processing alive in a big Data PLATFORM. The Resource Manager has a pure scheduler, that is, it does not monitor or track the application status and does not offer guarantees about restarting failed tasks due to either application failure or hardware failures. It offers only matching of applications jobs on resources. The new processing models based on bio-inspired techniques are used for fault-tolerant and self-adaptable handling of data-intensive and computation-intensive applications. These evolutionary techniques approach the learning based on history with the main aim to find a near-optimal solution for a problems with multiconstraint and multicriteria optimizations. For example, adaptive scheduling is needed for dynamic heterogeneous systems where we can change the scheduling strategy according to available resources and their capacity. 6.4 data Transfer Scheduling In many cases, depending on applications architecture, data must be transported to the place where tasks will be executed [21]. Consequently, scheduling schemes should consider not only the task execution time but also the data transfer time for finding a more convenient mapping of tasks. Only a handful of current research efforts consider the simultaneous optimization of computation and data transfer scheduling. The general data scheduling techniques are Least Frequently Used (LFU), Least Recently Used (LRU), and economical models. Handling multiple file transfer requests is made in parallel with maximizing the bandwidth for which the file transfer rates between two end-points are calculated and considering the heterogeneity of server resources. K23331_C006.indd 113 9/30/ :06:08 AM

10 114 Florin Pop and Valentin Cristea Definition of I/O correct? The big data input/output (I/O) scheduler in Ref. [22] is a solution for applications that compete for I/O resources in a shared MapReduce-type big data system. The solution, named Interposed Big-data I/O Scheduler (IBIS), has the main aims to solve the problem of differentiating the I/Os among competitive applications on separate data nodes and performs scheduling according to applications bandwidth demands. IBIS acts as a metascheduler and efficiently coordinates the distributed I/O schedulers across data nodes in order to allocate the global storage. In the context of big data transfers, a few big questions need to be answered in order to have an efficient cloud environment, more specifically, when and where to migrate. In this context, an efficient data migration method, focusing on the minimum global time, is presented in Ref. [23]. The method, however, does not try to minimize individual migrations duration. In Ref. [24], two migration models are described: offline and online. The offline scheduling model has as a main target the minimization of the maximum bandwidth usage on all links for all time slots of a planning period. In the online scheduling model, the scheduler has to make fast decisions, and the migrations are revealed to the migration scheduler in an a priori undefined sequence. Jung et al. [25] treat the data mining parallelization by considering the data transfer delay between two computing nodes. The delay is estimated by using the auto-regressive moving average filter. In Ref. [26], the impact of two resource reservation methods is tested: reservation in source machines and reservation in target machines. Experimental results proved that resource reservation in target machines is needed, in order to avoid migration failure. The performance overheads of live migration are affected by memory size, CPU, and workload types. The model proposed in Ref. [27] uses the greedy scheduling algorithm for data transfers through different cloud data centers. This algorithm gets the transfer requests on a firstcome-first-serve order and sets a time interval in which they can be sent. This interval is reserved on all the connections the packet has to go through (in this case, there is a maximum of three hops to destination, because of the full mesh infrastructure). This is done until there are no more transfers to schedule, taking into account the previously reserved time frames for each individual connection. The connections are treated individually to avoid the bottleneck. For instance, the connections between individual clouds need to transfer more messages than connections inside the cloud. This way, even if the connection from a physical machine to a router is unused, the connection between the routers can be oversaturated. There is no point in scheduling the migration until the transfers that are currently running between the routers end, even if the connection to the router is unused. 6.5 scheduling Policies The scheduling policies are used to determine the relative ordering of requests. Large distributed systems with different administrative domains will most likely have different resource utilization policies. A policy can take into consideration the priority, the deadlines, the budgets, and also the dynamic behavior [28]. For big data platforms, dynamic scheduling with soft deadlines and hard-budget constraints on hybrid clouds are an open issue. A general-purpose resource management approach in a cluster used for big data processing should make some assumptions about policies that are incorporated in service-level K23331_C006.indd 114 9/30/ :06:08 AM

11 The Art of Scheduling for Big Data Science 115 Application 1 Job 1 Job 2 Job N Allocate Resource manager Allocator Scheduler Locality scheduling, gang scheduling, DAG scheduling, custom scheduling Job 1 (20%) Job 2 (50%) Allocate Application 2 Figure 6.5 Resource allocation process. agreements. For example, interactive tasks, distributed and parallel applications, as well as noninteractive batch tasks should all be supported with high performance. This property is a straightforward one but, to some extent, difficult to achieve. Because tasks have different attributes, their requirements to the scheduler may contradict in a shared environment. For example, a real-time task requiring short-time response prefers space-sharing scheduling; a noninteractive batch task requiring high throughput may prefer time-sharing scheduling [6,29]. To be general purpose, a tradeoff may have to be made. The scheduling method focuses on parallel tasks, while providing an acceptable performance to other kinds of tasks. YARN has a pluggable solution for dynamic policy loading, considering two steps for the resource allocation process (Figure 6.5). Resource allocation is done by YARN, and task scheduling is done by the application, which permits the YARN platform to be a generic one while still allowing flexibility of scheduling strategies. The specific policies in YARN are oriented on resource splitting according with schedules provided by applications. In this way, the YARN s scheduler determines how much and in which cluster to allocate resources based on their availability and on the configured sharing policy. 6.6 optimization Techniques for Scheduling The scheduling in big data platforms is the main building block for making data centers more available to applications and user communities. An example of optimization is multiobjective and multiconstrained scheduling of many tasks in Hadoop [30] or optimizing short jobs with Hadoop [31]. The optimization strategies for scheduling are specific to each model and for each type of application. The most used techniques for scheduling optimization are as follows: Linear programming allows the scheduler to find the suitable resources or cluster of resources, based on defined constraints. Dynamic partitioning splits complex applications in a cluster of tasks and schedules each cluster with a specific scheduling algorithm. K23331_C006.indd 115 9/30/ :06:09 AM

12 116 Florin Pop and Valentin Cristea Combinatorial optimization aims to find an optimal allocation solution for a finite set of resources. This is a time-consuming technique, and it is not recommended for real-time processing. Evolutionary optimization aims to find an optimal configuration for a specific system within specific constraints and consists of specific bio-inspired techniques like genetic and immune algorithms, ant and bee computing, particle swarm optimization, and so forth. These methods usually find a near-optimal solution for the scheduling problem. Stochastic optimization uses random variables that appear in the formulation of the optimization problem itself and are used especially for applications that have a deterministic behavior, that is, a normal distribution (for periodic tasks) or Poisson distribution (for sporadic tasks), and so forth. Task-oriented optimization considers the task s properties, arrival time (slack optimization), and frequency (for periodic tasks). Resource-oriented optimization considers completion time constraints while making the decision to maximize resource utilization. 6.7 Case Study on Hadoop and Big Data Applications We can consider that, on top of Apache Hadoop or a Hadoop distribution, you can use a big data suite and scheduling application that generate big data. The Hadoop framework is suitable for specific high-value and high-volume workloads [32]. With optimized resource and data management by putting the right big data workloads in the right systems, a solution is offered by the integration of YARN with specific schedulers: G-Hadoop [33], MPI Scheduler, Torque, and so forth. The cost-effectiveness, scalability, and streamlined architectures of Hadoop have been developed by IBM, Oracle, and Hortonworks Data Platform (HDP) with the main scope to construct a Hadoop operating system. Hadoop can be used in public/private clouds through related projects integrated in the Hadoop framework, trying to offer the best answer to the following question: What type of data/tasks should move to a public cloud, in order to achieve a cost-aware cloud scheduler [34]? Different platforms that use and integrate Hadoop in application development and implementation consider different aspects. Modeling is the first one. Even in Apache Hadoop, which offers the infrastructure for Hadoop clusters, it is still very complicated to build a specific MapReduce application by writing complex code. Pig Latin and Hive query language (HQL), which generate MapReduce code, are optimized languages for application development. The integration of different scheduling algorithms and policies is also supported. The application can be modeled in a graphical way, and all required code is generated. Then, the tooling of big data services within a dedicated development environment (Eclipse) uses various dedicated plugins. Code generation of all the code is automatically generated for a MapReduce application. Last, but not least, execution of big data jobs has to be scheduled and monitored. Instead of writing jobs or other code for Even in Apache Hadoop, which offers the infrastructure for Hadoop clusters, it is still very complicated Edit OK? K23331_C006.indd 116 9/30/ :06:09 AM

13 The Art of Scheduling for Big Data Science 117 scheduling, the big data suite offers the possibility to define and manage execution plans in an easy way. In a big data platform, Hadoop needs to integrate data of all different kinds of technologies and products. Besides files and SQL databases, Hadoop needs to integrate the NoSQL databases, used in applications like social media such as Twitter or Facebook, the messages from middleware or data from B2B products such as Salesforce or SAP, the multimedia streams, and so forth. A big data suite integrated with Hadoop offers connectors from all these different interfaces to Hadoop and back. The main Hadoop-related projects, developed under the Apache license and supporting big data application execution, include the following [35]: HBase (https://hbase.apache.org) is a scalable, distributed database that supports structured data storage for large tables. HBase offers random, real-time read/write access to big data platforms. Cassandra (http://cassandra.apache.org) is a multimaster database with no single points of failure, offering scalability and high availability without compromising performance. Pig (https://pig.apache.org) is a high-level data-flow language and execution framework for parallel computation. It also offers a compiler that produces sequences of MapReduce programs, for which large-scale parallel implementations already exist in Hadoop. Hive (https://hive.apache.org) is a data warehouse infrastructure that provides data summarization and ad hoc querying. It facilitates querying and managing large data sets residing in distributed storage. It provides the SQL-like language called HiveQL. ZooKeeper (https://zookeeper.apache.org) is a high-performance coordination service for distributed applications. It is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. If NoSQL, B2B, and SAP are abbreviations, please define at first mention. ache ich re for sters, y d Ambari (https://ambari.apache.org) is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop. Tez (http://tez.incubator.apache.org) is a generalized data-flow managing framework, built on Hadoop YARN, which provides an engine to execute an arbitrary DAG of tasks (an application work flow) to process data for both batch and interactive use cases. Tez is being adopted by other projects like Hive and Pig. Spark (http://spark.apache.org) is a fast and general compute engine for Hadoop data in large-scale environments. It combines SQL, streaming, and complex analytics and support application implementation in Java, Scala, or Python. Spark also has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. If DAG is an abbreviation, please define at first mention. K23331_C006.indd 117 9/30/ :06:09 AM

14 118 Florin Pop and Valentin Cristea contributor/s. publisher location. volume number and page range. 6.8 Conclusions In the past years, scheduling models and tools have been faced with fundamental rearchitecting to fit as well as possible with large many-task computing environments. The way that existing solutions were redesigned consisted in splitting the tools into multiple components and adapting each component according to its new role and place. Workloads and work flow sequences are scheduled at the application side, and then, a resource manager allocates a pool of resources for the execution phase. So, the scheduling become important at the same time for users and providers, being the most important key for any optimal processing in big data science. References 1. M. D. Assuncao, R. N. Calheiros, S. Bianchi, M. A. Netto, and R. Buyya. Big data computing and clouds: Challenges, solutions, and future directions. arxiv preprint arxiv: , A. Rasooli, and D. G. Down. COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Generation Computer Systems, vol. 36, pp. 1 15, ISSN X. 3. M. Tim Jones. Scheduling in Hadoop. An introduction to the pluggable scheduler framework. IBM Developer Works, Technical Report, December Available at https://www.ibm.com /developerworks/opensource/library/os-hadoop-scheduling. 4. L. Zhang, C. Wu, Z. Li, C. Guo, M. Chen, and F. C. M. Lau. Moving big data to the cloud: An online cost-minimizing approach. Selected Areas in Communications, IEEE Journal on, vol. 31, no. 12, pp , Big Data Solutions on Cisco UCS Common Platform Architecture (CPA). CISCO Report. Available at May V. Cristea, C. Dobre, C. Stratan, F. Pop, and A. Costan. Large-Scale Distributed Computing and Applications: Models and Trends. IGI Global, Hershey, PA, pp , doi: / D. Pletea, F. Pop, and V. Cristea. Speculative genetic scheduling method for Hadoop environments. In th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. SYNASC, pp , A.-M. Oprescu, T. Kielmann, and H. Leahu. Budget estimation and control for bag-of-tasks scheduling in clouds. Parallel Processing Letters, vol. 21, no. 2, pp , M. Mao, J. Li, and M. Humphrey. Cloud auto-scaling with deadline and budget constraints. In The 11th ACM/IEEE International Conference on Grid Computing (Grid 2010). Brussels, Belgium, J. Yu, and R. Buyya. Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms. Scientific Programming Journal, vol. 14, nos. 3 4, pp , K. Liu, H. Jin, J. Chen, X. Liu, D. Yuan, and Y. Yang. A compromised-time-cost scheduling algorithm in swindew-c for instance-intensive cost-constrained workflows on a cloud computing platform. International Journal of High Performance Computing Applications, vol. 24, no. 4, pp , T. White. Hadoop: The Definitive Guide. O Reilly Media, Inc., X. Hua, H. Wu, Z. Li, and S. Ren. Enhancing throughput of the Hadoop Distributed File System for interaction-intensive tasks. Journal of Parallel and Distributed Computing. Available online April 3, ISSN additional data. K23331_C006.indd 118 9/30/ :06:09 AM

15 The Art of Scheduling for Big Data Science 119 conference location if available. Please update page range. conference location if available. editor/s. 14. M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Job scheduling for multi-user MapReduce clusters. Technical Report, EECS Department, University of California, Berkeley, CA, April Available at /EECS html. 15. M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving MapReduce performance in heterogeneous environments. In Proceeding OSDI 08 Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation. USENIX Association Berkeley, CA, pp , K. Kc, and K. Anyanwu. Scheduling Hadoop jobs to meet deadlines. In Proceeding CLOUDCOM 10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science. IEEE Computer Society Washington, DC, pp , Y. Tao, Q. Zhang, L. Shi, and P. Chen. Job scheduling optimization for multi-user MapReduce clusters. In Parallel Architectures, Algorithms and Programming (PAAP), 2011 Fourth International Symposium on, pp. 213, 217, December 9 11, Corona. Under the Hood: Scheduling MapReduce jobs more efficiently with Corona, Available at https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling -mapreduce-jobs-more-efficiently-with-corona/ B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI 11). USENIX Association, Berkeley, CA, pp , Hortonworks. Hadoop YARN: A next-generation framework for Hadoop data processing, Available at 21. J. Celaya, and U. Arronategui. A task routing approach to large-scale scheduling. Future Generation Computer Systems, vol. 29, no. 5, pp , ISSN X. 22. Y. Xu, A. Suarez, and M. Zhao. IBIS: Interposed big-data I/O scheduler. In Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing (HPDC 13). ACM, New York, pp , J. Hall, J. Hartline, A. R. Karlin, J. Saia, and J. Wilkes. On algorithms for efficient data migration. In Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 01). Society for Industrial and Applied Mathematics, Philadelphia, PA, pp , A. Stage, and T. Setzer. Network-aware migration control and scheduling of differentiated virtual machine workloads. In Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing (CLOUD 09). IEEE Computer Society, Washington, DC, pp. 9 14, G. Jung, N. Gnanasambandam, and T. Mukherjee. Synchronous parallel processing of bigdata analytics services to optimize performance in federated clouds. In Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing (CLOUD 12). IEEE Computer Society, Washington, DC, pp , K. Ye, X. Jiang, D. Huang, J. Chen, and B. Wang. Live migration of multiple virtual machines with resource reservation in cloud computing environments. In Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing (CLOUD 11). IEEE Computer Society, Washington, DC, pp , M.-C. Nita, C. Chilipirea, C. Dobre, and F. Pop. A SLA-based method for big-data transfers with multi-criteria optimization constraints for IaaS. In Roedunet International Conference (RoEduNet), th, pp. 1, 6, January 17 19, R. Van den Bossche, K. Vanmechelen, and J. Broeckhove. Online cost-efficient scheduling of deadline-constrained workloads on hybrid clouds. Future Generation Computer Systems, vol. 29, no. 4, pp , ISSN X. 29. H. Karatza. Scheduling in distributed systems. In Performance Tools and Applications to Networked Systems. Springer Berlin, Heidelberg, pp , K23331_C006.indd 119 9/30/ :06:09 AM

16 120 Florin Pop and Valentin Cristea Please update page range. page range. publisher name. 30. F. Zhang, J. Cao, K. Li, S. U. Khan, and K. Hwang. Multi-objective scheduling of many tasks in cloud platforms. Future Generation Computer Systems. Available online September 18, ISSN volume number X. Available at and page range. 31. K. Elmeleegy. Piranha: Optimizing short jobs in Hadoop. Proceedings of the VLDB Endowment, vol. 6, no. 11, pp , R. Ramakrishnan, and Team Members CISL. Scale-out beyond MapReduce. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 13), I. S. Dhillon, Y. Koren, R. Ghani, T. E. Senator, P. Bradley, R. Parekh, J. He, R. L. Grossman, and R. Uthurusamy (Eds.). ACM, New York, pp. 1 1, L. Wang, J. Tao, R. Ranjan, H. Marten, A. Streit, J. Chen, and D. Chen. G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Generation Computer Systems, vol. 29, no. 3, pp , B. C. Tak, B. Urgaonkar, and A. Sivasubramaniam. To move or not to move: The economics of cloud computing. In Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing (HotCloud 11). USENIX Association, Berkeley, CA, pp. 5 5, D. Loshin. Chapter 7 Big data tools and techniques. In Big Data Analytics, D. Loshin, and M. Kaufmann (Eds.). Boston, pp , ISBN K23331_C006.indd 120 9/30/ :06:09 AM

A SLA-Based Method for Big-Data Transfers with Multi-Criteria Optimization Constraints for IaaS

A SLA-Based Method for Big-Data Transfers with Multi-Criteria Optimization Constraints for IaaS A SLA-Based Method for Big-Data Transfers with Multi-Criteria Optimization Constraints for IaaS Mihaela-Catalina Nita, Cristian Chilipirea Faculty of Automatic Control and Computers University Politehnica

More information

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop

More information

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

The Improved Job Scheduling Algorithm of Hadoop Platform

The Improved Job Scheduling Algorithm of Hadoop Platform The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: wulinzhi1001@163.com

More information

Analysis of Information Management and Scheduling Technology in Hadoop

Analysis of Information Management and Scheduling Technology in Hadoop Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering

More information

Big Data Analysis and Its Scheduling Policy Hadoop

Big Data Analysis and Its Scheduling Policy Hadoop IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Managing large clusters resources

Managing large clusters resources Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth

More information

SCHEDULING IN CLOUD COMPUTING

SCHEDULING IN CLOUD COMPUTING SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS Survey of Optimization of Scheduling in Cloud Computing Environment Er.Mandeep kaur 1, Er.Rajinder kaur 2, Er.Sughandha Sharma 3 Research Scholar 1 & 2 Department of Computer

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Extending Hadoop beyond MapReduce

Extending Hadoop beyond MapReduce Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

International Journal of Computer & Organization Trends Volume21 Number1 June 2015 A Study on Load Balancing in Cloud Computing

International Journal of Computer & Organization Trends Volume21 Number1 June 2015 A Study on Load Balancing in Cloud Computing A Study on Load Balancing in Cloud Computing * Parveen Kumar * Er.Mandeep Kaur Guru kashi University,Talwandi Sabo Guru kashi University,Talwandi Sabo Abstract: Load Balancing is a computer networking

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

SCORE BASED DEADLINE CONSTRAINED WORKFLOW SCHEDULING ALGORITHM FOR CLOUD SYSTEMS

SCORE BASED DEADLINE CONSTRAINED WORKFLOW SCHEDULING ALGORITHM FOR CLOUD SYSTEMS SCORE BASED DEADLINE CONSTRAINED WORKFLOW SCHEDULING ALGORITHM FOR CLOUD SYSTEMS Ranjit Singh and Sarbjeet Singh Computer Science and Engineering, Panjab University, Chandigarh, India ABSTRACT Cloud Computing

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

Design of Electric Energy Acquisition System on Hadoop

Design of Electric Energy Acquisition System on Hadoop , pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center Big Data Processing using Hadoop Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center Apache Hadoop Hadoop INRIA S.IBRAHIM 2 2 Hadoop Hadoop is a top- level Apache project» Open source implementation

More information

CSE-E5430 Scalable Cloud Computing Lecture 11

CSE-E5430 Scalable Cloud Computing Lecture 11 CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

A STUDY OF ADOPTING BIG DATA TO CLOUD COMPUTING

A STUDY OF ADOPTING BIG DATA TO CLOUD COMPUTING A STUDY OF ADOPTING BIG DATA TO CLOUD COMPUTING ASMAA IBRAHIM Technology Innovation and Entrepreneurship Center, Egypt aelrehim@itida.gov.eg MOHAMED EL NAWAWY Technology Innovation and Entrepreneurship

More information

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Deep Mann ME (Software Engineering) Computer Science and Engineering Department Thapar University Patiala-147004

More information

Evaluating Task Scheduling in Hadoop-based Cloud Systems

Evaluating Task Scheduling in Hadoop-based Cloud Systems 2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System Renyu Yang( 杨 任 宇 ) Supervised by Prof. Jie Xu Ph.D. student@ Beihang University Research Intern @ Alibaba

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Addressing Open Source Big Data, Hadoop, and MapReduce limitations Addressing Open Source Big Data, Hadoop, and MapReduce limitations 1 Agenda What is Big Data / Hadoop? Limitations of the existing hadoop distributions Going enterprise with Hadoop 2 How Big are Data?

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing Eric Charles [http://echarles.net] @echarles Datalayer [http://datalayer.io] @datalayerio FOSDEM 02 Feb 2014 NoSQL DevRoom

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds ABSTRACT Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds 1 B.Thirumala Rao, 2 L.S.S.Reddy Department of Computer Science and Engineering, Lakireddy Bali Reddy College

More information

Do You Feel the Lag of Your Hadoop?

Do You Feel the Lag of Your Hadoop? Do You Feel the Lag of Your Hadoop? Yuxuan Jiang, Zhe Huang, and Danny H.K. Tsang Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology, Hong Kong Email:

More information

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014)

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014) SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE N.Alamelu Menaka * Department of Computer Applications Dr.Jabasheela Department of Computer Applications Abstract-We are in the age of big data which

More information

Figure 1. The cloud scales: Amazon EC2 growth [2].

Figure 1. The cloud scales: Amazon EC2 growth [2]. - Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Applied research on data mining platform for weather forecast based on cloud storage

Applied research on data mining platform for weather forecast based on cloud storage Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information

More information

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud Gunho Lee, Byung-Gon Chun, Randy H. Katz University of California, Berkeley, Yahoo! Research Abstract Data analytics are key applications

More information

Reverse Auction-based Resource Allocation Policy for Service Broker in Hybrid Cloud Environment

Reverse Auction-based Resource Allocation Policy for Service Broker in Hybrid Cloud Environment Reverse Auction-based Resource Allocation Policy for Service Broker in Hybrid Cloud Environment Sunghwan Moon, Jaekwon Kim, Taeyoung Kim, Jongsik Lee Department of Computer and Information Engineering,

More information

From Wikipedia, the free encyclopedia

From Wikipedia, the free encyclopedia Page 1 sur 5 Hadoop From Wikipedia, the free encyclopedia Apache Hadoop is a free Java software framework that supports data intensive distributed applications. [1] It enables applications to work with

More information

Oracle s Big Data solutions. Roger Wullschleger.

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Alternatives to HIVE SQL in Hadoop File Structure

Alternatives to HIVE SQL in Hadoop File Structure Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The

More information

Big Data Storage Architecture Design in Cloud Computing

Big Data Storage Architecture Design in Cloud Computing Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,

More information

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Copyright 2012, Oracle and/or its affiliates. All rights reserved. 1 Oracle Big Data Appliance Releases 2.5 and 3.0 Ralf Lange Global ISV & OEM Sales Agenda Quick Overview on BDA and its Positioning Product Details and Updates Security and Encryption New Hadoop Versions

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Discovery 2015: Cloud Computing Workshop June 20-24, 2011 Berkeley, CA Introduction to Cloud Computing Keith R. Jackson Lawrence Berkeley National Lab What is it? NIST Definition Cloud computing is a model

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information