BEYOND MAP REDUCE: THE NEXT GENERATION OF BIG DATA ANALYTICS

Size: px
Start display at page:

Download "BEYOND MAP REDUCE: THE NEXT GENERATION OF BIG DATA ANALYTICS"

Transcription

1 WHITE PAPER Abstract Co-Authors Brian Hellig ET International, Inc. Stephen Turner ET International, Inc. Rich Collier ET International, Inc. Long Zheng University of Delaware BEYOND MAP REDUCE: THE NEXT GENERATION OF BIG DATA ANALYTICS MapReduce does what it was intended to do: store and process large datasets in batches, inexpensively. However, many organizations that implemented Hadoop have experienced unexpected challenges, in part because they want to do more than MapReduce was designed to do. In response, an ecosystem emerged that continually introduces additional tools to help overcome some of those challenges. Unfortunately, this has only made implementations more complex and daunting, giving rise to the need for the simpler toolset offered by HAMR. ET International (ETI) is introducing a way to keep the best aspects of MapReduce and its popular implementation, Apache Hadoop, while reducing the number of add-on tools needed to make it relevant for commercial application. ETI s novel multisource analytics product HAMR runs both batch and real-time streaming. It complements the current paradigm and accommodates the next generation of systems that will begin to render Hadoop MapReduce as we know it obsolete. Given that Moore s Law has an 18-month cycle time, it is urgent that information systems professionals budget and plan now for future generations. [6] Page 1

2 1. Introduction The name MapReduce originally referred to proprietary Google technology, but has come to generically mean a programming model used to store and process large scale datasets on commodity hardware clusters. One popular open-source implementation of MapReduce is Apache Hadoop. The big idea behind MapReduce revolved around processing and analyzing big data. Although it was designed to hide the complexities of scheduling, parallelization, failure handling, and computation distribution across a cluster of nodes [7], Google developers pioneered the MapReduce model to handle really big data. [1] Dean and Ghemawat opened their breakthrough 2004 paper by suggesting that MapReduce is a programming model and associated implementation for processing and generating large data sets. [1] The MapReduce concept became extremely popular, in part because organizations could deploy it on available computing components for parallel processing. This allowed terabytes of data to be analyzed on vast networks of inexpensive computers. This commodity-based approach offered tremendous value and solved major problems for Google. Its search engine behemoth crawls, indexes, inspects and ranks the entire web. [3] MapReduce allowed them to aggregate information from a variety of sources, transform the data into different formats, assess application logs, and export the data for analysis. [4] By 2007, Doug Cutting asserted Yahoo! regularly uses Hadoop for research tasks to improve its products and services such as ranking functions, ad targeting, etc. There are also a few cases where data generated by Hadoop is directly used by products. He was proud to say, Unlike Google, Yahoo! has decided to develop Hadoop in an open and non-proprietary environment and the software is free for anyone to use and modify. [2] This open source approach led to today s Hadoop 2.0 ecosystem and its variety of complementary technologies (such as Storm, HBase, Hive, Pig, Mahout and Zookeeper). Although these technologies fulfill different needs, they add complexity [5] because their capabilities go beyond what MapReduce was designed to do. Specifically, users want to extend the limitations of MapReduce s current synchronous, batch processing of datasets. Page 2

3 As a solution to batch processing limitations and future-generation computing system challenges, ETI proposes the next generation of MapReduce: its novel product, HAMR, which uses asynchronous Flowlets and Key/Value Stores (patents pending) to allow processing multiple sources of data in real-time streaming or batch modes. 1I. Batch Processing Datasets MapReduce is a synchronous approach to data processing, therefore it is synonymous with batch processing. Mapping generates a set of intermediate key/ value pairs; reduce merges intermediate value associates with an intermediate key. [4] A key is a particular identifier for a unit or type of datum. The value can point to the whereabouts of that data, or it can actually be the identified data. [10] Dean and Ghemawat noted that their design was an abstraction that allowed us to express the simple computations we were trying to perform but hides the messy details of parallelization. They built their model around sequential map and reduce basic building blocks in functional languages such as Lisp. [4] MapReduce methodically batch processes enormous sets of data. In practice, large e-commerce sites must wait to see the results of log files analysis because MapReduce takes many hours or even days to process datasets measured in terabytes. (Large sites can have 100,000 visits per day, with each visit generating 100 log files). E-commerce sites that do not require rapid response could make adjustments daily or weekly if the analysis indicates a problem. For example, by analyzing shopping cart abandonment, they could identify the root cause and improve the customer experience. This is not something that needs to be done on the fly; multi-hour or overnight processing is fine. However, other situations require rapid response. For example, a fraudulent credit card transaction in a retail store should be identified and stopped in no more than a few minutes, but typical batch processing can only detect the fraud hours after the merchandise has left the building. The Hadoop Distributed File System (HDFS) contributes to the delay, because administrators must physically move the data from each system of record to Page 3

4 HDFS. Even if moved hourly in smaller batches, this time-consuming process is neither appropriate for situations that require real-time streaming analytics (such as fraud) nor optimal for a range of processing algorithms. Figure 1: Hadoop 2.0 Ecosystem (right) The Ecosystem includes Analytics (top layer), Resource Management (middle layer), and Storage (bottom layer). Batch (Map Reduce) Batch (Map Reduce) Interactive Online Streaming Graph In- Memory HPC MPI YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) Interactive Online Streaming Graph ETI is HDFS2 introducing (Redundant, a new Reliable way to Storage) run both batch (when delays are Data acceptable) Flow Engine and streaming (when time is of the essence). ETI s novel multi-source analytics Batch In- HPC (Map Interactive Online Streaming Graph Other product HAMR is optimized Memory to run MPI Reduce) in either batch or streaming Analyticsmode and to reduce the number of add-on software tools found in Hadoop. YARN (Cluster Resource Management) In- Memory HPC MPI YARN (Cluster Resource Management) Other Other Map Reduce is one of many tools on the Analytics layer (top). All of these tools make Hadoop 2.0 complex to implement. YARN (middle) is responsible for tracking the resources in a cluster and scheduling applications (i.e., Map Reduce jobs). Hadoop Distributed File System (HDFS2 on the bottom layer) is a major challenge within the ecosystem because all data must be moved into it. III. Next Generation of Map Reduce Figure 2: Extension of the Ecosystem with HAMR Multi-Source Analytics Data Flow Engine (right) By introducing HAMR, a range of datasets can be accessed including HDFS2 as well as the Cloud, HBASE, SQL, Lustre, streaming data and more. This is a significant breakthrough that enables real-time streaming analytics. Batch (Map Reduce) HDFS2 (Redundant, Reliable Storage) Interactive Online Streaming Graph In- Memory HPC MPI YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) Other Data Flow Engine Cloud HBASE SQL Twitter Lustre Analytics One Intuitive Interface Cloud HBASE SQL Twitter Lustre One Intuitive Interface Streaming Data Streaming Data Although the terms real-time and streaming are often used interchangeably, they have different meanings. Real-time means that once an event happens, the results are returned within a predictable timeframe. In her article, Mary Ludloff used the example of a car s antilock brakes as a real-time computing system, and the time in which the brakes must be released in a predictable timeframe. [8] In contrast, streaming relates more to user perception. That is, users with sufficient network bandwidth expect continuous output without noticeable delays. [9] Page 4

5 The option to run real-time streaming analytics is in increasing demand, but the network, hardware, and software systems must be designed to support this requirement. Database systems typically use structured query languages (SQL) or resource description frameworks (e.g., SPARQL) as their primary means of interfacing with the outside world. Although it is possible to express diverse queries in these languages, many operations, especially graph-related queries, require very complex syntactic constructs or repeated queries. Furthermore, both database systems and more direct software approaches have a drawback: they perform acceptably only for particular problem classes. For example, neither database systems nor MapReduce easily perform a search within a graph. As the industry merges heterogeneous data sources, it becomes necessary to interact with ad hoc relationships among data. Data scientists need ways to traverse these data structures in an efficient, parallel way. These data structures are too large to live on a single shared-memory system; they must be distributed across many systems. They can live in-memory for fast access, or reside on local disks suitable for larger volumes of data, but with less efficient access to the data. Some distributed graph systems have been presented in the literature and market, such as Pregel and its derivatives (Giraph, GoldenOrb, JPregel, Phoebus, Sedge, Angrapa, etc.), or federated/sharded share-nothing databases (BigData, Neo4j, etc.). However, there has been very little work on distributed graph systems that allow both more flexibility than Pregel (i.e., just storing temporary data in the vertex) and complex relationships among data stored on different computer nodes. Specifically, federated /sharded systems force internode joins to the application layer and limit the complexity of the joins and therefore, the data interrelatedness. ETI proposes and is currently beta testing a scalable system that will allow more flexible graph computations such as clustering, pattern searching, and predictive analytics on distributed heterogeneous systems without forcing shared-nothing policies. Page 5

6 Such a fundamental transformation requires three software prerequisite features: A runtime system that manages: o Concurrency o Synchronization o Locality o Heterogeneous computing resource types such as different central processing units (CPUs), graphics processing units (GPUs), and field programming gate arrays o Distributed hierarchies of memory o Fault tolerance A simple to use, domain-specific interface to interact with this runtime system. The interface must provide methods to load, modify, traverse, and query a parallel distributed data structure all without forcing the programmer to understand, or be concerned with, the concurrency, synchronization, or locality details of those operations. Language constructs that simplify and extend programmatic access through keywords and compiler analysis when the application programming interface is either clumsy or insufficient. IV. Preparing for Future Generations Present-day big data software will be difficult to adapt to future-generation computing systems with many-core sockets and GPU accelerators, and their related, increasingly difficult programming, efficiency, heterogeneity, and scalability challenges. These challenges call for careful coordination between the very large number of executing CPU cores as they access highly non-uniform and widely distributed memories, which may incur large latencies for more remote accesses. Long-latency operations can seldom be predicted and scheduled statically to any great effect, partly because they may be highly data-dependent or complex, and partly due to unpredictable interfaces between hardware components. Making matters more difficult, the reliability of large-scale computing systems will likely be far lower than that of present-day systems at smaller scale. Future-generation systems applications will be required to handle and recover from what today would be considered major hardware faults in order to fully complete functions. In any system with non-uniform or remote memories, maximizing collocation of data and computation minimizes the time and power overhead required for Page 6

7 remote communications. Within a shared-memory domain, this is typically done with hardware caches. However, maintaining coherence among these caches requires an amount of communication proportional to the square of the number of caches. Thus, as the number of cores increases exponentially, the difficulty of efficiently maintaining cache coherency increases far more quickly. Hardware vendors have several ways to solve this problem, including lower-bustraffic (but still fundamentally O(n 2 ) for n cache), coherence protocols, eliminating caches altogether, or moving the burden of coherence management into software. If software must handle some coherency issues and the runtime can address this, then the runtime might as well handle all coherency maintenance. Data movement costs time and power, but HAMR has several means to reduce those costs. HAMR takes full advantage of locality aware file systems, such as HDFS, by performing computation on the compute node where the data resides. In memory data sets are partitioned among the compute nodes where further processing can be performed. These partitions remain in memory until they exceed the available memory capacity, at which time they are intelligently spilled to disk. Programmers will see little to no impact on programmability as locality and memory management tasks are managed by the HAMR runtime. Computing systems grow ever larger and more interconnected, and the boundaries between the systems introduce stricter limitations on the type of work that any of them can perform. Traditional parallel applications must explicitly partition their data and work, so that they can distribute it effectively to all available hardware resources. This is commonly seen in the practice of assigning work to nodes in a commodity cluster based on each one s assigned ordinal rank. However, as workloads become larger and less regular, it becomes increasingly difficult to create an effective static mapping from the work and data to the compute node. Virtualizing the way work is initiated and data is accessed can overcome this difficulty, by allowing processing and data to be decoupled from their initial locations and migrate as runtime needs dictate. By building a suitable execution environment and allowing running tasks to easily query their relative and absolute location within a parallel system, application software can also direct its own work/data movement rather than leaving it entirely under the control of the runtime system. Page 7

8 V. Asynchronous Flowlets and Key/Value Stores ETI s HAMR model uses a dataflow-based computing system. Although users still write map and reduce functions, HAMR treats these MapReduce tasks with a fine-grain and dataflow style parallelism, quite different from the coarse-grain, bulk synchronous parallelism found in Hadoop. In contrast to the existing MapReduce model, which starts with mapping and is followed by reduction, the HAMR approach allows users to implement multiple phases as needed. This new patent-pending workflow executes on a distributed computing system using a data parallel execution model. Each of the phases is a type of Flowlet, and each Flowlet is a Resource, a Key/Value Store or a Transform. ETI has coined the term Flowlet (patent-pending) to describe a data-flow actor in a workflow that performs a computation on an input dataset and produces output datasets. Flowlets can be flow controlled, for example to stop producing data when downstream actors are busy computing on other data and have no space to store the incoming data. This is what makes the HAMR model scalable and well suited for future hardware upgrades in large scale systems. The system can process the flow control event from the downstream consumer actor, forcing the producer actor to stop producing output. Downstream, data may be accessed from a variety of sources. These Resources may include social media feeds, sensor data, biometric readings, real-time market quotes or internal operational data. The Resource is responsible for converting the raw format into the structured, key/value format suitable for data parallel processing. Key/Value Stores (patent pending) represent a reliable data structure that takes advantage of a paradigm already familiar to developers using the MapReduce model. These data structures exist across multiple nodes but are treated as a single data structure. A Key/Value Store manages its own memory, spilling to disk only when necessary. Read-and-write access to the data structures is based on key/value pairs, where the key represents an index and the value represents the object to store or retrieve. Page 8

9 These data structures fall into three categories: Distributed. The data structure is partitioned among the nodes, and each node owns a unique subset of the key space. Replicated. The data structure is mirrored across all nodes. Each node can read the entire key space, but write access is partitioned to a subset of the key space. Shared. The data structure is readable and writable by all nodes, but the physical memory for storing the values is partitioned among the nodes. Within these three categories, several data structures can be implemented, including hash tables, distributed arrays, and trees. Key/Value Stores and Flowlets can be connected in a directed graph. Transform is a common term in computer science, representing the conversion of one data format to another. If Key/Value Stores represent the data structures in a HAMR workflow, then Transforms represent the algorithms. Developers implement their application logic with a Transform, possibly interacting with Key/ Value Stores. In a traditional MapReduce model, the relationship between the map and reduce phases is fixed. That is what makes it synchronous: all outputs of mappers are pulled by the corresponding reducers. Figure 3: Flowlets (right) consist of Resources, Transforms and Key/Value Stores. Iterator Flowlet Flowlet Flowlet Raw Input Flowlet Flowlet Final Output However, with the HAMR approach, users have the flexibility to define the relationship among multiple Flowlets that can dynamically control the flow of data between the phases. The HAMR approach incorporates multiple operators to the application Page 9

10 developer s toolbox. These operators control iteration, streaming, real-time guarantees, and more. Operators are applied to a portion of the workflow graph. HAMR s execution model enables these Flowlets to be executed in-memory in many cases, and is not restricted by the amount of cache memory on the system. Instead, with flow control and fine-grain parallelism, Resources, Transforms, and Key/Value Stores interact, so that they can adjust computing resources to each Flowlet, leading to higher compute utilization and proper load balancing. By implementing the HAMR approach within their Hadoop cluster, information technology departments will have one solution to manage, lessening the burden of continually creating and updating unique datasets for both batch and streaming analytics. VI. Benchmarking Performance HAMR s viability is best measured in performance benchmarks. University of Delaware graduate students and ETI performed these experiments, comparing HAMR to MapReduce. Four algorithms were run by the teams: K-Means, Word- Count, Graph Search and Clustering. Cluster Information Number of compute nodes:16 CPU Count: 2 CPU Type Intel Xeon Processor: E52620 CPU MHz: 2 GHz Memory: 32 GB Network Type (a): 1 GbE Network Type (b): 4x FDR InfiniBand Local Disk Type: SATAIII # of Local Disk: 5 Hadoop/HDFS Information Version: IDH 2.2 Configured Capacity: TB Live Datanodes: 15 Sample Data Set Source of Data: Movie Ratings Database, PUMA (Perdue MR Benchmarks Suite) cgi?article=1438&context=ecetr Size: 300 GB K-MEANS BENCHMARK HAMR MapReduce WORD COUNT BENCHMARK HAMR MapReduce K-Means is used as a prototype at the center of a Voronoi cell, and other observations are clustered within that cell. Cluster analysis is commonly used at every stage in marketing, for segmenting the population and targeting each segment with different offers. In a comparison of performance on a K- Means clustering algorithm, the performance improvement was nearly 7x, in favor of HAMR. Word count is a fairly common algorithm to determine the number of words in a file or web page. This method is being used more frequently in processing electronic medical records. MapReduce performs well, but HAMR increases the response time from 1.0 to 1.5x. Page 10

11 Benchmark Notes All benchmarks were executed on the same cluster, with the same number of compute nodes, and with the same input data between the Hadoop and the HAMR implementation. Sockets Direct Protocol was not enabled on this system as a suitable driver could not be installed. OFED release supports SDP but does not have native support for the installed Mellanox HCA, causing a performance degredation. OFED release supports SDP and the HCA, but could not be built in the RHEL 6.4 environment. The stock RHEL 6.4 kernel modules are currently installed, and these do not support SDP. Movie ratings dataset is classified based on their ratings using anonymized movies rating data which is of the form <movie_id: list{rater_id, rating}>. Random starting values are chosen for the cluster centroids. Input Format: {movie_id: userid1_rating1, userid2_rating2,...} Output Format: K-Means produces two types of outputs: (a) <centroid_num><{movie_id: userid1_ rating1, userid2_rating2,...}> (list of all movies associated with a particular centroid) (b) <centroid_num><{similarity_value}{centroid_movie_id}{num_members}{userid1_ rating1, userid2_rating2, }> (new centroid) VII. Conclusion BEYOND MAP REDUCE GRAPH SEARCH BENCHMARK HAMR MapReduce CLASSIFICATION BENCHMARK HAMR MapReduce Graph search algorithms check the values of all nodes in a graph. This is a complex algorithm and demonstrates the power of parallel processing. HAMR outperformed MapReduce by nearly 30x. Graph Search can be used in logistics, to find multiple routes on a map. Machine learning commonly involves a classification algorithm. A user can develop a hypothesis by selecting from classified units that conform to the theory. One type of classifier is Naïve Bayes. This is commonly used in financial services and actuarial science to assess probability. HAMR outperforms MapReduce by more than 5x. Algorithms are the basis for mining insights from unstructured big data, and transforming data to conform with other systems. HAMR performance improvements are promising, and set the stage for the next generation of Big Data Analytics. There are two meta forces applying tectonic pressure to Hadoop and the 2.0 ecosystem: Users and Systems. Users. Many enterprise users are looking to stream their data processing directly from storage systems of record, which has not been practical with Hadoop alone. The software Storm (an additional tool within the Hadoop 2.0 ecosystem) allows streaming from multiple sources but excludes HDFS by default an impractical way to maintain continuity after investing millions of dollars into Hadoop. That is why enterprise users are exploring alternatives. Systems. MapReduce (synchronous batch-only) software will be difficult to adapt to future-generation computing systems with many-core sockets and GPU accelerators, and their related, increasingly difficult programming, efficiency, heterogeneity, and scalability challenges. Page 11

12 These conditions have created a climate that bodes well for the adoption of HAMR. This software not only allows multiple sources of data to be processed in real-time streaming, but does not exclude HDFS. HAMR is currently in beta testing and invites colleagues to test it in real-world conditions prior to its launch in Q VIII. References [1] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Comm. ACM, 51(1): , January 2008 doi: / [2] S. Delap. Yahoo s Doug Cutting on MapReduce and the Future of Hadoop. [Online], September 21, Accessed February 27, [3] Google. Google Basics. [Online]. Accessed February 27, [4] Google Developers. MapReduce for App Engine. [Online], Last modified February 26, Accessed February 27, [5] Guruzon.com. History of MapReduce. [Online], Last modified June 1, Accessed February 27, /6/introduction/map-reduce/history-of-map-reduce. [6] Intel. Moore s Law and Intel Innovation. [Online]. Accessed February 27, [7] G. Koplin. MapReduce: Simplified Data Processing on Large Clusters (master s course report, University of Wisconsin, Madison, 2006), Accessed February 27, wisc.edu/~dusseau/classes/cs739s06/writeups/mapreduce.pdf. [8] M. Ludloff. How Real is Real-Time and What the Heck is Streaming Analytics? [Online], February 22, Accessed February 27, builders.com/ 2011/02/22/what-is-real-time-what-is-streaming-analytics/. [9] M. Rouse. Data Streaming. [Online], September Accessed February 27, 2014, [10] M. Rouse. Key-value pair (KVP). [Online], Last modified August Accessed February 27, techtarget.com/definition/key-value-pair. Page 12

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

Enabling High performance Big Data platform with RDMA

Enabling High performance Big Data platform with RDMA Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Cray: Enabling Real-Time Discovery in Big Data

Cray: Enabling Real-Time Discovery in Big Data Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches Introduction For companies that want to quickly gain insights into or opportunities from big data - the dramatic volume growth in corporate

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

HPC ABDS: The Case for an Integrating Apache Big Data Stack

HPC ABDS: The Case for an Integrating Apache Big Data Stack HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org

More information

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory) WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Entering the Zettabyte Age Jeffrey Krone

Entering the Zettabyte Age Jeffrey Krone Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Hybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com

Hybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com Hybrid Software Architectures for Big Data Laurence.Hubert@hurence.com @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Speak<geek> Tech Brief. RichRelevance Distributed Computing: creating a scalable, reliable infrastructure

Speak<geek> Tech Brief. RichRelevance Distributed Computing: creating a scalable, reliable infrastructure 3 Speak Tech Brief RichRelevance Distributed Computing: creating a scalable, reliable infrastructure Overview Scaling a large database is not an overnight process, so it s difficult to plan and implement

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

White Paper: What You Need To Know About Hadoop

White Paper: What You Need To Know About Hadoop CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise WHITE PAPER and Hadoop: Powering the Real-time Intelligent Enterprise BIGMEMORY: IN-MEMORY DATA MANAGEMENT FOR THE REAL-TIME ENTERPRISE Terracotta is the solution of choice for enterprises seeking the

More information

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved. Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013 Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software SC13, November, 2013 Agenda Abstract Opportunity: HPC Adoption of Big Data Analytics on Apache

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

TRAINING PROGRAM ON BIGDATA/HADOOP

TRAINING PROGRAM ON BIGDATA/HADOOP Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,

More information