MapReduce everywhere. Carsten Hufe & Michael Hausenblas
|
|
- Willis McDonald
- 8 years ago
- Views:
Transcription
1 MapReduce everywhere Carsten Hufe & Michael Hausenblas
2 About Carsten Hufe Big Data Consultant at comsysto Vodafone Telefonica / o2 Payback Hadoop Ecosystem, Distributed systems Committer of JumboDB
3 About Michael Hausenblas Chief Data Engineer at MapR, responsible for EMEA Background in large-scale data integration Using Hadoop and NoSQL since 2008 Apache Drill contributor Big Data advocate (lambda-architecture.net, sparkstack.org)
4 Outline Hadoop & MapReduce introduction Experiences from 'SmartSteps' JumboDB Some examples for MapReduce Future and vision
5 Big Data processing conventional data processing (RDBMS-based) is a special case of Big Data processing (think: Newton s mechanic vs. relativity and quantum mechanics)
6 General observations Analytics becoming a critical component in business environments Base decisions on (a lot of) data Principle: keep all data around benefit from all data Human generated (think: Excel sheet, CRM system, etc.) Machine generated (think: mobile phone, etc.) Pioneered at Google and Amazon
7 First Principles Scaling out (horizontal) over scaling up (vertical) Commodity hardware Open Source software (Apache, etc.) Open, community-defined interfaces Schema on read Data locality
8 Schema on read schema on write established (experience exists) strong typing (validations etc. on DBlevel) forces fixed schema up-front forces one correct view of world raw data dismissed less agile schema on read flexible interpretation of data at load time (agility) raw data stays around allows for unstructured, semi-structured and structured data (typically) weak typing schema handling on app-level
9 Data locality move processing (code) to the data rather than the other way round why?
10 ~1990 ~2000 ~ x disk capacity 2.1 GB 200 GB 3,000 GB -31 price $157/GB $1/GB 00x $0.05/GB 12x transfer rate 16 MB/s 56 MB/s 210 MB/s 2min to read whole disk 58min to read whole disk 4h to read whole disk
11 RDBMS vs Hadoop RDBMS/MPP schema on write Hadoop on read, on write workload interactive batch (default) but interactive solutions emerging interface SQL core: MapReduce, but SQL-in-Hadoop solutions emerging volume GB++ PB++ variety ETL to tabular no restrictions velocity limited agility DBA/schema + ETL is main bottleneck $$$/TB >>20,000$ stock Hadoop limited, but can be realised with frameworks like Kafka, Storm, etc. very quick roll-outs and results <1000$
12 Simple algorithms and lots of data trump complex models Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems So combining data together delivers better, more accurate results how can I integrate all that data from my legacy applications? but how do I keep all that data safe? how can I perform at that level of scale?
13 Distributed Storage Model Distributed Compute Model Google File System MapReduce Designed to run on massive clusters of cheap machines Sends compute to data on GFS, not vice versa Tolerates hardware failure Paper published in 2003 Vastly simplifies distributed programming Paper published in 2004 Runs on commodity hardware. Costs scale linearly.
14 Distributed File System (HDFS) Map Reduce Runs on commodity hardware
15 Hadoop 101 Apache Hadoop is an open source software project that provides a major step toward meeting the big data challenge With Hadoop you can have thousands of disks on hundreds of machines with near linear scaling Uses commodity hardware, no need to purchase expensive or specialized hardware Handles Big Data, Petabytes and more
16 Hadoop History
17 Architecture MapReduce: Parallel computing Move the computation to the data Storage: Keeping track of data and metadata Data is sharded across the cluster Cluster management tools Applications and tools
18 Architecture
19 Nature of MapReduce-able Problems Complex data Multiple data sources Lots of it Nature of Analysis Batch Processing Parallel Execution Data in distributed file system and computation close to data Analysis Applications Text mining Risk Assessment Pattern Recognition Sentiment Analysis Collaborative Filtering Prediction Models
20 Hadoop Distributed Filesystem
21 Hadoop Cluster Data Failures are expected and managed gracefully
22 HDFS NameNode Architecture Data is conceptually record-oriented in the Hadoop programming framework HDFS splits large data files into chunks (default size is 64 MB) Chunks are spread over multiple nodes in the cluster. They are also replicated across the cluster for fault tolerance Shared nothing architecture Chunks form a single namespace and are accessible universally Moving computation to data allows Hadoop framework to achieve high data locality and avoid strain on network bandwidth Although files are split into 64Mb or 128Mb blocks, if a file is smaller than this the full 64Mb/128Mb will not be used Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoop s configuration files Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster A =Primary Namenode A B B =Standby Namenode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode
23 The MapReduce Paradigm
24 SAN Server Data Hadoop Cluster Data Sources lt u s Re ce u d Re p a m M a r g Pro
25 MapReduce To use Hadoop, a query is expressed as MapReduce jobs MapReduce is a batch process MapReduce accesses an entire dataset, in parallel, in order to reduce seeks In conventional programs, seek time is generally rate-limiting MapReduce is a streaming process that is not limited by seeks MapReduce tasks are pure functions, meaning they are stateless Pure functions have no side effects and thus can be run in any order Pure functions can even be run multiple times if necessary MapReduce jobs are divided into different phases Map tasks Shuffle phase Reduce tasks
26 Inside MapReduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes and ships and sealing-wax has, [1,5,2] come, 6 come, 1 the, [1,2,1] has, 8 time, [10,1,3] the, 4 time, 14 Input Shuffle Reduce Map Output and sort
27 MapReduce Key Phases Map phase Input files have been automatically broken into pieces Data is read on each node using large I/O operations for efficiency Mappers run locally to the data in this step, avoiding the need for most network traffic Each input record is transformed by mapper independently, so they can all take place at the same time If your cluster isn t big enough to run them all at the same time, it can run them in multiple waves Output of the mapper is a key and a value The MapReduce framework takes care of handling the output and sending it to the right place
28 MapReduce Key Phases Shuffle Moves intermediate results to the reducers and collates Provides all communication between computing elements Rearranges data and involves network traffic Reduce Combines mapper outputs Computes final results Output is done using large writes Output of final reducers is stored to disk
29 What Happens in the Cluster? Disk I/O Highest during map phase when program is reading input data Another peak at end of MapReduce job when final output is written to disk by the reducers Network Shuffle rearranges data and involves large amounts of network traffic Memory Peak memory loads are typically during reduce phase Framework is merging map outputs, reducer is processing merged results Mapper may also have a memory usage peak
30 Disk I/O Network Memory t Input Map Shuffle and sort Reduce Output
31 The Hadoop ecosystem
32 The Hadoop ecosystem
33 Hive Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded python Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that Source: cc-licensed slide by Cloudera
34 Hive Data Model Tables Typed columns (int, float, string, boolean) Also, list: map (for JSON-like data) Partitions For example, range-partition tables by date Buckets Hash partitions within ranges (useful for sampling, join optimization) Source: cc-licensed slide by Cloudera
35 Hive Example Hive looks similar to an SQL database Relational join on two tables: Table of word counts from Shakespeare collection Table of word counts from the bible SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10; the I and to of a you my in is
36 Pig Latin Pig provides a higher level language, Pig Latin, that: Increases productivity. In one test: 10 lines of Pig Latin 200 lines of Java. What took 4 hours to write in Java took 15 minutes in Pig Latin. Opens the system to non-java programmers. Provides common operations like join,group, filter, sort. User Defined Functions first class citizens
37 Pig Latin Script Example Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; UserPageranks = foreach UserVisits generate user, AVG(VP. pagerank) as avgpr; GoodUsers = filter UserPageranks by avgpr > 0.5 ; store GoodUsers into '/data/good_users'; Pig Slides adapted from Olston et al.
38 Other ways to write MapReduce jobs Cascading* Scalding (tuples) Scala Cascalog Clojure/Java *) For details see David Whiting s excellent talk Scalding the Crunchy Pig for Cascading into the Hive Crunch* (functional) Java M/R frameworks for scripting languages such as Python, Ruby, etc.
39 Hadoop 2.0 / YARN In a cluster there are a resources (CPUs, RAM, disks) that need to be managed. In Hadoop 2.0 YARN replaces the MapReduce layer in Hadoop with a more general-purpose scheduler allowing to run in addition to MapReduce jobs other types of workloads (e.g., graph databases, MPI).
40 MapReduce everywhere? Hadoop MongoDB R Studio Java On-Demand Aggregation
41 Smart Steps Prototyp Analyze and visualize mobile data Footfalls Catchment Segmentation for socio-demographic characteristics
42 Smart Steps
43 Smart Steps - challenges Provide a data pipeline Handle huge amounts of data, which could be queried on demand Limited hardware resources Provide a near 'real time performance'
44 Smart Steps 1st iteration Web-Application (Java, Spring MVC) MongoDB 2.2 as storage MapReduce with MongoDB and JavaScript MongoDB Sharded
45 Sample MongoDB Document { "cellid": "12345", "date": " ", "hour": 0, "visitors": 15000, "age": { "to10": 1111, "to20": 2222, "to30": 3333 }, "gender": { "male": 4444, "female": 5555 } }
46 Sample MongoDB MapReduce Result for all cells for a month var mapfunction = function() { var yearandmonth = getyearandmonth(this.date); // e.g. yearandmonth = emit(yearandmonth, this.visitors); }; var reducefunction = function(yearandmonth, visitors) { return Array.sum(visitors); }; db.footfalls.mapreduce( mapfunction, reducefunction, { out: "map_reduce_result" } )
47 MongoDB MapReduce Result { "results": [ { "_id": " ", "value": }, { "_id": " ", "value": }, { "_id": " ", "value": } ] }
48 Result MapReduce in MongoDB was singlethreaded per server-instance (version 2.2) JavaScript-Engine was slow Slow import Indexes must fit into memory Reponse times too long
49 Smart Steps 2nd iteration Web-Application (Java, Spring MVC) MongoDB as storage MapReduce with Hadoop MongoDB Sharded
50 Result Slow import Indexes must fit in memory Response times too long Not blocked due to single-thread issues
51 Smart Steps 3rd iteration Web-Application (Java, Spring MVC) MemCached as storage MapReduce with Hadoop Multiple Memcached instances
52 Result Very fast import Entire data must fit into memory Very good response times Very expensive many instances required Data is not persistent
53 GAME CHANGED
54 Smart Steps Last iteration Budget reduced, not enough hardware available How to provide the same amount of data with the new budget? Reducing data will cause loss of user acceptance
55 Smart Steps Last iteration Web-Application (Java, Spring MVC) JumboDB as storage MapReduce with Hadoop One server instance for application and storage!
56 Result Very fast import Low memory footprint (less than 5% of index information) Very good response times Very cheap Provides data workflow and versioning
57 Final architecture RAW events Calculate business aspects Hadoop ecosystem Aggregated data s* pect s a l a hnic Precalculated database c te te a l u c l Ca * sort, index, compress Binary copy Read from Smart Steps Reporting Application
58 Benchmark comparison: JumboDB, MemCached, MongoDB Import 70 GB data JumboDB MongoDB MemCached 1 server Capacity: 2 TB EBS (with compression 10TB) 1 server Capacity: 2 TB EBS 4 servers Capacity: 4x70GB RAM 7min 30s Querying with 40 criterias Result: datasets Conclusion 20h 18min 6min 20s Throughput: ~156MB/s datasets/s Throughput: ~0,95 MB/s 2075 datasets/s Throughput: ~184MB/s datasets/s With data transfer to client Aborted after 20 minutes With data transfer to client 1220 ms Fast import Fast querying Only one machine! Not possible Slow import Querying not possible 2336 ms Fast import Fast querying But 4 machines
59 Cost comparison Storing 10 TB data JumboDB MongoDB MemCached 1 server with 70 GB RAM Min. 7 servers with 70 GB RAM 143 server 70 GB RAM Capacity: 2 TB EBS (with compression >10TB) Capacity: 6x2 TB EBS + 1x MongoS Capacity: 143x70GB RAM RAID EBS volume with IOPS RAID EBS with IOPS No EBS volume (currently it was not not possible to use Mongo with more than 500GB!) Only calculating server costs, because EBS volumes are hard to calculate 2000$ 1 server = 2000$ + EBS and IOPS Conclusion Cheapest version 14000$ + EBS and IOPS Relatively cheap $ Expensive But no extra EBS costs!
60 Reasons for the good performance Database will be calculated in a distributed environment Data is immutable No reorganisation of data required during read operations Preorganized data for the main usecases (e.g. Sorted by geographic region, data can be read in a sequential way) Data is compressed, use the storage more effectively and speed up the read operation
61 Github: Wiki:
62
63 MapReduce example: how to sum on cell base
64 What is a Geohash? Converts coordinates (lat/long) into a single hash value Invented by Gustavo Niemeyer
65 How does it work?
66 How does it work?
67 How does it work?
68 Example London, Piccadilly Circus bit precision Integer value: Geohash String: u281z
69 MongoDB Geohash Example var mapfunction = function() { var geohash = getgeohash24bitprecision(this.latitude, this.longitude); emit(geohash, this.visitorid); }; var reducefunction = function(geohash, visitorids) { return visitorids.length; }; db.visits.mapreduce( mapfunction, reducefunction, { out: "users_per_grid_cell" } )
70 MongoDB MapReduce Result { "results": [ { "_id": "u281z", "value": }, { "_id": "u282b", "value": }, { "_id": "d567", "value": } ] }
71 Future MapReduce Spark Real-time (from Kafka to Storm) Lambda Architecture SQL on Hadoop Impala Apache Drill Presto
72 Thank you for your attention!
73 Smart Steps Workflow Version 1: Here is my first delivery with 'January' data for 'Collection 1' Version 2: Made some optimizations, data should be better Version 3: There was a mistake in the latest delivery. I corrected it! Data Scientist I have new 'February' data and added a new collection 'Collection 2'. Please extend the 'January' data with it. One month later... jumbodb Version 4: New 'February' data for 'Collection 1' and 'Collection 2' Version 5: Made some optimizations to 'February' data. Version 6: Data is much cooler! Data Scientist DAMN! The latest delivery was faulty. I am not able to fix it quickly! Please roll back to 'Version 5'. Reporting application
74 Smart Steps
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationGetting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationGoogle Bing Daytona Microsoft Research
Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationBig Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing
Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationThe Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @
The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @ whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com what s hadoop...
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationDesign and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationBIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationUsing distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More informationHow To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationHadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationInternational Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014)
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE N.Alamelu Menaka * Department of Computer Applications Dr.Jabasheela Department of Computer Applications Abstract-We are in the age of big data which
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationBig Data Weather Analytics Using Hadoop
Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationW H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationBig Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com
Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationThe Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationBig Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More information<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationHadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationData Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationRelational Processing on MapReduce
Relational Processing on MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Recap: Key relational DBMS notes Key Hadoop
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationMap Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
More informationIntro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
More informationReal Time Big Data Processing
Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationParquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala Outline Context from various
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationAlternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationPlay with Big Data on the Shoulders of Open Source
OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19
More informationHow to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationSuresh Lakavath csir urdip Pune, India lsureshit@gmail.com.
A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. Ramlal Naik L Acme Tele Power LTD Haryana, India ramlalnaik@gmail.com. Abstract Big Data
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationNoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
More informationPetabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationBig Data Workshop. dattamsha.com
Big Data Workshop About Praveen Has more than15 years of experience working on various technologies. Is a Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% score and got through the
More informationCS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
More informationCan the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
More informationBig Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
More information