Data Management Using MapReduce
|
|
- Valentine Harper
- 7 years ago
- Views:
Transcription
1 Data Management Using MapReduce M. Tamer Özsu University of Waterloo CS742-Distributed & Parallel DBMS M. Tamer Özsu 1 / 24
2 Basics For data analysis of very large data sets Highly dynamic, irregular, schemaless, etc. SQL too heavy Embarrassingly parallel problems New, simple parallel programming model Data structured as (key, value) pairs E.g. (doc-id, content), (word, count), etc. Functional programming style with two functions to be given: Map(k1,v1) list(k2,v2) Reduce(k2, list (v2)) list(v3) Implemented on a distributed file system (e.g., Google File System) on very large clusters CS742-Distributed & Parallel DBMS M. Tamer Özsu 2 / 24
3 Map Function User-defined function Processes input key/value pairs Produces a set of intermediate key/value pairs Map function I/O Input: read a chunk from distributed file system (DFS) Output: Write to intermediate file on local disk MapReduce library Executes function Groups together all intermediate values with the same key (i.e., generates a set of lists) Passes these lists to reduce functions Effect of function Processes and partitions input data Builds a distributed (trnasparent to user) Similar to group by operaton in SQL CS742-Distributed & Parallel DBMS M. Tamer Özsu 3 / 24
4 Reduce Function User-defined function Accepts one intermediate key and a set of values for that key (i.e., a list) Merges these values together to form a (possibly) smaller set Typically, zero or one output value is generated per invocation Reduce function I/O Input: read from intermediate files using remote reads on local files of corresponding per nodes Output: Each reducees writes its output as a file back to DFS Effect of function Similar to aggregation operaton in SQL CS742-Distributed & Parallel DBMS M. Tamer Özsu 4 / 24
5 MapReduce Processing Input data set. Map Map Map (k 1, v) (k 2, v) (k 2, v) (k 2, v) (k 1, v) Group by k Group by k (k 1, (v, v, v)) Reduce (k 1, (v, v, v, v)) Reduce Output data set Map (k 1, v) (k 2, v) CS742-Distributed & Parallel DBMS M. Tamer Özsu 5 / 24
6 Example 1 Assume you are reading the monthly average temperatures for each of the 12 months of a year for a bunch of cities, i.e., each input is the name of the city (key) and the average monthly temperature. Compute the average annual temperature for each city. Map: Input: City, Month, MonthAvgTemp 1. Create key/value pairs Output: City, MonthAvgTemp Reduce: Input: City, MonthAvgTemp 1. Sort to get City, list(monthavgtemp) (i.e., it combines in a list the monthly average temperatures for a given city) 2. Compute average over list(monthavgtemp) Output: City, AnnualAvgTemp CS742-Distributed & Parallel DBMS M. Tamer Özsu 6 / 24
7 Example 2 Consider EMP (ENAME, TITLE, CITY) Query: Map: SELECT CITY, COUNT( ) FROM EMP WHERE ENAME LIKE \%Smith GROUP BY CITY I n p u t : ( TID, emp ), Output : ( CITY, 1 ) i f emp.ename l i k e \%Smith return ( CITY, 1 ) Reduce: I n p u t : ( CITY, l i s t ( 1 ) ), Output : ( CITY,SUM( l i s t ( 1 ) ) ) return ( CITY,SUM( 1 ) ) CS742-Distributed & Parallel DBMS M. Tamer Özsu 7 / 24
8 MapReduce Architecture Master Worker Map Process Input Module Map Module Combine Module Partition Module Worker Map Process Input Module Map Module Combine Module Partition Module Worker Map Process Input Module Map Module Combine Module Partition Module Scheduler Worker Reduce Process Group Module Reduce Module Output Module Worker Reduce Process Group Module Reduce Module Output Module CS742-Distributed & Parallel DBMS M. Tamer Özsu 8 / 24
9 MapReduce: Simplified Data Processing on Large Clusters Execution Flow with Architecture (1) fork User Program (1) fork (1) fork (2) assign Master (2) assign reduce worker split 0 split 1 split 2 (3) read (4) local write worker (5) remote (5) read worker (6) write output file 0 split 3 split 4 worker output file 1 worker Input files Map phasr Intermediate files (on local disks) Reduce phase Output files Fig. 1. Execution overview. From: J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Comm. ACM, CS742-Distributed & Parallel DBMS M. Tamer Özsu 9 / 24
10 Characteristics Flexibility User can write any and reduce function code No need to know how to parallelize Scalability Elastic scalability Automatic load balancing Efficiency Simple: parallel scan No database loading Fault tolerance Worker failure Master pings workers periodically; assumes failure if no response Tasks (both and reduce) on failed workers scheduled on a different worker node Master failure Checkpoints of master state Recovery after failure progress can halt Replication on distributed data store CS742-Distributed & Parallel DBMS M. Tamer Özsu 10 / 24
11 Hadoop Most popular MapReduce implementation developed by Yahoo! Two components Processing engine HDFS: Hadoop Distributed Storage System others possible Can be deployed on the same machine or on different machines Processes Job tracker: hosted on the master node and implements the schedule Task tracker: hosted on the worker nodes and accepts tasks from job tracker and executes them HDFS Name node: stores how data are partitioned, monitors the status of data nodes, and data dictionary Data node: Stores and manages data chunks assigned to it Worker 1 Name Node Worker n MapReduce Task Tracker Job Tracker Task Tracker HDFS Data Node Name Node Data Node CS742-Distributed & Parallel DBMS M. Tamer Özsu 11 / 24
12 Hadoop UDF Functions Phase Name Function InputFormat::getSplit Partition the input data into different splits. Each split is processed by a per Map and may consist of several chunks. RecordReader::next Define how a split is divided into items. Each item is a key/value pair and used as the input for the function. Mapper:: Users can customize the function to process the input data. The input data are transformed into some intermediate key/value pairs. WritableComparable::compareTo The comparison function for the key/- value pairs. Job::setCombinerClass Specify how the key/value pair are aggregated locally. Shuffle Job::setPartitionerClass Specify how the intermediate key/value pairs are shuffled to different reducers. Job::setGroupingComparatorClass Specify how the key/value pairs are Reduce grouped in the reduce phase. Reducer::reduce Users write their own reduce functions to perform the corresponding jobs. CS742-Distributed & Parallel DBMS M. Tamer Özsu 12 / 24
13 MapReduce Implementations Name Language File System Index Master Server Multiple Job Support Hadoop Java HDFS No Name Node and Job Yes Tracker Disco Python and Erlang Distributed Index Disco Server Skynet Ruby MySQL or No Any node in the cluster No Unix File System FileMap Shell Unix File No Any node in the cluster No and Perl Scripts System Twister Java Unix File No One master node in broker Yes System network Cascading Java HDFS No Name Node and Job Yes Tracker No No CS742-Distributed & Parallel DBMS M. Tamer Özsu 13 / 24
14 MapReduce Languages Hive Pig Language Declarative SQL-like Dataflow Data model Nested Nested UDF Supported Supported Data partition Supported Not supported Interface Command line, web, Command line JDBC/ODBC server Query optimization Rule based Rule based Metastore Supported Not supported CS742-Distributed & Parallel DBMS M. Tamer Özsu 14 / 24
15 MapReduce Implementations of Database Operators Select and Project can be easily implemented in the function Aggregation is not difficult (see next slide) Join requires more work MapReduce join implementations Repartition join θ-join Equi-join Semi-join Broadcast join Map-only join Similarity join Partition join Multiple MapReduce jobs Multi-way join Replicated join CS742-Distributed & Parallel DBMS M. Tamer Özsu 15 / 24
16 Aggregation Extracting aggregation attribute (Aid) Grouping by Aid Applying the aggregation function for the tuples with the same Aid Mapper 1 R 1 R1 2 R2 Mapper 2 R 3 R3 4 R4 Aid Value 1 R1 2 R2 Aid Value 1 R3 2 R4 Partitioning by Aid (Round Robin) Reducer 1 Aid 1 Reducer 2 Aid 2 Value R1 R3 Value R2 R4 reduce reduce Result 1, f(r1, R3) Result 2, f(r2, R4) Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 16 / 24
17 θ-join Baseline implementation of R(A, B) S(B, C) 1. Partition R and assign each partition to pers 2. Each per takes a, b tuples and converts them to a list of key/value pairs of the form (b, a, R ) 3. Each reducer pulls the pairs with the same key 4. Each reducer joins tuples of R with tuples of S CS742-Distributed & Parallel DBMS M. Tamer Özsu 17 / 24
18 θ-join If θ equals = (i.e., equijoin) Repartition join Semijoin-based join Map-only join If θ equals Randomly assigning bucket ID (Bid) Grouping by Bid Aggregate tuples based on origins Local θ-join (θ is ) Mapper 1 R 1 R1 2 R2 3 R3 4 R4 Mapper 1 S 1 S1 2 S2 3 S3 4 S4 Bid Tuple 1 R,1,R1 4 R,2,R2 3 R,3,R3 2 R,2,R4 Bid Tuple 4 S,1,S1 2 S,2,S2 3 S,3,S3 1 S,2,S4 Partitioning by Bid (Round Robin) Reducer 1 Bid Tuple R,1,R1 1 S,4,S4 R,3,R3 3 S,3,S3 Reducer 2 reduce Origin Tuple R 1,R1 R 4,R4 Origin Tuple S 3,S3 S 4,S4 Result R1 S3 R1 S4 R4 S3 Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 18 / 24
19 Repartition Join Tagging origins Grouping by keys Local join Mapper 1 R 1 R1 2 R2 3 R3 4 R4 Mapper 1 S 1 S1 2 S2 3 S3 4 S4 1 R,R1 2 R,R2 3 R,R3 4 R,R4 1 S,S1 2 S,S2 3 S,S3 4 S,S4 Partitioning by key (Round Robin) Reducer 1 Key 1 3 Reducer 2 Key 2 4 Tuple R,R1 S,S1 R,R3 S,S3 Tuple R,R2 S,S2 R,R4 S,S4 reduce reduce Result R1 S1 R3 S3 Result R2 S2 R4 S4 Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 19 / 24
20 Semijoin-based Join Extracting join keys Broadcasting keys of R to all the splits of S and join S with keys of R Broadcasting the results of the previous job (S) to all the splits of R, and locally joining R with S R 1 R1 3 R2 1 R3 4 R4 MapReduce Key Mapper 1 Key S 1 S1 2 S2 Mapper 2 Key S 3 S3 4 S4 1 S1 3 S3 4 S4 Mapper 1 S 1 S1 3 S3 4 S4 R 1 R1 2 R2 Mapper 2 Result R1 S1 R2 S3 Job 1 Full MapReduce job Job 2 Map-only job Job 3 Map-only job CS742-Distributed & Parallel DBMS M. Tamer Özsu 20 / 24
21 Map-only Join Broadcast join: If inner relation outer relation no shuffling Map phase similar to third job of semijoin-based join Each per loads the full inner table to build an in-memory hash; scan the outer relation Trojan join: Relations are already co-partitioned on the join key all tuples of both relations are co-located on the same node Co-partitioning implemented by one job Scheduler loads co-partitioned data chunks in the same per to perform a local join Co-partitioned Split Co-group Co-group H R Data R H S Data S H R Data R H S Data S Footer CS742-Distributed & Parallel DBMS M. Tamer Özsu 21 / 24
22 Multiway Join Multiple MapReduce jobs R S T (R S) T Each join implemented in one MapReduce job Join ordering problem Replicated join Single MapReduce job CS742-Distributed & Parallel DBMS M. Tamer Özsu 22 / 24
23 Replicated Join Generating keys and tagging origins Joining the three tables locally Mapper 1 R Rid Value 1 C1 2 C2 Mapper 2 T Rid Sid Value 1 1 L1 1 2 L2 2 2 L3 Mapper 3 S Sid Value 1 O1 2 O2 1, null C,C1 2, null C,C2 1,1 L,L1 1,2 L,L2 2,2 L,L3 Key null,1 null,2 Value O,O1 O,O2 Partitioning by key (Round Robin) Shuffling tuples to multiple reducers if necessary Reducer 1 1,null C, C1 1,1 L, L1 null,1 O, O1 Reducer 2 1,null C, C1 1,2 L, L2 null,2 O, O2 Reducer 3 Key 2,null null,1 Reducer 4 Value C, C2 O, O1 2,null C, C2 2,2 L, L3 null,2 O, O2 reduce reduce reduce reduce Result C1,L1,O1 Result C1,L2,O2 Result Result C2,L2,O2 Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 23 / 24
24 DBMS on MapReduce HadoopDB Llama Cheetah Language SQL-like Simple interface SQL Storage Row store Column store Hybrid store Data compression No Yes Yes Indexing Query optimization parti- Horizontally tioned Data partition Local index in each database instance Rule based optimization plus local optimization by PostgreSQL Vertically partitioned Horizontally partitioned at chunk level No index Local index for each data chunk Column-based optimization, Multi-query optimiza- late tion, materialized materialization and views processing multiway join in one job CS742-Distributed & Parallel DBMS M. Tamer Özsu 24 / 24
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationA Distributed Data Management Using MapReduce
A Distributed Data Management Using MapReduce FENG LI, National University of Singapore BENG CHIN OOI, National University of Singapore M. TAMER ÖZSU, University of Waterloo SAI WU, Zhejiang University
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationMapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationMapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
More informationA programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
More informationLecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationAdvanced Data Management Technologies
ADMT 2015/16 Unit 15 J. Gamper 1/53 Advanced Data Management Technologies Unit 15 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationBIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationBig Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
More informationClick Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationHow To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
More informationData Management in the Cloud -
Data Management in the Cloud - current issues and research directions Patrick Valduriez Esther Pacitti DNAC Congress, Paris, nov. 2010 http://www.med-hoc-net-2010.org SOPHIA ANTIPOLIS - MÉDITERRANÉE Is
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationBig Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com
Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
More informationCitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
More informationBig Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
More informationAmerican International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationSQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
More informationFacebook s Petabyte Scale Data Warehouse using Hive and Hadoop
Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 12+TB(compressed) raw data per day today Trends
More informationImplementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework
Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Fariha Atta MSc Informatics School of Informatics University of Edinburgh 2010 T Abstract he Map/Reduce
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationMapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
More informationA bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
More informationThe Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
More informationExploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
More informationBig Data Analytics: Hadoop-Map Reduce & NoSQL Databases
Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Abinav Pothuganti Computer Science and Engineering, CBIT,Hyderabad, Telangana, India Abstract Today, we are surrounded by data like oxygen. The exponential
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationSession: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
More informationBig Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing
Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:
More informationBIG DATA IN SCIENCE & EDUCATION
BIG DATA IN SCIENCE & EDUCATION SURFsara Data & Computing Infrastructure Event, 12 March 2014 Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra WHY BIG DATA? 2 Source: Jimmy Lin & http://en.wikipedia.org/wiki/mount_everest
More informationBig Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationIntroduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
More informationITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationProcessing Joins over Big Data in MapReduce
Processing Joins over Big Data in MapReduce Christos Doulkeridis Department of Digital Systems School of Information and Communication Technologies University of Piraeus http://www.ds.unipi.gr/cdoulk/
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationHadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
More informationBig Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
More informationSolving Big Data Problem using Hadoop File System (HDFS)
International Journal of Applied Information Systems (IJAIS) ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA International Conference and Workshop on Communication, Computing and Virtualization
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationConstructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationBig Data and Hadoop with Components like Flume, Pig, Hive and Jaql
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759
More informationMapReduce for Data Warehouses
MapReduce for Data Warehouses Data Warehouses: Hadoop and Relational Databases In an enterprise setting, a data warehouse serves as a vast repository of data, holding everything from sales transactions
More informationHadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013
Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables
More informationCloud Computing. Chapter 8. 8.1 Hadoop
Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationBig Data Technology CS 236620, Technion, Spring 2013
Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Roadmap Previous class MR Implementation This class Query Languages
More informationData Management in the Cloud
Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationMapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)
MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large
More informationLarge Scale Text Analysis Using the Map/Reduce
Large Scale Text Analysis Using the Map/Reduce Hierarchy David Buttler This work is performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract
More informationBig Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
More informationEXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics
BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents
More informationMatchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony
Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony Speaker logo centered below image Steve Kuo, Software Architect Joshua Tuberville, Software Architect Goal > Leverage EC2 and Hadoop to
More informationWhat We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
More informationComplete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
More informationHadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
More informationMAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
More informationHadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
More information