Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
|
|
|
- Ethel Nicholson
- 10 years ago
- Views:
Transcription
1 Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process - involves pre-processing a collection of documents and storing a representation of it in an index Retrieval/runtime process upon a user issuing a query, involves accessing the index to find documents relevant to the query To process queries, search engines need quick access to all documents containing a set of search terms Billions of documents sparsely containing millions of terms The Inverted Index: a mapping from terms to the documents containing them With additional location-specific details 23 March Big Data Technology 2 1
2 Inverted Index Example The good 1 2 the bad 3 4 and the ugly Doc #1 As good as it gets, and more Term Doc #2 Doc #3 Lexicon DF The 1 Good 2 Bad 2 And 3 Ugly 2 As 1 It 2 Gets 1 More 2 Is 1 (1; 1,3,6) (1; 2) (2; 2) (1; 4) (3; 5) (1; 5) (2; 6) (3; 4,8) (1; 7) (3; 3) (2; 1,3) (2; 4) (3; 2,6) (2; 5) (2, 7) (3, 9) (3; 1,7) Is it ugly and bad? It is, and more! Occurrences sorted by increasing docid & location 23 March Big Data Technology 3 Inverted Index Structure An Inverted Index consists of 2 elements: The lexicon (AKA dictionary) The inverted file (AKA postings file) The Inverted file is a set of postings lists a list per term. The list consists of posting elements. The list of term t holds the locations (documents + offsets) where t appears Encoded in compressed form Many variations and degrees of freedom The Lexicon is the set of all indexing units (terms) in a given collection The entry of a term typically holds its frequency, and points to the corresponding postings list 23 March Big Data Technology 4 2
3 (Traditional) Indexing: Assumptions Computational Assumptions: Sequential scans of RAM are faster than accessing random main-memory addresses RAM access is much faster than disk I/O Sequential I/O is much faster than random-access I/O I/O reads/writes data from/to disk in block-size units Scale assumptions: Both the input (token stream) and the output (inverted file) are too large to fit in main memory (RAM) Mission: efficiently transform a stream of tokens into an inverted index 23 March Big Data Technology 5 The good 1 2 the bad 3 4 and the ugly Doc #1 Indexing First Approximation As good as it gets, and more Doc #2 Is it ugly and bad? It is, and more! Doc #3 tokenizer The 1 1 Good 1 2 The 1 3 Bad 1 4 And 1 5 The 1 6 Ugly 1 7 As 2 1 Good 2 2 As 2 3 It 2 4 Gets 2 5 And 2 6 More 2 7 Is 3 1 It 3 2 Ugly 3 3 Stable Lexicographic sort And 1 5 And 2 6 And 3 4 And 3 8 As 2 1 As 2 3 Bad 1 4 Bad 3 5 Gets 2 5 Good 1 2 Good 2 2 Is 3 1 Is 3 7 It 2 4 It 3 2 It 3 6 More March Big Data Technology 6 3
4 Indexing First Approximation (cont) And 1 5 And 2 6 And 3 4 And 3 8 As 2 1 As 2 3 Bad 1 4 Bad 3 5 Gets 2 5 Good 1 2 Good 2 2 Is 3 1 Is 3 7 It 2 4 It 3 2 It 3 6 More 2 7 Group by term Term Lexicon (1;5)-(2;6)-(3;4,8)- (2;1,3)-(1;4)-(3;5)- (2;5)-(1;2)-(2;2)- (3;1,7)-(2;4)-(3;2,6)- (2;7)-(3;9)-(1;1,3,6)- (1;7)-(3;3) 23 March Big Data Technology 7 DF And 3 As 1 Bad 2 Gets 1 Good 2 Is 1 It 2 More 2 The 1 Ugly 2 Inverted File Note that the lexicon is created in parallel to the inverted file At query time, lookup in the lexicon is logarithmic in its size First Problem Cannot work in RAM Due to scale of data: token stream does not fit in RAM Solution: work in runs Allocate a RAM buffer, fill it with as many tokens as possible Sort buffer, write to disk Once all runs (say, k) have been written to disk, perform a k-way merge to build inverted file and lexicon Merge key: term document offset Merge reads full blocks from each run into a RAM buffer Increasing terms Run 1 Lexicon, inverted file K-way merge Run 2 Run k-1 Run k Increasing documents/offsets 23 March Big Data Technology 8 4
5 Second Problem: Handling Variable-Length Sort Keys The sorting of each run, and the k-way merge of all runs, require handling of keys whose format is (word, docnum, offset) The first component is variable-length And requires costly string comparisons Solution: work with term identifiers E.g., hash each string into an integer (complexity is linear with length of string) Keys become fixed-length Sorting can be done in linear complexity through radix-sort The Good = h( the ) The = h( good ) becomes Bad = h( bad ) And = h( and ) The What about hash collisions? 23 March Big Data Technology 9 Issues with Hashing Terms Any hash function of strings into integers introduces the probability of collisions What will this cause at query time? Can decrease the probability of hash collision by increasing number of bits in hash function Birthday Paradox : need ~twice as many bits as needed for simply counting the number of distinct terms (i.e. 2 log vocabulary ) However, widening the hash function slows down sorting/merging Solution: assign terms consecutive (ordinal) numbers, through the maintenance of a lexicon in real time Previously unseen terms get added to the lexicon with the next available ordinal number and an initial count of 1 The lexicon needs to be maintained globally, i.e. the same lexicon is used throughout all runs (why?) 23 March Big Data Technology 10 5
6 Third Problem: Scale of Data Order of magnitude: more than documents of average length over 10 4 bytes petabyte scale (10 15 ) Requires lots of storage Requires lots of I/O bandwidth Requires lots of sorting Index cannot be built on a single machine Solution: the computation must be distributed However, writing distributed business logic is difficult! 23 March Big Data Technology 11 Segmented Inverted Indices So far we assumed the index cannot be built on a single machine In reality, it also cannot be stored on a single machine Consequently, indexes of large-scale search engines are distributed across multiple machines Addresses mainly data scale; usage scale (query throughput) is mostly addressed by replication Two basic architectures: Local index organization - index partitioned by documents. Each machine inverts a disjoint set of documents Global index organization - index partitioned by terms. Each machine holds postings lists for a disjoint set of terms Query processing becomes a distributed task, where the choice of the partitioning scheme affects the query processing algorithm 23 March Big Data Technology 12 6
7 Segmented Inverted Indices Doc 1 Doc 2 Doc 3 Doc 4 A B C A B D A C D B C D Global index organization: index partitioned by terms Segment 1 A: 1,2,3 B: 1,2,4 Segment 2 C: 1,3,4 D: 2,3,4 23 March Big Data Technology 13 Segmented Inverted Indices Doc 1 Doc 2 Doc 3 Doc 4 A B C A B D A C D B C D Local index organization: index partitioned by documents Segment 1 A: 1,2 B: 1,2 C: 1 D: 2 Segment 2 A: 3 B: 4 C: 3,4 D: 3,4 23 March Big Data Technology 14 7
8 Local Index - Runtime Top-n query n best results QI Send all partitions a top-k query (the same query sent by the user) QI merges km results and returns the top-n to the user Each partitions returns its top-k results S 1 S m Query latency depends on the latency of the slowest partition Partition latency depends on number of its documents that match the query, and on the overall size of its index 23 March Big Data Technology 15 Abstracting the Distributed Indexing Problem We care about our business logic: 1. We want to process lots of data, specifically tuples [token streams] 2. We want to group them by some key [token] 3. We want to sort within each group [by doc-id and position] 4. We want to process each group somehow [encode posting list and output] We want to utilize many machines in parallel, without having to worry about: Data partitioning Inter-machine communication RAM limitations, e.g. dealing with out-of-band sorting Fault tolerance of machines, disks, network, 23 March Big Data Technology 16 8
9 Scalable Indexing Logical Steps Virtual Huge Token Stream Data Partitioning Token Stream Token Stream Token Stream Token Stream Processing: Define Groups {t 1,d,o} * {t 5,d,o} * {t 9,d,o} * {t 2,d,o} * {t 6,d,o} * {t 10,d,o} * {t 3,d,o} * {t 7,d,o} * {t 11,d,o} * {t 4,d,o} * {t 8,d,o} * {t 12,d,o} * Group-by, Sort Groups Processing: Encode Groups Output 23 March Big Data Technology 17 Map-Reduce Overloaded term: refers to (1) a programming paradigm and (2) a realizing system for distributed computation The combination of the system and paradigm was first introduced by Google in a paper in 2004 Actually built and utilized a few years before Hadoop, the open-source implementation of Map- Reduce, was initiated in 2005 Today, Hadoop is used by dozens of Big Data companies; Google and Microsoft are known to use their own proprietary platforms Distributed indexing was a main use-case for both 23 March Big Data Technology 18 9
10 Map-Reduce Programming Paradigm A flow for processing key-value pairs that consists of two computational functions (per round): 1. Mapper: transforms input key-value pairs to output keyvalue pairs, thereby defining how to next group the data (by keys) 2. Reducer: consumes streams of (potentially sorted) values associated with the same key and performs some computation/aggregation on them Ultimately emits an output of the form key-value as well 23 March Big Data Technology 19 Map-Reduce the System Runs on large clusters of shared-nothing machines Operates over a Distributed File System (DFS) Spawns multiple mapper tasks to run in parallel on multiple machines over multiple partitions of the input Shuffles the outputs of the mappers around, grouping by output keys and further sorting if required Spawns multiple reducer tasks to run in parallel on multiple machines, and routes one or more groups to each reducer 23 March Big Data Technology 20 10
11 Map-Reduce the System (cont.) The system takes care of: Physical data partitioning and replication (DFS) Distributed processing of mappers and reducers Inter-machine communication for shuffling & sorting data by keys Task management that overcomes hardware and software failures (more on the system s internals in the next lecture) 23 March Big Data Technology 21 Distributed Indexing via Map-Reduce Input: [URL, {token array}]* We will produce a locally partitioned distributed index by three map-reduce jobs Description is at a high level, hides many details One of several possible implementations 23 March Big Data Technology 22 11
12 Distributed Indexing, First Step Input: [URL, {token array}]* Goal: create index partitions by routing documents uniformly at random (good for load balancing) Number documents densely [1..N] per partition Mapper: [URL, {token array}] [hash(url), URL, token, offset]* Group key: hash(url), i.e. partition# Sort within group: URL (primary), offset (secondary) Reducer: [hash(url), URL, token, offset] [partition#, token, doc#, offset] Increment doc# whenever URL changes 23 March Big Data Technology 23 Distributed Indexing, Second Step Input: [partition#, token, doc#, offset] Goal: create inverted lists per token per partition Mapper: identity, i.e. [partition#, token, doc#, offset] [partition#, token, doc#, offset]* Group key: {partition#, token} Note: many more - and much smaller groups than in first step Sort within group: doc# (primary), offset (secondary) Reducer: [partition#, token, doc#, offset] [partition#, token, encoded inverted list] 23 March Big Data Technology 24 12
13 Distributed Indexing, Third Step Input: [partition#, token, encoded inverted list]* Goal: create index per partition Mapper: identity, i.e. [partition#, token, encoded inverted list] [partition#, token, encoded inverted list]* Group key: {partition} Sort within group: N/A or by token, depending on implementation Reducer: [partition#, token, encoded inverted list] [partition#, inverted index] Q: why didn t we group by partition (pushing token to sort key) to finish the indexing task in the second step? 23 March Big Data Technology 25 13
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
Sibyl: a system for large scale machine learning
Sibyl: a system for large scale machine learning Tushar Chandra, Eugene Ie, Kenneth Goldman, Tomas Lloret Llinares, Jim McFadden, Fernando Pereira, Joshua Redstone, Tal Shaked, Yoram Singer Machine Learning
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data
: High-throughput and Scalable Storage Technology for Streaming Data Munenori Maeda Toshihiro Ozawa Real-time analytical processing (RTAP) of vast amounts of time-series data from sensors, server logs,
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Physical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Map Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Big Data Technology Pig: Query Language atop Map-Reduce
Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)
MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
MapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
MapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
In-Memory Data Management for Enterprise Applications
In-Memory Data Management for Enterprise Applications Jens Krueger Senior Researcher and Chair Representative Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
Storage and File Structure
Storage and File Structure Chapter 10: Storage and File Structure Overview of Physical Storage Media Magnetic Disks RAID Tertiary Storage Storage Access File Organization Organization of Records in Files
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Chapter 11 I/O Management and Disk Scheduling
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
PERFORMANCE MODELS FOR APACHE ACCUMULO:
Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHAREDNOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc. I M NOT ADAM FUCHS But perhaps
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
Chapter 13. Disk Storage, Basic File Structures, and Hashing
Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible Hashing
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics
BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Parquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala Outline Context from various
CS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
