Cloud Programming: From Doom and Gloom to BOOM and Bloom
|
|
- Jocelin Carr
- 2 years ago
- Views:
Transcription
1 Cloud Programming: From Doom and Gloom to BOOM and Bloom Neil Conway UC Berkeley Joint work with Peter Alvaro, Ras Bodik, Tyson Condie, Joseph M. Hellerstein, David Maier (PSU), William R. Marczak, and Russell Sears (Yahoo! Research) Datalog 2.0 Workshop
2 Cloud CompuQng: The Next Great CompuQng PlaSorm
3 The Problem WriQng reliable, scalable distributed souware remains extremely difficult.
4 Doom and Gloom!...when we start talking about parallelism and ease of use of truly parallel computers, we re talking about a problem that s as hard as any that computer science has faced... I would be panicked if I were in industry. - - John Hennessey, Stanford
5 A Ray of Light We understand data- parallel compuqng MapReduce, parallel DBs, etc. Can we take a hard problem and transform it into an easy one?
6 Everything is Data Distributed compuqng is all about state System state Session state Protocol state User and security- related state... and of course, the actual data CompuQng = CreaQng, updaqng, and communicaqng that state
7 Datalog to the Rescue! 1. Data- centric programming Explicit, uniform state representaqon: relaqons 2. High- level declaraqve queries Datalog + asynchrony + state update
8 Outline 1. The BOOM Project Cloud CompuQng stack built w/ distributed logic BOOM AnalyQcs: MapReduce and DFS in Overlog 2. Dedalus: Datalog in Time (and Space) 3. (Toward) The Bloom Language Distributed Logic for Joe the Programmer
9 The BOOM Project Berkeley Orders Of Magnitude OOM more scale, OOM less code Can we build Google in 10k LOC? Build Real Systems in distributed logic Begin w/ an exisqng variant of Datalog ( Overlog ) Inform the design of a new language for distributed compuqng (Bloom)
10 BOOM AnalyQcs Typical Big Data stack: MapReduce (Hadoop) + distributed file system (HDFS) We tried two approaches: HDFS: clean- slate rewrite Hadoop: replace job scheduling logic w/ Overlog Goals 1. Replicate exisqng funcqonality 2. Add Hard Stuff (and make it look easy!)
11 Overlog: Distributed Datalog Originally designed for rouqng protocols and overlay networks (Loo et al., SIGMOD 06) RouQng = recursive query over distributed DB Datalog w/ aggregaqon, negaqon, funcqons DistribuQon = horizontal parqqoning of tables Data placement induces communicaqon
12 Yet Another TransiQve Closure link(x, Y, C); path(x, Y, C) :- link(x, Y, C); path(x, Z, C1 + C2) :- link(x, Y, C1), path(y, Z, C2); mincost(x, Z, min<c>) :- path(x, Z, C);
13 Overlog Example Y, C); Y, C) :- Y, C); Z, C1 + C2) :- Y, C1), Z, C2); Z, min<c>) :- Z, C);
14 Overlog Timestep Model
15 Hadoop Distributed File System Based on the Google File System (SOSP 03) Large files, sequenqal workloads, append- only Used by Yahoo!, Facebook, etc. Chunks 3x replicated at data nodes for fault tolerance Data Protocol Client Data Node Metadata Protocol Heartbeat Protocol Data Node Master Node Control Protocol Data Node
16 BOOM- FS Hybrid system Client Complex logic: Overlog Metadata Protocol Overlog Performance- criqcal (but simple!): Java Data Protocol Heartbeat Protocol Master Node Control Protocol Clean separaqon between policy and mechanism Data Node Data Node Data Node
17 BOOM- FS Example: State Represent file system metadata with relaqons. Name Descrip9on A<ributes file Files fileid, parentid, name, isdir fqpath Fully- qualified path names fileid, path fchunk Chunks per file chunkid, fileid datanode DataNode heartbeats nodeaddr, lasthbtime hb_chunk Chunk heartbeats nodeaddr, chunkid, length
18 BOOM- FS Example: Query Represent file system metadata with relaqons. // Base case: root directory has null parent fqpath(fileid, Path) :- file(fileid, FParentId, FName, IsDir), IsDir = true, FParentId = null, Path = "/"; fqpath(fileid, Path) :- file(fileid, FParentId, FName, _), fqpath(fparentid, ParentPath), // Do not add extra slash if parent is root dir PathSep = (ParentPath = "/"? "" : "/"), Path = ParentPath + PathSep + FName;
19 BOOM- FS Example: Query Distributed protocols: join between event stream and local database // "ls" for extant path => return liscng for path RequestId, true, DirLisQng) :- RequestId, Source, "Ls", Path), FileId, Path), FileId, DirLisQng); // "ls" for nonexistent path => error RequestId, false, null) :- RequestId, Source, "Ls", Path), noqn _, Path);
20 Comparison with Hadoop CompeQQve performance (~20%) Lines of Java Lines of Overlog HDFS ~21,700 0 BOOM- FS 1, New Features: 1. Hot Standby for FS master nodes using Paxos 2. ParQQoned FS master nodes for scalability ~1 day! 3. Monitoring, tracing, and invariant checking
21 Lessons from BOOM AnalyQcs Overall, Overlog was a good fit for the task Concise programs for real features, easy evoluqon Data- centric design: language- independent ReplicaQon, parqqoning, monitoring all involve data management Node- local invariants are cross- cuwng queries SpecificaQon is enforcement Policy vs. mechanism Datalog vs. Java
22 Challenges from BOOM AnalyQcs Poor perf, crypqc syntax, lixle/no tool support Easy to fix! Many bugs related to updaqng state Ambiguous semanqcs (in Overlog) We avoided distributed queries The global database is a lie! Hand- coding protocols vs. staqng distributed invariants
23 Outline 1. The BOOM Project Cloud CompuQng stack built w/ distributed logic BOOM AnalyQcs: MapReduce and DFS in Overlog 2. Dedalus: Datalog in Time (and Space) 3. (Toward) The Bloom Language Distributed Logic for Joe the Programmer
24 Dedalus Dedalus: a theoreqcal foundaqon for Bloom In Overlog, the Hard Stuff happens between Qme steps State update Asynchronous messaging Can we talk about the Hard Stuff with logic?
25 State Update Updates in Overlog: ugly, outside of logic Difficult to express common paxerns Queues, sequencing Order doesn t maxer... except when it does! Val + 1) :- Val), _);
26 Asynchronous Messaging notaqon describes space Logical interpretaqon unclear: B) :- A); Upon reflecqon, "me is more fundamental Model failure with arbitrary delay Discard illusion of global DB
27 Dedalus: Datalog in Time (1) Deduc9ve rule: (Pure Datalog) p(a, B) :- q(a, B); (2) Induc9ve rule: (Constraint across next Qmestep) p(a, :- q(a, B); (3) Async rule: (Constraint across arbitrary Qmesteps) p(a, q(a, B);
28 Dedalus: Datalog in Time (1) Deduc9ve rule: (Pure Datalog) All terms in body have same Qme p(a, B, S) :- q(a, B, T), T = S; (2) Induc9ve rule: (Constraint across next Qmestep) p(a, B, S) :- q(a, B, T), successor(t, S); (3) Async rule: (Constraint across arbitrary Qmesteps) p(a, B, S) :- q(a, B, T), Qme(S), choose((a, B, T), (S));
29 State Update in Dedalus p (A, :- p (A, B), noqn p_neg(a, B); p (1, p (1, p_neg(1, Time p(1, 2) p(1, 3) p_neg(1, 2)
30 Counters in Dedalus counter(a, Val + :- counter(a, Val), event(a, _); counter(a, :- counter(a, Val), noqn event(a, _);
31 Asynchrony in Dedalus Unreliable Broadcast: sbcast(#target, Sender, :- new_message(#sender, Message), members(#sender, Target); More saqsfactory logical interpretaqon Can build Lamport clocks, reliable broadcast, etc. What about space? Space is the unit of atomic deducqon w/o parqal failure
32 Asynchrony in Dedalus Unreliable Broadcast: Project sender s local Qme sbcast(#target, Sender, N, :- new_message(#sender, members(#sender, Target); More saqsfactory logical interpretaqon Can build Lamport clocks, reliable broadcast, etc. What about space? Space is the unit of atomic deducqon w/o parqal failure
33 Dedalus Summary Logical, model- theoreqc semanqcs for two key features of distributed systems 1. Mutable state 2. Asynchronous communicaqon All facts are transient Persistence and state update are explicit IniQal correctness checks 1. Temporal straqfiability ( modular straqficaqon in Qme ) 2. Temporal safety ( eventual quiescence )
34 DirecQons: Bloom 1. Bloom: Logic for Joe the Programmer Expose sets, map/reduce, and callbacks? TranslaQon to Dedalus 2. VerificaQon of Dedalus programs 3. Network- oriented opqmizaqon 4. Finding the right abstracqons for Distributed CompuQng Hand- coding protocols vs. staqng distributed invariants 5. Parallelism and monotonicity?
35 QuesQons? Thank you! hxp://declaraqvity.net IniQal PublicaQons: BOOM AnalyCcs: EuroSys 10, Alvaro et al. Paxos in Overlog: NetDB 09, Alvaro et al. Dedalus: UCB TR # , Alvaro et al.
36 Temporal StraQfiability ReducQon to Datalog not syntacqcally straqfiable: :- p(x), noqn p_neg(x); p_neg(x) :- event(x, _), p(x);
Dedalus: Datalog in Time and Space
Dedalus: Datalog in Time and Space Peter Alvaro, Ras Bodík, Neil Conway, Joe Hellerstein, David Maier (PSU), Bill Marczak, and Rusty Sears (Yahoo! Research) 2010 Berkeley OSQ Retreat Dedalus Dedalus is
BOOM: Data-Centric Programming For the Data Center. Peter Alvaro and Neil Conway UC Berkeley DB Group
BOOM: Data-Centric Programming For the Data Center Peter Alvaro and Neil Conway UC Berkeley DB Group 1 HDFS Architecture Client Metadata Protocol Data Protocol Master Node Heartbeat Protocol Control Protocol
BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud
BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud Peter Alvaro UC Berkeley palvaro@eecs.berkeley.edu Khaled Elmeleegy Yahoo! Research khaled@yahoo-inc.com Tyson Condie UC Berkeley
BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud
BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud Peter Alvaro UC Berkeley palvaro@eecs.berkeley.edu Khaled Elmeleegy Yahoo! Research khaled@yahoo-inc.com Tyson Condie UC Berkeley
BOOM: Data-Centric Programming in the Datacenter
BOOM: Data-Centric Programming in the Datacenter Peter Alvaro Tyson Condie Neil Conway Khaled Elmeleegy Joseph M. Hellerstein Russell C Sears Electrical Engineering and Computer Sciences University of
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Hadoop@LaTech ATLAS Tier 3
Cerberus Hadoop Hadoop@LaTech ATLAS Tier 3 David Palma DOSAR Louisiana Tech University January 23, 2013 Cerberus Hadoop Outline 1 Introduction Cerberus Hadoop 2 Features Issues Conclusions 3 Cerberus Hadoop
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
A Logic-Based Approach to Cloud Computing
A Logic-Based Approach to Cloud Computing Jan Van den Bussche (Hasselt University, Belgium) joint work with Tom Ameloot and Frank Neven 1 Origins of Cloud Computing Large websites (Amazon, Google, Facebook)
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
Language Support for Loosely Consistent Distributed Programming
Language Support for Loosely Consistent Distributed Programming Neil Conway Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-153 http://www.eecs.berkeley.edu/pubs/techrpts/2014/eecs-2014-153.html
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Hadoop Distributed File System (HDFS) Overview
2012 coreservlets.com and Dima May Hadoop Distributed File System (HDFS) Overview Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
Storage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
HDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.
HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside
Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment Sanjay Kulhari, Jian Wen UC Riverside Team Sanjay Kulhari M.S. student, CS U C Riverside Jian Wen Ph.D. student, CS U
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy
MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears Neeraj Ganapathy Outline Hadoop Architecture Pipelined MapReduce Online Aggregation Continuous
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:
Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
Amr El Abbadi. Computer Science, UC Santa Barbara amr@cs.ucsb.edu
Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu Collaborators: Divy Agrawal, Sudipto Das, Aaron Elmore, Hatem Mahmoud, Faisal Nawab, and Stacy Patterson. Client Site Client Site Client
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Distributed Storage Systems
Distributed Storage Systems John Leach john@brightbox.com twitter @johnleach Brightbox Cloud http://brightbox.com Our requirements Bright box has multiple zones (data centres) Should tolerate a zone failure
Hadoop in Action. Justin Quan March 15, 2011
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
DRAFT. 1 Proposed System. 1.1 Abstract
Doctoral Thesis Proposal A language-level approach to distributed computation with weak synchronization and its application to cloud and edge computing environments. Christopher Meiklejohn Université catholique
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
The Inside Scoop on Hadoop
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution
HDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Programming Without a Call Stack: Event-driven Architectures
Gregor Hohpe Google Programming Without a Call Stack: -driven Architectures www.eaipatterns.com Who's Gregor? Distributed systems, enterprise integration, service-oriented architectures MQ, MSMQ, JMS,
Introduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
BlobSeer: Towards efficient data storage management on large-scale, distributed systems
: Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu
Big Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Enhancing Map-Reduce Job Execution On Geo-Distributed Data Across Datacenters
Enhancing Map-Reduce Job Execution On Geo-Distributed Data Across Datacenters P. JEBILLA 1, P. JAYASHREE 2 1,2 Department of Computer Science and Engineering, Anna University- MIT Campus, Chennai, India
Distributed Metadata Management Scheme in HDFS
International Journal of Scientific and Research Publications, Volume 3, Issue 5, May 2013 1 Distributed Metadata Management Scheme in HDFS Mrudula Varade *, Vimla Jethani ** * Department of Computer Engineering,
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
Windows Azure Storage Scaling Cloud Storage Andrew Edwards Microsoft
Windows Azure Storage Scaling Cloud Storage Andrew Edwards Microsoft Agenda: Windows Azure Storage Overview Architecture Key Design Points 2 Overview Windows Azure Storage Cloud Storage - Anywhere and
Chukwa: a scalable log collector Ari Rabkin and Randy Katz UC Berkeley USENIX LISA 2010
UC Berkeley Chukwa: a scalable log collector Ari Rabkin and Randy Katz UC Berkeley USENIX LISA 2010 With thanks to Eric Yang, Jerome Boulon, Bill Graham, Corbin Hoenes, and all the other Chukwa developers,
Big Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
Hadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens
Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at
5 HDFS - Hadoop Distributed System
5 HDFS - Hadoop Distributed System 5.1 Definition and Remarks HDFS is a file system designed for storing very large files with streaming data access patterns running on clusters of commoditive hardware.
Apache Hadoop FileSystem Internals
Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs
Data Management in the Cloud
Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
Using Data Mining and Machine Learning in Retail
Using Data Mining and Machine Learning in Retail Omeid Seide Senior Manager, Big Data Solutions Sears Holdings Bharat Prasad Big Data Solution Architect Sears Holdings Over a Century of Innovation A Fortune
The Recovery System for Hadoop Cluster
The Recovery System for Hadoop Cluster Prof. Priya Deshpande Dept. of Information Technology MIT College of engineering Pune, India priyardeshpande@gmail.com Darshan Bora Dept. of Information Technology
Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013
Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,
What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming
Distributed and Parallel Technology Datacenter and Warehouse Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 0 Based on earlier versions by
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Special Relativity and the Problem of Database Scalability
Special Relativity and the Problem of Database Scalability James Starkey NimbusDB, Inc. The problem, some jargon, some physics, a little theory, and then NimbusDB. Problem: Database systems scale badly
Do Relational Databases Belong in the Cloud? Michael Stiefel www.reliablesoftware.com development@reliablesoftware.com
Do Relational Databases Belong in the Cloud? Michael Stiefel www.reliablesoftware.com development@reliablesoftware.com How do you model data in the cloud? Relational Model A query operation on a relation
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Yuji Shirasaki (JVO NAOJ)
Yuji Shirasaki (JVO NAOJ) A big table : 20 billions of photometric data from various survey SDSS, TWOMASS, USNO-b1.0,GSC2.3,Rosat, UKIDSS, SDS(Subaru Deep Survey), VVDS (VLT), GDDS (Gemini), RXTE, GOODS,
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
Google File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
Curriculum Vitae. Peter Alvaro. November Computer Science Department PHONE: +1 (415) University of California, Santa Cruz
Curriculum Vitae Peter Alvaro November 2016 Computer Science Department PHONE: +1 (415) 813 9364 University of California, Santa Cruz EMAIL: palvaro@ucsc.edu 1156 High Street, MS SOE3 http://people.ucsc.edu/~palvaro
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability