Hadoop Open Platform-as-a-Service (Hops)
|
|
- Toby Summers
- 8 years ago
- Views:
Transcription
1 Hadoop Open Platform-as-a-Service (Hops) Academics: PostDocs: PhDs: R/Engineers: Jim Dowling, Seif Haridi Gautier Berthou (SICS) Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh, Ali Gholami Stig Viaene (SICS), Steffen Grohschmeidt MSc Students: Theofilos Kakantousis, Nikolaos Stangios, Sri Srijeyanthan, Vangelos Savvidis, Seçkin Savaşçı.
2 Why is Big Data Important? In a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research. More data trumps better algorithms * * The Unreasonable Effectiveness of Data [Halevey, Norvig et al 09]
3 Background: Hadoop Filesystem and MapRed 3
4 Data nodes Data nodes HDFS: Hadoop Filesystem Name node
5 Data nodes Data nodes HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node
6 Data nodes Data nodes HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node
7 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Data nodes Data nodes
8 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Rebalance Data nodes Data nodes
9 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Data nodes Data nodes
10 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Data nodes Data nodes
11 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Under-replicated blocks Heartbeats Data nodes Data nodes
12 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Under-replicated blocks Data nodes Data nodes
13 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Under-replicated blocks Data nodes Data nodes 6
14 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) Workflow Manager
15 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager
16 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node
17 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node Job
18 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node Job
19 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth is the bottleneck
20 MapReduce Data Locality Job( /genomes/jim.bam ) Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker DN DN DN DN DN DN
21 MapReduce Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker DN DN DN DN DN DN
22 MapReduce Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker DN DN DN DN DN DN
23 MapReduce Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Task Task Task Task Task Tracker Tracker Tracker Tracker Tracker Tracker Job Job Job Job Job Job DN DN DN DN DN DN
24 MapReduce Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker DN DN DN DN DN DN R R = resultfile(s) R R
25 The NameNode 7
26 HDFS: Hadoop Filesystem Name node Data nodes Data nodes 8
27 HDFS NameNode Stores Mappings: path_component -> inode inode -> {block1, block2, block3, } block -> {replica1,replica2,replica3} External API to HDFS Clients - Internal API to DataNodes Monitors Datanodes for failures, corrupted data Manages Leases, Quotas, (re-)replication Must do all this in a single JVM - Spotify have a 90GB Heap storing references to 300m files 9
28 High Availability for the NameNode HDFS 2.x NN DN DN DN DN
29 High Availability for the NameNode HDFS 2.x DN DN DN DN
30 High Availability for the NameNode HDFS 2.x JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby DN DN DN DN
31 High Availability for the NameNode HDFS 2.x Agreement on the Active Master JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby DN DN DN DN
32 High Availability for the NameNode HDFS 2.x Agreement on the Active Master ZK ZK ZK JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby DN DN DN DN
33 High Availability for the NameNode HDFS 2.x Agreement on the Active Master ZK ZK ZK JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby Faster Recovery, Cut Journal Log DN DN DN DN
34 High Availability for the NameNode HDFS 2.x Agreement on the Active Master ZK ZK ZK JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby Faster Recovery, Cut Journal Log Checkpt NN DN DN DN DN
35 High Availability for the NameNode HDFS 2.x Agreement on the Active Master ZK ZK ZK DOESN T SCALEOUT! JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby Faster Recovery, Cut Journal Log Checkpt NN DN DN DN DN
36 The Evolution of the NamNode HDFS (2006) - In-memory metadata HDFS 0.07 (2006) - WAL (EditLog) - FSImage HDFS 0.21 (2009) - Weaken Global Lock HDFS 2.0 (2011) - Eventually Consistent Replication: HA-NameNode
37 The Evolution of the NamNode HDFS (2006) - In-memory metadata HDFS 0.07 (2006) - WAL (EditLog) - FSImage HDFS 0.21 (2009) - Weaken Global Lock They reinvented the Database for the NameNode! HDFS 2.0 (2011) - Eventually Consistent Replication: HA-NameNode
38 Databases had these features long ago
39 Databases had these features long ago Oracle v6 (1988) - Redo and Undo Logs - Rollback Segments
40 Databases had these features long ago Oracle v6 (1988) - Redo and Undo Logs - Rollback Segments Oracle V7.1 (1994) - Symmetric Replication
41 Databases had these features long ago Oracle v6 (1988) - Redo and Undo Logs - Rollback Segments Oracle V7.1 (1994) - Symmetric Replication and have continued to evolve..
42 Databases had these features long ago Oracle v6 (1988) - Redo and Undo Logs - Rollback Segments Oracle V7.1 (1994) - Symmetric Replication and have continued to evolve.. Oracle 9i RAC (2001) - Shared State Replication
43 The end of the One-size-fits-All Database Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
44 The end of the One-size-fits-All Database Columnar Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
45 The end of the One-size-fits-All Database Columnar Databases NewSQL Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
46 The end of the One-size-fits-All Database Columnar Databases NewSQL Databases Graph Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
47 The end of the One-size-fits-All Database Columnar Databases NewSQL Databases Graph Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
48 The end of the One-size-fits-All Database Columnar Databases In-Memory Stores NewSQL Databases Graph Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
49 The end of the One-size-fits-All Database Columnar Databases NewSQL Databases In-Memory Stores Key-Value Stores Graph Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
50 The end of the One-size-fits-All Database Columnar Databases NewSQL Databases In-Memory Stores Key-Value Stores Graph Databases Petabyte Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
51 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases In-Memory Stores Key-Value Stores Graph Databases Petabyte Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
52 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases In-Memory Stores Key-Value Stores Petabyte Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
53 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases - Neo4J RDBMSes In-Memory Stores Key-Value Stores Petabyte Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
54 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases - Neo4J RDBMSes - MySQL, Postgres, DB2, Oracle, SQLServer In-Memory Stores Key-Value Stores Petabyte Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
55 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases - Neo4J RDBMSes - MySQL, Postgres, DB2, Oracle, SQLServer In-Memory Stores - Memcached, Redis Key-Value Stores Petabyte Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
56 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases - Neo4J RDBMSes - MySQL, Postgres, DB2, Oracle, SQLServer In-Memory Stores - Memcached, Redis Key-Value Stores - Dynamo, Cassandra, MongoDB, Riak Petabyte Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
57 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases - Neo4J RDBMSes - MySQL, Postgres, DB2, Oracle, SQLServer In-Memory Stores - Memcached, Redis Key-Value Stores - Dynamo, Cassandra, MongoDB, Riak Petabyte Databases - BigQuery (Google), RedShift (Amazon), Impala (Cloudera) Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,
58 MySQL Cluster (NDB) Shared Nothing DB SQL API NDB API 30+ million update transactions/second on a 30-node cluster Distributed, In-memory 2-Phase Commit - Replicate DB, not the Log! Real-time - Low TransactionInactive timeouts Commodity Hardware Scales out - Millions of transactions/sec - TB-sized datasets (48 nodes) Split-Brain solved with Arbitrator Pattern SQL and Native Blocking/Non- Blocking APIs 14
59 15 HopsFS
60 HopsFS Customizable and Scalable Metadata High throughput for read and write operations NameNode failover time 5 seconds (vs ~1 minute for HDFS)
61 Request Handling (Apache HDFS vs HopsFS) Apache HDFS NameNode Request Handling HopsFS NameNode Request Handling
62 Fine-Grained Locking, Transactional Updates NDB gives us READ_COMMITTED isolation-level, not strong enough. We implemented Serializability for FS operations using implicit locking in the DAG and row-level locking in NDB. [Hakimzadeh, Peiro, Dowling, Scaling HDFS with a Strongly Consistent Relational Model for Metadata, DAIS 2014] 18
63 Preventing Deadlocks and Starvation /user/jdowling/dna.bam 19
64 Preventing Deadlocks and Starvation read /user/jdowling/dna.bam 19
65 Preventing Deadlocks and Starvation read mv /user/jdowling/dna.bam 19
66 Preventing Deadlocks and Starvation read mv /user/jdowling/dna.bam block_report 19
67 Preventing Deadlocks and Starvation read mv /user/jdowling/dna.bam block_report Solution: all request threads for inode operations traverse the FS hierarchy in the same order, acquiring locks in the same order. Block-level operations have to follow the same order. 19
68 Per Transaction Cache Experimentation revealed many roundtrips to the database per transaction. Cache intermediate transaction results at NameNodes. We also use Memcached at each NameNode to cache mappings of: path->{inode/blocks/replicas}
69 Sometimes, Transactions Just ain t Enough Large Subtree Operations with millions of Inodes can t be executed in a single Transaction, due to the low timeouts for Transactions (real-time). Subtree Operations: 4-phase Protocol Sacrifices Atomicity, but keeps Isolation and Consistency. Batch operations and multithreading for performance. Failed NameNodes handled transparently. Leases used to handle failed clients. 21
70 Leader Election using the Database (NDB) We need a leader NameNode to coordinate replication and lease management Use NDB as shared memory for Leader Election. No more Zookeeper, yay! 22
71 HopsFS Internal Protocol Scalability On 100PB+ clusters, internal protocols make up most of the network traffic for HDFS Block Reporting and Exiting Safe Mode - Batching and work stealing.
72 HopsFS Write Performance 1 Gbit Network, Nodes: 12-core Xeon 2.8 Ghz. 2-Node NDB Cluster. 24
73 HopsFS Read Performance 1 Gbit Network, Nodes: 12-core Xeon 2.8 Ghz. 2-Node NDB Cluster. 25
74 HopsFS Erasure Coding HDFS 2.x Triple Replication (300%)
75 HopsFS Erasure Coding HDFS 2.x Triple Replication (300%) 2x Replication + XOR (220%)
76 HopsFS Erasure Coding HDFS 2.x Triple Replication (300%) 2x Replication + XOR (220%) Reed-Solomon (140%)
77 HopsFS Erasure Coding Data durability with Triple Replication Data durability with Reed-Solomon 27
78 Comparison with HDFS-RAID
79 We did the same for YARN 29
80 Apache Hadoop Yarn HA/Scaleout Limitations Clients Zookeeper Primary RM Standby RM NM NM NM NM NM The Resource Manager (RM) is a bottleneck. Zookeeper throughput not high enough to persist all RM state Standby resource manager can only recover partial state All running jobs must be restarted. RM state not queryable. 30
81 Hops Yarn Present. Client NDB NDB NDB Scheduler RT RT NM NM NM NM NM NDB is faster than Zookeeper; the complete state of the system is stored in the database Provide transparent failover. The standby resource manager is replaced by several Resources Trackers. Node Managers heartbeats are handled by the Resources Trackers. Reduce the load on the Scheduler allows to have more nodes managers and/or to send more frequent heartbeats. 31
82 Hops Yarn Future. Client NDB NDB NDB RM RM NM NM NM NM NM The RM is a State-Machine. Almost no session state to manage. Transparent failover working. 32
83 Hops Yarn HA-YARN (done) Distributed Resource Tracker Service (ongoing) Make YARN more interactive (ongoing) - Reduce NodeManager Heartbeat Time 33
84 Hops-Hadoop NDB NDB NDB NDB NN NN NN RM RM RM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM Exabyte-Scale Hadoop
85 Hops-Hadoop NDB NDB NDB NDB NN NN NN RM RM RM HDFS HDFS DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM Exabyte-Scale Hadoop
86 Hops-Hadoop NDB NDB NDB NDB NN NN NN RM RM RM HDFS HDFS YARN YARN DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM Exabyte-Scale Hadoop
87 The Hops Stack Continued 35
88 Bringing Data People Together HopsHub Spark Flink Adam Cuneiform Hops-YARN Hops-HDFS Karamel/PaaS 36
89 Bringing Data People Together Data Owners - Metadata, Ingestion - Non-programmers HopsHub Spark Flink Adam Cuneiform Hops-YARN Hops-HDFS Karamel/PaaS 36
90 Bringing Data People Together Data Owners - Metadata, Ingestion - Non-programmers Data Scientists - Data analysts - Programmers HopsHub Spark Flink Adam Cuneiform Hops-YARN Hops-HDFS Karamel/PaaS 36
91 Perimeter Security and Multi-Tenancy HopsHub - Project-level RBAC Hadoop trusted proxy - Analytics Plugin Framework Adam, Cuneiform, Spark, Flink, MR - REST APIs Network Isolation Kerberos LIMS Related Hadoop Security Projects Knox, Sentry, Rhino LDAP 37
92 Projects for Multi-Tenancy; Activity Trails 38
93 Projects for Multi-Tenancy; Activity Trails Project 38
94 Projects for Multi-Tenancy; Activity Trails Project Global Activity Trail 38
95 Project Membership 39
96 File Browser (Iceberg)
97 HDFS Files File Browser (Iceberg)
98 HDFS Files File Browser (Iceberg)
99 Upload Data 41
100 Upload Data Overcome 3 GB browser upload limit 41
101 Upload Data Overcome 3 GB browser upload limit 41
102 Upload Data Overcome 3 GB browser upload limit Automated Ingestion of Data 41
103 Upload Data Overcome 3 GB browser upload limit Apache Flume Automated Ingestion of Data 41
104 Run Cuneiform Workflows on YARN
105 PaaS support with Chef/Karamel Support for EC2, Vagrant, Bare Metal. 43
106 Conclusions Hops will be the first European distribution of Hadoop when released. - First beta release coming in Q Lots of ideas for future work - Tighter Spark, Flink integration - BiobankCloud support 44
Hadoop Open Platform-as-a-Service (Hops)
Hadoop Open Platform-as-a-Service (Hops) Academics: PostDocs: PhDs: R/Engineers: Jim Dowling, Seif Haridi Gautier Berthou (SICS) Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh, Ali Gholami Stig Viaene
More informationManaging large clusters resources
Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth
More informationMANAGING RESOURCES IN A BIG DATA CLUSTER.
MANAGING RESOURCES IN A BIG DATA CLUSTER. Gautier Berthou (SICS) EMDC Summer Event 2015 www.hops.io @hopshadoop We are producing lot of data Where does they Come From? On-line services : PBs per day Scientific
More informationwww.biobankcloud.com Jim Dowling KTH Royal Institute of Technology, Stockholm SICS Swedish ICT CSHL Meeting on Biological Data Science, 2014
www.biobankcloud.com Jim Dowling KTH Royal Institute of Technology, Stockholm SICS Swedish ICT CSHL Meeting on Biological Data Science, 2014 Definition of a Biobank The Biobank concept is defined (by Swedish
More informationBig Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
More informationD2.3 Scalable and Highly Available HDFS
Project number: 317871 Project acronym: BIOBANKCLOUD Project title: Scalable, Secure Storage of Biobank Data Project website: http://www.biobankcloud.eu Project coordinator: Jim Dowling (KTH) Coordinator
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationHOPS: Hadoop Open Platform-as-a-Service
HOPS: Hadoop Open Platform-as-a-Service Alberto Lorente, Hamid Afzali, Salman Niazi, Mahmoud Ismail, Kamal Hakimazadeh, Hooman Piero, Jim Dowling jdowling@sics.se Scale Research Laboratory What is Hadoop?
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationIntroduction to Hadoop
Introduction to Hadoop ID2210 Jim Dowling Large Scale Distributed Computing In #Nodes - BitTorrent (millions) - Peer-to-Peer In #Instructions/sec - Teraflops, Petaflops, Exascale - Super-Computing In #Bytes
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationApache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
More informationThe Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
More informationData Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
More informationNoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationHow to Hadoop Without the Worry: Protecting Big Data at Scale
How to Hadoop Without the Worry: Protecting Big Data at Scale SESSION ID: CDS-W06 Davi Ottenheimer Senior Director of Trust EMC Corporation @daviottenheimer Big Data Trust. Redefined Transparency Relevance
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationHDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.
HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework
More informationRealtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens
Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationDesign and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
More informationHADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationLecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
More informationNon-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability
More informationApache Hadoop FileSystem Internals
Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationApache Sentry. Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com
Apache Sentry Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com Agenda Various aspects of data security Apache Sentry for authorization Key concepts of Apache Sentry Sentry features Sentry architecture
More informationENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics
More informationHadoop Scalability at Facebook. Dmytro Molkov (dms@fb.com) YaC, Moscow, September 19, 2011
Hadoop Scalability at Facebook Dmytro Molkov (dms@fb.com) YaC, Moscow, September 19, 2011 How Facebook uses Hadoop Hadoop Scalability Hadoop High Availability HDFS Raid How Facebook uses Hadoop Usages
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationEntering the Zettabyte Age Jeffrey Krone
Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationHadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
More information<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
More informationHDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
More informationextensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
More informationThere's Plenty of Room in the Cloud
There's Plenty of Room in the Cloud [Shameless reference to Feynman s talk from 1959] Lecturer: Zoran Dimitrijevic Altiscale, Inc. Spring 2015 CS290B -- Cloud Computing 50 Years of Moore
More informationSAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES
SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES AWS GLOBAL INFRASTRUCTURE 10 Regions 25 Availability Zones 51 Edge locations WHAT
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationCloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationCommunicating with the Elephant in the Data Center
Communicating with the Elephant in the Data Center Who am I? Instructor Consultant Opensource Advocate http://www.laubersoltions.com sml@laubersolutions.com Twitter: @laubersm Freenode: laubersm Outline
More informationBig Data Management and Security
Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationIJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY Hadoop Distributed File System: What and Why? Ashwini Dhruva Nikam, Computer Science & Engineering, J.D.I.E.T., Yavatmal. Maharashtra,
More informationCloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationВовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН
Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН Zettabytes Petabytes ABC Sharding A B C Id Fn Ln Addr 1 Fred Jones Liberty, NY 2 John Smith?????? 122+ NoSQL Database
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationBig Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
More informationScalable Architecture on Amazon AWS Cloud
Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies kalpak@clogeny.com 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect
More informationNative Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy
Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationCopyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Oracle Big Data Appliance Releases 2.5 and 3.0 Ralf Lange Global ISV & OEM Sales Agenda Quick Overview on BDA and its Positioning Product Details and Updates Security and Encryption New Hadoop Versions
More informationHadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
More informationContents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes
Contents Pentaho Corporation Version 5.1 Copyright Page New Features in Pentaho Data Integration 5.1 PDI Version 5.1 Minor Functionality Changes Legal Notices https://help.pentaho.com/template:pentaho/controls/pdftocfooter
More informationChukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
More informationIntroduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
More informationMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures Oli Sennhauser Senior Consultant osennhauser@mysql.com 1 Introduction Who we are? What we want? 2 Table of Contents Scale-Up vs. Scale-Out MySQL Replication
More informationSplice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com
REPORT Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com The content of this evaluation guide, including the ideas and concepts contained within, are the property of Splice Machine,
More informationSujee Maniyam, ElephantScale
Hadoop PRESENTATION 2 : New TITLE and GOES Noteworthy HERE Sujee Maniyam, ElephantScale SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationHADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop
More informationOverview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB
Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what
More informationOracle Big Data Fundamentals Ed 1 NEW
Oracle University Contact Us: +90 212 329 6779 Oracle Big Data Fundamentals Ed 1 NEW Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationMyISAM Default Storage Engine before MySQL 5.5 Table level locking Small footprint on disk Read Only during backups GIS and FTS indexing Copyright 2014, Oracle and/or its affiliates. All rights reserved.
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationBookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
More informationUpcoming Announcements
Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC jmarkham@hortonworks.com Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within
More information<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store
Oracle NoSQL Database A Distributed Key-Value Store Charles Lamb, Consulting MTS The following is intended to outline our general product direction. It is intended for information
More information6.S897 Large-Scale Systems
6.S897 Large-Scale Systems Instructor: Matei Zaharia" Fall 2015, TR 2:30-4, 34-301 bit.ly/6-s897 Outline What this course is about" " Logistics" " Datacenter environment What this Course is About Large-scale
More informationHDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
More informationOn- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform
On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...
More informationUsing MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A
More informationBIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS
BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS WHAT IS BIG DATA? describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information
More informationDeploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationData Services Advisory
Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains
More informationMASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
More informationIn Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
More informationActian SQL in Hadoop Buyer s Guide
Actian SQL in Hadoop Buyer s Guide Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationWhat Next for DBAs in the Big Data Era
What Next for DBAs in the Big Data Era February 21 st, 2015 Copyright 2013. Apps Associates LLC. 1 Satyendra Kumar Pasalapudi Associate Practice Director IMS @ Apps Associates Co Founder & President of
More informationBig Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce
Big Data and Hadoop Module 1: Introduction to Big Data and Hadoop Learn about Big Data and the shortcomings of the prevailing solutions for Big Data issues. You will also get to know, how Hadoop eradicates
More informationSo What s the Big Deal?
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
More informationInfomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
More informationCloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationSession: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
More information