The Best Database for Hadoop

Size: px

Start display at page:

Download "The Best Database for Hadoop"

Bruce Perry
7 years ago
Views:

1 The Best Database for Hadoop Justin Makeig, Director, Product Management, MarkLogic April 9, 2013

2 Disclaimer Forward-looking Statements All statements describing future releases and capabilities, estimated release dates, and content are plans only, and MarkLogic is under no obligation to develop, include or make available, commercially or otherwise, any specific feature or functionality in any MarkLogic product. Information is provided for general understanding and informational purposes only, and is subject to change at the sole discretion of MarkLogic in response to changing customer requirements, market conditions, delivery schedules and other factors. Information should not be distributed without written permission from MarkLogic. Slide 8

3 Agenda What is Hadoop and why is it important? MarkLogic with Hadoop as compute infrastructure MarkLogic with Hadoop as storage infrastructure Reference architectures Example implementation Q and/or A Slide 9

4 What is Hadoop? Distributed processing for large or computationally complex problems Core tenets: Scale out, not up Move processing, not data in bandwidth-constrained environments Expect and embrace failure Open-source Java implementation of Google s MapReduce framework Evolved out of Nutch and Lucene in 2006 at Yahoo! Slide 10

5 Why is Hadoop important? Emerging compute and storage infrastructure Economics of commodity scale out vs. up Unstructured throughout More data > clever algorithms Fault tolerant by design Momentum and community Slide 11

6 Real-time applications Real-time applications Batch analytics Hadoop Slide 12

7 Real-time applications Real-time applications Magic? Batch analytics Hadoop Slide 13

8 Why must we choose? Legacy RDBMS Indexes Transactions Security Enterprise operations NoSQL Flexible data model Commodity scale out Distributed, fault-tolerant Hadoop sink/source Slide 14

9 The best database for Hadoop Real-time applications Batch analytics Hadoop Slide 15

10 MarkLogic for Hadoop Deploy MarkLogic into an existing Hadoop stack Real-time enterprise applications for Hadoop Less data movement, duplication over its life cycle Mixed workloads: Index once, real-time or batch Cost-effective long-term and long-tail storage Leverage existing (or upcoming) infrastructure investments Compute Storage Slide 16

11 Complementary approaches MarkLogic Online applications Decision-making Real-time Distributed indexes Hadoop Offline analytics Model-building Long-haul batch Distributed file system Slide 17

12 Hadoop ecosystem Hive HBase Pig Impala Flume Sqoop Scribe Query Ingest Management Ambari Hue Chuckwa Vaidya MapReduce: Compute HDFS: Storage Coordination Zoo Keeper Oozie Slide 18

13 Hadoop ecosystem MapReduce: Distributed computation, divide and conquer HDFS: Distributed file system MapReduce: Compute HDFS: Storage Slide 19

14 MapReduce Divide and conquer Break large or complex processing into small, independent pieces Map: Process or filter a chunk of the total input data Reduce: Aggregate and collate intermediate results Map and Reduce processes work in parallel Scale by adding workers, not bigger/faster workers Centrally coordinated: If a worker goes down, reschedule its work to another Slide 20

15 MapReduce Example Problem: Sally needs to count the number of books in the library that have pictures of cats, grouped by publisher. Willow Creek Press (April 2005) Slide 21

16 Example: Single-Threaded Sally looks at every page of every book on each of the library s seven shelves If a given book has a picture of a cat she notes the publisher and adds a tick She sums up the ticks for each publisher Reading 30 seconds/book 500 books/shelf 7 shelves = 29 hours + Aggregating 0.1 seconds/tick 3,500 ticks = 6 minutes Slide 22

17 Example: MapReduce Sally recruits nine friends She assigns one friend to each of the seven shelves For each book with a cat picture that each friend finds she adds the publisher s name her own running list of publishers (Map) When they all finish, Sally combines the lists and sorts the combined list by publisher She divides the total list in half and gives each half to the remaining two of her friends Each adds up the occurrences of each publisher (Reduce) Sally combines their results Map 30 seconds/book 500 books/shelf = 4 hours + Reduce 0.1 second/entry 1,750 entries = 2 minutes Slide 23

18 MapReduce on MarkLogic Export ETL Enrichment Connector for Hadoop Slide 24

19 Hadoop ecosystem MapReduce: Distributed computation, divide and conquer HDFS: Distributed file system HDFS: Storage Slide 25

20 Hadoop Distributed File System Cheap, reliable storage for large files Default data storage for Hadoop Scales to hundreds of petabytes on commodity hardware Designed only for reading large, opaque files from start to finish Optimizes for aggregate throughput, not latency Write-once, read-many Automatic replication for fault tolerance and locality File-level security designed to prevent accidental corruption Slide 26

21 HDFS A 64MB block File B C Client Name Name Node Node Data Nodes A C A C B B C B A Slide 27

22 MarkLogic storage Database Hosts Forests File system Slide 28

23 Shared file system as storage SAN NAS File system Slide 29

24 Hadoop as shared file system HDFS Data and indexes Journals Offline archives Backups Binaries Slide 30

25 Hadoop as a storage tier ~$25/GB Low density for ingest performance Replication for HA ~$1/GB HDFS High density for efficiency Shared-disk failover Slide 31

26 Tiered storage with Hadoop Optimize among cost, performance, and availability Partition and balance forest data by key (e.g. date) Migrate partitions between hosts and storage Local disk, SAN, NAS, HDFS, and S3 Offline to free resources, back online in seconds MapReduce on forest data in HDFS without MarkLogic Much more on tiered storage tomorrow at 10:30 am in Chelsea 5 Slide 32

27 Mixed batch and real-time workloads MarkLogic Database Hadoop MapReduce HDFS Slide 33

28 Hadoop for the Enterprise Enterprise Hadoop meets enterprise NoSQL Certification and testing of a known configuration Integrated support and enterprise management tools Performance and security enhancements Slide 34

29 The best database for Hadoop Deploy MarkLogic into an existing Hadoop stack Real-time enterprise applications for Hadoop Less data movement, duplication over its life cycle Mixed workloads: Index once, real-time or batch Cost-effective long-term and long-tail storage Leverage existing (or upcoming) infrastructure investments Available today in Early Access Slide 35

Unleashing the Power of Hadoop for Big Data Analytics

THOUGHT LEADERSHIP SERIES AUGUST 2013 2 Unleashing the Power of Hadoop for Big Data Analytics Data analytics, long the obscure pursuit of analysts and quants toiling in the depths of enterprises, has emerged