Hadoop & MarkLogic. Presented by: Jim Clark, Senior Director, Product Management COPYRIGHT 2013 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Transcription

2 Agenda What is Hadoop and why is it important? MarkLogic with Hadoop as compute infrastructure MarkLogic with Hadoop as storage infrastructure Reference architectures Use cases Distribution strategy and Intel partnership Q and/or A SLIDE: 2

3 Hadoop Partners SLIDE: 3

4 Hadoop Components SLIDE: 4

5 SLIDE: 5

6 Why should we care? Why should we care? SLIDE: 6

7 What is it? Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. [2] It is licensed under the Apache License Wikipedia.org The Apache Hadoop framework is composed of the following modules: Hadoop Common Hadoop Distributed File System (HDFS) Hadoop MapReduce ~Hadoop YARN (MapReduce 2.0) SLIDE: 7

8 Hive HBase Pig Impala Flume Sqoop Scribe Management Query Ingest MapReduce: Compute HDFS: Storage Coordination SLIDE: 8 Ambari Hue Chuckwa Vaidya Yarn Zoo Keeper Oozie

9 Why is Hadoop important? Emerging compute and storage infrastructure Economics of commodity scale out vs. up Unstructured throughout More data > clever algorithms Fault tolerant by design Momentum and community SLIDE: 9

10 Hadoop Hadoop Staging Analytics Persistence SLIDE: 10

11 Real-time applications Real-time applications Batch analytics Hadoop SLIDE: 11

12 Real-time applications Real-time applications Magic? Batch analytics Hadoop SLIDE: 12

13 Why must we choose? Legacy RDBMS Indexes Transactions Security Enterprise operations NoSQL Flexible data model Commodity scale out Distributed, faulttolerant Hadoop sink/source SLIDE: 13

14 The best database for Hadoop Real-time applications MarkLogic Batch analytics SLIDE: 14 Hadoop

15 MarkLogic for Hadoop Deploy MarkLogic into an existing Hadoop stack Real-time enterprise applications for Hadoop Less data movement, duplication over its life cycle Mixed workloads: Index once, real-time or batch Cost-effective long-term and long-tail storage Leverage existing (or upcoming) infrastructure investments Compute Storage SLIDE: 15

16 Complementary approaches MarkLogic MarkLogic Hadoop Online applications Decision-making Real-time Distributed indexes Offline analytics Model-building Long-haul batch Distributed file system SLIDE: 16

17 Hadoop ecosystem MapReduce: Distributed computation, divide and conquer HDFS: Distributed file system MapReduce: Compute HDFS: Storage SLIDE: 17

18 MapReduce Divide and conquer Break large or complex processing into small, independent pieces Map: Process or filter a chunk of the total input data Reduce: Aggregate and collate intermediate results Map and Reduce processes work in parallel Scale by adding workers, not bigger/faster workers Centrally coordinated: If a worker goes down, reschedule its work to another SLIDE: 18

19 MapReduce Example Problem: Sally needs to count the number of books in the library that have pictures of cats, grouped by publisher. SLIDE: 19 Willow Creek Press (April 2005)

20 Example: Single-Threaded Sally looks at every page of every book on each of the library s seven shelves If a given book has a picture of a cat she notes the publisher and adds a tick She sums up the ticks for each publisher SLIDE: 20 Reading 30 seconds/book 500 books/shelf 7 shelves = 29 hours + Aggregating 0.1 seconds/tick 3,500 ticks = 6 minutes

21 Example: MapReduce Sally recruits nine friends She assigns one friend to each of the seven shelves For each book with a cat picture that each friend finds she adds the publisher s name her own running list of publishers (Map) When they all finish, Sally combines the lists and sorts the combined list by publisher She divides the total list in half and gives each half to the remaining two of her friends Each adds up the occurrences of each publisher (Reduce) Sally combines their results SLIDE: 21 Map 30 seconds/book 500 books/shelf = 4 hours + Reduce 0.1 second/entry 1,750 entries = 2 minutes

22 MapReduce on MarkLogic Export ETL Enrichment Connector for Hadoop MarkLogic SLIDE: 22

23 Direct access MapReduce processing Batch and real-time No ETL or re-indexing Consistent migrations Online in seconds Open-source reader SLIDE: 23

24 Hadoop ecosystem MapReduce: Distributed computation, divide and conquer HDFS: Distributed file system HDFS: Storage SLIDE: 24

25 Hadoop Distributed File System (HDFS) Cheap, reliable storage for large files Default data storage for Hadoop Scales to hundreds of petabytes on commodity hardware Designed only for reading large, opaque files from start to finish Optimizes for aggregate throughput, not latency Write-once, read-many Automatic replication for fault tolerance and locality File-level security designed to prevent accidental corruption SLIDE: 25

26 HDFS File A B C 64MB block Client Name Node Name Node Data Nodes A C A C B B C B A SLIDE: 26

27 MarkLogic storage Database Hosts Forests File system SLIDE: 27

28 Shared file system as storage SAN NAS File system SLIDE: 28

29 Hadoop as shared file system HDFS Data and indexes Journals Offline archives Backups Binaries SLIDE: 29

30 Hadoop as a storage tier Low density for ingest performance Replication for HA HDFS High density for efficiency Shared-disk failover SLIDE: 30

31 Data Retention and Tiered Storage Provide multiple Service Level Agreements (SLAs) in a single system Decrease time and costs of ETL to bring offline content back online Empower your operations team without imposing burdens on your developers SLIDE: 31

32 Tiered Storage Architecture Data tiers are defined based on indexes balanced into forests by tier Query one tier or the other tier or both at once! All with no downtime, and 100% consistency! SLIDE: 32

33 Tiered Storage MarkLogic NoSQL Active ~$25/GB Hadoop Historical ~$1/GB SLIDE: 33

34 Tiered storage with Hadoop Optimize among cost, performance, and availability Partition and balance forest data by key (e.g. date) Migrate partitions between hosts and storage Local disk, SAN, NAS, HDFS, and S3 Offline to free resources, back online in seconds MapReduce on forest data in HDFS without MarkLogic SLIDE: 34

35 OK.so how should I use it? SLIDE: 35

36 Case Studies Operational Tradestore Compliance for Customer Onboarding Compliance for Legal Holds Publishing SLIDE: 36

37 Tier 1 Bank: Operational trade store What are the bank s obligations? Trade stores Post-trade processing Trade execution Reporting ETL Analytics Reference data SLIDE: 37

38 Legacy trade store challenges Long development cycles for new instrument types Complex combinations of ETL and data models Limited visibility across the business Governance risk, maintenance costs of siloed infrastructure Varied SLAs and access patterns created inefficiencies SLIDE: 38

39 Information lifecycle SSD DAS SAN Hadoop DAS SAN NAS Hadoop S3 NAS Hadoop S3 Active Historical Archive Time SLIDE: 39

40 Active Active 96 TB Local 10K SAS, RAID10 Replication for HA Merge overhead for updates 20 hosts, 320 shards 4 TB of SSD cache SLIDE: 40

41 Compliance Active 96 Compliance 504 TB Shared NAS 63 hosts Effective 8 TB/host SLIDE: 41

42 Analytic Active 96 Compliance 504 Analytic 1,044 TB Hadoop 120 hosts Effective 12 TB/host 10 MarkLogic hosts SLIDE: 42

43 Online migration Active Compliance Analytic TB SLIDE: 43

44 Total Size (TB) ,044 Total Cost ($000) 592 2,066 2,080 Effective Unit Cost ($/GB) ($/GB) $25 $4 $1.50 Operational Compliance Analytic SLIDE: 44 COPYRIGHT 2013 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

45 Mixed batch and real-time workloads MarkLogic Database Hadoop MapReduce HDFS

46 KPMG: FATCA Compliance for Customer On-Boarding Thousands of rules, 1 2M accounts, 30 40M documents Encoding, adjusting, and matching rules must scale Impossible to pre-define dimensions, relationships Vet new accounts and show your work Real-time decision-making SLIDE: 46

47 KPMG: FATCA Compliance for Customer On-Boarding MarkLogic Rules Documents SLIDE: 47

48 Tier 1 European Bank: Compliance and Legal Holds Accurately respond to discovery as part of litigation Hold, review, produce data across current, legacy systems Repatriate and reconcile distributed data Demonstrate fidelity and audit trail Reduce infrastructure and maintenance costs SLIDE: 48

49 Tier 1 European Bank: Compliance and Legal Holds Ingest Query Oracle 100TB 40TB Mainframe MarkLogic MarkLogic 87 total systems Sybase Offline Replication Shared Storage NAS HDFS SLIDE: 49

50 McGraw-Hill Gains 120X Performance Value Allows McGraw Hill customers to discover all articles related to topic of interest quickly Increased McGraw Hill business process efficiency from 5 days per quarter to 1 hour on demand Inspired McGraw Hill to envision new opportunities for extracting value from data Text Analytics Full-text Search Analytics Provide text analysis of over 1.3 million documents (classification, correlation) Provide full-text search and semantic analysis Data Management MarkLogic and Intel (Hadoop/HDFS) SLIDE: 50 COPYRIGHT 2013 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

51 The best database for Hadoop MarkLogic Deploy MarkLogic into an existing Hadoop stack Real-time enterprise applications for Hadoop Less data movement, duplication over its life cycle Mixed workloads: Index once, real-time or batch Cost-effective long-term and long-tail storage Leverage existing (or upcoming) infrastructure investments SLIDE: 51

52 Roadmap Hadoop is moving fast. Intel MarkLogic Partnership We will continue to work closely as Intel and Cloudera build out a roadmap for Hadoop over the next quarters Selecting the contenders over the pretenders for a Software provider is critical Not unlike Linux back in the day. Input from You!! Use cases, business opportunities Alignment with MarkLogic s Product Direction SLIDE: 52

53 Distribution Certifications Today MarkLogic Distribution Cloudera TBD TBD Intel TBD TBD Horton Our flexible development process can roadmap certify distributions quickly SLIDE: 53

54 Take-aways New and more data is both an opportunity and a threat Last generation of data management is not sufficient More copies, representations, transformations increase risk and slow innovation Index once and reuse across workloads, lifecycle NoSQL: indexing and updates for interactive apps Hadoop: staging, persistence, and analytics SLIDE: 54

55 SLIDE: 55

56 SLIDE: 56