Trend Micro Big Data Platform and Apache Bigtop. 葉祐欣 (Evans Ye) Big Data Conference 2015

Transcription

1 Trend Micro Big Data Platform and Apache Bigtop 葉祐欣 (Evans Ye) Big Data Conference 2015

2 Who am I Apache Bigtop PMC member Apache Big Data Europe 2015 Speaker Software Trend Micro Develop big data apps & infra Has some experience in Hadoop, HBase, Pig, Spark, Kafka, Fluentd, Akka, and Docker

3 Outline Quick Intro to Bigtop Trend Micro Big Data Platform Mission-specific Platform Big Data Landscape (3p) Bigtop 1.1 Release (6p)

4 Quick Intro to Bigtop

5 Linux Distributions

6 Hadoop Distributions

7 Hadoop Distributions We re fully open sourced!

8 How do I add patches?

9

10 From source code to packages Bigtop Packaging

11 Bigtop feature set Packaging Testing Deployment Virtualization for you to easily build your own Big Data Stack

12 Supported components

13 One click to build packages $ git clone $ docker run \ --rm \ --volume `pwd`/bigtop:/bigtop \ --workdir /bigtop \ bigtop/slaves:trunk-centos-7 \ bash -l -c./gradlew rpm

14 $./gradlew tasks

15 Easy to do CI ci.bigtop.apache.org

16 RPM/DEB packages

17 One click Hadoop provisioning./docker-hadoop.sh -c 3

18 One click Hadoop provisioning bigtop/deploy image on Docker hub./docker-hadoop.sh -c 3

19 One click Hadoop provisioning bigtop/deploy image on Docker hub puppet apply puppet apply puppet apply./docker-hadoop.sh -c 3 Just google bigtop provisioner

20 Should I use Bigtop?

21 If you want to build your own customised Big Data Stack

22 Curves ahead

23 Pros & cons Bigtop You need a talented Hadoop team Self-service: troubleshoot, find solutions, develop patches Add any patch at any time you want (additional efforts) Choose any version of component you want (additional efforts) Vendors (Hortonworks, Cloudera, etc) Better support since they re the guy who write the code! $

24 Trend Micro Big Data Platform

25 Trend Micro Hadoop (TMH) Use Bigtop as the basis for our internal custom distribution of Hadoop Apply community, private patches to upstream projects for business and operational need Newest TMH7 is based on Bigtop 1.0 SNAPSHOT

26 Working with community made our life easier Knowing community status made TMH7 release based on Bigtop 1.0 SNAPSHOT possible

27 Working with community made our life easier Knowing community status made TMH7 release based on Bigtop 1.0 SNAPSHOT possible Contribute Bigtop Provisioner, packaging code, puppet recipes, bugfixes, CI infra, anything!

28 Working with community made our life easier Leverage Bigtop smoke tests and integration tests with Bigtop Provisioner to evaluate TMH7

29 Working with community made our life easier Leverage Bigtop smoke tests and integration tests with Bigtop Provisioner to evaluate TMH7 Contribute feedback, evaluation, use case through Production level adoption

30 Trend Micro Big Data Stack Powered by Bigtop In-house Apps App A App B App C App D APIs and Interfases Processing Engine Storage Kerberos Ad-hoc Query UDFs Pig Mapreduce Hadoop HDFS Wuji Oozie HBase Solr Cloud Resource Management Hadoop YARN Deployment Hadooppet (prod) Hadoocker (dev)

31 Hadooppet Puppet recipes to deploy and manage TMH Big Data Platform HDFS, YARN, HA auto-configured Kerberos, LDAP auto-configured Kerberos cross realm authentication auto-configured (For distcp to run across secured clusters)

32

33 Hadoocker A Devops toolkit for Hadoop app developer to develop and test its code on Big Data Stack preload images > dev & test env w/o deployment > support end-to-end CI test A Hadoop env for apps to test against new Hadoop distribution

34 Docker based dev & test env internal Docker registry Hadoop server TMH7 Hadoop client Hadoop app Restful APIs data sample data./execute.sh hadoop fs put

35 Docker based dev & test env internal Docker registry Oozie(Wuji) Hadoop server TMH7 Hadoop client Dependency service data Hadoop app Restful APIs sample data./execute.sh Solr hadoop fs put

36 Mission-specific Platform

37 Use case Real-time streaming data flows in Lookup external info when data flows in Detect threat/malicious activities on streaming data Correlate with other historical data (batch query) to gather more info Can also run batch detections by specifying arbitrary start time and end time Support Investigation down to raw log level

38 Lambda Architecture

39 receiver

40 buffer receiver

41 receiver transformation, lookup ext info buffer

42 receiver transformation, lookup ext info buffer streaming batch

43 receiver transformation, lookup ext info buffer streaming batch

44 High-throughput, distributed publish-subscribe messaging system Supports multiple consumers attached to a topic Configurable partition(shard), replication factor Load-balance within same consumer group Only consume message once a b c

45 Distributed NoSQL key-value storage, no SPOF Super fast on write, suitable for data keeps coming in Decent read performance, if design it right Build data model around your queries Spark Cassandra Connector Configurable CA (CAP theorem) Choose A over C for availability and vise-versa Dynamo: Amazon s Highly Available Key-value Store

46 Fast, distributed, in-memory processing engine One system for streaming and batch workloads Spark streaming

47 Akka High performance concurrency framework for Java and Scala Actor model for message-driven processing Asynchronous by design to achieve high throughput Each message is handled in a single threaded context (no lock, synchronous needed) Let-it-crash model for fault tolerance and auto-healing system Clustering mechanism to scale out The Road to Akka Cluster, and Beyond

48 Akka Streams Akka Streams is a DSL library for streaming computation on Akka Source Flow Sink Materializer to transform each step into Actor Back-pressure enabled by default The Reactive Manifesto

49 No back-pressure Source Fast!!! Slow Sink v( ) )y (> <)

50 No back-pressure Source Fast!!! Slow Sink v( ) )y (> <)

51 With back-pressure Source Fast!!! Slow Sink

52 With back-pressure Source Fast!!! Slow Sink request 3 request 3

53 Data pipeline with Akka Streams Scale up using balance and merge worker balance worker worker merge source:

54 Data pipeline with Akka Streams Scale out using docker $ docker-compose scale pipeline=3

55 Reactive Kafka Akka Streams wrapper for Kafka Commit processed offset back into Kafka Provide at-least-once delivery guarantee

56 Message delivery guarantee Actor Model: at-most-once Akka Persistence: at-least-once Persist log to external storage (like WAL) Reactive Kafka: at-least-once + back-pressure Write offset back into Kafka At-least-once + Idempotent writes = exactly-once

57 Recap: SDACK Stack Spark: both streaming and batch analytics Docker: resource management (fine for one app) Akka: fine-grained, elastic data pipelines Cassandra: batch queries Kafka: durable buffer, fan-out to multiple consumers

58 Your mileage may vary

59 we re still evolving

60 Remember this:

61 The SMACK Stack Toolbox for wide variety of data processing scenarios

62 SMACK Stack Spark: fast and general engine for large-scale data processing Mesos: cluster resource management system Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds Source:

63 Reference Spark Summit Europe 2015 Streaming Analytics with Spark, Kafka, Cassandra, and Akka (Helena Edelson) Big Data AW Meetup SMACK Architectures (Anton Kirillov)

64 Big Data Landscape

65 Big Data moving trend Memory is faster than SSD/disk, and is cheaper In Memory Computing & Fast Data Spark : In memory batch/streaming engine Flink : In memory streaming/batch engine Iginte : In memory data fabric Geode (incubating) : In memory database

66 Off-Heap, Off-Heap, Off-Heap Off-Heap storage is a JVM process memory outside of the heap, which is allocated and managed using native calls. size not limited by JVM (it is limited by physical memory limits) is not subject to GC which essentially removes long GC pauses Project Tungsten, Flink, Iginte, Geode, HBase

67 (Some) Apache Big Data APIs and Interfases Processing Engine Storage Resource Management Flink ML, Gelly Flink Components Streaming, MLlib, GraphX Spark Pig Hadoop HDFS Hadoop YARN Hive Tez Hadoop Distribution } Phoenix Trafodion Bigtop HBase Slider Mesos Ambari Hadoop Management messaging system in memory data grid search engine NoSQL Kafka Ignite Geode Solr Cassandra

68 Bigtop 1.1 Release Jan, 2016, I expect

69 Bigtop 1.1 Release Hadoop Spark Hive Pig Oozie Flume Zeppelin Ignite Hadoop Phoenix Hue Crunch 0.12, 24 components included!

70

71 Hadoop 2.6 Heterogeneous Storages SSD + hard drive Placement policy (all_ssd, hot, warm, cold) Archival Storage (cost saving) HDFS-7285 (Hadoop 3.0) Erasure code to save storage from 3X to 1.5X

72 Hadoop 2.7 Transparent encryption (encryption zone) Available in 2.6 Known issue: Encryption is sometimes done incorrectly (HADOOP-11343) Fixed in HDFS2015_Past_present_future.pdf

73 Rising star: Flink Streaming dataflow engine Treat batch computing as fixed length streaming Exactly-once by distributed snapshotting Event time handling by watermarks

74 Bigtop Roadmap Integrate and package Apache Flink Re-implement Bigtop Provisioner using docker-machine, compose, swarm Deploy containers on multiple hosts Support any kind of base image for deployment

75 Wrap up

76 Wrap up Hadoop Distribution Choose Bigtop if you want more control The SMACK Stack Toolbox for variety data processing scenarios Big Data Landscape In-memory, off-heap solutions are hot

77 Thank you! Questions?