Trend Micro Big Data Platform and Apache Bigtop. 葉 祐 欣 (Evans Ye) Big Data Conference 2015



Similar documents
How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

Upcoming Announcements

Hadoop Ecosystem B Y R A H I M A.

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Dominik Wagenknecht Accenture

Moving From Hadoop to Spark

HDP Hadoop From concept to deployment.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

HDP Enabling the Modern Data Architecture

Data Security in Hadoop

Apache Flink Next-gen data analysis. Kostas

Deploying Hadoop with Manager

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop & Spark Using Amazon EMR

Big Data and Industrial Internet

Big Data. Lyle Ungar, University of Pennsylvania

Introduction to Big Data Training

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

HADOOP. Revised 10/19/2015

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Why Spark on Hadoop Matters

Apache Flink. Fast and Reliable Large-Scale Data Processing

Shark Installation Guide Week 3 Report. Ankush Arora

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Qsoft Inc

TRAINING PROGRAM ON BIGDATA/HADOOP

Savanna Hadoop on. OpenStack. Savanna Technical Lead

INTRODUCING APACHE IGNITE An Apache Incubator Project

Case Study : 3 different hadoop cluster deployments

How To Create A Data Visualization With Apache Spark And Zeppelin

Unified Big Data Processing with Apache Spark. Matei

Self-service BI for big data applications using Apache Drill

Real Time Data Processing using Spark Streaming

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Next-Gen Big Data Analytics using the Spark stack

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Big Data Management and Security

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Managing large clusters resources

Comprehensive Analytics on the Hortonworks Data Platform

Self-service BI for big data applications using Apache Drill

Dell In-Memory Appliance for Cloudera Enterprise

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

How Companies are! Using Spark

The Future of Data Management

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data Analytics - Accelerated. stream-horizon.com

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Peers Techno log ies Pv t. L td. HADOOP

Analytics on Spark &

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Apache Sentry. Prasad Mujumdar

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

From Spark to Ignition:

Big Data Pipeline and Analytics Platform

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Encryption and Anonymization in Hadoop

Apache Hadoop: Past, Present, and Future

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Native Connectivity to Big Data Sources in MSTR 10

Big Data Analytics - Accelerated. stream-horizon.com

<Insert Picture Here> Big Data

CloudStack and Big Data. Sebastien May 22nd 2013 LinuxTag, Berlin

Oracle Big Data Fundamentals Ed 1 NEW

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

The Future of Data Management with Hadoop and the Enterprise Data Hub

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

6.S897 Large-Scale Systems

Bringing Big Data to People

GigaSpaces Real-Time Analytics for Big Data

#TalendSandbox for Big Data

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Hortonworks Data Platform for Hadoop and SAP HANA

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data Course Highlights

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Transcription:

Trend Micro Big Data Platform and Apache Bigtop 葉 祐 欣 (Evans Ye) Big Data Conference 2015

Who am I Apache Bigtop PMC member Apache Big Data Europe 2015 Speaker Software Engineer @ Trend Micro Develop big data apps & infra Has some experience in Hadoop, HBase, Pig, Spark, Kafka, Fluentd, Akka, and Docker

Outline Quick Intro to Bigtop Trend Micro Big Data Platform Mission-specific Platform Big Data Landscape (3p) Bigtop 1.1 Release (6p)

Quick Intro to Bigtop

Linux Distributions

Hadoop Distributions

Hadoop Distributions We re fully open sourced!

How do I add patches?

From source code to packages Bigtop Packaging

Bigtop feature set Packaging Testing Deployment Virtualization for you to easily build your own Big Data Stack

Supported components

One click to build packages $ git clone https://github.com/apache/bigtop.git $ docker run \ --rm \ --volume `pwd`/bigtop:/bigtop \ --workdir /bigtop \ bigtop/slaves:trunk-centos-7 \ bash -l -c./gradlew rpm

$./gradlew tasks

Easy to do CI ci.bigtop.apache.org

RPM/DEB packages www.apache.org/dist/bigtop

One click Hadoop provisioning./docker-hadoop.sh -c 3

One click Hadoop provisioning bigtop/deploy image on Docker hub./docker-hadoop.sh -c 3

One click Hadoop provisioning bigtop/deploy image on Docker hub puppet apply puppet apply puppet apply./docker-hadoop.sh -c 3 Just google bigtop provisioner

Should I use Bigtop?

If you want to build your own customised Big Data Stack

Curves ahead

Pros & cons Bigtop You need a talented Hadoop team Self-service: troubleshoot, find solutions, develop patches Add any patch at any time you want (additional efforts) Choose any version of component you want (additional efforts) Vendors (Hortonworks, Cloudera, etc) Better support since they re the guy who write the code! $

Trend Micro Big Data Platform

Trend Micro Hadoop (TMH) Use Bigtop as the basis for our internal custom distribution of Hadoop Apply community, private patches to upstream projects for business and operational need Newest TMH7 is based on Bigtop 1.0 SNAPSHOT

Working with community made our life easier Knowing community status made TMH7 release based on Bigtop 1.0 SNAPSHOT possible

Working with community made our life easier Knowing community status made TMH7 release based on Bigtop 1.0 SNAPSHOT possible Contribute Bigtop Provisioner, packaging code, puppet recipes, bugfixes, CI infra, anything!

Working with community made our life easier Leverage Bigtop smoke tests and integration tests with Bigtop Provisioner to evaluate TMH7

Working with community made our life easier Leverage Bigtop smoke tests and integration tests with Bigtop Provisioner to evaluate TMH7 Contribute feedback, evaluation, use case through Production level adoption

Trend Micro Big Data Stack Powered by Bigtop In-house Apps App A App B App C App D APIs and Interfases Processing Engine Storage Kerberos Ad-hoc Query UDFs Pig Mapreduce Hadoop HDFS Wuji Oozie HBase Solr Cloud Resource Management Hadoop YARN Deployment Hadooppet (prod) Hadoocker (dev)

Hadooppet Puppet recipes to deploy and manage TMH Big Data Platform HDFS, YARN, HA auto-configured Kerberos, LDAP auto-configured Kerberos cross realm authentication auto-configured (For distcp to run across secured clusters)

Hadoocker A Devops toolkit for Hadoop app developer to develop and test its code on Big Data Stack preload images > dev & test env w/o deployment > support end-to-end CI test A Hadoop env for apps to test against new Hadoop distribution https://github.com/evans-ye/hadoocker

Docker based dev & test env internal Docker registry Hadoop server TMH7 Hadoop client Hadoop app Restful APIs data sample data./execute.sh hadoop fs put

Docker based dev & test env internal Docker registry Oozie(Wuji) Hadoop server TMH7 Hadoop client Dependency service data Hadoop app Restful APIs sample data./execute.sh Solr hadoop fs put

Mission-specific Platform

Use case Real-time streaming data flows in Lookup external info when data flows in Detect threat/malicious activities on streaming data Correlate with other historical data (batch query) to gather more info Can also run batch detections by specifying arbitrary start time and end time Support Investigation down to raw log level

Lambda Architecture

receiver

buffer receiver

receiver transformation, lookup ext info buffer

receiver transformation, lookup ext info buffer streaming batch

receiver transformation, lookup ext info buffer streaming batch

High-throughput, distributed publish-subscribe messaging system Supports multiple consumers attached to a topic Configurable partition(shard), replication factor Load-balance within same consumer group Only consume message once a b c

Distributed NoSQL key-value storage, no SPOF Super fast on write, suitable for data keeps coming in Decent read performance, if design it right Build data model around your queries Spark Cassandra Connector Configurable CA (CAP theorem) Choose A over C for availability and vise-versa Dynamo: Amazon s Highly Available Key-value Store

Fast, distributed, in-memory processing engine One system for streaming and batch workloads Spark streaming

Akka High performance concurrency framework for Java and Scala Actor model for message-driven processing Asynchronous by design to achieve high throughput Each message is handled in a single threaded context (no lock, synchronous needed) Let-it-crash model for fault tolerance and auto-healing system Clustering mechanism to scale out The Road to Akka Cluster, and Beyond

Akka Streams Akka Streams is a DSL library for streaming computation on Akka Source Flow Sink Materializer to transform each step into Actor Back-pressure enabled by default The Reactive Manifesto

No back-pressure Source Fast!!! Slow Sink v( ) )y (> <)

No back-pressure Source Fast!!! Slow Sink v( ) )y (> <)

With back-pressure Source Fast!!! Slow Sink

With back-pressure Source Fast!!! Slow Sink request 3 request 3

Data pipeline with Akka Streams Scale up using balance and merge worker balance worker worker merge source: http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-cookbook.html#working-with-flows

Data pipeline with Akka Streams Scale out using docker $ docker-compose scale pipeline=3

Reactive Kafka Akka Streams wrapper for Kafka Commit processed offset back into Kafka Provide at-least-once delivery guarantee https://github.com/softwaremill/reactive-kafka

Message delivery guarantee Actor Model: at-most-once Akka Persistence: at-least-once Persist log to external storage (like WAL) Reactive Kafka: at-least-once + back-pressure Write offset back into Kafka At-least-once + Idempotent writes = exactly-once

Recap: SDACK Stack Spark: both streaming and batch analytics Docker: resource management (fine for one app) Akka: fine-grained, elastic data pipelines Cassandra: batch queries Kafka: durable buffer, fan-out to multiple consumers

Your mileage may vary

we re still evolving

Remember this:

The SMACK Stack Toolbox for wide variety of data processing scenarios

SMACK Stack Spark: fast and general engine for large-scale data processing Mesos: cluster resource management system Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka

Reference Spark Summit Europe 2015 Streaming Analytics with Spark, Kafka, Cassandra, and Akka (Helena Edelson) Big Data AW Meetup SMACK Architectures (Anton Kirillov)

Big Data Landscape

Big Data moving trend Memory is faster than SSD/disk, and is cheaper In Memory Computing & Fast Data Spark : In memory batch/streaming engine Flink : In memory streaming/batch engine Iginte : In memory data fabric Geode (incubating) : In memory database

Off-Heap, Off-Heap, Off-Heap Off-Heap storage is a JVM process memory outside of the heap, which is allocated and managed using native calls. size not limited by JVM (it is limited by physical memory limits) is not subject to GC which essentially removes long GC pauses Project Tungsten, Flink, Iginte, Geode, HBase

(Some) Apache Big Data APIs and Interfases Processing Engine Storage Resource Management Flink ML, Gelly Flink Components Streaming, MLlib, GraphX Spark Pig Hadoop HDFS Hadoop YARN Hive Tez Hadoop Distribution } Phoenix Trafodion Bigtop HBase Slider Mesos Ambari Hadoop Management messaging system in memory data grid search engine NoSQL Kafka Ignite Geode Solr Cassandra

Bigtop 1.1 Release Jan, 2016, I expect

Bigtop 1.1 Release Hadoop 2.7.1 Spark 1.5.1 Hive 1.2.1 Pig 0.15.0 Oozie 4.2.0 Flume 1.6.0 Zeppelin 0.5.5 Ignite Hadoop 1.5.0 Phoenix 4.6.0 Hue 3.8.1 Crunch 0.12, 24 components included!

Hadoop 2.6 Heterogeneous Storages SSD + hard drive Placement policy (all_ssd, hot, warm, cold) Archival Storage (cost saving) HDFS-7285 (Hadoop 3.0) Erasure code to save storage from 3X to 1.5X http://www.slideshare.net/hadoop_summit/reduce-storagecosts-by-5x-using-the-new-hdfs-tiered-storage-feature

Hadoop 2.7 Transparent encryption (encryption zone) Available in 2.6 Known issue: Encryption is sometimes done incorrectly (HADOOP-11343) Fixed in 2.7 http://events.linuxfoundation.org/sites/events/files/slides/ HDFS2015_Past_present_future.pdf

Rising star: Flink Streaming dataflow engine Treat batch computing as fixed length streaming Exactly-once by distributed snapshotting Event time handling by watermarks

Bigtop Roadmap Integrate and package Apache Flink Re-implement Bigtop Provisioner using docker-machine, compose, swarm Deploy containers on multiple hosts Support any kind of base image for deployment

Wrap up

Wrap up Hadoop Distribution Choose Bigtop if you want more control The SMACK Stack Toolbox for variety data processing scenarios Big Data Landscape In-memory, off-heap solutions are hot

Thank you! Questions?