Cloud Computing Era. Trend Micro

Similar documents

The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

Word Count Code using MR2 Classes and API

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop Ecosystem B Y R A H I M A.

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Internals of Hadoop Application Framework and Distributed File System

Hadoop 2.0 Introduction with HDP for Windows. Seele Lin

How To Scale Out Of A Nosql Database

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

BIG DATA TRENDS AND TECHNOLOGIES

Implement Hadoop jobs to extract business value from large and varied data sets

CS54100: Database Systems

Hadoop IST 734 SS CHUNG

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data Course Highlights

Apache Hadoop: Past, Present, and Future

Word count example Abdalrahman Alsaedi

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Data processing goes big

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Hadoop implementation of MapReduce computational model. Ján Vaňo

Workshop on Hadoop with Big Data

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Hadoop and Map-Reduce. Swati Gore

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Big Data Too Big To Ignore

Large scale processing using Hadoop. Ján Vaňo

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

BIG DATA What it is and how to use?

Big Data and Scripting Systems build on top of Hadoop

Deploying Hadoop with Manager

Hadoop WordCount Explained! IT332 Distributed Systems

Introduction to MapReduce and Hadoop

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Big Data for the JVM developer. Costin Leau,

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Cloudera Certified Developer for Apache Hadoop

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

HPCHadoop: MapReduce on Cray X-series

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

HDP Hadoop From concept to deployment.

Big Data and Apache Hadoop s MapReduce

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

BIG DATA APPLICATIONS

LANGUAGES FOR HADOOP: PIG & HIVE

Introduction to Hbase Gkavresis Giorgos 1470

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data Workshop. dattamsha.com

How to Hadoop Without the Worry: Protecting Big Data at Scale

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data With Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

Peers Techno log ies Pv t. L td. HADOOP

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Hadoop. Sunday, November 25, 12

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Introduction to Hadoop

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data Explained. An introduction to Big Data Science.

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

White Paper: What You Need To Know About Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Using distributed technologies to analyze Big Data

Apache HBase. Crazy dances on the elephant back

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Open source Google-style large scale data analysis with Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Moving From Hadoop to Spark

Big Data Weather Analytics Using Hadoop

Bringing Big Data to People

How To Handle Big Data With A Data Scientist

Big Data and Scripting Systems build on top of Hadoop

Cost-Effective Business Intelligence with Red Hat and Open Source

Transcription:

Cloud Computing Era Trend Micro

Three Major Trends to Chang the World Cloud Computing Big Data Mobile

什麼是雲端運算? 美國國家標準技術研究所 (NIST) 的定義 : Essential Characteristics Service Models Deployment Models 以服務 (as-a-service) 的商業模式, 透過 Internet 技術, 提供具有擴充性 (scalable) 和彈性 (elastic) 的 IT 相關功能給使用者

It s About the Ecosystem Structured, Semi-structured Cloud Computing Enterprise Data Warehouse SaaS PaaS IaaS Generate Big Data Lead Business Insights create Competition, Innovation, Productivity

What is Big Data?

What is the problem Getting the data to the processors becomes the bottleneck Quick calculation Typical disk data transfer rate: 75MB/sec Time taken to transfer 100GB of data to the processor: approx. 22 minutes!

The Era of Big Data Are You Ready Data for business commercial analysis 2011: multi-terabyte (TB) 2020: 35.2 ZB (1 ZB = 1 billion TB)

Who Needs It? Enterprise Database Hadoop When to use? Ad-hoc Reporting (<1sec) Multi-step Transactions Lots of Inserts/Updates/Deletes When to use? Affordable Storage/Compute Unstructured or Semi-structured Resilient Auto Scalability

Hadoop!

inspired by Apache Hadoop project inspired by Google's MapReduce and Google File System papers. Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware Open Source Software + Hardware Commodity IT Costs Reduction

Hadoop Core MapReduce HDFS 2011 Cloudera, Inc. All Rights Reserved.

HDFS Hadoop Distributed File System Redundancy Fault Tolerant Scalable Self Healing Write Once, Read Many Times Java API Command Line Tool 2011 Cloudera, Inc. All Rights Reserved.

MapReduce Two Phases of Functional Programming Redundancy Fault Tolerant Scalable Self Healing Java API 13 2011 Cloudera, Inc. All Rights Reserved.

Hadoop Core Java Java MapReduce HDFS Java Java 14 2011 Cloudera, Inc. All Rights Reserved.

Word Count Example Key: offset Value: line Key: word Value: count Key: word Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa

The Hadoop Ecosystems

The Ecosystem is the System Hadoop has become the kernel of the distributed operating system for Big Data No one uses the kernel alone A collection of projects at Apache

Relation Map Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Zookeeper Coordination Framework Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

What is ZooKeeper A centralized service for maintaining Configuration information Providing distributed synchronization A set of tools to build distributed applications that can safely handle partial failures ZooKeeper was designed to store coordination data Status information Configuration Location information

Flume / Sqoop Data Integration Framework Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

What s the problem for data collection Data collection is currently a priori and ad hoc A priori decide what you want to collect ahead of time Ad hoc each kind of data source goes through its own collection path

(and how can it help?) A distributed data collection service It efficiently collecting, aggregating, and moving large amounts of data Fault tolerant, many failover and recovery mechanism One-stop solution for data collection of all formats

Flume Architecture Log Flume Node... Log Flume Node HDFS 2011 Cloudera, Inc. All Rights Reserved.

Flume Sources and Sinks Local Files HDFS Stdin, Stdout Twitter IRC IMAP 2011 Cloudera, Inc. All Rights Reserved.

Sqoop Easy, parallel database import/export What you want do? Insert data from RDBMS to HDFS Export data from HDFS back into RDBMS

Sqoop HDFS Sqoop RDBMS 28 2011 Cloudera, Inc. All Rights Reserved.

Sqoop Examples $ sqoop import --connect jdbc:mysql://localhost/world -- username root --table City... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He rat,afg,herat,1868004,mazar-e- Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200... 29 2011 Cloudera, Inc. All Rights Reserved.

Pig / Hive Analytical Language Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Why Hive and Pig? Although MapReduce is very powerful, it can also be complex to master Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code Many organizations have programmers who are skilled at writing code in scripting languages Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce Hive was initially developed at Facebook, Pig at Yahoo!

Hive Developed by What is Hive? An SQL-like interface to Hadoop Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop MapRuduce for execution HDFS for storage Hive Query Language Basic-SQL : Select, From, Join, Group-By Equi-Join, Muti-Table Insert, Multi-Group-By Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Hive SQL Hive MapReduce 33 2011 Cloudera, Inc. All Rights Reserved.

Pig Initiated by A high-level scripting language (Pig Latin) Process data one step at a time Simple to write MapReduce program Easy understand Easy debug A = load a.txt as (id, name, age,...) B = load b.txt as (id, address,...) C = JOIN A BY id, B BY id;store C into c.txt

Pig Script Pig MapReduce 2011 Cloudera, Inc. All Rights Reserved.

Hive vs. Pig Hive Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions that are stored in a metastore Programmait Access JDBC, ODBC Pig A schema is optionally defined at runtime PigServer

WordCount Example Input Hello World Bye World Hello Hadoop Goodbye Hadoop For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> the reduce just sums up the values

WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); }

WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;

WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;

The Story So Far SQL Hive Pig Script Java Java MapReduce HDFS Sqoop Flume SQL RDBMS FS Posix 4 1 2011 Cloudera, Inc. All Rights Reserved.

Hbase Column NoSQL DB Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Structured-data vs Raw-data

I Inspired by Coordinated by Zookeeper Low Latency Random Reads And Writes Distributed Key/Value Store Simple API PUT GET DELETE SCAN

Hbase Data Model Cells are versioned Table rows are sorted by row key Region a row range [start-key:end-key]

HBase Examples hbase> create 'mytable', 'mycf hbase> list hbase> put 'mytable', 'row1', 'mycf:col1', 'val1 hbase> put 'mytable', 'row1', 'mycf:col2', 'val2 hbase> put 'mytable', 'row2', 'mycf:col1', 'val3 hbase> scan 'mytable hbase> disable 'mytable hbase> drop 'mytable' 2011 Cloudera, Inc. All Rights Reserved.

Oozie Job Workflow & Scheduling Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

What is? A Java Web Application Oozie is a workflow scheduler for Hadoop Crond for Hadoop Triggered Time Data Job 1 Job 3 Job 2 Job 4 Job 5

Oozie Features Component Independent MapReduce Hive Pig SqoopStreaming 2011 Cloudera, Inc. All Rights Reserved.

Mahout Data Mining Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

What is Machine-learning tool Distributed and scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster

Mahout Use Cases Yahoo: Spam Detection Foursquare: Recommendations SpeedDate.com: Recommendations Adobe: User Targetting Amazon: Personalization Platform 2011 Cloudera, Inc. All Rights Reserved.

Hue developed by Hadoop User Experience Apache Open source project HUE is a web UI for Hadoop Platform for building custom applications with a nice UI library

Hue HUE comes with a suite of applications File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. Job Browser: View jobs, tasks, counters, logs, etc. Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.

Hue: File Browser UI

Hue: Beewax UI

Use case Example Predict what the user likes based on His/Her historical behavior Aggregate behavior of people similar to him

Conclusion Today, we introduced: Why Hadoop is needed The basic concepts of HDFS and MapReduce What sort of problems can be solved with Hadoop What other projects are included in the Hadoop ecosystem

Recap Hadoop Ecosystem Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

趨勢科技雲端防毒 Case Study

Collaboration in the underground

網路威脅呈現爆炸性的成長各式各樣的變種病毒垃圾郵件不明的下載來源等等, 這些來自網路上的威脅, 躲過傳統安全防護系統的偵測, 一直持續呈現爆炸性的成長, 形成嚴重的資安威脅 New Unique Malware Discovered 1M unique Malwares every month

New Design Concept for Threat Intelligence CDN / xsp Human Intelligence Honeypot Web Crawler Trend Micro Mail Protection Trend Micro Web Protection 150M+ Worldwide Endpoints/Sensors Trend Micro Endpoint Protection

Challenges We Are Faced The Concept is Great but. 6TB of data and 15B lines of logs received daily by It becomes the Big Data Challenge!

Issues to Address Raw Data Information Threat Intelligence/Solution Volume: Infinite Time: No Delay Target: Keep Changing Threats

SPN Feedback HTTP POST SPAM CDN Log SPN High Level Architecture L4 Log Receiver Log Receiver L4 Log Post Processing Log Post Processing Log Post Processing SPN infrastructure Application MapReduce HBase Hadoop Distributed File System (HDFS) Feedback Information Email Reputation Service Adhoc-Query (Pig) Circus (Ambari) Message Bus Web Reputation Service File Reputation Service

Trend Micro Big Data process capacity 雲端防毒每日需要處理的資料量 85 億個 Web Reputation 查詢 30 億個 Email Reputation 查詢 70 億個 File Reputation 查詢處理 6 TB 從全世界收集到的 raw logs 來自 1.5 億台終端裝置的連線

Trend Micro: Web Reputation Services Technology Process Operation Trend Micro Products / Technology User Traffic Honeypot 8 billions/day CDN Cache High Throughput Web Service Hadoop Cluster Web Crawling Akamai Rating Server for Known Threats Unknown & Prefilter Page Download 4.8 billions/day 860 millions/day 40% filtered 82% filtered 15 Minutes Machine Learning Data Mining Threat Analysis 99.98% filtered 25,000 malicious URL /day Block malicious URL within 15 minutes once it goes online!

Big Data Cases

Google vs. Hadoop Ecosystem Chubby vs. Zookeeper MapReduce vs. MapReduce BigTable vs. HBase GFS vs. HDFS Ref: Google Cluster

Pioneer of Big Data Infrastructure Google

Hbase use Case@Facebook - Messages HBase Use Cases @ Facebook Messages Facebook Insights Self-service Hashout Operational Data Store More Analytics/Hashout apps Site Integrity 2010 2011 2012 2013 Social Graph Search Indexing Realtime Hive Updates Cross-system Tracing and more

Flagship App:Facebook Messages Monthly data volume prior to launch Monthly data volume prior to launch 15B x 1,024 bytes = 14TB 120B x 100 bytes = 11TB

Facebook Messages Now ook Messages NOW tats Quick Stats 11B+ messages/day messages/day B+ data accesses ak: 1.5M ops/sec 5%Rd, 45% Wr 90B+ data accesses Peak:1.5M ops/sec ~55% Read, 45% Write 20PB+ of total data Grows 400TB/month + of total data ows 400TB/month Messages Emails Chats SMS

Facebook Messages:Requirements Very High Write Volume Previously, chat was not persisted to disk Ever-growing data sets(old data rarely gets accessed) Elasticity & automatic failover Strong consistency within a single data center Large scans/map-reduce support for migrations & schema conversions Bulk import data

Physical Multi-tenancy Real-time Ads Insights Real-time analytics for social plugins on top of Hbase Publishers get real-time distribution/engagement metrics: # of impressions, likes analytics by domain/url/demographics and time periods Uses HBase capabilities: Efficient counters (single-rpc increments) TTL for purging old data Needs massive write throughput & low latencies Billions of URLs Millions of counter increments/second Operational Data Store

Facebook Open Source Stack Memcached --> App Server Cache ZooKeeper --> Small Data Coordination Service HBase --> Database Storage Engine HDFS --> Distributed FileSystem Hadoop --> Asynchronous Map-Reduce Jobs

Questions?

Thank you!