Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience 黃 振 修 (Chris Huang) SPN 主 動 式 雲 端 截 毒 技 術 架 構 師
About Me SPN 主 動 式 雲 端 截 毒 技 術 架 構 師 SPN Hadoop 基 礎 運 算 架 構 師 Hadoop in Taiwan 2013 講 師 Hadoop.TW 活 躍 成 員 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 2
The Journey to Big Data 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 3
Yesterday ~40 Hadoop nodes ~15 Service/user accounts 3 Teams <50 TB storage <100 Jobs per day 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 4
Today ~200 Hadoop nodes ~130 Service/user accounts 11 Teams ~500 TB storage >16000 Jobs per day 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 5
1 MapReduce Job Submitted Each 5.4 Seconds 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 6
Why? Raw Data Actionable Intelligence 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 7
Collaboration in the underground
網 路 威 脅 呈 現 爆 炸 性 的 成 長 各 式 各 樣 的 變 種 病 毒 垃 圾 郵 件 不 明 的 下 載 來 源 等 等, 這 些 來 自 網 路 上 的 威 脅, 躲 過 傳 統 安 全 防 護 系 統 的 偵 測, 一 直 持 續 呈 現 爆 炸 性 的 成 長, 形 成 嚴 重 的 資 安 威 脅 New Unique Malware Discovered 1M unique Malwares every month
Reality Check Skyrocketing Volume Dangerous Risks Avoiding Detection New Unique Threats per Hour (worldwide estimate*) NEW Threat Every 0.28 Seconds 2400 12600 DANGER Threats Found in Enterprises (Real-world data from 150+ assessments*) Network Worms Data-Stealing Malware IRC Bots 42% 56% 77% 2007 2008 2009 2010 2011 COMPLEXITY Targeting Malware 100% 52% of companies failed to report or remediate a cyber breach in 2011. ---SAIC, 2011 Two new pieces of malwares are created every second. --- Trend Micro, 2012 A cyber intrusion occurs every 5 minutes. ---US CERT 2012
Traditional approach is no more sufficient!
Big Data Exploration 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 17
New approach for cyber threat solution CDN / xsp Researcher Intelligence Honeypot Web Crawler Trend Micro Mail Protection Trend Micro Web Protection 3+ Billion Worldwide Sensors Trend Micro Endpoint Protection
SPN: Smart Protection Network BIG DATA ANALYTICS (Data Mining, Machine Learning, Modeling, Correlation) Collects Identifies Protects DAILY STATS: 7.2 TB data correlated 1B IP addresses 90K malicious threats identified 100+M good files 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 19
SPN High Level Architecture Service Delivery SPN Feedback Honey Pot CDN/xSP Log API Server/Portal Data Sourcing Receiver Service Platform MySPN Platform Hadoop Distributed File System (HDFS) Solr Cloud Adhoc-Query (Pig) MapReduce Oozie HBase APP 1 APP 2 Trend Message Exchange (Message Bus) 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 20
MySPN Ecosystem Need Solution Data Catalogue backed-by SPN Infrastructure TopCVE Service APT KB Service Census FB Logs Login Customer Develop Solution Access SSO Portal & API Single Entry-Point Dispatcher All My Guard Threat Connect Dashboard Service Catalog Census Profile MySPN Market Place New App Alert VE DB APT KB Publish Monitor Service Platform SDK New App Trender Operate Implement OPS RD / Team
SPN Solution Architecture Sourcing Processing & Analysis Validate & Create Solution Quality Assurance Solution Distribution Solution Adoption File Web / URL Email Domain IP File Reputation Service Web Reputation Service Email Reputation Service Smart Protection Customer SPN Correlation Community Intelligence (Feedback loop)
Big Data Case Study 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 23
Case Study: Web Reputation Services 200K+ new URL created every day 1. Intercept URL 4. Access page Internet Web Server SPN Cloud 9/10/2013 24
8+ billions URL process daily Technology Process Operation Trend Micro Products / Technology CDN Cache High Throughput Web Service Hadoop Cluster Web Crawling Machine Learning Data Mining User Traffic / Sourcing CDN vender Rating Server for Known Threats Unknown & Prefilter Page Download Threat Analysis 8 billions/day 4.8 billions/day 860 millions/day 40% filtered 82% filtered 99.98% filtered 25,000 malicious URL /day Block malicious URL within 15 minutes once it goes online!
WRS Architecture Overview
Big Data Lesson Learned 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 27
How to Scale? Un-structure data first If you really need structure data Use Google Protocol Buffers or JSON string Purify your data before processing Leverage HBase more Well design row key to prevent hot-spot Use MapReduce to create Lucene index Leverage SolrCloud for complex real-time use cases 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 28
Our Learning Has clear strategy first Start small, scale quickly Chose right solution for right problem
Q&A 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 30
Thank You Big Challenge Big Opportunity 9/10/2013 Confidential Copyright 2013 TrendMicro Inc. 31