Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang 2011-10

Similar documents
Hadoop. Sunday, November 25, 12

Hadoop implementation of MapReduce computational model. Ján Vaňo

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Large scale processing using Hadoop. Ján Vaňo

Big Data and Apache Hadoop s MapReduce

Open source Google-style large scale data analysis with Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

CSE-E5430 Scalable Cloud Computing Lecture 2

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop IST 734 SS CHUNG

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Hadoop & its Usage at Facebook

UPS battery remote monitoring system in cloud computing

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Massive Cloud Auditing using Data Mining on Hadoop

Big Data Big Data/Data Analytics & Software Development

Hadoop & its Usage at Facebook

Transforming the Telecoms Business using Big Data and Analytics

How To Handle Big Data With A Data Scientist

From Wikipedia, the free encyclopedia

Big Data Storage Architecture Design in Cloud Computing

Are You Ready for Big Data?

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

How To Scale Out Of A Nosql Database

BIG DATA What it is and how to use?

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

The Future of Data Management

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Hadoop Architecture. Part 1

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Application Development. A Paradigm Shift

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data and Market Surveillance. April 28, 2014

Open source large scale distributed data management with Google s MapReduce and Bigtable

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data? Definition # 1: Big Data Definition Forrester Research

Big Data Use Case: Business Analytics

Are You Ready for Big Data?

BIG DATA CHALLENGES AND PERSPECTIVES

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Sunnie Chung. Cleveland State University

Distributed Computing and Hadoop in Statistics

Chapter 7. Using Hadoop Cluster and MapReduce

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Agile Business Intelligence Data Lake Architecture

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Bringing Big Data to People

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

In-Memory Analytics for Big Data

Apache Hadoop FileSystem Internals

Implement Hadoop jobs to extract business value from large and varied data sets

Microsoft Big Data. Solution Brief

MapReduce with Apache Hadoop Analysing Big Data

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Chase Wu New Jersey Ins0tute of Technology

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Networking in the Hadoop Cluster

From Internet Data Centers to Data Centers in the Cloud

The Next Wave of Data Management. Is Big Data The New Normal?

Cloud Computing Training

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Open Source for Cloud Infrastructure

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

Comprehensive Analytics on the Hortonworks Data Platform

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Data Modeling for Big Data

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

HDP Enabling the Modern Data Architecture

Big Data on Microsoft Platform

Cloud Computing based on the Hadoop Platform

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Ins+tuto Superior Técnico Technical University of Lisbon. Big Data. Bruno Lopes Catarina Moreira João Pinho

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Ubuntu and Hadoop: the perfect match

The little elephant driving Big Data

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Raul F. Chong Senior program manager Big data, DB2, and Cloud IM Cloud Computing Center of Competence - IBM Toronto Lab, Canada

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Design of Electric Energy Acquisition System on Hadoop

Hadoop Ecosystem B Y R A H I M A.

Transcription:

Application and practice of parallel cloud computing in ISP Guangzhou Institute of China Telecom Zhilan Huang 2011-10

Outline Mass data management problem Applications of parallel cloud computing in ISPs Practice of parallel cloud computing in China telecom Conclusions 2

Mass data management problem By July 2011, China Telecom has more than 108 million CDMA users, 172 million fix-line telephone subscribers and more than 70 million broadband users. Each user generate more than 150KB billing data per year. Web pages Facebook is reportedly hosting 40 billion photos for over 600 million users. Wal-Mart handles more than 1m customer transactions every hour, feeding database estimated at more than 2.5 petabytes. The backbone network of China Telecom generates at least 10TB netflow data per day with the sampling rate of 1:5000. Billing data User data Netflow data ISP s mass operation & business data Mass Internet Data Pictures Videos Audios SOC generates massive warning info & device log data, TB size SOC operation data How to store massive data economically and reliably? How to analyze massive data efficiently and effectively? 3

Existing solutions Ways to store massive data: provide as much storage as needed, not economic. abandon historical data to reduce data size, will lose information. compress data before storing, not effective for future data processing. Ways to analyze massive data: Statistical analysis, data mining and business intelligent. Traditional data analysis tools such as SPSS, Clementine and SAS can only run on one host or on a small cluster, not feasible for petabytes massive data. More powerful processing ability and more efficient analyzing methods are needed for massive data. 4

Parallel cloud computing Introduced by Google, has been proved to be a powerful technology for massive data with extremely scalability and simplicity. Built on large cluster of commodity x86 pc servers. With an open source implementation named Hadoop, which provides a reliable shared storage and analysis system in a nutshell. Use MapReduce as paralleling programming model, suitable for data intensive application. Other features: Simple, flexible, fault-tolerant, scalable, economic Currently, it is widely used by internet companies like Yahoo!, Last.fm, Facebook, and the New York Times. Also suitable for ISPs who have lots of IT supporting systems that gather massive network operation data and business data. 5

Outline Mass data management problem Applications of parallel cloud computing in ISPs Practice of parallel cloud computing in China telecom Conclusions 6

Mass data management problem in ISPs ISPs have many IT supporting systems such as OSS/BSS, which collect rich operation data & business data after years of running. Typical IT supporting systems are built on unix server & SAN storage network which are very expensive & hard to extend. Data is well-structured & managed by relational database. Massive data is analyzed by traditional methods. System capacity need to be carefully planned in advance. Problems that ISPs are facing: Unix Servers are too expensive to maintain and upgrade. In 2009, China Telecom has spent more than 3 billion in IT supporting system. Traditional data analysis methods couldn t process mass data efficiently. More economic & more flexible architectures are needed by ISPs to solve mass data management problem. 7

Applications of parallel cloud computing in ISPs Utilize the large scale storage and processing ability of parallel cloud computing to do network operation data and user behavior analysis to optimize network and service quality, as well as to develop the customer consumption model. Run internal supporting system (OSS/BSS) on parallel cloud computing. Deploying Hadoop on virtual hosts in internet data centers (IDC) to provide parallel computing service for small enterprises. 8

Outline Mass data management problem Applications of parallel cloud computing in ISPs Practice of parallel cloud computing in China telecom Conclusions 9

Distributed Service Engine (DSE) China telecom has constructed a large scale data processing platform named distributed service engine (DSE). This platform is based on Hadoop and can provide large scale data processing ability for users and is supposed to support petabytes massive data. Functional abilities of DSE Distributed storage and computing ability Data integration ability Basic data analysis ability Basic information processing ability Information service ability Basic platform service ability Platform operation management system 10

DSE Implementation Hardware Resource 52 commodity pc servers with 396 cpu cores and 18T storage capacity High speed internet, with 512M bandwidth Platform Deployment Deploy Xen at hypervisor layer, and ECP as management platform Deploy Hadoop, Hbase, Zookeeper at distributed layer Use Mahout, HIVE as distributed data mining and data warehouse tools. Some basic text and image processing modules, such as information retrieval module are developed and deployed. Some basic information services like location service are also provided. Applications Telecom systems: network traffic matrix analysis, PCA network traffic anomaly detection Internet applications:r2s2, RAFE and visual search 11

PCA network traffic anomaly detection on DSE PCA can effectively detect network anomaly traffic and the parallel cloud computing technology can greatly improve the system efficiency and scalability. This makes PCA based approaches can handle telecoms large scale network data. 12

Restaurant Review Search System R2S2 is a restaurant recommending system which is based on analyzing and summarizing user comments over the internet. R2S2 is provided to public users as an open web service. It crawls more than 1,300,000 user comment records for over 120,000 restaurants from eight famous restaurant reputation websites. Using Hadoop Mapreduce, the system succeeds in processing 1,353,844 records in 10 hours in this cluster. 13

Restaurant Review Search System 6.Popularity trend diagram 7. Emotion trend diagram 1.Comment Summary 2.Comment categories 3. Positive Comments 4.Negative Comments 5. Neutral comments 14

Image Processing System The system is deployed in a twenty hosts cluster, with twelve physical machines and seven virtual machines A distributed crawler developed from Nutch grabs over 1 million pictures in about 24 hours, and generates 50G data. All images data are store in HDFS. SIFT algorithm is applied to detect and describe local features in images. The algorithm is implemented with MapReduce. It generates a 1000*128 high dimensional data matrix for each picture and store the matrix in Hbase. The system is able to process 1 million pictures in 5 hours. 15

Outline Mass data management problem Applications of parallel cloud computing in ISPs Practice of parallel cloud computing in China telecom Conclusions 16

Conclusion Parallel cloud computing technology is effective for large scale data processing, including telecom data processing and internet data processing. The maintenance cost of hadoop platform is relatively high due to the frequent hardware failures. Cannot reach carrier-grade level, which requires 99.999% availability. Need effective tools for tracking task statuses and failure reasons. Does not support real-time applications. 17

Thank you Zhilan Huang Email: huangzl@gsta.com 18