Computing Issues for Big Data Theory, Systems, and Applications



Similar documents
Open source Google-style large scale data analysis with Hadoop

UPS battery remote monitoring system in cloud computing

BIG DATA CHALLENGES AND PERSPECTIVES

Advances in Natural and Applied Sciences

Telecom Data processing and analysis based on Hadoop

Large scale processing using Hadoop. Ján Vaňo

Applied research on data mining platform for weather forecast based on cloud storage

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Massive Data Query Optimization on Large Clusters

Hadoop implementation of MapReduce computational model. Ján Vaňo

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Log Mining Based on Hadoop s Map and Reduce Technique

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

COMP9321 Web Application Engineering

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Task Scheduling in Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15


The Improved Job Scheduling Algorithm of Hadoop Platform

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Deep Learning Meets Heterogeneous Computing. Dr. Ren Wu Distinguished Scientist, IDL, Baidu

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data and Analytics: Challenges and Opportunities

Energy Efficient MapReduce

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Design of Electric Energy Acquisition System on Hadoop

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Industry 4.0 and Big Data

International Journal of Innovative Research in Computer and Communication Engineering

Fast Data in the Era of Big Data: Twitter s Real-

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi

Open source large scale distributed data management with Google s MapReduce and Bigtable

ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP

NextGen Infrastructure for Big DATA Analytics.

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

HadoopRDF : A Scalable RDF Data Analysis System

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Big Data Explained. An introduction to Big Data Science.

Transforming the Telecoms Business using Big Data and Analytics

Challenges for Data Driven Systems

The 4 Pillars of Technosoft s Big Data Practice

BIG DATA What it is and how to use?

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Research on IT Architecture of Heterogeneous Big Data

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Survey on Scheduling Algorithm in MapReduce Framework

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Chapter 7. Using Hadoop Cluster and MapReduce

Chase Wu New Jersey Ins0tute of Technology

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Research of Postal Data mining system based on big data

Keywords Stock Exchange Market, Clustering, Hadoop, Map-Reduce

BPOE Research Highlights

The Big Data Paradigm Shift. Insight Through Automation

Fault Analysis in Software with the Data Interaction of Classes

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Extract Transform and Load Strategy for Unstructured Data into Data Warehouse Using Map Reduce Paradigm and Big Data Analytics

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Unified Big Data Processing with Apache Spark. Matei

Big Data Analytics Opportunities and Challenges

Data Refinery with Big Data Aspects

A Hadoop MapReduce Performance Prediction Method

Big Data Analytics Hadoop and Spark

BIG DATA ANALYSIS USING RHADOOP

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Efficient Data Replication Scheme based on Hadoop Distributed File System

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

A Big Data Analytical Framework For Portfolio Optimization Abstract. Keywords. 1. Introduction

Testing 3Vs (Volume, Variety and Velocity) of Big Data

NP-complete? NP-hard? Some Foundations of Complexity. Prof. Sven Hartmann Clausthal University of Technology Department of Informatics

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Brave New World: Hadoop vs. Spark

Big Data and Market Surveillance. April 28, 2014

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Metaheuristics in Big Data: An Approach to Railway Engineering

How Companies are! Using Spark

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Extending Hadoop beyond MapReduce

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Transcription:

Computing Issues for Big Data Theory, Systems, and Applications Beihang University Chunming Hu (hucm@buaa.edu.cn) Big Data Summit, with CyberC 2013 October 10, 2013. Beijing, China.

Bio of Myself Chunming Hu Got my Ph.D in 2006, Beihang University Associate Professor at Institute of Advanced Computing Technology (ACT), School of Computer Science, Beihang University Research Interests Service Computing(2001-2008) Grid Computing(2005-2009) Cloud Infrastructure and System Virtualization (2008-) Network System Virtualization Cloud-Client Computing Distributed Systems for Data Processing 2

Agenda Understanding the Big Data Background Computing Issues (4V 3I) Big Data Research at Beihang University RCBD, lead by Wenfei Fan 973 Project on Big Data, lead by Beihang University Early Experience on Big Data BD-Tractable by Preprocessing Performance Model for Hadoop Distributed Graph Pattern Matching Event Early Detection via Social Data Summary 3

Big Data in Cyberspace Large Scale with Rapid Updates Social Networks 4 Micro-blogger Provider in China: 800M Users, 200M tweets everyday, 20M+ Photos. Data doubled every 18 months Data in Cyberspace IDC Report : 2009: 0.8ZB 2012: 2.7 ZB 2020(E): 35ZB IDC Report Internet Search Baidu:1PB log data per Day. Handling 1000PB Google:Processing 20PB data everyday 1PB data in DVD: ~25km 1ZB=1PB 10 6 Airplane 15,000m Chomolung ma 8,800m 4

4V Features in Big Data Volume Velocity Variety Value In PB or EB Distributed data Dynamic Changes Updated constantly Heterogeneous Semi-structured or unstructured Biz opportunity Sensitive Data Data Deluge Wikipedia large and complex datasets, which is quite difficult to process using existing data management tools, and traditional data processing applications 5

Big Data Changes [Population] Statistical distribution Hypothesis Testing Sample Data Sampling in Statistics The Real World Knowledge New Statistical Theory & Math Tools

Big Data Changes [Population] Statistical distribution Hypothesis Testing Sample Data Sampling in Statistics The Real World Knowledge Modal based Prediction Mining Learning Problem -Specific Sample Data Pre-processing [Population ] Multiple Large Datasets Log, Device Capture Human Replay New Statistical Theory & Math Tools New Computing Theory & Algorithm Method? Large Scale Distributed Computing Infrastructure

Big Data Changes [Population] Statistical distribution Hypothesis Testing Sample Data Sampling in Statistics The Real World Knowledge Modal based Prediction? Mining Learning How to find Knowledge, rules Problem -Specific Sample Data Pre-processing? Multiple Large Datasets How to Re-sampling, Dimension Reduction? Log, Device Capture Human Replay Data Error? Deviation? New Statistical Theory & Math Tools New Computing Theory & Algorithm Method? Large Scale Distributed Computing Infrastructure

Focusing on the Computing 4-V Inexact 用 户 强 交 互 性 跨 多 通 道 快 Datasets are inexact: Noisy, Erros. Target are inexact. Eg. to find the macro trends. Avoid exact result to reduce cost Inexact but acceptable Results

Focusing on the Computing 4-V Inexact Incremental 用 户 强 交 互 性 跨 多 通 道 快 Data arrives continuesly Online/Realtime processing Hard to get an Static View of Data Batch/Full data is not enough

Focusing on the Computing 4-V Features of Big Data Computing Inexact Incremental Inductive 用 户 强 交 互 性 跨 多 通 道 快 Multi-source Datasets References between Datasets 973 Use the data correlations to adjust the errors Transfer Learning

Questions on Big Data Computing Question 1: Is there new Theory for Big Data Computing? Problem Algorithm Data Good: PTIME Bad: NP-Hard Ugly: PSPACE-hard, or EXPTIME-hard, undecidable Undecidable Problem In-feasible Problem approximate Algorithms (in PTIME) Decidable Problem BDuntractabl e Feasible Problem BD- Tractable 12

Questions on Big Data Computing Question 2: Is Hadoop a must to data processing? Different Computing requirements, User Scenarios Different Algorithm Design methods MapReduce (OSDI 2004) Handling All Data in a Distributed way Increm ental 3I MR is not the only solution Incremental Computing: Percolator by Google (OSDI 2010) New Algorithm Methods Resampling Query preserving data compression Partial evaluation and distributed processing Top-k query and terminating 13

Questions on Big Data Computing Question 3: How to handle the data computing algorithms in an operatable manner? Application specific Features Analysis Data Pattern, Arriving Rate, Query, General Purpose vs. Specific Purpose Domain-specific Knowledges Data Mining and ML Methods Distributed system Offline/Online Batch/Incremental/Streaming In-memory Computing 14

Agenda Understanding the Big Data Background Computing Issues (4V 3I) Big Data Research at Beihang University RCBD, lead by Wenfei Fan 973 Project on Big Data, lead by Beihang University Early Experience on Big Data BD-Tractable by Preprocessing Performance Model for Hadoop Distributed Graph Pattern Matching Event Early Detection via Social Data Summary 15

RCBD: International Research Centre on Big Data International Research Centre on Big Data (Founded in Sept 2012) Beihang U. U. Edinburgh HKUST U.Pennsylvania Baidu 973 16

973 Project on Big Data Computing Theory on Big Data (2014-2018) a 973 Project Funding from Ministry of Science & Technology, with 8 institutional organization Focusing on the Computing Issue of Big Data Processing 17

973 Project on Big Data WP5.Pilot Applications (Social Data, Internet Search Engine Data) WP4.Data Mining and Analyzing for Big Data WP3.Energy Efficient Distributed Data Processing WP1. Data Model and Understanding (Semantic/Visulization) WP2.Computing Complexity Theory and Algorithms Design 18

Agenda Understanding the Big Data Background Computing Issues (4V 3I) Big Data Research at Beihang University RCBD, lead by Wenfei Fan 973 Project on Big Data, lead by Beihang University Early Experience on Big Data BD-Tractable by Preprocessing Performance Model for Hadoop Distributed Graph Pattern Matching Event Early Detection via Social Data Summary 19

Some Early Experiences Theory BD-Tractable via Data Preprocessing Systems Performance Model for Hadoop Graph Pattern Matching Applications Using Mood to Detect Sudden Occurrence (Early Event Detection via Social Data) 20

BD-Tractable with Preprocessing Polynomial time queries become intractable on big data We want to be able to tell what queries are feasible on big data NP and beyond PTIME BD-tractable not BD-tractable BD-tractable queries: queries feasible on big data 21

BD-Tractable with Preprocessing How do we dealing with SQL querys on a large DATABASE? Scan through all the records? NO!! Using Index to get better query performance! B-Tree index, from O(n) to O(logn) Query Optimizations! Two steps of computing Set up the index : preprocessing Doing query on the index 22

BD-Tractable with Preprocessing A class Q of queries is BD-tractable if there exists a PTIME preprocessing function such that for any database D on which queries of Q are defined, D = (D) hence D is of polynomial size for all possible queries rewriting Q in Q defined on D, Q(D) can be computed by evaluating Q on D in parallel polylog time (NC) parallel log k ( D, Q ) Q 1 ( (D)) Q 2 ( (D)) D (D) Does it work? If a linear scan of D could be done in log( D ) time: 15 seconds when D is of 1 PB instead of 1.99 days 18 seconds when D is of 1 EB rather than 5.28 years BD-tractable queries are feasible on big data 23

Performance Model for Hadoop Accurate performance prediction can help optimize the MR jobs Cost-based scheduling strategies Query Optimization Multi-processing steps in MR Basic Idea: Benchmarking and Profiling on target machine Find out the parameters for the model 24

Performance Model for Hadoop User Program Hive Query Cost Predictor Progress Indicator Job Client MRPacker+ Cost Analyzor Parameter Optimizer Cost-based Scheduler JobTracker Other Scheduler Performance Sub System Performancer JHisCollector JSampler TaskTracker CDataCollector JDataCollector Task Task Task HDFS Data Node HDFS Master Computing Node Sub System Modules Hadoop Packages Modified New Data Flow

Distributed Graph Pattern Matching Graph patter matching Providing evaluation algorithms and optimizations for graph simulation in a distributed setting 26

Early Event Detection via Social Data Social Data reflect the physical world Event Detection via Social Data Event Population Trends Motion plays important role in social media How to detect the user motion through the weibo text?? 27

Early Event Detection via Social Data Classification 95 motion icons selected from 1000 icons Use the text with motion icons as the training sets 28

Early Event Detection via Social Data Abnormal event detection Mood Search http://xinqings.nlsde.buaa.edu.cn 29

icome: a Competition for Big Data http://www.icome.org.cn 30

Agenda Understanding the Big Data Background Computing Issues (4V 3I) Big Data Research at Beihang University RCBD, lead by Wenfei Fan 973 Project on Big Data, lead by Beihang University Early Experience on Big Data BD-Tractable by Preprocessing Performance Model for Hadoop Distributed Graph Pattern Matching Event Early Detection via Social Data Summary 31

Summary Big Data: from 4V to 3I Inexact Incremental Inductive Application-Driven Vertical Integration Theory, Algorithm, Distributed Systems, Mining & ML Methods Open Data Community Get more data from industry/government 32

Acknowledgement Part of the slides borrowed from Prof. Wenfei Fan at RCBD, Prof. Ke Xu at NLSDE, Beihang University Prof. Shuai Ma, Dr. Xuelian Lin at ACT, Beihang University References Wenfei Fan, FlorisGeerts, Frank Neven. Making Queries Tractable on Big Data with Preprocessing, VLDB 2013. Lin Xuelian, MengZide, XuChuan and Wang Meng.A Practical Performance Model for HadoopMapReduce. IEEE International Conference on Cluster Computing Workshops (ClusterW), 2012 Shuai Ma, Yang Cao, JinpengHuai, Tianyu Wo. Distributed Graph Pattern Matching, WWW 2012. Shuai Ma, Yang Cao, Wenfei Fan, JinpengHuai, Tianyu Wo. Capturing Topology in Graph Pattern Matching, VLDB 2012. Jichang Zhao, Li Dong, Junjie Wu, KeXu: MoodLens: an emoticon-based sentiment analysis system for chinese tweets. KDD 2012. 33

Thank You! School of Computer Science, Beihang University Chunming Hu (hucm@buaa.edu.cn) Big Data Summit, with CyberC 2013 October 10, 2013. Beijing, China.