Large-Scale Data Processing



Similar documents
Challenges for Data Driven Systems

Open source Google-style large scale data analysis with Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data Explained. An introduction to Big Data Science.

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Privacy Preservation in the Context of Big Data Processing

Application Development. A Paradigm Shift

How To Scale Out Of A Nosql Database

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Lecture Data Warehouse Systems

BIG DATA TRENDS AND TECHNOLOGIES

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

How Companies are! Using Spark

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Hadoop Ecosystem B Y R A H I M A.

NoSQL Data Base Basics

CSE-E5430 Scalable Cloud Computing Lecture 2

Machine Learning over Big Data

Big Graph Processing: Some Background

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Hybrid Software Architectures for Big

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Evaluating partitioning of big graphs

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

The Internet of Things and Big Data: Intro

Big Data and Industrial Internet

Hadoop. Sunday, November 25, 12

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

INTRODUCTION TO CASSANDRA

Hadoop IST 734 SS CHUNG

From Internet Data Centers to Data Centers in the Cloud

6.S897 Large-Scale Systems

Big Data. Lyle Ungar, University of Pennsylvania

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Jeffrey D. Ullman slides. MapReduce for data intensive computing

So What s the Big Deal?

Ali Ghodsi Head of PM and Engineering Databricks

TUT NoSQL Seminar (Oracle) Big Data

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Real Time Big Data Processing

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Hadoop implementation of MapReduce computational model. Ján Vaňo

Bringing Big Data Modelling into the Hands of Domain Experts

Presto/Blockus: Towards Scalable R Data Analysis

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data on Google Cloud

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Information Processing, Big Data, and the Cloud

The Power of Relationships

Cloud Computing Summary and Preparation for Examination

How To Create A Data Visualization With Apache Spark And Zeppelin

Unified Big Data Processing with Apache Spark. Matei

Data Processing in the Era of Big Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Workshop on Hadoop with Big Data

Ubuntu and Hadoop: the perfect match

Building your Big Data Architecture on Amazon Web Services

Large scale processing using Hadoop. Ján Vaňo

Big Data and Apache Hadoop s MapReduce

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

Cloud Computing Training

Big Data Analytics. Chances and Challenges. Volker Markl

Big Data on Microsoft Platform

GigaSpaces Real-Time Analytics for Big Data

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Moving From Hadoop to Spark

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

NextGen Infrastructure for Big DATA Analytics.

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

THE AGE OF BIG DATA. Chula DataScience

Raul F. Chong Senior program manager Big data, DB2, and Cloud IM Cloud Computing Center of Competence - IBM Toronto Lab, Canada

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Paolo Costa

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Transcription:

Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory

2010s: Big Data Why Big Data now? Increase of Storage Capacity Increase of Processing Capacity Availability of Data Hardware and software technologies can manage ocean of data up to 2003 5 exabytes 2012 2.7 zettabytes (500 x more) 2015 ~8 zettabytes (3 x more than 2012) 2

Examples of Big Data Facebook: 300 PB data warehouse (600TB/day) 1 billion users Twitter Firehose: 500 million tweet/day CERN 15 PB/year - Stored in RDB Google: 40000 search queries/second ebay 9PB of user data+ >50 TB/day Amazon web services Estimated ~450000 servers for AWS S3 450B objects, peak 290K request/sec JPMorganChase 150PB on 50K+ servers with 15K apps running 3

Scale-Up vs. Scale-Out Popular solution for big data processing to scale and build on distribution and combine theoretically unlimited number of machines in a single distributed storage Scale up: add resources to single node in a system Scale out: add more nodes to a system 4

Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service, Amazon EC2, Google App Engine, Elastic, Azure) cf. Multi-core (parallel computing) Storage Distributed storage (e.g. Amazon S3, Hadoop Distributed File System (HDFS), Google File System (GFS)) Data model/indexing High-performance schema-free database (e.g. NoSQL DB - Redis, BigTable, Hbase, Neo4J) Programming model Distributed processing (e.g. MapReduce) 5

MapReduce Programming Target problem needs to be parallelisable Split into a set of smaller code (map) Next small piece of code executed in parallel Results from map operation get synthesised into a result of original problem (reduce) 6

Programming Non standard programming models Data (flow) parallel programming e.g. MapReduce, Dryad/LINQ, NAIAD, Spark MapReduce: Hadoop DAG (Directed Acyclic Graph) based: Dryad/Spark/Tez Two-Stage fixed dataflow More flexible dataflow model 7

Do we need new types of algorithms? Cannot always store all data Online/streaming algorithms Have we seen x before? Rolling average of previous K items Incremental updating Memory vs. disk becomes critical Algorithms with limited passes N 2 is impossible and fast data processing Approximate algorithms, sampling Iterative operation (e.g. machine learning) Data has different relations to other data Algorithms for high-dimensional data (efficient multidimensional indexing) 8

Big Data Analytics Stack 9

Hadoop Big Data Analytics Stack Storm 10

Spark Big Data Analytics Stack 11

How about Graph (Network) Data? Bipartite graph of phrases in documents Brain Networks: 100B neurons(700t links) requires 100s GB memory Airline Graphs Gene expression data Web 1.4B pages(6.6b links) Protein Interactions [genomebiology.com] Social media data 12

How about Graph Data? Bipartite graph of phrases in documents Brain Networks: 100B neurons(700t links) requires 100s GB memory Airline Graphs Gene expression data Web 1.4B pages(6.6b links) Protein Interactions [genomebiology.com] Social media data 13

Data-Parallel vs. Graph-Parallel Data-Parallel for all? Graph-Parallel is hard! Data-Parallel (sort/search - randomly split data to feed MapReduce) Not every graph algorithm is parallelisable (interdependent computation) Not much data access locality High data access to computation ratio 14

Graph-Parallel Graph-Parallel (Graph Specific Data Parallel) Vertex-based iterative computation model Use of iterative Bulk Synchronous Parallel Model Pregel (Google), Giraph (Apache), Graphlab, GraphChi (CMU) Optimisation over data parallel GraphX/Spark (U.C. Berkeley) Data-flow programming more general framework NAIAD (MSR) 15

Bulk Synchronous Parallel Model Computation is sequence of iterations Each iteration is called a super-step Computation at each vertex in parallel Example: Computing of maximum vertex value 16

Big Data Analytics Stack Query Language Machine learning Streaming Processing Execution Engine Graph Processing Percolator Resource Manager Storage GFS Database 17

Single Computer? Use of powerful HW/SW parallelism SSDs as external memory GPU for massive parallelism Exploit graph structure/algorithm for processing Computation Platform = CPU CPU CPU Multi-core CPU Cluster Single Computer Parallelism Here Input Storage/Stream GPU HD/SSD (External Memory) 18

Conclusions Increasing capability of hardware and software will make big data accessible Challenges in data intensive processing Data-Parallel (MapReduce) and Graph-Parallel Graph-Parallel is hard but great potential of advanced data mining Data-driven computations, poor data locality, needs to consider beyond computation model Inter-disciplinary approach is necessary Distributed systems Networking Database Algorithms Statistics Machine learning 19

Conclusions Increasing capability of hardware and software will make big data accessible Challenges in data intensive processing Data-Parallel (MapReduce) and Graph-Parallel Graph-Parallel is hard but great potential of advanced data mining Data-driven computations, poor data locality, needs to consider beyond computation model Inter-disciplinary approach is necessary Distributed systems Networking Database Algorithms Statistics Machine learning 20

3Vs of Big Data Volume: terabytes to petabytes scale Velocity: Time sensitive streaming Variety: beyond structured data (e.g. text, audio, video etc.) Time-sensitive to maximise its value Beyond structured data SOURCE: IBM Terabytes or even Petabytes scale 21

MapReduce Programming Model Target problem needs to be parallelisable 22

External Memory Lot of work on computation Little attention to storage Store LARGE amount of graph structure data (majority of data is edges) Efficiently move it to computation Potential solutions: Cost effective but efficient storage Move to SSDs (or HD) from RAM Reduce latency Runtime prefetching Streaming (edge centric approach) Reduce storage requirements Compressed adjacency lists 23