Large-Scale Data Processing

Size: px

Start display at page:

Download "Large-Scale Data Processing"

Scott Lang
9 years ago
Views:

1 Large-Scale Data Processing Eiko Yoneki Systems Research Group University of Cambridge Computer Laboratory

2 2010s: Big Data Why Big Data now? Increase of Storage Capacity Increase of Processing Capacity Availability of Data Hardware and software technologies can manage ocean of data up to exabytes zettabytes (500 x more) 2015 ~8 zettabytes (3 x more than 2012) 2

Availability of Data Hardware and software technologies can manage

3 Examples of Big Data Facebook: 300 PB data warehouse (600TB/day) 1 billion users Twitter Firehose: 500 million tweet/day CERN 15 PB/year - Stored in RDB Google: search queries/second ebay 9PB of user data+ >50 TB/day Amazon web services Estimated ~ servers for AWS S3 450B objects, peak 290K request/sec JPMorganChase 150PB on 50K+ servers with 15K apps running 3

queries/second ebay 9PB of user data+ >50 TB/day Amazon web services Estimated ~450000

4 Scale-Up vs. Scale-Out Popular solution for big data processing to scale and build on distribution and combine theoretically unlimited number of machines in a single distributed storage Scale up: add resources to single node in a system Scale out: add more nodes to a system 4

build on distribution and combine theoretically unlimited number of

5 Big Data: Technologies Distributed infrastructure Cloud (e.g. Infrastructure as a service, Amazon EC2, Google App Engine, Elastic, Azure) cf. Multi-core (parallel computing) Storage Distributed storage (e.g. Amazon S3, Hadoop Distributed File System (HDFS), Google File System (GFS)) Data model/indexing High-performance schema-free database (e.g. NoSQL DB - Redis, BigTable, Hbase, Neo4J) Programming model Distributed processing (e.g. MapReduce) 5

6 MapReduce Programming Target problem needs to be parallelisable Split into a set of smaller code (map) Next small piece of code executed in parallel Results from map operation get synthesised into a result of original problem (reduce) 6

small piece of code executed in parallel Results from map

7 Programming Non standard programming models Data (flow) parallel programming e.g. MapReduce, Dryad/LINQ, NAIAD, Spark MapReduce: Hadoop DAG (Directed Acyclic Graph) based: Dryad/Spark/Tez Two-Stage fixed dataflow More flexible dataflow model 7

8 Do we need new types of algorithms? Cannot always store all data Online/streaming algorithms Have we seen x before? Rolling average of previous K items Incremental updating Memory vs. disk becomes critical Algorithms with limited passes N 2 is impossible and fast data processing Approximate algorithms, sampling Iterative operation (e.g. machine learning) Data has different relations to other data Algorithms for high-dimensional data (efficient multidimensional indexing) 8

disk becomes critical Algorithms with limited passes N 2 is impossible and fast data processing Approximate

9 Big Data Analytics Stack 9

10 Hadoop Big Data Analytics Stack Storm 10

11 Spark Big Data Analytics Stack 11

12 How about Graph (Network) Data? Bipartite graph of phrases in documents Brain Networks: 100B neurons(700t links) requires 100s GB memory Airline Graphs Gene expression data Web 1.4B pages(6.6b links) Protein Interactions [genomebiology.com] Social media data 12

neurons(700t links) requires 100s GB memory Airline Graphs Gene

13 How about Graph Data? Bipartite graph of phrases in documents Brain Networks: 100B neurons(700t links) requires 100s GB memory Airline Graphs Gene expression data Web 1.4B pages(6.6b links) Protein Interactions [genomebiology.com] Social media data 13

14 Data-Parallel vs. Graph-Parallel Data-Parallel for all? Graph-Parallel is hard! Data-Parallel (sort/search - randomly split data to feed MapReduce) Not every graph algorithm is parallelisable (interdependent computation) Not much data access locality High data access to computation ratio 14

Data-Parallel (sort/search - randomly split data to feed MapReduce) Not

15 Graph-Parallel Graph-Parallel (Graph Specific Data Parallel) Vertex-based iterative computation model Use of iterative Bulk Synchronous Parallel Model Pregel (Google), Giraph (Apache), Graphlab, GraphChi (CMU) Optimisation over data parallel GraphX/Spark (U.C. Berkeley) Data-flow programming more general framework NAIAD (MSR) 15

Pregel (Google), Giraph (Apache), Graphlab, GraphChi (CMU) Optimisation over data

16 Bulk Synchronous Parallel Model Computation is sequence of iterations Each iteration is called a super-step Computation at each vertex in parallel Example: Computing of maximum vertex value 16

17 Big Data Analytics Stack Query Language Machine learning Streaming Processing Execution Engine Graph Processing Percolator Resource Manager Storage GFS Database 17

18 Single Computer? Use of powerful HW/SW parallelism SSDs as external memory GPU for massive parallelism Exploit graph structure/algorithm for processing Computation Platform = CPU CPU CPU Multi-core CPU Cluster Single Computer Parallelism Here Input Storage/Stream GPU HD/SSD (External Memory) 18

parallelism Exploit graph structure/algorithm for processing Computation

19 Conclusions Increasing capability of hardware and software will make big data accessible Challenges in data intensive processing Data-Parallel (MapReduce) and Graph-Parallel Graph-Parallel is hard but great potential of advanced data mining Data-driven computations, poor data locality, needs to consider beyond computation model Inter-disciplinary approach is necessary Distributed systems Networking Database Algorithms Statistics Machine learning 19

advanced data mining Data-driven computations, poor data locality, needs to consider beyond computation model

20 Conclusions Increasing capability of hardware and software will make big data accessible Challenges in data intensive processing Data-Parallel (MapReduce) and Graph-Parallel Graph-Parallel is hard but great potential of advanced data mining Data-driven computations, poor data locality, needs to consider beyond computation model Inter-disciplinary approach is necessary Distributed systems Networking Database Algorithms Statistics Machine learning 20

21 3Vs of Big Data Volume: terabytes to petabytes scale Velocity: Time sensitive streaming Variety: beyond structured data (e.g. text, audio, video etc.) Time-sensitive to maximise its value Beyond structured data SOURCE: IBM Terabytes or even Petabytes scale 21

22 MapReduce Programming Model Target problem needs to be parallelisable 22

23 External Memory Lot of work on computation Little attention to storage Store LARGE amount of graph structure data (majority of data is edges) Efficiently move it to computation Potential solutions: Cost effective but efficient storage Move to SSDs (or HD) from RAM Reduce latency Runtime prefetching Streaming (edge centric approach) Reduce storage requirements Compressed adjacency lists 23

Challenges for Data Driven Systems

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2