Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
|
|
|
- Amos McDaniel
- 10 years ago
- Views:
Transcription
1 Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
2 MOTIVATION
3 Why do we need tools? Source : nature.com Visualization Properties extraction Complex queries Source : Boldi et al.
4 Graphs are everywhere RDF ( test1, writtenby, Sophie ) ( test1, publishedin, Journal ) ( test2, publishedin, Journal) SPARQL SELECT?s WHERE {?s writtenby?a.?a hasname Sophie.?s publishedin Journal. } Basically a sub-graph matching
5 Why are graphs different? Graphs can be large - Facebook : 720M users, 69B friends in Twitter : 537M accounts, 23.95B links in 2012 Low memory cost per vertex - 1 ID, 1 pointer/edge Low computation per vertex Graphs are not memory friendly - Random jumps to memory They are not hardware friendly!
6 Lots of frameworks Really lots of them - Matlab, NetworkX, GraphChi, Hadoop, Twister, Piccolo, Maiter, Pregel, Giraph, Hama, GraphLab, Pegasus, Snap, Neo4J, Gephi, Tulip, any DBMS, Why so many? - Not one size fits all - Different computational models - Different architecture
7 Possible taxonomy Generic vs Specialized - Hadoop vs GraphChi Shared vs Distributed Memory - GraphChi vs Pregel Synchronous vs Asynchronous - Giraph vs Maiter Single vs Multi threaded - NetworkX vs GraphiChi
8 NETWORKX 8
9 Overview A Python package for complex network analysis Simple API Very flexible - Can attach any data to vertices and edges - Supports visualization Graphs generators
10 Dependencies Supports Python 2.7 (preferred) or 3.0 If drawing support required - Numpy ( - Mathplotlib ( - Graphivz (
11 Creating an empty graph Adding nodes Examples >>> import networkx as nx >>> G=nx.Graph() >>> G.add_node(1) >>> G.add_nodes_from([2,3]) Adding edges >>> G.add_edge(2,3) >>> G.add_edges_from([(1,2),(1,3)])
12 Examples (2) Graph generators Stochastic graph generators Reading from files >>> K_5=nx.complete_graph(5) >>> K_3_5=nx.complete_bipartite_graph(3,5) >>> er=nx.erdos_renyi_graph(100,0.15) >>> ws=nx.watts_strogatz_graph(30,3,0.1) >>> ba=nx.barabasi_albert_graph(100,5) >>> red=nx.random_lobster(100,0.9,0.9) >>> mygraph=nx.read_gml("path.to.file")
13 Examples (3) Graph analysis >>> nx.connected_components(g) >>> nx.degree(g) >>> pr=nx.pagerank(g,alpha=0.9) Graph drawing >>> import matplotlib.pyplot as plt >>> nx.draw(g) >>> plt.show()
14 NetworkX - Conclusion Easy to use - Very good for prototyping/testing Centralized - Limited scalability Efficiency - Memory overhead
15 GRAPHCHI 15
16 Overview Single machine - Distributed systems are complicated! Disk-based system - Memory is cheap but limited Supports both static and dynamic graph Kyrola, Aapo and Blelloch, Guy and Guestrin, Carlos, GraphChi: Large-scale Graph Computation on Just a PC, Proceedings of OSDI 12
17 Computational Model Vertex centric - Vertices and Edges have associated values - Update a vertex values using edges values Typical update - Read values from edges - Compute new value - Update edges Asynchronous model - Always get the most recent value for edges - Schedule multiple updates
18 Storing graphs on disk Compressed Sparse Row (CSR) - Equivalent to adjacency sets - Store out-edges of vertex consecutively on Disk - Maintain index to adjacency sets for each vertex Very efficient for out-edges, not so for in-edges - Use Compressed Sparse Column (CSC) Changing edges values - On modification of out-edge : write to CSC - On reading of in-edge : read from CSR - Random read or random write L
19 Parallel Sliding Windows Minimize non sequential disk access 3 stages algorithm Storing graph on disk - Vertices V are split into P disjoints intervals - Store all edges that have destination in an interval in a Shard - Edges are stored by source order From Kyrola and al.
20 Parallel Sliding Windows (2) Loading subgraph of vertices in interval p - Load Shard(p) in memory Get in-edges immediately - Out-edges are stored in the P-1 other shards But ordered by sources, so easy to find Loading subgraph p+1 - Slide a window over all shards Each interval requires P sequential reads
21 Parallel updates Once interval loaded, update in parallel Data races - Only a problem if considering edge with both endpoints in interval - Enforce sequential update Write back result to disk - Current shard totally rewritten - Sliding window of other shards rewritten
22 Example
23 Example
24 Performance Mac Mini 2.5GHz, 8GB and 256GB SSD Shard creation
25 Performance (2)
26 GOOGLE PREGEL
27 Overview Directed graphs Distributed Framework Based on the Bulk Synchronous Parallel model Vertex Centric computation model Private framework with C++ API Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD '10)
28 Model of Computation (1) BSP : model for parallel programming - Takes into account communication/synchronization - Series of super-steps (iterations) Performs local computations Communicate with others Barrier From :
29 Model of Computation (2) Vertex Centric - Each vertex execute a function in parallel Can read messages sent at previous super-step Can send messages to be read at next super-step - Not necessarily following edges Can modify state of outgoing edges Run until all vertices agree to stop and no message in transit From Malewicz and al.
30 From Malewicz and al. Maximum Value Example
31 Implementation and Execution (1) User provides a graph, some input (vertex and edges values) and a program The program is executed on all nodes of a cluster - One node become the master, other are workers The graph is divided into partitions by the master - Vertex Id used to compute partition index (e.g. hash(id) mod N) Partitions are assigned to workers User input file is partitioned (no fancy hash) and sent to workers - If some input is not for the worker, it will pass it along
32 Implementation and Execution (2) The master request worker to perform superstep - At the end, each worker reports the number of active vertices for next superstep Aggregators can be used at end of super-step to reduce communications - Perform reduction on values before sending If no more active vertices, Master can halt computation What about failures? - Easy to checkpoint workers at end of superstep - If failure, rollback to previous checkpoint - If master fails too bad L
33 From Malewicz and al. PageRank in Pregel
34 From Malewicz and al. Performance
35 From Malewicz and al. Performance
36 MAPREDUCE
37 Map Reduce operations Input data are (key, value) pairs 2 operations available : map and reduce Map Takes a (key, value) and generates other (key, value) Reduce Takes a key and all associated values Generates (key, value) pairs A map-reduce algorithm requires a mapper and a reducer Re-popularized by Google - MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, OSDI 04
38 Map Reduce example Compute the average grade of students For each course, the professor provides us with a text file Text file format : lines of student grade Algorithm (non map-reduce) For each student, collect all grades and perform the average Algorithm (map-reduce) Mapper Assume the input file is parsed as (student, grade) pairs So do nothing! Reducer Perform the average of all values for a given key
39 Map Reduce example Course 1 Bob 20 Brian 10 Paul 15 Course 2 Bob 15 Brian 20 Paul 10 Course 3 Bob 10 Brian 15 Paul 20 Map (Bob, 20) (Brian, 10) (Paul, 15) (Bob, 15) (Brian, 20) (Paul, 10) (Bob, 10) (Brian, 15) (Paul, 20) (Bob, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10]) (Bob, 15) (Brian 15) (Paul, 15) Reduce
40 Map Reduce example too easy J Ok, this was easy because We didn t care about technical details like reading inputs All keys are equals, no weighted average Now can we do something more complicated? Let s computed a weighted average Course 1 has weight 5 Course 2 has weight 2 Course 3 has weight 3 What is the problem now?
41 Map Reduce example Course 1 Bob 20 Brian 10 Paul 15 Course 2 Bob 15 Brian 20 Paul 10 Course 3 Bob 10 Brian 15 Paul 20 Map (Bob, 20) (Brian, 10) (Paul, 15) (Bob, 15) (Brian, 20) (Paul, 10) (Bob, 10) (Brian, 15) (Paul, 20) (Bob, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10]) (Bob, 15) (Brian 15) (Paul, 15) Reduce Should be able to discriminate between values
42 Map Reduce example - advanced How discriminate between values for a given key We can t unless the values look different New reducer Input : (Name, [course1_grade1, course2_grade2, course3_grade3]) Strip values from course indication and perform weighted average So, we need to change the input of the reducer which comes from the mapper New mapper Input : (Name, Grade) Output : (Name, coursename_grade) The mapper needs to be aware of the input file
43 Map Reduce example - 2 Course 1 Bob 20 Brian 10 Paul 15 Course 2 Bob 15 Brian 20 Paul 10 Course 3 Bob 10 Brian 15 Paul 20 Map (Bob, C1_20) (Brian, C1_10) (Paul, C1_15) (Bob, C2_15) (Brian, C2_20) (Paul, C2_10) (Bob, C3_10) (Brian, C3_15) (Paul, C3_20) (Bob, [C1_20, C2_15, C3_10]) (Brian, [C1_10, C2_15, C3_20]) (Paul, [C1_15, C2_20, C3_10]) (Bob, 16) (Brian, 14) (Paul, 14.5) Reduce
44 What is Hadoop? A set of software developed by Apache for distributed computing Many different projects MapReduce HDFS : Hadoop Distributed File System Hbase : Distributed Database. Written in Java - Bindings for your favorite languages available Can be deployed on any cluster easily
45 Hadoop Job An Hadoop job is composed of a map operation and (possibly) a reduce operation Map and reduce operations are implemented in a Mapper subclass and a Reducer subclass Hadoop will start many instances of Mapper and Reducer Decided at runtime but can be specified Each instance will work on a subset of the keys called a Splits
46 Hadoop workflow Source : Hadoop the definitive guide
47 Graphs and MapReduce How to write a graph algorithm in MapReduce? Graph representation? - Use adjacency matrix Line based representation - V 1 : 0, 0, 1 - V 2 : 1, 0, 1 - V 3 : 1, 1, 0 Size V 2 with tons of 0 V 1 V 2 V 3 V V V
48 Sparse matrix representation Only encode useful values, i.e. non 0 - V 1 : (V 3,1) - V 2 : (V 1,1), (V 3,1) - V 3 : (V 1,1), (V 2,1) And if equal weights - V 1 : V 3 - V 2 : V 1, V 3 - V 3 : V 1,V 2
49 Single Source Shortest Path Find the shortest path from one source node S to others Assume edges have weight 1 General idea is BFS - Distance(S) = 0 - For all nodes N reachable from S Distance(N) = 1 - For all nodes N reachable from other set of nodes M Distance(N) = 1+ min(distance(m)) - And start next iteration
50 MapReduce SSSP Data - Key : node N - Value : (d, adjacency list of N) d distance from S so far Map : - m adjacency list: emit (m, d + 1) Reduce : - Keep minimum distance for each node This basically advances the frontier by one hop - Need more iterations
51 MapReduce SSSP (2) How to maintain graph structure between iterations - Output adjacency list in mapper - Have special treatment in reducer Termination? - Eventually J - Stops when no new distance is found (any idea how?)
52 Seriously? MapReduce + Graphs is easy But everyone is MapReducing the world! - Because they are forced to - And because of Hadoop Hadoop gives - A scalable infrastructure (computation and storage) - Fault tolerance So let s use Hadoop as an underlying infrastructure
53 Giraph Built on top of Hadoop Vertex centric and BSP model J - Giraph jobs run as MapReduce Source :
54 Conclusion So many frameworks to choose from Criteria - What is the size of your graph? - What algorithms do you want to run? - How fast do you want your results? Distributed frameworks are no silver bullet - Steeper learning curve - Add new problems (data distribution, faults )
55 Resources Slides au/lectures/algorithms.pdf - GraphAlgorithms.pptx
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
Evaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
Graph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Map-Reduce and Hadoop
Map-Reduce and Hadoop 1 Introduction to Map-Reduce 2 3 Map Reduce operations Input data are (key, value) pairs 2 operations available : map and reduce Map Takes a (key, value) and generates other (key,
Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction
Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Map-Based Graph Analysis on MapReduce
2013 IEEE International Conference on Big Data Map-Based Graph Analysis on MapReduce Upa Gupta, Leonidas Fegaras University of Texas at Arlington, CSE Arlington, TX 76019 {upa.gupta,fegaras}@uta.edu Abstract
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013
MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective 2014-2015
Cloud Computing Lectures 10 and 11 Map Reduce: System Perspective 2014-2015 1 MapReduce in More Detail 2 Master (i) Execution is controlled by the master process: Input data are split into 64MB blocks.
Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Data and Algorithms of the Web: MapReduce
Data and Algorithms of the Web: MapReduce Mauro Sozio May 13, 2014 Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, 2014 1 / 39 Outline 1 MapReduce Introduction MapReduce
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics
A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics Ilie Tanase 1, Yinglong Xia 1, Lifeng Nai 2, Yanbin Liu 1, Wei Tan 1 Jason Crawford 1, and Ching-Yung Lin 1 1 IBM Thomas J.
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
Machine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou [email protected] Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
The Current State of Graph Databases
The Current State of Graph Databases Mike Buerli Department of Computer Science Cal Poly San Luis Obispo [email protected] December 2012 Abstract Graph Database Models is increasingly a topic of interest
Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
Systems and Algorithms for Big Data Analytics
Systems and Algorithms for Big Data Analytics YAN, Da Email: [email protected] My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Efficient Analysis of Big Data Using Map Reduce Framework
Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Subscriber classification within telecom networks utilizing big data technologies and machine learning
Subscriber classification within telecom networks utilizing big data technologies and machine learning Jonathan Magnusson Uppsala University Box 337 751 05 Uppsala, Sweden jonathanmagnusson@ hotmail.com
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Information Processing, Big Data, and the Cloud
Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Review on the Cloud Computing Programming Model
, pp.11-16 http://dx.doi.org/10.14257/ijast.2014.70.02 Review on the Cloud Computing Programming Model Chao Shen and Weiqin Tong School of Computer Engineering and Science Shanghai University, Shanghai
An Experimental Comparison of Pregel-like Graph Processing Systems
An Experimental Comparison of Pregel-like Graph Processing Systems Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, Tianqi Jin David R. Cheriton School of Computer Science, University
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph Sebastian Schelter Invited talk at GameDuell Berlin 29th May 2012 the mandatory about me slide PhD student at the Database Systems and Information Management
MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping
: Fast Billion-Scale Graph Computation on a PC via Memory Mapping Zhiyuan Lin, Minsuk Kahng, Kaeser Md. Sabrin, Duen Horng (Polo) Chau Georgia Tech Atlanta, Georgia {zlin48, kahng, kmsabrin, polo}@gatech.edu
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Scalability! But at what COST?
Scalability! But at what COST? Frank McSherry Michael Isard Derek G. Murray Unaffiliated Microsoft Research Unaffiliated Abstract We offer a new metric for big data platforms, COST, or the Configuration
Large Scale Social Network Analysis
Large Scale Social Network Analysis DATA ANALYTICS 2013 TUTORIAL Rui Sarmento [email protected] João Gama [email protected] Outline PART I 1. Introduction & Motivation Overview & Contributions 2. Software
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
Graphs, Networks and Python: The Power of Interconnection. Lachlan Blackhall - [email protected]
Graphs, Networks and Python: The Power of Interconnection Lachlan Blackhall - [email protected] A little about me Graphs Graph, G = (V, E) V = Vertices / Nodes E = Edges NetworkX Native graph
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
A1 and FARM scalable graph database on top of a transactional memory layer
A1 and FARM scalable graph database on top of a transactional memory layer Miguel Castro, Aleksandar Dragojević, Dushyanth Narayanan, Ed Nightingale, Alex Shamis Richie Khanna, Matt Renzelmann Chiranjeeb
