Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Datacenters
|
|
|
- Ronald Owens
- 10 years ago
- Views:
Transcription
1 Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Datacenters 60km 35mi (We are here) founded 1842 pop: 13,000 Alexandru Iosup Delft University of Technology The Netherlands Team: Undergrad Tim Hegeman, Grad Yong Guo, Mihai Capota, Bogdan Ghit Researchers Marcin Biczak, Otto Visser Staff Henk Sips, Dick Epema Collaborators* Ana Lucia Varbanescu (UvA, Ams), Claudio Martella (VU, Giraph), KIT, Intel Research Labs, IBM TJ Watson, SAP, Google Inc. MV, Salesforce SF, * Not their fault for any mistakes in this presentation. Or so they wish. 1 May 14, nd ISO/IEC JTC 1 Study Group on Big Data, Amsterdam
2 Data at the Core of Our Society: The LinkedIn Example A very good resource for matchmaking workforce and prospective employers Vital for your company s life, as your Head of HR would tell you Vital for the prospective employees Feb M Mar 2011, 69M May Sources: Vincenzo Cosenza, The State of LinkedIn, via Christopher Penn,
3 Data at the Core of Our Society: The LinkedIn Example 3-4 new users every second Great, if you can process this graph: opinion mining, hub detection, etc. Feb M Mar 2011, 69M May Sources: Vincenzo Cosenza, The State of LinkedIn, via Christopher Penn,
4 Data at the Core of Our Society: The LinkedIn Example 3-4 new users every second Great, if you can process this graph: opinion mining, hub detection, etc. but fewer visitors (and page views) 139/277 million questions of customer retention, so time-based analytics Feb M Mar 2011, 69M May Sources: Vincenzo Cosenza, The State of LinkedIn, via Christopher Penn,
5 LinkedIn Is Part of the Data Deluge Data Deluge = data generated by humans and devices (IoT) Interacting Understanding Deciding Creating May Sources: IDC, EMC.
6 The Data Deluge Is A Challenge for Tech But Good for Us[ers] All human knowledge Until 2005: 150 Exa-Bytes 2010: 1,200 Exa-Bytes Online gaming (Consumer) 2002: 20TB/year/game 2008: 1.4PB/year/game (only stats) Public archives (Science) 2006: GBs/archive 2011: TBs/year/archive 1TB/yr Dataset Size 1TB 100GB 10GB 1GB GTA P2PTA 6 Sources: Vincenzo Cosenza, The State of LinkedIn, via Christopher Penn, Year
7 The Challenge: The Three V s of Big Data When You Can, Keep and Process Everything Volume More data vs. better models Exponential growth + iterative models Scalable storage and distributed queries Velocity Speed of the feedback loop Gain competitive advantage: fast recommendations Analysis in near-real time to extract value Variety * New queries later The data can become messy: text, video, audio, etc. Difficult to integrate into applications Too big, too fast, does not comply with traditional DB Adapted from: Doug Laney, 3D data management, META Group/Gartner report, Feb Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
8 The Opportunity, via a Detour (An Anecdotal Example) The Overwhelming Growth of Knowledge When 12 men founded the Royal Society in 1660, it was possible for an educated person to encompass all of Number of Publications scientific knowledge. Professionals [ ] In already know the last 50 years, such they has don t know [it all] been the pace of scientific advance that even the best scientists cannot keep up with discoveries at frontiers outside their own field. Tony Blair, PM Speech, May Data: King,The scientific impact of nations,nature 04.
9 The Opportunity, via a Detour From Hypothesis to Data Thousand years ago: science The Fourth was empirical Paradigm is suitable for describing natural phenomena professionals who already know they. don t Last few know hundred [enough years: to formulate good a theoretical branch using models, generalizations a hypotheses], yet need to deliver quickly 2 = 4 π G ρ Κ 3 c a 2 2 Last few decades: a computational branch simulating complex phenomena Today (the Fourth Paradigm): data exploration unify theory, experiment, and simulation Data captured by instruments or generated by simulator Processed by software Information/Knowledge stored in computer Scientist analyzes results using data management and statistics 9 Source: Jim Gray and The Fourth Paradigm,
10 The Vision: Everyone Is a Scientist! (the Fourth Paradigm) Data as individual right, enabling private lifestyle and modern societal services Data as workhorse in creating services for SMEs (~60% gross value added, for many years) EC reasons to address Big Data challenges >500 million people >85 million employees >3 trillion euros / year gross value added May Sources: European Commission Annual Reports 2012 & 2013, ECORYS, Eurostat, National Statistical Offices, DIW, DIW econ, London Economics.
11 Can We Afford This Vision, with the Current Technology and Resources? (An Anecdote) Time magazine reported that it takes kWh to stream 1 minute of video from the YouTube data centre Based on Jay Walker s recent TED talk, 0.01kWh of energy is consumed on average in downloading 1MB over the Internet. The average Internet device energy consumption is around 0.001kWh for 1 minute of video streaming For 1.6B downloads of this 17MB file and streaming for 4 minutes gives the overall energy for this one pop video in one year 312GWh = more than some countries in a year, 36MW of 24/7/365 diesel, 100M liters of Oil, May ,000 cars running for a year,... Source: Ian Bitterlin and Jon Summers, UoL, UK.
12 Can We Afford This Vision, with the Current Technology and Resources? Not with the current technology (in this presentation) Not with the current resources (energy, human, ) Global power consumption Breakdown of EC power consumption EC May Sources: DatacenterDynamics and Jon Summers, UoL, UK.
13 Our Big Data Team, PDS Group at TU Delft ( Alexandru Iosup Dick Epema TU Delft TU Delft Big Data & Clouds Big Data & Clouds Res. management Res. management Systems, Benchmarking Systems Bogdan Ghit TU Delft Systems Workloads Ana Lucia Varbanescu U. Amsterdam Graph processing Benchmarking Claudio Martella VU Amsterdam Graph processing Mihai Capota TU Delft Big Data apps Benchmarking Yong Guo Marcin Biczak TU Delft TU Delft Graph processing Big Data & Clouds Benchmarking Performance & Development May 14,
14 Agenda 1. Big Data, Our Vision, Our Team 2. Big Data on Clouds 1. The Big Data ecosystem 2. Understanding workloads 3. Benchmarking 4. How can clouds help? Elastic systems 3. Summary Ecosystem Modeling Benchmarking Elastic Systems
15 The Current Technology Big Data = Systems of Systems High-Level Language Flume BigQuery SQL Meteor JAQL Hive Pig Sawzall Scope DryadLINQ AQL Programming Model PACT MapReduce Model Pregel Dataflow Algebrix Execution Engine Flume Engine Dremel Tera Azure Service Data Engine Tree Engine Nephele Haloop Hadoop/ YARN Giraph MPI/ Erlang Dryad Hyracks Storage Engine S3 GFS Tera Data Store Azure Data Store HDFS Voldemort * Plus Zookeeper, CDN, etc. L F S CosmosFS Asterix B-tree Adapted from: Dagstuhl Seminar on Information Management in the Cloud,
16 The Problem: Monolithic Systems Monolithic Integrated stack (can still learn from decades of sw.eng.) Fixed set of homogeneous resources (we forgot 2 decades of distrib.sys.) Execution engines do not coexist (we re running now MPI inside Hadoop Maps, Hadoop jobs inside MPI processes, etc.) Little performance information is exposed (we forgot 4 decades of par.sys.) Stuck in stacks! Hive MapReduce Model HDFS High-Level Language Programming Model Execution Engine Hadoop/ YARN Storage Engine A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from GPUs to Clouds. Proc. of SC 12 (MTAGS).
17 Instead Many-Task Big-Data Processing on Heterogeneous Resources: from GPUs to Clouds 1. Take Big-Data Processing applications 2. Split into Many Tasks 3. Each of the tasks parallelized to match resources 4. Execute each Task on the most efficient resource 5. Exploiting the massive parallelism available now and increasing in the combination multi-core CPUs & GPUs 6. Using the set of resources provided by local clusters 7. And exploiting the efficient elasticity of IaaS clouds A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from GPUs to Clouds. Proc. of SC 12 (MTAGS).
18 A Generic Architecture for Many-Task Big Data Processing Execute Big Data apps as many tasks using mixed resources: 1. High performance 2. Elasticity 3. Predictability 4. Compatibility May 14,
19 10 Main Challenges in 4 Categories* High Performance 1. Parallel architectures and algorithms support from start 2. Heterogeneous platforms application and data decomposition 3. Programmability by portability (OpenCL/ACC/...) * List not exhaustive Elasticity 1. Performance and costawareness under elasticity elastic data 2. Portfolio scheduling 3. Social awareness Predictability 1. Modeling 2. Benchmarking Compatibility 1. Interfacing with the application 2. Storage management Varbanescu and Iosup, On Many-Task Big Data Processing: from GPUs to Clouds, MTAGS Proc. of SC13. (invited paper) Ad: Read our article May 14,
20 Agenda 1. Big Data, Our Vision, Our Team 2. Big Data on Clouds 1. The Big Data ecosystem 2. Understanding workloads 3. Benchmarking 4. How can clouds help? Elastic systems 3. Summary Ecosystem Modeling Benchmarking Elastic Systems
21 Statistical MapReduce Models From Long-Term Usage Traces Real traces Yahoo Google 2 x Social Network Provider May 14, de Ruiter and Iosup. A workload model for MapReduce. MSc thesis at TU Delft. Jun Available online via TU Delft Library, Q0
22 The BTWorld Use Case (When Long-Term Traces Do Not Exist) Collected Data BitTorrent: swarms of people sharing files 100M users At some point 35% of total internet traffic Data-driven project: data first, ask questions later Over 14TB of data, 1 file/tracker/sample Timestamped, multi-record files Hash: unique id for file Tracker: unique id for tracker Information per file: seeders, leechers Wojciechowski, Capota, Pouwelse, and Iosup. BTWorld: Towards observing the global BitTorrent file-sharing network. HPDC 2010
23 The BTWorld Use Case (When Long-Term Traces Do Not Exist) Analyst Questions How does the number of peers evolve over time? How long are files available? Did the legal bans and tracker take-downs impact BT? How does the location of trackers evolve over time? Etc. These questions need to be translated into queries Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation.IEEE BigData 13
24 MapReduce-based Workflow for the BTWorld Use Case Query Diversity Queries use different operators, stress different parts of system Workflow is not modeled well by singleapplication benchmarks Active Hashes (AH): SELECT timestamp, COUNT(DISTINCT(hash)) FROM logs GROUP BY timestamp; Global Top K Trackers (TKT-G): SELECT * FROM logs NATURAL JOIN ( SELECT tracker FROM TKTL GROUP BY tracker ORDER BY MAX(sessions) DESC LIMIT k); Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation.IEEE BigData 13
25 MapReduce Is Now Part of Workflows Use Case: Monitoring Large-Scale Distributed Computing System with 160M users Inter-query dependencies Diverse queries New queries during project logscale Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation.IEEE BigData 13
26 Agenda 1. Big Data, Our Vision, Our Team 2. Big Data on Clouds 1. The Big Data ecosystem 2. Understanding workloads 3. Benchmarking 4. How can clouds help? Elastic systems 3. Summary Ecosystem Modeling Benchmarking Elastic Systems
27 Performance: Our Team Also Includes... Ana Lucia Varbanescu U. Amsterdam Jianbin Fang TU Delft Jie Shen TU Delft Alexandru Iosup TU Delft Performance modeling Parallel systems Multi-core systems Parallel systems Multi-core systems Tianhe/Xeon Phi Performance evaluation Parallel systems Multi-core systems Performance modeling Performance evauluation May 14,
28 Benchmarking MapReduce Systems Non-linear scaling May
29 SPEC Research Group (RG) The Research Group of the Standard Performance Evaluation Corporation Mission Statement Provide a platform for collaborative research efforts in the areas of computer benchmarking and quantitative system analysis Provide metrics, tools and benchmarks for evaluating early prototypes and research results as well as fullblown implementations Foster interactions and collaborations btw. industry and academia Ad: Join us! More information:
30 Benchmarking From single kernel or solitary-kernel suite to Big Data processing workflow Derived from modeling Intra-query, intra-job, and inter-job data dependencies Can benchmarking be Realistic? Cost- and time-effective? Fair?
31 Our Method A benchmark suite for performance evaluation of graph-processing platforms 1. Multiple Metrics, e.g., Execution time Normalized: EPS, VPS Utilization 2. Representative graphs with various characteristics, e.g., Size Directivity Density 3. Typical graph algorithms, e.g., BFS Connected components May 14, Guo, Biczak, Varbanescu, Iosup, Martella, Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis
32 Survey of graph algorithms Class Examples % Graph Statistics Diameter, PageRank 16.1 Graph Traversal BFS, SSSP, DFS 46.3 Connected Component Reachability, BiCC 13.4 Community Detection Clustering, Nearest Neighbor 5.4 Graph Evolution Forest Fire Model, PAM 4.0 Other Sampling, Partitioning
33 Selection of algorithms A1: General Statistics (STATS: # vertices and edges, LCC) Single step, low processing, decision-making A2: Breadth First Search (BFS) Iterative, low processing, building block A3: Connected Component (CONN) Iterative, medium processing, building block A4: Community Detection (CD) Iterative, medium or high processing, social network A5: Graph Evolution (EVO) Iterative (multi-level), high processing, prediction 33 Challenge 4. Algorithm selection
34 The Science: Which Algorithms? (DONE) Our own survey, related to graph-processing Academic publications (CIKM, ICDE, SIGKDD, SIGMOD, VLDB, CCGRID, HPDC, IPDPS, PPoPP, SC) Guo, Biczak, Varbanescu, Iosup, Martella, Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis, IPDPS14
35 Selection of graphs Number of vertices, edges, link density, size, directivity, etc. The Game Trace Archive
36 The Science: Dataset sizes? Machines in cluster? Our own survey, related to graph-processing Academic publications (CIKM, ICDE, SIGKDD, SIGMOD, VLDB, CCGRID, HPDC, IPDPS, PPoPP, SC) Dataset size: 100sMB 10s GB System size: <10 100s nodes Guo, Biczak, Varbanescu, Iosup, Martella, Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis, IPDPS14
37 Benchmarking suite Platforms and Process Platforms Process Evaluate baseline (out of the box) and tuned performance Evaluate performance on fixed-size system Future: evaluate performance on elastic-size system Evaluate scalability YARN May 14, Guo, Biczak, Varbanescu, Iosup, Martella, Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis Giraph Guo, Biczak, Varbanescu, Iosup, Martella, Willke. Benchmarking Graph- Processing Platforms: A Vision. Proc. of ICPE 2014.
38 Experimental setup Size Most experiments take 20 working nodes Up to 50 working nodes DAS4: a multi-cluster Dutch grid/cloud Intel Xeon 2.4 GHz CPU (dual quad-core, 12 MB cache) Memory 24 GB 10 Gbit/s Infiniband network and 1 Gbit/s Ethernet network Utilization monitoring: Ganglia HDFS used here as distributed file systems May 14, Guo, Biczak, Varbanescu, Iosup, Martella, Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis
39 BFS: results for all platforms, all data sets No platform runs fastest for every graph Not all platforms can process all graphs Hadoop is the worst performer May 14, Guo, Biczak, Varbanescu, Iosup, Martella, Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis
40 Giraph: results for all algorithms, all data sets Storing the whole graph in memory helps Giraph perform well Giraph may crash when graphs or number of messages large May 14, Guo, Biczak, Varbanescu, Iosup, Martella, Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis
41 Horizontal scalability: BFS on Friendster (31 GB) Using more computing machines can reduce execution time Tuning needed for horizontal scalability, e.g., for GraphLab, split large input files into number of chunks equal to the number of machines May 14, Guo, Biczak, Varbanescu, Iosup, Martella, Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis
42 Additional Overheads Data ingestion time Data ingestion Batch system: one ingestion, multiple processing Transactional system: one ingestion, one processing Data ingestion matters even for batch systems Amazon DotaLeague Friendster HDFS 1 second 7 seconds 5 minutes Neo4J 4 hours days n/a May 14, Guo, Biczak, Varbanescu, Iosup, Martella, Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis
43 GPUs vs CPUs: All-Pairs Shortest Path Pender and Varbanescu. MSc thesis at TU Delft. Jun TU Delft Library, Graph processing: Possible to get better performance on GPUs than on CPUs However, Algorithm and Dataset also determine performance May 14,
44 GPUs vs CPUs: BFS vs Data Format, E/V-based Pender and Varbanescu. MSc thesis at TU Delft. Jun TU Delft Library, However, data format can also determine performance May 14,
45 Agenda 1. Big Data, Our Vision, Our Team 2. Big Data on Clouds 1. The Big Data ecosystem 2. Understanding workloads 3. Benchmarking 4. How can clouds help? Elastic systems 3. Summary Ecosystem Modeling Benchmarking Elastic Systems
46 Elasticity: Our Team Elastically Includes... Alexandru Iosup TU Delft Provisioning Allocation Elasticity Portfolio Scheduling Isolation Multi-Tenancy Athanasios Antoniou TU Delft Provisioning Allocation Isolation Utility David Villegas FIU/IBM Elasticity, Utility Kefeng Deng NUDT Portfolio Scheduling Orna Agmon-Ben Yehuda Technion Elasticity, Utility May 14,
47 Cloud Computing, the useful IT service Use only when you want! Pay only for what you use! May 14,
48 IaaS Cloud Computing: Energy-Efficient IT Infrastructure Service
49 Elasticity, Performance and Cost-Awareness Why Dynamic Data Processing Clusters? Improve resource utilization Grow when the workload is too heavy Shrink when resources are idle Fairness across multiple data processing clusters Redistribute idle resources Allocate resources for new MR clusters Isolation Performance Failure Data Version cluster Ghit and Epema. Resource Management for Dynamic MapReduce Clusters in Multicluster Systems. MTAGS Best Paper Award. 49
50 The DAS-4 Infrastructure VU (148 CPUs) SURFnet6 10 Gb/s lambdas UvA/MultimediaN (72) UvA (32) Used for research in systems for over a decade 1,600 cores (quad cores) 2.4 GHz CPUs, GPUs 180 TB storage 10 Gbps Infiniband 1 Gbps Ethernet Koala grid scheduler TU Delft (64) Leiden (32) Astron (46) 50
51 KOALA Grid Scheduler and MapReduce MR cluster MR-Runner Placement Users submit jobs to deploy MR clusters Koala Schedules MR clusters Stores their meta-data MR-Runner Installs the MR cluster MR job submissions are transparent to Koala MR jobs Launching SITE C SITE B Monitoring Ghit and Epema. Resource Management for Dynamic MapReduce Clusters in Multicluster Systems. MTAGS Best Paper Award. 51
52 Resizing Mechanism Two-level provisioning Koala makes resource offers / reclaims MR-Runners accept / reject request Grow-Shrink Policy (GSP) MR cluster utilization: F totaltasks availslots min F max Size of grow and shrink steps: S grow and S shrink S grow S grow S shrink S shrink Timeline Ghit and Epema. Resource Management for Dynamic MapReduce Clusters in Multicluster Systems. MTAGS Best Paper Award. 52
53 Baseline Policies Greedy-Grow Policy (GGP) only grow with transient nodes: S grow x S grow x Greedy-Grow-with-Data Policy (GGDP) grow, core nodes: S grow x S grow x Ghit and Epema. Resource Management for Dynamic MapReduce Clusters in Multicluster Systems. MTAGS Best Paper Award. 53
54 Setup 98% of Facebook take less than a minute Google reported computations with TB of data DAS-4 Two applications: Wordcount and Sort Workload 1 Single job 100 GB Makespan Workload 2 Single job 40 GB, 50 GB Makespan Workload 3 Stream of 50 jobs 1 GB 50 GB Average job execution time Ghit and Epema. Resource Management for Dynamic MapReduce Clusters in Multicluster Systems. MTAGS Best Paper Award. 54
55 Elastic MapReduce, TUD version Two types of nodes Core nodes: compute and data storage (DataNode) Transient nodes: only compute / + data storage S grow S grow S shrink S shrink Timeline Ghit and Epema. Resource Management for Dynamic MapReduce Clusters in Multicluster Systems. MTAGS Best Paper Award. 55
56 Transient Nodes 10 x 30 x 20 x 20 x 30 x 10 x 40 x better Workload 2: 40GB, 50GB Wordcount scales better than Sort on transient nodes Ghit and Epema. Resource Management for Dynamic MapReduce Clusters in Multicluster Systems. MTAGS Best Paper Award. 56
57 Performance of Resizing using Static, Transient, and Core Nodes 20 x +20x +20x better Big Data processing: possible to get better performance using elastic data processing + we understand how for many scenarios (key is balanced allocations) Sort + WordCount (50 jobs, 1-50GB) B. Ghit, N. Yigitbasi, A. Iosup, and D. Epema. Balanced Resource Allocations Across Multiple Dynamic MapReduce Clusters, SIGMETRICS
58 Elasticity, Portfolio Scheduling Why Portfolio Scheduling? Old scheduling aspects Hundreds of approaches, each targeting specific conditions which to choose? How to configure? No one-size-fits-all policy New scheduling aspects New workloads, e.g., pretty much all Big Data New data center architectures New cost models, e.g., moving workloads to IaaS clouds Developing a scheduling policy is risky and ephemeral Selecting a scheduling policy is risky and difficult
59 What is Portfolio Scheduling? In a Nutshell, for Elastic Big Data Processing Create a set of scheduling policies Resource provisioning and allocation policies Online selection of the active policy, at important moments Periodic selection, for example Same principle for other changes: pricing model, system,
60 Portfolio Scheduling (Technical) Periodic execution Simulation-based selection Utility function SC 13 Tue, Nov 19 1:30p-2:00p Room Alternatives simulator Expert human knowledge WL sample in real env. Mathematical analysis Alternatives utility function Well-known and exotic functions α=β=1 Κ=100 Deng, Verboon, Iosup. A Periodic Portfolio Scheduler for Scientific Computing in the Data Center. JSSPP 13. R J : Total Runtime of Jobs R V : Total Runtime of VMs S: Slowdown Deng, Song, Ren, Iosup. Exploring portfolio scheduling for long-term execution of scientific workloads in IaaS clouds. SC 13. Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup. ExPERT: pareto-efficient task replication on grids and a cloud. IPDPS 12.
61 Portfolio Scheduling for Online Gaming (also for Scientific Workloads) CoH = Cloud-based, online, Hybrid scheduling Intuition: keep rental cost low by finding good mix of machine configurations and billing options, use on-demand cloud VMs Main idea: run both solver of an Integer Programming Problem and various heuristics, pick best schedule periodically (at deadline) Additional feature: Can use reserved cloud instances Gaming (and scientific) workloads Shen, Deng, Iosup, and Epema. Scheduling Jobs in the Cloud Using On-demand and Reserved Instances, EuroPar 13.
62 Agenda 1. Big Data, Our Vision, Our Team 2. Big Data on Clouds 1. The Big Data ecosystem 2. Understanding workloads 3. Benchmarking 4. How can clouds help? Elastic systems 3. Summary Ecosystem Modeling Benchmarking Elastic Systems
63 Conclusion Take-Home Message Big Data is necessary but grand challenge Big Data = Systems of Systems Big data programming models have ecosystems Stuck in stacks! Many trade-offs, many programming models, many problems Towards a Generic Big-Data Processing System Looking at the Execution Engine thrilling moment for this! Predictability challenges: Understanding workload (modeling) and performance (benchmarking) Performance challenges: distrib/parallel from the beginning Elasticity challenges: elastic data processing, portfolio scheduling, etc. etc. May 14,
64 Thank you for your attention! Questions? Suggestions? Observations? More Info: Alexandru Iosup Do not hesitate to contact me [email protected] (or google iosup ) Parallel and Distributed Systems Group Delft University of Technology May 14,
Towards an Optimized Big Data Processing System
Towards an Optimized Big Data Processing System The Doctoral Symposium of the IEEE/ACM CCGrid 2013 Delft, The Netherlands Bogdan Ghiţ, Alexandru Iosup, and Dick Epema Parallel and Distributed Systems Group
GRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms
GRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms Mihai Capotã, Yong Guo, Ana Lucia Varbanescu, Tim Hegeman, Jorai Rijsdijk, Alexandru Iosup, GRAPHALYTICS
Scheduling in IaaS Cloud Computing Environments: Anything New? Alexandru Iosup Parallel and Distributed Systems Group TU Delft
Scheduling in IaaS Cloud Computing Environments: Anything New? Alexandru Iosup Parallel and Distributed Systems Group TU Delft Hebrew University of Jerusalem, Israel June 6, 2013 TUD-PDS 1 Lectures in
How To Scale A Complex Big Data Workflow
V for Vicissitude: The Challenge of Scaling Complex Big Data Workflows Bogdan Ghiț, Mihai Capotă, Tim Hegeman, Jan Hidders, Dick Epema, and Alexandru Iosup Parallel and Distributed Systems Group, Delft
Professor and Director, Division of Biological Sciences & Computation Institute, University of Chicago
Dr. Robert Grossman Professor and Director, Division of Biological Sciences & Computation Institute, University of Chicago Dr. Xian-He Sun Chair and Professor, Computer Science, Illinois Institute of Technology
Big Data Analytics. Chances and Challenges. Volker Markl
Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
Duke University http://www.cs.duke.edu/starfish
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Big Data on Google Cloud
Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud way William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform @vambenepe / [email protected]
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
How to Do/Evaluate Cloud Computing Research. Young Choon Lee
How to Do/Evaluate Cloud Computing Research Young Choon Lee Cloud Computing Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing
Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013
Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software SC13, November, 2013 Agenda Abstract Opportunity: HPC Adoption of Big Data Analytics on Apache
6.S897 Large-Scale Systems
6.S897 Large-Scale Systems Instructor: Matei Zaharia" Fall 2015, TR 2:30-4, 34-301 bit.ly/6-s897 Outline What this course is about" " Logistics" " Datacenter environment What this Course is About Large-scale
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis
How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis Yong Guo, Marcin Biczak, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella and Theodore L. Willke
Teradata s Big Data Technology Strategy & Roadmap
Teradata s Big Data Technology Strategy & Roadmap Artur Borycki, Director International Solutions Marketing 18 March 2014 Agenda > Introduction and level-set > Enabling the Logical Data Warehouse > Any
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Introduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Extending Hadoop beyond MapReduce
Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Cloud Computing Where ISR Data Will Go for Exploitation
Cloud Computing Where ISR Data Will Go for Exploitation 22 September 2009 Albert Reuther, Jeremy Kepner, Peter Michaleas, William Smith This work is sponsored by the Department of the Air Force under Air
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
Virtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata
BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING
Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing
Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Zhuoyao Zhang University of Pennsylvania, USA [email protected] Ludmila Cherkasova Hewlett-Packard Labs, USA [email protected]
From Internet Data Centers to Data Centers in the Cloud
From Internet Data Centers to Data Centers in the Cloud This case study is a short extract from a keynote address given to the Doctoral Symposium at Middleware 2009 by Lucy Cherkasova of HP Research Labs
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.
Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
COMP9321 Web Application Engineering
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411
Big Data and Analytics: Challenges and Opportunities
Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014
Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the
Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER
Hur hanterar vi utmaningar inom området - Big Data Jan Östling Enterprise Technologies Intel Corporation, NER Legal Disclaimers All products, computer systems, dates, and figures specified are preliminary
Big Data Performance Growth on the Rise
Impact of Big Data growth On Transparent Computing Michael A. Greene Intel Vice President, Software and Services Group, General Manager, System Technologies and Optimization 1 Transparent Computing (TC)
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Addressing Open Source Big Data, Hadoop, and MapReduce limitations
Addressing Open Source Big Data, Hadoop, and MapReduce limitations 1 Agenda What is Big Data / Hadoop? Limitations of the existing hadoop distributions Going enterprise with Hadoop 2 How Big are Data?
Apache Hadoop Ecosystem
Apache Hadoop Ecosystem Rim Moussa ZENITH Team Inria Sophia Antipolis DataScale project [email protected] Context *large scale systems Response time (RIUD ops: one hit, OLTP) Time Processing (analytics:
Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Emerging Technology for the Next Decade
Emerging Technology for the Next Decade Cloud Computing Keynote Presented by Charles Liang, President & CEO Super Micro Computer, Inc. What is Cloud Computing? Cloud computing is Internet-based computing,
Big Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
Analyzing Big Data with AWS
Analyzing Big Data with AWS Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota What is Big Data? Computer generated data Application server logs (web sites, games) Sensor data (weather,
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Big Data Big Data/Data Analytics & Software Development
Big Data Big Data/Data Analytics & Software Development Danairat T. [email protected], 081-559-1446 1 Agenda Big Data Overview Business Cases and Benefits Hadoop Technology Architecture Big Data Development
A Service for Data-Intensive Computations on Virtual Clusters
A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King [email protected] Planets Project Permanent
Cloud Computing Summary and Preparation for Examination
Basics of Cloud Computing Lecture 8 Cloud Computing Summary and Preparation for Examination Satish Srirama Outline Quick recap of what we have learnt as part of this course How to prepare for the examination
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
