Benchmarking and Ranking Big Data Systems
|
|
- Howard Osborne
- 7 years ago
- Views:
Transcription
1 Benchmarking and Ranking Big Data Systems Xinhui Tian ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences INSTITUTE OF COMPUTING TECHNOLOGY
2 Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Ranking big data systems BPOE
3 Why Big Data Benchmarking? Measuring big data systems and architectures quantitatively BPOE
4 What is BigDataBench? n An open source big data benchmarking project Search Google using BigDataBench BPOE
5 BigDataBench Users n n Industry users n Accenture, BROADCOM, SAMSUMG, Huawei, IBM n About 20 academia groups published papers using BigDataBench BPOE
6 BigDataBench 3.2 Overview BDGS(Big Data Generator Suite) for scalable data Wikipedia Entries Amazon Movie Reviews Google Web Graph Facebook Social NetworkE- commerce Transaction ImageNet English broadcasting audio ProfSearch Resumes DVD Input Streams Image scene Genome sequence data Assembly of the human genome SoGou Data MNIST MovieLens Dataset 15 Real- world Data Sets Impala NoSql Search Engine Multimedia Social E-commerce Network Bioinformatics 37 Workloads Shark Hadoop RDMA MPI DataMPI Software Stacks BPOE
7 What s New in BigDataBench 3.2 BigDataBench support for Flink WordCount, Grep, Naïve Bayes, PageRank, K- means Streaming JStorm, Spark Streaming Graph processing GraphX, GraphLab, Flink Gelly BPOE
8 BigDataBench evolution New software stack: Flink, JStorm, GraphX, GraphLab New workload type: Streaming, Graph processing New dataset and workloads BigDataBench application domains: 14 data sets and 33 workloads Same specifications: diverse implementations Multi- tenancy version BigDataBench subset and simulator version BigDataBench Multidisciplinary effort 32 workloads: diverse implementations BigDataBench Typical Internet service domains An architectural perspective 19 workloads & data generation tools BigDataBench Search engine 6 workloads BigDataBench data analytics workloads DCBench 1.0 Mixed data analytics workloads CloudRank 1.0 BPOE
9 BigDataBench Publications n n n n n n BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA- 2014). Characterizing data analysis workloads in data centers IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award) BigOP: generating comprehensive big data workloads as a benchmarking framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014) BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The Fourth workshop on big data benchmarking (WBDB 2014) Identifying Dwarfs Workloads in Big Data Analytics arxiv preprint arxiv: BigDataBench- MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads arxiv preprint arxiv: BPOE
10 Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Remarking big data systems BPOE
11 BigDataBench Methodology Application Domain 1 Benchmark specification 1 Real- world data sets Multi- tenancy version Application Domain Data models of different types & semantics Data operations & workloads patterns Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset BPOE
12 BigDataBench Methodology with Dwarfs Application Domain 1 Benchmark specification 1 Real- world data sets Multi- tenancy version Application Domain Data models of different types & semantics Dwarfs Workloads Identification Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset BPOE
13 Dwarfs Workloads n Finding the common atomic workloads in big data analytics workloads n Representing maximum patterns of big data analytics using a minimum workload set BPOE
14 Inspiration Successful Compute Abstractions Relational algebra 5 primitive operations Select, Project, Product, Union, Difference Successful Benchmarks TPC- C OLTP domain Functions of abstraction Parallel computing Computational & communication patterns 13 dwarfs HPCC High performance computing Seven basically tests BPOE
15 Fundamental Issues What are the challenges for big data dwarfs? How to find the dwarfs workloads in big data analytics? BPOE
16 Challenges Deep learning Classification Data mining 80% d ata g rowth Dimension Clustering reduction Many echniques for plrocessing b ig d ata e xist, w hich b ring A mtachine lcomputer earning ibrary scikit learn, i mplements Machine are u nstructured Massive application make uds warfs wonder where learning vision domains greater c omplexity f or i dentifying w orkloads Data M ining so many algorithms, which is still much less than the data to start or how to achieve a wide range of coverage total number of algorithms Natural language processing Signal processi ng Regression BPOE 2016 Association rule mining 16
17 Methodology of Dwarfs Application Domain 1 Big Data Analytics Processing Techniques Representative Algorithms Dwarfs 1 Application Domain 2 Machine learning Data mining Deep learning Computer vision Natural language processing... Frequently appearing operations Dwarfs 2 Application Domain N Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Benchmarks BigBench, LinkBench... Different combinations Dwarfs M BPOE
18 Methodology of Dwarfs Application Domain 1 Big Data Analytics Processing Techniques Representative Algorithms Dwarfs 1 Application Domain 2 Machine learning Data mining Deep learning Computer vision Natural language processing... Frequently appearing operations Dwarfs 2 Application Domain N Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Benchmarks BigBench, LinkBench... Different combinations Dwarfs M BPOE
19 Methodology of Dwarfs Application Domain 1 Big Data Analytics Processing Techniques Representative Algorithms Dwarfs 1 Application Domain 2 Machine learning Data mining Deep learning Computer vision Natural language processing... Frequently appearing operations Dwarfs 2 Application Domain N Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Benchmarks BigBench, LinkBench... Different combinations Dwarfs M BPOE
20 Methodology of Dwarfs Application Domain 1 Big Data Analytics Processing Techniques Representative Algorithms Dwarfs 1 Application Domain 2 Machine learning Data mining Deep learning Computer vision Natural language processing... Frequently appearing operations Dwarfs 2 Application Domain N Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Benchmarks BigBench, LinkBench... Different combinations Dwarfs M BPOE
21 Algorithms Chosen to Investigate Computer Vision Database Software Deep Learning MPEG- 2, Scale- invariant feature transform, Image segmentation, Ray Tracing Needleman- Wunsch, Smith- Waterman, BLAST CNN, DBN Data Mining & Machine Learning C4.5/CART/ID3, Logistic regression, SVM, KNN, HMM, Maximum- entropy markov model, Conditional random field, PageRank, HITS, Aporiori, FP- growth, Principal component analysis, Back Propagation, Adaboost, MCMC, Connected component, Random forest Natural Language Processing Latent semantic indexing, plsi, Latent dirichlet allocation, Index, Porter Stemming, Sphinx speech recognition Recommendation Demographic/Content based recommendation, Collaborative Filtering BPOE
22 Frequently- appearing Operations Similarity Analysis Sampling Statistic Operation KNN, K- means, Recommendation, MCMC, LDA, Feature matching, Random Image sampling, segmentation Downsampling Probability calculation LSI, plsi, Latent dirichletallocation MPEG- 2 Association Set Operations Rules Linear Algebra Mining Sort Graph Operation Neural Jaccard Network similarity Transform Similarity Analysis Operation Locality sensitive hashing FFT, Convolution Back computation, propagation, DCT, CNN, DBN, Neural Speech recognition, network Multimedia Partial Representation sort, quick sort, Top BFS, DFS, Decision Encrpytion, Matrix/Vector k sort index, Calculation tree, Connected Fingerprint, SimHash, Set K- means, Decision tree union Component Sphinx speech recognition, MPEG- 2, SIFT, Image segmentation, Ray tracing Logic Operation MinHash Database Aporiori FP- Growth Set difference SVM, HMM, MEMM, CRF, PageRank, HITS, Logistic regression BPOE
23 Dwarfs in Big Data Offline Analytics Linear Algebra Sampling Transform operation Graph operation Logic operation Set operation Statistic operation Sort BPOE
24 Outline n BigDataBench n BigDataBench Dwarfs n BigData100 n Ranking big data systems BPOE
25 BigData100 Ranking n An open source project for benchmarking and ranking big data systems n Using benchmarks from BigDataBench n Big data systems including Hadoop, Spark, Flink and many SQL- interface systems n BPOE
26 Why Do Big Data Systems Ranking n First a quick glance of current big data ecosystem BPOE
27 Big Data Systems Programming Interface SQL-like DataFlow APIs R Language Computation Engines DB layer Graph Layer ML Layer DataFlow Engines Resource Management Graph MPP Streaming Data Storage Column-oriented Format Distributed File Systems Key-Value Storage NOSQL Relational DB Data Sources Structured Data Unstructured Data Semi-structured Data BPOE
28 Computation Engine Programming Model Intermediate Data Management Network Transfer Local Execution Task Scheduling Fault Tolerance Parallel Computation BPOE
29 The Dataflow Model n A computation job is represented as a directed acyclic graph (DAG) consisting of data sources, sinks and operators n First used in the Microsoft Dryad system n Google MapReduce can be seen as a special case of DAG with only two types of operators BPOE
30 DataFlow MapReduce Dryad Input Input1 Input2 Map Sort Operator1 Channel1 Operator2 Channel2 Intermediate Data Transformation Spill Operator3 Channel3 Merge Operator4 Channel4 Reduce Operator5 Output Output User function Spark Input1 Input2 Cached RDD1 RDD6 Transformation1 RDD2 Cached RDD Transformation2 RDD3 Transformation3 RDD4 Transformation4 RDD5 Action Flink Output Input1 Input2 PACT1 Channel1 PACT2 Channel2 PACT3 Channel3 PACT4 Channel4 PACT5 BPOE 2016Output 30
31 Other Specific Models Impala Pregel Malewicz G, et al. Pregel: a system for large- scale graph processing[c] Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: Kornacker M, Behm A, Bittorf V, et al. Impala: A Modern, Open- Source SQL Engine for Hadoop[C]//CIDR BPOE
32 Why Do Big Data Systems Ranking? n Understanding the performance difference between general- purpose and specific systems n performance comparison n Figuring out how the design of framework impact the performance of specific types of workloads n behavior analysis BPOE
33 Benchmark n Current Benchmarks n Micro workloads: WordCount, Grep n Iteration workloads: PageRank, Kmeans n Interactive Queries Select, Aggregation, Join n Others Naïve Bayes BPOE
34 Benchmark Behaviors: Spark Case n Spark n Resilient distributed datasets (RDDs) Immutable, partitioned collections of objects Created through parallel transformations (map, filter, groupby, join, ) on data in stable storage Can be cached for efficient reuse HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...)) BPOE
35 Benchmark Behaviors n Wordcount, Grep, NaiveBayes, Select n IO- intensive n Computation can be easy paralleled n No or little data transfer DataSet1 Map DataSet2 BPOE
36 Benchmark Behaviors n PageRank, Kmeans n Iterative Computation n Data sets join and aggregation in each iteration n A lot of shuffle operations DataSet1 Iteration 1 Iteration 2 Iteration 3 GroupBy DataSet2 GroupBy DataSet2 GroupBy DataSet2 Join Join Join Map Map Map BPOE
37 Benchmark Behaviors n Database queries n One DAG job that consists of many different transformations Including complex aggregation and join algorithms DataSet1 Map DataSet2 Map Join Partial Aggregation Full Aggregation DataSet3 BPOE
38 Workload Behaviors on CPU and Memory (JVM) n Simple Batch Workloads n Wordcount, Grep, NaiveBayes, Select n Iterative Workloads n PageRank, Kmeans n Interactive Queries n Aggregation, Join BPOE
39 CPU - Simple Batch Workloads WordCount NaiveBayes Grep BPOE
40 CPU - Iterative Workloads PageRank KMeans BPOE
41 CPU - Query Processing Aggregation Join BPOE
42 Memory - JVM Memory Model jvm- memory- model- and- garbage- collection- monitoring- tuning BPOE
43 JVM Memory Usage in Spark n EC: Eden Capacity n EU: Eden Usage n OC: Old Memory Capacity n OU: Old Memory Usage BPOE
44 Simple Batch Workloads WordCount NaiveBayes Grep Select BPOE
45 Iterative Workloads PageRank KMeans BPOE
46 Query Processing Aggregation Join BPOE
47 Revisit the dataflow Iteration 1 Iteration 2 Iteration 3 GroupBy DataSet2 GroupBy DataSet2 GroupBy DataSet2 Join Join Join Map Map Map DataSet1 Map DataSet2 Map Join Partial Aggregation Full Aggregation DataSet3 BPOE
48 Some Primary Results of n Cluster Configuration BigData100 Computation Nodes CPU per node Memory per node Disk per node Network 16 nodes 2 x Intel Xeon E5645, 12 cores 32 GB 1TB x 2 SATA disks Gigabit Ethernet BPOE
49 System Selected n Simple Batch Workloads and Iterative Workloads n Hadoop 2.7.1, Spark 1.5.1, Flink n Interactive Query Processing n Spark 1.5.1, Hive (on Hadoop, on Tez and on Spark), Impala BPOE
50 Size of data sets n WordCount, Grep n 1.600,000,000 English words from Wikipedia n About 300GB n Naïve Bayes n 5,700,000 reviews from Amazon Movie Reviews n 300GB BPOE
51 Size of data sets n PageRank n Google Web Graph, 16 million vertexes, 99 million edges (unstructured graph) n KMeans n Facebook Social Network, 460 million vectors n Queries subset of TPC- DS n item: 4 GB n customer: 13 GB n order: 100 GB BPOE
52 Results Simple Batch Workloads WordCount Grep RunTime (s) Run Time (s) Hadoop Spark Flink 0 Hadoop Spark Flink BPOE
53 Results Naive Bayes 563 Run Time (s) Hadoop Spark Flink BPOE
54 Observation n Spark has the best performance on that kind of workloads n Good data locality n One task for each partition n One reason for the bad performance in Flink n High level data locality Considering the whole file BPOE
55 Results Iterative Workloads PageRank KMeans Run Time (s) Run Time(s) Hadoop Spark Flink Flink- delta 0 Hadoop Spark Flink BPOE
56 Observation n Spark, Flink can be much faster than Hadoop n Advantage of the DAG model n The delta iteration from Flink can be more efficient than the bulk model used in Spark BPOE
57 Results Queries Select Run Time(s) SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE
58 Results - Queries Aggregation Run Time(s) SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE
59 Results Queries Join Run Time(s) SparkSQL HiveOnSpark HiveOnTez Hive Impala BPOE
60 Observations n Impala has the best performance on select and aggregation n Specific data structure designed for database query n Good disk performance due to I/O buffer managemen and short- circuit local reads n Impala gets bad performance for join operation in this situation n Big pressure on a small part of nodes BPOE
61 Thanks BPOE
Big Data Benchmark Suite
BigDataBench: An Open source Big Data Benchmark Suite Jianfeng Zhan http://prof.ict.ac.cn/bigdatabench Professor, ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences WBDB 2015
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationBPOE Research Highlights
BPOE Research Highlights Jianfeng Zhan ICT, Chinese Academy of Sciences 2013-10- 9 http://prof.ict.ac.cn/jfzhan INSTITUTE OF COMPUTING TECHNOLOGY What is BPOE workshop? B: Big Data Benchmarks PO: Performance
More informationSpark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationSystems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationCloudRank-D:A Benchmark Suite for Private Cloud Systems
CloudRank-D:A Benchmark Suite for Private Cloud Systems Jing Quan Institute of Computing Technology, Chinese Academy of Sciences and University of Science and Technology of China HVC tutorial in conjunction
More informationSpark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
More informationUnified Big Data Analytics Pipeline. 连 城 lian@databricks.com
Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
More informationSpark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
More informationBeyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationSpark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
More informationThe Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
More informationApache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
More informationBig Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
More informationOn Big Data Benchmarking
On Big Data Benchmarking 1 Rui Han and 2 Xiaoyi Lu 1 Department of Computing, Imperial College London 2 Ohio State University r.han10@imperial.ac.uk, luxi@cse.ohio-state.edu Abstract Big data systems address
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationHow To Write A Bigbench Benchmark For A Retailer
BigBench Overview Towards a Comprehensive End-to-End Benchmark for Big Data - bankmark UG (haftungsbeschränkt) 02/04/2015 @ SPEC RG Big Data The BigBench Proposal End to end benchmark Application level
More informationFast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
More informationBigDataBench: a Big Data Benchmark Suite from Internet Services
BigDataBench: a Big Data Benchmark Suite from Internet Services Lei Wang 1,7, Jianfeng Zhan 1, Chunjie Luo 1, Yuqing Zhu 1, Qiang Yang 1, Yongqiang He 2, Wanling Gao 1, Zhen Jia 1, Yingjie Shi 1, Shujie
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationInteractive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20
Interactive Analytical Processing in Big Data Systems,BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking,Study about DataSet May 23, 2014 Interactive Analytical Processing in Big Data Systems,BDGS:
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationArchitecture Support for Big Data Analytics
Architecture Support for Big Data Analytics Ahsan Javed Awan EMJD-DC (KTH-UPC) (http://uk.linkedin.com/in/ahsanjavedawan/) Supervisors: Mats Brorsson(KTH), Eduard Ayguade(UPC), Vladimir Vlassov(KTH) 1
More informationBig Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
More informationGeneral purpose Distributed Computing using a High level Language. Michael Isard
Dryad and DryadLINQ General purpose Distributed Computing using a High level Language Michael Isard Microsoft Research Silicon Valley Distributed Data Parallel Computing Workloads beyond standard SQL,
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationDuke University http://www.cs.duke.edu/starfish
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists
More informationApache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
More informationApache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationThe BigData Top100 List Initiative. Chaitan Baru San Diego Supercomputer Center
The BigData Top100 List Initiative Chaitan Baru San Diego Supercomputer Center 2 Background Workshop series on Big Data Benchmarking (WBDB) First workshop, May 2012, San Jose. Hosted by Brocade. Second
More informationEvaluating Task Scheduling in Hadoop-based Cloud Systems
2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationModern Processors using BigDataBench
Understanding Big Data Workloads on Modern Processors using BigDataBench Jianfeng Zhan http://prof.ict.ac.cn/bigdatabench Professor, ICT, Chinese Academy of Sciences and University of Chinese Academy of
More informationOverview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
More informationarxiv:1505.06872v1 [cs.db] 26 May 2015
2015-5 arxiv:1505.06872v1 [cs.db] 26 May 2015 IDENTIFYING DWARFS WORKLOADS IN BIG DATA ANALYTICS Wanling Gao, Chunjie Luo, Jianfeng Zhan, Hainan Ye, Xiwen He, Lei Wang, Yuqing Zhu and Xinhui Tian Institute
More informationBig Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationReal Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
More informationBig Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
More informationOn Big Data Benchmarking
On Big Data Benchmarking 1 Rui Han and 2 Xiaoyi Lu 1 Department of Computing, Imperial College London 2 Ohio State University r.han10@imperial.ac.uk, luxi@cse.ohio-state.edu Abstract Big data systems address
More informationMachine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
More informationBig Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com
Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
More informationWhat s next for the Berkeley Data Analytics Stack?
What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationConjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
More informationGraph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationMachine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
More informationProject DALKIT (informal working title)
Project DALKIT (informal working title) Michael Axtmann, Timo Bingmann, Peter Sanders, Sebastian Schlag, and Students 2015-03-27 INSTITUTE OF THEORETICAL INFORMATICS ALGORITHMICS KIT University of the
More informationLarge-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationSpark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationMining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
More informationNext-Gen Big Data Analytics using the Spark stack
Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our
More informationBig Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
More informationBigDataBench. Khushbu Agarwal
BigDataBench Khushbu Agarwal Last Updated: May 23, 2014 CONTENTS Contents 1 What is BigDataBench? [1] 1 1.1 SUMMARY.................................. 1 1.2 METHODOLOGY.............................. 1 2
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationArchitectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationProcessing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015
Processing of Big Data Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015 Acknowledgement Some slides in this set of slides were provided by EMC Corporation and Sandra Avila, University
More informationHybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com
Hybrid Software Architectures for Big Data Laurence.Hubert@hurence.com @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line
More informationBig Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
More informationISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
More informationJournée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
More informationCOMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)
More informationBig Data Processing. Patrick Wendell Databricks
Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
More informationProgramming Abstractions and Security for Cloud Computing Systems
Programming Abstractions and Security for Cloud Computing Systems By Sapna Bedi A MASTERS S ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty
More informationBig Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
More informationBayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationCS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
More informationMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
More informationHadoop-BAM and SeqPig
Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
More informationHow to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
More informationBIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
More informationAdvanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
More informationSpark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
More informationAli Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
More informationSQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS
Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical
More informationBig Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
More informationHadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More information