Making Sense at Scale with the Berkeley Data Analytics Stack
|
|
|
- Arnold Stephens
- 10 years ago
- Views:
Transcription
1 Making Sense at Scale with the Berkeley Data Analytics Stack Michael Franklin February 3, 2015 WSDM 2015 Shanghai UC BERKELEY
2 Agenda A Little Bit on Big Data AMPLab Overview BDAS Philosophy: Unification not Specialization Spark, GraphX, and other BDAS components Wrap Up: Thoughts on Big Data Software
3 Sources Driving Big Data It s All Happening On- line Every: Click Ad impression Billing event Fast Forward, pause, Friend Request Transaction Network message Fault User Generated (Web & Mobile).. Internet of Things / M2M Scientific Computing
4 Big Data A Bad Definition Data sets, typically consisting of billions or trillions of records, that are so vast and complex that they require new and powerful computational resources to process. - Dictionary.com <Big Data as a Problem!>
5 Big Data as a Resource For a long time, we thought that Tamoxifen was roughly 80% effective for breast cancer patients. But now we know much more: we know that it s 100% effective in 70% to 80% of the patients, and ineffective in the rest. - Tim O Reilly et al. How Data Science is Transforming Health Care With enough of the right data you can determine precisely who the treatment will work for. Thus, even a 1% effective drug could save lives
6 A Big Data Pattern 750,000+ downloads 6
7 AMPLab: Integrating 3 Resources Algorithms Machine Learning, Statistical Methods Prediction, Business Intelligence Machines Clusters and Clouds Warehouse Scale Computing People Crowdsourcing, Human Computation Data Scientists, Analysts
8 AMPLab Overview UC BERKELEY Berkeley s AMPLab has already left an indelible mark on world of information technology, and even the web. But we haven t yet experienced the full impact of the group Not even close. Derrick Harris, GigaOM, Aug 2, Students, Postdocs, Faculty and Staff from: Databases, Machine Learning, Systems, Security, and Networking Franklin Jordan Stoica Culler Goldberg Joseph Katz Patterson Recht Shenker 50/50 Split: Industry Sponsors and Govt White House Big Data Program: NSF CISE Expeditions in Computing and Darpa XData Fixed Timeline (ends Dec 2016); Collaborative Working Space See Dave Patterson How to Build a Bad Research Center, CACM March, 2014
9 A Nexus of Industrial Engagement Open Source Software Industrial-Strength Twice-yearly 3-day offsite retreats; AMPCamp training
10 Open Source Community Building MeetUp on (Aug 6, 2013) Spark Summit SF (June 30, 2014)
11 Big Data Ecosystem Evolution Pregel Giraph MapReduce General batch" processing Dremel Impala Storm Drill S4 Tez GraphLab Specialized systems (iterative, interactive and" streaming apps)
12 AMPLab Unification Philosophy Don t specialize MapReduce Generalize it! Two additions to Hadoop MR can enable all the models shown earlier! 1. General Task DAGs 2. Data Sharing For Users: Fewer Systems to Use Less Data Movement SparkSQL Streaming GraphX Spark MLbase
13 Berkeley Data Analytics Stack (Apache and BSD open source) In-house Apps Access and Interfaces Processing Engine Storage Resource Virtualization
14 It s only September but it s already clear that 2014 will be the year of Apache Spark -- Datanami, 9/15/14 In-Memory Dataflow System Developed in AMPLab and its predecessor the RADLab Alternative to Hadoop MapReduce x speedup for ML and interactive queries Central component of the BDAS Stack Graduated to Apache Foundation -> Apache Spark M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010.
15 Apache Spark Contributors: contributors to current release
16 Apache Spark: Compared to Other Projects MapReduce YARN HDFS Storm Spark Activity in past 6 months 0 MapReduce YARN HDFS Storm Spark 2-3x Commits more activity than: Hadoop, Storm, MongoDB, NumPy, D3, Julia, Lines of Code Changed
17 Iteration in Map-Reduce Initial Model Map Reduce Learned Model w (0) w (1) Training Data w (2) w (3)
18 Cost of Iteration in Map-Reduce Initial Model w (0) Map Reduce Learned Model w (1) Training Data Read 2 Repeatedly w (2) load same data w (3)
19 Cost of Iteration in Map-Reduce Initial Model Map Reduce Learned Model w (0) w (1) Training Data Redundantly save output between w (2) stages w (3)
20 Dataflow View Map Reduce Training Data (HDFS) Map Reduce Map Reduce
21 Memory Opt. Dataflow Training Data (HDFS) Cached Load Map Map Reduce Reduce Map Reduce
22 Memory Opt. Dataflow View Training Data (HDFS) Map Map Reduce Reduce Efficiently move data between stages Map Reduce Spark: faster than Hadoop MapReduce
23 Resilient Distributed Datasets (RDDs) API: coarse-grained transformations (map, group-by, join, sort, filter, sample, ) on immutable collections Resilient Distributed Datasets (RDDs)» Collections of objects that can be stored in memory or disk across a cluster» Built via parallel transformations (map, filter, )» Automatically rebuilt on failure Rich enough to capture many models:» Data flow models: MapReduce, Dryad, SQL,» Specialized models: Pregel, Hama, M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012.
24 Abstraction: Dataflow Operators map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save...
25 Fault Tolerance with RDDs RDDs track the series of transformations used to build them (their lineage)» Log one operation to apply to many elements» No cost if nothing fails Enables per-node recomputation of lost data messages = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)) HadoopRDD path = hdfs:// FilteredRDD func = _.contains(...) MappedRDD func = _.split( )
26 A Unified System for SQL & ML Deep integration of SQL and Spark Both share the same set of workers and caches def logregress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextdouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("select * FROM user u JOIN comment c ON c.uid=u.uid") Can move seamlessly between SQL and Machine Learning worlds val features = users.maprows { row => new Vector(extractFeature1(row.getInt("age")), extractfeature2(row.getstr("country")),...)} val trainedvector = logregress(features.cache()) R. Xin, J. Rosen, M. Zaharia, M. Franklin,S. Shenker, I. Stoica, Shark: SQL and Rich Analytics at Scale, SIGMOD 2013.
27 Spark Streaming Microbatch approach provides lower latency Additional operators provide windowed operations M. Zaharia, et al, Discretized Streams: Fault-Tollerant Streaming Computation at Scale, SOSP 2013.
28 Batch/Streaming Unification Batch and streaming codes virtually the same» Easy to develop and maintain consistency // count words from a file (batch) val file = sc.textfile("hdfs://.../pagecounts-*.gz") val words = file.flatmap(_.split(" ")) val wordcounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordcounts.print() // count words from a network stream, every 10s (streaming) val ssc = new StreamingContext(args(0), "NetCount", Seconds(10),..) val lines = ssc.sockettextstream("localhost, 3456) val words = lines.flatmap(_.split(" ")) val wordcounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordcounts.print() ssc.start()
29 Apache Spark v1.2 (12/14) Includes» Spark (core)» Spark Streaming» GraphX (alpha release)» MLlib» SparkSQL Query Processing Wide range of interfaces:» Python / interactive ipython» Scala / interactive scala shell» R / interactive R-shell» Java Now included in all major Hadoop Distributions
30 Graph Analytics Link Table Hyperlinks PageRank Top 20 Pages Title Link Title PR Raw Wikipedia < / > < / > < / > XML Top Communities Com. PR.. Discussion Table Editor Graph Community Detection User Community User Disc. User Com.
31 Separate Systems Tables Graphs
32 Dataflow Systems Separate Systems Graphs Table Row Row Row Result Row
33 Separate Systems 6. Before Dataflow Systems Graph Systems 7. After 8. After Table Row Dependency Graph Row Row Result Row
34 Difficult to Use Users must Learn, Deploy, and Manage multiple systems Leads to brittle and often complex interfaces 34
35 Efficiencies are Possible Hadoop 1340 Spark 354 GraphLab Runtime (in seconds, PageRank for 10 iterations) Live-Journal Graph Specialized Graph System can be faster than general MR computation
36 But Extensive data movement and duplication across the network and file system < / > < / > < / > XML HDFS HDFS HDFS HDFS Limited reuse internal data-structures across stages 36
37 GraphX Adding Graphs to the Mix Table View GraphX Unified Representation Graph View Tables and Graphs are composable views of the same physical data Each view has its own operators that exploit the semantics of the view to achieve efficient execution J. Gonzalez, R. Xin, A. Dave, D. Crankshaw, M. Franklin, I. Stoica, GraphX: Graph Processing in a Distributed Dataflow Framework OSDI Conf., Oct 2014
38 Representation Join Distributed Graphs Horizontally Partitioned Tables Vertex Programs Dataflow Operators Optimizations Advances in Graph Processing Systems Distributed Join Optimization Materialized View Maintenance
39 Property Graph Data Model Property Graph B C A D Vertex Property: User Profile Current PageRank Value Edge Property: F E Weights Timestamps
40 Encoding Property Graphs as Tables Property Graph B C A Vertex Cut F E Part. 1 D Part. 2 Vertex Table (RDD) Machine 1 Machine 2 A B C D E F Routing Table (RDD) A B C D E F Edge Table (RDD) A A B C A A E E B C C D E F D F
41 Graph Operators (Scala) class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations def reverse: Graph[V, E] def subgraph(pv: (Id, V) => Boolean, pe: Edge[V,E] => Boolean): Graph[V,E] def mapv(m: (Id, V) => T ): Graph[T,E] def mape(m: Edge[V,E] => T ): Graph[V,T] // Joins def joinv(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joine(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation def mrtriplets(mapf: (Edge[V,E]) => List[(Id, T)], reducef: (T, T) => T): Graph[T, E] } 41
42 Graph Operators (Scala) class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations def reverse: Graph[V, E] def subgraph(pv: (Id, V) => Boolean, pe: Edge[V,E] => Boolean): Graph[V,E] def mapv(m: (Id, V) => T ): Graph[T,E] def mape(m: Edge[V,E] => T ): Graph[V,T] // Joins def joinv(tbl: Table GraphLab [(Id, T)]): Graph[(V, T), E ] def joine(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation def mrtriplets(mapf: (Edge[V,E]) => List[(Id, T)], reducef: (T, T) => T): Graph[T, E] } capture the Gather-Scatter pattern from 42
43 Graph System Optimizations Specialized Data-Structures Vertex-Cuts Partitioning Remote Caching / Mirroring Message Combiners Active Set Tracking 43
44 PageRank Benchmark EC2 Cluster of 16 x m2.4xlarge Nodes + 1GigE Runtime (Seconds) Twitter Graph (42M Vertices,1.5B Edges) x UK-Graph (106M Vertices, 3.7B Edges) 18x GraphX performs comparably to state-of-the-art graph processing systems.
45 The GraphX Stack (Lines of Code) PageRank (20) Connected Comp. (20) GraphLab/Pregel API (34) K-core (60) Triangle Count (50) LDA (220) SVD++ (110) GraphX (2,500) Spark (30,000) Some algorithms are more naturally expressed using the GraphX primitive operators
46 Graphs are just one stage. What about a pipeline?
47 A Small Pipeline in GraphX Raw Wikipedia Hyperlinks PageRank Top 20 Pages < / > < / > < / > XML HDFS HDFS Spark Giraph + Spark GraphLab + Spark GraphX Spark Preprocess Compute Spark Post Total Runtime (in Seconds) Timed end-to-end GraphX is the fastest
48 MLBase: Distributed ML Made Easy DB Query Language Analogy: Specify What not How MLBase chooses: Algorithms/Operators Ordering and Physical Placement Parameter and Hyperparameter Settings Featurization Leverages Spark for Speed and Scale T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. Franklin, M. Jordan, MLBase: A Distributed Machine Learning System, CIDR 2013.
49 ML Pipeline Generation & Optimization Ex: Image Classification Pipeline MLBase Optimizer Focus: Better Resource Utilization Algorithmic Speedups Reduced Model Search Time E. Sparks, S. Venkataraman, M. Franklin, B. Recht, ML Pipelines, In Process. (see AMPLab Blog for an overview)
50 Introducing Velox: Model Serving Data Training Model Where do models go? Conference Papers Sales Reports Drive Actions 50
51 Driving Actions Suggesting Items at Checkout Low-Latency Fraud Detection Personalized Cognitive Assistance Internet of Things Rapidly Changing 51
52 Problem: Separate Systems Offline Analytics Systems 6. Before Online Serving Systems 7. After Sophisticated ML on static data. 8. After MongoDB Low-Latency data serving How do we serve low-latency predictions and train on live data? 52
53 Velox Model Serving System [CIDR 15] Decompose personalized predictive models: 53
54 Velox Model Serving System [CIDR 15] Decompose personalized predictive models: Feature Caching Feature Model Personalization Model Online Updates Approx. Features Batch Split Online Active Learning Order-of-magnitude reductions in prediction latencies. 54
55 Berkeley Data Analytics Stack (Apache and BSD open source) In-house Apps Access and Interfaces Processing Engine Storage Resource Virtualization
56 Big Data Architecture Open Questions & Research Issues What is the role of a unified stack in a fastchanging software landscape? Single-node vs. Cluster; Elastic Cloud vs HPC GPUs, FPGAs, XYZs,??? New memory hierarchies: SSDs, RDMA, Serving vs. Analytics Workloads What correctness guarantees are really needed?
57 Summary Big Data yes, there s hype but it is a Big Deal AmpLab project 6 Yrs, Cross-disciplinary team, Industry engagement Open Source development and community building BDAS philosophy: Unification Spark + SQL + Graphs + ML + The Big Data landscape continues to evolve Open Source enables Academic research to play a huge role
58 To find out more or get involved: UC BERKELEY amplab.berkeley.edu Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, and SAP, the Thomas and Stacy Siebel Foundation, all our industrial sponsors and partners, and all the members of the AMPLab Team.
What s next for the Berkeley Data Analytics Stack?
What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Spark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
The Berkeley Data Analytics Stack: Present and Future
The Berkeley Data Analytics Stack: Present and Future Michael Franklin 27 March 2014 Technion Big Day on Big Data UC BERKELEY BDAS in the Big Data Context 2 Sources Driving Big Data It s All Happening
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Big Data Research in the AMPLab: BDAS and Beyond
Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Conquering Big Data with BDAS (Berkeley Data Analytics)
UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
GraphX: Unifying Data-Parallel and Graph-Parallel nalytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel rankshaw, nkur Dave, Michael Franklin, and Ion Stoica Strata 2014 *These slides
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Spark and Shark: High-speed In-memory Analytics over Hadoop Data
Spark and Shark: High-speed In-memory Analytics over Hadoop Data May 14, 2013 @ Oracle Reynold Xin, AMPLab, UC Berkeley The Big Data Problem Data is growing faster than computation speeds Accelerating
Conquering Big Data with Apache Spark
Conquering Big Data with Apache Spark Ion Stoica November 1 st, 2015 UC BERKELEY The Berkeley AMPLab January 2011 2017 8 faculty > 50 students 3 software engineer team Organized for collaboration achines
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Data Science in the Wild
Data Science in the Wild Lecture 4 59 Apache Spark 60 1 What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General
The Berkeley AMPLab - Collaborative Big Data Research
The Berkeley AMPLab - Collaborative Big Data Research UC BERKELEY Anthony D. Joseph LASER Summer School September 2013 About Me Education: MIT SB, MS, PhD Joined Univ. of California, Berkeley in 1998 Current
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Apache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
MLlib: Scalable Machine Learning on Spark
MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Streaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
Introduc8on to Apache Spark
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
Berkeley Data Analytics Stack:! Experience and Lesson Learned
UC BERKELEY Berkeley Data Analytics Stack:! Experience and Lesson Learned Ion Stoica UC Berkeley, Databricks, Conviva Research Philosophy Follow real problems Focus on novel usage scenarios Build real
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Machine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
Big Data Processing. Patrick Wendell Databricks
Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Next-Gen Big Data Analytics using the Spark stack
Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
Real Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Beyond Hadoop MapReduce Apache Tez and Apache Spark
Beyond Hadoop MapReduce Apache Tez and Apache Spark Prakasam Kannan Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT Hadoop MapReduce
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia
Compliments of Learning Spark LIGHTNING-FAST DATA ANALYTICS Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia Bring Your Big Data to Life Big Data Integration and Analytics Learn how to power
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
An Open Source Memory-Centric Distributed Storage System
An Open Source Memory-Centric Distributed Storage System Haoyuan Li, Tachyon Nexus [email protected] September 30, 2015 @ Strata and Hadoop World NYC 2015 Outline Open Source Introduction to Tachyon
Big Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
SPARK USE CASE IN TELCO. Apache Spark Night 9-2-2014! Chance Coble!
SPARK USE CASE IN TELCO Apache Spark Night 9-2-2014! Chance Coble! Use Case Profile Telecommunications company Shared business problems/pain Scalable analytics infrastructure is a problem Pushing infrastructure
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
Introduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us
DATA INTELLIGENCE FOR ALL Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO Agenda 1. Challenges & Motivation 2. DDF Overview 3. DDF Design
Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
CS 294: Big Data System Research: Trends and Challenges
CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/) 1 Big Data First papers:»
Analytics on Spark & Shark @Yahoo
Analytics on Spark & Shark @Yahoo PRESENTED BY Tim Tully December 3, 2013 Overview Legacy / Current Hadoop Architecture Reflection / Pain Points Why the movement towards Spark / Shark New Hybrid Environment
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Tachyon: memory-speed data sharing
Tachyon: memory-speed data sharing Ali Ghodsi, Haoyuan (HY) Li, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley Memory trumps everything else RAM throughput increasing exponentially Disk throughput
Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful
Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks
Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks Haoyuan Li UC Berkeley Outline Motivation System Design Evaluation Results Release Status Future Directions Outline Motivation
Shark: SQL and Rich Analytics at Scale
Shark: SQL and Rich Analytics at Scale Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica AMPLab, EECS, UC Berkeley {rxin, joshrosen, matei, franklin, shenker, istoica}@cs.berkeley.edu
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use
Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera
Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta [email protected] Radu Chilom [email protected] Buzzwords Berlin - 2015 Big data analytics / machine
Tachyon: A Reliable Memory Centric Storage for Big Data Analytics
Tachyon: A Reliable Memory Centric Storage for Big Data Analytics a Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica June 30 th, 2014 Spark Summit @ San Francisco UC Berkeley Outline
This is a brief tutorial that explains the basics of Spark SQL programming.
About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types
Why Spark on Hadoop Matters
Why Spark on Hadoop Matters MC Srivas, CTO and Founder, MapR Technologies Apache Spark Summit - July 1, 2014 1 MapR Overview Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 13
An Architecture for Fast and General Data Processing on Large Clusters
An Architecture for Fast and General Data Processing on Large Clusters Matei Zaharia Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-12
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Introduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
Shark Installation Guide Week 3 Report. Ankush Arora
Shark Installation Guide Week 3 Report Ankush Arora Last Updated: May 31,2014 CONTENTS Contents 1 Introduction 1 1.1 Shark..................................... 1 1.2 Apache Spark.................................
Apache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
Hybrid Software Architectures for Big Data. [email protected] @hurence http://www.hurence.com
Hybrid Software Architectures for Big Data [email protected] @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line
Unlocking the True Value of Hadoop with Open Data Science
Unlocking the True Value of Hadoop with Open Data Science Kristopher Overholt Solution Architect Big Data Tech 2016 MinneAnalytics June 7, 2016 Overview Overview of Open Data Science Python and the Big
Big Data Frameworks Course. Prof. Sasu Tarkoma 10.3.2015
Big Data Frameworks Course Prof. Sasu Tarkoma 10.3.2015 Contents Course Overview Lectures Assignments/Exercises Course Overview This course examines current and emerging Big Data frameworks with focus
Big Data Frameworks: Scala and Spark Tutorial
Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional
Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey
Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey By, Mr. Brijesh B. Mehta Admission No.: D14CO002 Supervised By, Dr. Udai Pratap Rao Computer Engineering Department S. V. National
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
