Big Data Frameworks: Scala and Spark Tutorial
|
|
|
- Julia Lynette Walton
- 10 years ago
- Views:
Transcription
1 Big Data Frameworks: Scala and Spark Tutorial Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides:
2 Functional Programming Functional operations create new data structures, they do not modify existing ones After an operation, the original data still exists in unmodified form The program design implicitly captures data flows The order of the operations is not significant
3 Word Count in Scala val lines = scala.io.source.fromfile("textfile.txt").getlines val words = lines.flatmap(line => line.split(" ")).toiterable val counts = words.groupby(identity).map(words => words._1 -> words._2.size) val top10 = counts.toarray.sortby(_._2).reverse.take(10) println(top10.mkstring("\n")) Scala can be used to concisely express pipelines of operations Map, flatmap, filter, groupby, operate on entire collections with one element in the function's scope at a time This allows implicit parallelism in Spark
4 About Scala Scala is a statically typed language Support for generics: case class MyClass(a: Int) implements Ordered[MyClass] All the variables and functions have types that are defined at compile time The compiler will find many unintended programming errors The compiler will try to infer the type, say val=2 is implicitly of integer type Use an IDE for complex types: or IDEA with the Scala plugin Everything is an object Functions defined using the def keyword Laziness, avoiding the creation of objects except when absolutely necessary Online Scala coding: A Scala Tutorial for Java Programmers
5 Functions are objects def nocommonwords(w: (String, Int)) = { // Without the =, this would be a void (Unit) function val (word, count) = w word!= "the" && word!= "and" && word.length > 2 } val better = top10.filter(nocommonwords) println(better.mkstring("\n")) Functions can be passed as arguments and returned from other functions Functions as filters They can be stored in variables This allows flexible program flow control structures Functions can be applied for all members of a collection, this leads to very compact coding Notice above: the return value of the function is always the value of the last statement
6 Scala Notation _ is the default value or wild card => Is used to separate match expression from block to be evaluated The anonymous function (x,y) => x+y can be replaced by _+_ The v=>v.method can be replaced by _.Method "->" is the tuple delimiter Iteration with for: for (i <- 0 until 10) { // with 0 to 10, 10 is included println(s"item: $i") } Examples: import scala.collection.immutable._ lsts.filter(v=>v.length>2) is the same as lsts.filter(_.length>2) (2, 3) is equal to 2 -> 3 2 -> (3 -> 4) == (2,(3,4)) 2 -> 3 -> 4 == ((2,3),4)
7 Scala Examples map: lsts.map(x => x * 4) Instantiates a new list by applying f to each element of the input list. flatmap: lsts.flatmap(_.tolist) uses the given function to create a new list, then places the resulting list elements at the top level of the collection lsts.sort(_<_): sorting ascending order fold and reduce functions combine adjacent list elements using a function. Processes the list starting from left or right: lst.foldleft(0)(_+_) starts from 0 and adds the list values to it iteratively starting from left tuples: a set of values enclosed in parenthesis (2, z, 3), access with the underscore (2, < )._2 Notice above: single-statement functions do not need curly braces { } Arrays are indexed with ( ), not [ ]. [ ] is used for type bounds (like Java's < >) REMEMBER: these do not modify the collection, but create a new one (you need to assign the return value) val sorted = lsts.sort(_ < _)
8 Implicit parallelism The map function has implicit parallelism as we saw before This is because the order of the application of the function to the elements in a list is commutative We can parallelize or reorder the execution MapReduce and Spark build on this parallelism
9 Map and Fold is the Basis Map takes a function and applies to every element in a list Fold iterates over a list and applies a function to aggregate the results The map operation can be parallelized: each application of function happens in an independent manner The fold operation has restrictions on data locality Elements of the list must be together before the function can be applied; however, the elements can be aggregated in groups in parallel
10 Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based on MapReduce enhanced with new operations and an engine that supports execution graphs Tools include Spark SQL, MLLlib for machine learning, GraphX for graph processing and Spark Streaming
11 Obtaining Spark Spark can be obtained from the spark.apache.org site Spark packages are available for many different HDFS versions Spark runs on Windows and UNIX-like systems such as Linux and MacOS The easiest setup is local, but the real power of the system comes from distributed operation Spark runs on Java6+, Python 2.6+, Scala 2.1+ Newest version works best with Java7+, Scala
12 Installing Spark We use Spark or newer on this course For local installation: Download Extract it to a folder of your choice and run bin/spark-shell in a terminal (or double click bin/spark-shell.cmd on Windows) For the IDE, take the assembly jar from spark-1.2.1/assembly/target/scala-2.10 OR spark-1.2.1/lib You need to have Java 6+ For pyspark: Python 2.6+
13 For Cluster installations Each machine will need Spark in the same folder, and key-based passwordless SSH access from the master for the user running Spark Slave machines will need to be listed in the slaves file See spark/conf/ For better performance: Spark running in the YARN scheduler Running Spark on Amazon AWS EC2: Further reading: Running Spark on Mesos
14 First examples # Running the shell with your own classes, given amount of memory, and # the local computer with two threads as slaves./bin/spark-shell --driver-memory 1G \ --jars your-project-jar-here.jar \ --master "local[2]" // And then creating some data val data = 1 to 5000 data: scala.collection.immutable.range.inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, // Creating an RDD for the data: val ddata = sc.parallelize(data) // Then selecting values less than 10 ddata.filter(_ < 10).collect() res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
15 SparkContext sc A Spark program creates a SparkContext object, denoted by the sc variable in Scala and Python shell Outside shell, a constructor is used to instantiate a SparkContext val conf = new SparkConf().setAppName("Hello").setMaster("local[2]") val sc = new SparkContext(conf) SparkContext is used to interact with the Spark cluster
16 SparkContext master parameter Can be given to spark-shell, specified in code, or given to spark-submit Code takes precedence, so don't hardcode this Determines which cluster to utilize local with one worker thread local[k] local with K worker threads local[*] local with as many threads as your computer has logical cores spark://host:port Connect to a Spark cluster, default port 7077 mesos://host:port Connect to a Mesos cluster, default por 5050
17 Spark overview Worker Node Executor Tasks Cache Distributed Storage Driver Program SparkContext Cluster Manager Worker Node SparkContext connects to a cluster manager Obtains executors on cluster nodes Sends app code to them Sends task to the executors Executor Tasks Cache
18 Example: Log Analysis /* Java String functions (and all other functions too) also work in Scala */ val lines = sc.textfile("hdfs://... ) val errors = lines.filter(_.startswith("error")) val messages = errors.map(_.split("\t")).map(_(1)) messages.persist() messages.filter(_.contains("mysql")).count() messages.filter(_.contains("php")).count()
19 WordCounting /* When giving Spark file paths, those files need to be accessible with the same path from all slaves */ val file = sc.textfile("readme.md") val wc = file.flatmap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) wc.saveastextfile("wc_out.txt") wc.collect.foreach(println)
20 Join val f1 = sc.textfile("readme.md") val sparks = f1.filter(_.startswith("spark")) val wc1 = sparks.flatmap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) val f2 = sc.textfile("changes.txt") val sparks2 = f2.filter(_.startswith("spark")) val wc2 = sparks2.flatmap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) wc1.join(wc2).collect.foreach(println)
21 Transformations Create a new dataset from an existing dataset All transformations are lazy and computed when the results are needed Transformation history is retained in RDDs calculations can be optimized data can be recovered Some operations can be given the number of tasks. This can be very important for performance. Spark and Hadoop prefer larger files and smaller number of tasks if the data is small. However, the number of tasks should always be at least the number of CPU cores in the computer / cluster running Spark.
22 Transformation map(func) filter(func) flatmap(func) mappartitions(func) Spark Transformations I/IV Description Returns a new RDD based on applying function func to the each element of the source Returns a new RDD based on selecting elements of the source for which func is true Returns a new RDD based on applying function func to each element of the source while func can return a sequence of items for each input element Implements similar functionality to map, but is executed separately on each partition of the RDD. The function func must be of the type (Iterator <T>) => Iterator<U> when dealing with RDD type of T. mappartitionswithind x(func) Similar to the above transformation, but includes an integer index of the partition with func. The function func must be of the type (Int, Iterator <T>) => Iterator<U> when dealing with RDD type of T.
23 Transformations II/IV Transformation sample(withreplac, frac, seed) union(other) intersection(other) distinct([ntasks]) Description Samples a fraction (frac) of the source data with or without replacement (withreplac) based on the given random seed Returns an union of the source dataset and the given dataset Returns elements common to both RDDs Returns a new RDD that contains the distinct elements of the source dataset.
24 Transformation groupbykey([numtask]) reducebykey(func, [numtasks]) aggregatebykey(zeroval )(seqop, comboop, [numtask]) Spark Transformations III/IV Description Returns an RDD of (K, Seq[V]) pairs for a source dataset with (K,V) pairs. Returns an RDD of (K,V) pairs for an (K,V) input dataset, in which the values for each key are combined using the given reduce function func. Given an RDD of (K,V) pairs, this transformation returns an RDD RDD of (K,U) pairs for which the values for each key are combined using the given combine functions and a neutral zero value. sortbykey([ascending], [numtasks]) join(inputdataset, [numtask]) cogroup(inputdataset, [numtask]) cartesian(inputdataset) Returns an RDD of (K,V) pairs for an (K,V) input dataset where K implements Ordered, in which the keys are sorted in ascending or descending order (ascending boolean input variable). Given datasets of type (K,V) and (K, W) returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Given datasets of type (K,V) and (K, W) returns a dataset of (K, Seq[V], Seq[W]) tuples. Given datasets of types T and U, returns a combined dataset of (T, U) pairs that includes all pairs of elements.
25 Spark Transformations IV Transformation pipe(command, [envvars]) coalesce(numpartitions) repartition(numpartitions) repartitionandsortwithinpartitio ns(partitioner) Description Pipes each partition of the given RDD through a shell command (for example bash script). Elements of the RDD are written to the stdin of the process and lines output to the stdout are returned as an RDD of strings. Reduces the number of partitions in the RDD to numpartitions. Facilitates the increasing or reducing the number of partitions in an RDD. Implements this by reshuffling data in a random manner for balancing. Repartitions given RDD with the given partitioner sorts the elements by their keys. This transformation is more efficient than first repartitioning and then sorting.
26 Spark Actions I/II Transformation reduce(func) collect() count() first() take(n) takesample(withreplac, frac, seed) takeordered(n, [ordering]) Description Combine the elements of the input RDD with the given function func that takes two arguments and returns one. The function should be commutative and associative for correct parallel execution. Returns all the elements of the source RDD as an array for the driver program. Returns the number of elements in the source RDD. Returns the first element of the RDD. (Same as take(1)) Returns an array with the first n elements of the RDD. Currently executed by the driver program (not parallel). Returns an array with a random sample of frac elements of the RDD. The sampling is done with or without replacement (withreplac) using the given random seed. Returns first n elements of the RDD using natural/custom ordering.
27 Spark Actions II Transformation saveastextfile(path) saveassequencefile(path) saveasobjectfile(path) countbykey() foreach(func) Description Saves the elements of the RDD as a text file to a given local/hdfs/hadoop directory. The system uses tostring on each element to save the RDD. Saves the elements of an RDD as a Hadoop SequenceFile to a given local/hdfs/hadoop directory. Only elements that conform to the Hadoop Writable interface are supported. Saves the elements of the RDD using Java serialization. The file can be loaded with SparkContext.objectFile(). Returns (K, Int) pairs with the count of each key Applies the given function func for each element of the RDD.
28 Spark API For Python Spark Programming Guide: Check which version's documentation (stackoverflow, blogs, etc) you are looking at, the API had big changes after version
29 More information These slides: Intro to Apache Spark: Project that can be used to start (If using Maven): This is for Spark 1.0.2, so change the version in pom.xml.
Introduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python
Apache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
Introduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP
LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP ICTP, Trieste, March 24th 2015 The objectives of this session are: Understand the Spark RDD programming model Familiarize with
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Writing Standalone Spark Programs
Writing Standalone Spark Programs Matei Zaharia UC Berkeley www.spark- project.org UC BERKELEY Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging Building
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
How to Run Spark Application
How to Run Spark Application Junghoon Kang Contents 1 Intro 2 2 How to Install Spark on a Local Machine? 2 2.1 On Ubuntu 14.04.................................... 2 3 How to Run Spark Application on a
Certification Study Guide. MapR Certified Spark Developer Study Guide
Certification Study Guide MapR Certified Spark Developer Study Guide 1 CONTENTS About MapR Study Guides... 3 MapR Certified Spark Developer (MCSD)... 3 SECTION 1 WHAT S ON THE EXAM?... 5 1. Load and Inspect
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Data Science in the Wild
Data Science in the Wild Lecture 4 59 Apache Spark 60 1 What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
BIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
Spark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
Big Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
Using Apache Spark Pat McDonough - Databricks
Using Apache Spark Pat McDonough - Databricks Apache Spark spark.incubator.apache.org github.com/apache/incubator-spark [email protected] The Spark Community +You! INTRODUCTION TO APACHE
Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
Hadoop, Hive & Spark Tutorial
Hadoop, Hive & Spark Tutorial 1 Introduction This tutorial will cover the basic principles of Hadoop MapReduce, Apache Hive and Apache Spark for the processing of structured datasets. For more information
Implementations of iterative algorithms in Hadoop and Spark
Implementations of iterative algorithms in Hadoop and Spark by Junyu Lai A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email [email protected] if you have questions or need more clarifications. Nilay
Machine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
Big Data Frameworks Course. Prof. Sasu Tarkoma 10.3.2015
Big Data Frameworks Course Prof. Sasu Tarkoma 10.3.2015 Contents Course Overview Lectures Assignments/Exercises Course Overview This course examines current and emerging Big Data frameworks with focus
SparkLab May 2015 An Introduction to
SparkLab May 2015 An Introduction to & Apostolos N. Papadopoulos Assistant Professor Data Engineering Lab, Department of Informatics, Aristotle University of Thessaloniki Abstract Welcome to SparkLab!
Streaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
This is a brief tutorial that explains the basics of Spark SQL programming.
About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types
Real Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
http://glennengstrand.info/analytics/fp
Functional Programming and Big Data by Glenn Engstrand (September 2014) http://glennengstrand.info/analytics/fp What is Functional Programming? It is a style of programming that emphasizes immutable state,
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
CS 378 Big Data Programming. Lecture 24 RDDs
CS 378 Big Data Programming Lecture 24 RDDs Review Assignment 10 Download and run Spark WordCount implementaeon WordCount alternaeve implementaeon Basic RDD TransformaEons we ve discussed filter(function)
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
Other Map-Reduce (ish) Frameworks. William Cohen
Other Map-Reduce (ish) Frameworks William Cohen 1 Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci
Hadoop Setup. 1 Cluster
In order to use HadoopUnit (described in Sect. 3.3.3), a Hadoop cluster needs to be setup. This cluster can be setup manually with physical machines in a local environment, or in the cloud. Creating a
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
CS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
Apache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group
A Tutorial Introduc/on to Big Data Hands On Data Analy/cs over EMR Robert Grossman University of Chicago Open Data Group Collin BenneE Open Data Group November 12, 2012 1 Amazon AWS Elas/c MapReduce allows
Hadoop Streaming. Table of contents
Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5
University of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Hadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide
Microsoft SQL Server Connector for Apache Hadoop Version 1.0 User Guide October 3, 2011 Contents Legal Notice... 3 Introduction... 4 What is SQL Server-Hadoop Connector?... 4 What is Sqoop?... 4 Supported
Hadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows
Click Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol
Data Algorithms Mahmoud Parsian Beijing Boston Farnham Sebastopol Tokyo O'REILLY Table of Contents Foreword xix Preface xxi 1. Secondary Sort: Introduction 1 Solutions to the Secondary Sort Problem 3 Implementation
Running Knn Spark on EC2 Documentation
Pseudo code Running Knn Spark on EC2 Documentation Preparing to use Amazon AWS First, open a Spark launcher instance. Open a m3.medium account with all default settings. Step 1: Login to the AWS console.
The Easiest Way to Run Spark Jobs. How-To Guide
The Easiest Way to Run Spark Jobs How-To Guide The Easiest Way to Run Spark Jobs Recently, Databricks added a new feature, Jobs, to our cloud service. You can find a detailed overview of this feature in
Cloud Computing. Chapter 8. 8.1 Hadoop
Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction
Getting Started with Hadoop with Amazon s Elastic MapReduce
Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson [email protected] http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
Data Intensive Computing Handout 5 Hadoop
Data Intensive Computing Handout 5 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
Data Intensive Computing Handout 6 Hadoop
Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.
ITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
Write Once, Run Anywhere Pat McDonough
Write Once, Run Anywhere Pat McDonough Write Once, Run Anywhere Write Once, Run Anywhere You Might Have Heard This Before! Java, According to Wikipedia Java, According to Wikipedia Java is a computer programming
Hadoop 2.6.0 Setup Walkthrough
Hadoop 2.6.0 Setup Walkthrough This document provides information about working with Hadoop 2.6.0. 1 Setting Up Configuration Files... 2 2 Setting Up The Environment... 2 3 Additional Notes... 3 4 Selecting
HADOOP CLUSTER SETUP GUIDE:
HADOOP CLUSTER SETUP GUIDE: Passwordless SSH Sessions: Before we start our installation, we have to ensure that passwordless SSH Login is possible to any of the Linux machines of CS120. In order to do
Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
Extreme computing lab exercises Session one
Extreme computing lab exercises Session one Michail Basios ([email protected]) Stratis Viglas ([email protected]) 1 Getting started First you need to access the machine where you will be doing all
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
ITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
Introduc8on to Apache Spark
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these
