CS 378 Big Data Programming. Lecture 24 RDDs
|
|
- Clyde Wilson
- 8 years ago
- Views:
Transcription
1 CS 378 Big Data Programming Lecture 24 RDDs
2 Review Assignment 10 Download and run Spark WordCount implementaeon WordCount alternaeve implementaeon
3 Basic RDD TransformaEons we ve discussed filter(function) map(function) flatmap(function) maptopair(function) How do these work for muleple pareeons? Is a shuffle over the network required? MLSS 2015 Big Data Programming 3
4 Filter Apply a funceon to each element of an RDD, return those elements that evaluate to true Source RDD element type: T Result RDD element type: T Java funceon (class) type: Function<T, Boolean> Java method: Boolean call(t t) MLSS 2015 Big Data Programming 4
5 Map Apply a funceon to each element of an RDD, return the result of applying the funceon RDD element type: T Result RDD element type: R Java funceon (class) type: Function<T, R> Java method: R call(t t) MLSS 2015 Big Data Programming 5
6 Flat Map Apply a funceon to each element of an RDD, return the result of applying the funceon (an Iterable) Source RDD element type: T Result RDD element type: R Java funceon (class) type: FlatMapFunction<T, R> Java method: Iterable<R> call(t t) MLSS 2015 Big Data Programming 6
7 Map to Pair Apply a funceon to each element of an RDD, return the result of applying the funceon (a key/value pair) Source RDD element type: T Result RDD element type: <K, V> Java funceon (class) type: PairFunction<T, K, V> Java method: Tuple2<K, V> call(t t) MLSS 2015 Big Data Programming 7
8 Basic RDD AddiEonal transformaeons distinct() union(otherrdd) intersection(otherrdd) subtract(otherrdd) cartesian(otherrdd) sample(withreplacement, fraction, [seed]) MLSS 2015 Big Data Programming 8
9 DisEnct Remove duplicates Source RDD element type: T Result RDD element type: T MLSS 2015 Big Data Programming 9
10 Union Compute the union of two RDDs Duplicates are not removed Source RDD element type: T Other RDD element type: T Result RDD element type: T MLSS 2015 Big Data Programming 10
11 IntersecEon Compute the interseceon of two RDDs Source RDD element type: T Other RDD element type: T Result RDD element type: T Requires a shuffle over the network. Why? Does union require a shuffle? MLSS 2015 Big Data Programming 11
12 Subtract Remove the elements of one RDD from another Source RDD element type: T Other RDD element type: T Result RDD element type: T Require a shuffle over the network? MLSS 2015 Big Data Programming 12
13 Cartesian Compute the cross product (Cartesian product) of two RDDs Source RDD element type: T Other RDD element type: U Result RDD element type: JavaPairRDD<T, U> Require a shuffle over the network? Very expensive in Eme and space MLSS 2015 Big Data Programming 13
14 Sample Sample from an RDD Source RDD element type: T Result RDD element type: T sample(withreplacement, fraction, [seed]) Require a shuffle over the network? MLSS 2015 Big Data Programming 14
15 Basic RDD AcEons we ve discussed collect() count() take(n) first() saveastextfile() MLSS 2015 Big Data Programming 15
16 Basic RDD AddiEonal aceons countbyvalue() Returns a map from RDD element to count (integer) Could we use this for WordCount? takeordered(n, comparator) Sort the RDD elements, then take() takesample(withreplacement, num) Start sampling to get num elements MLSS 2015 Big Data Programming 16
17 Basic RDD AddiEonal aceons reduce(function2<t,t,t> f) Returns an object of type T The funceon f should be commutaeve AssociaEve Does summing counts meet the specs for f? Does concatenaeng strings meet the specs for f? MLSS 2015 Big Data Programming 17
18 Assignment 11 Inverted index, implemented in Spark We ll use the same data (Assignment 3) Remove punctuaeon For each word, output a list of verses containing the word, in sorted order MLSS 2015 Big Data Programming 18
CS 378 Big Data Programming. Lecture 24 RDDs
CS 378 Lecture 24 RDDs Review Assignment 11: Inverted index in Spark We ll use the same data (Assignment 3) Remove punctuakon For each word, output a list of verses containing the word, in sorted order
More informationCS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
More informationMachine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
More informationCS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
More informationBig Data Frameworks: Scala and Spark Tutorial
Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional
More informationBig Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
More informationLambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
More informationBig Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
More informationHadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
More informationMapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use
More informationIntroduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python
More informationIntroduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationApache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
More informationCS 2316 Data Manipulation for Engineers
CS 2316 Data Manipulation for Engineers SQL Christopher Simpkins chris.simpkins@gatech.edu Chris Simpkins (Georgia Tech) CS 2316 Data Manipulation for Engineers SQL 1 / 26 1 1 The material in this lecture
More informationPrinciples of Software Construction: Objects, Design, and Concurrency. Distributed System Design, Part 2. MapReduce. Charlie Garrod Christian Kästner
Principles of Software Construction: Objects, Design, and Concurrency Distributed System Design, Part 2. MapReduce Spring 2014 Charlie Garrod Christian Kästner School of Computer Science Administrivia
More informationUnified Big Data Analytics Pipeline. 连 城 lian@databricks.com
Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
More informationBig Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
More informationProject DALKIT (informal working title)
Project DALKIT (informal working title) Michael Axtmann, Timo Bingmann, Peter Sanders, Sebastian Schlag, and Students 2015-03-27 INSTITUTE OF THEORETICAL INFORMATICS ALGORITHMICS KIT University of the
More informationHadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
More informationCrystal Reports Integration Plugin for JIRA
Crystal Reports Integration Plugin for JIRA Copyright 2008 The Go To Group Page 1 of 7 Table of Contents Crystal Reports Integration Plugin for JIRA...1 Introduction...3 Prerequisites...3 Architecture...3
More informationCS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns
CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns Review Assignment 2 Ques9ons? If you d like to use guava (Google collec9ons classes) pom.xml available for assignment 2 Includes dependency for
More informationHadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of
More informationData Science in the Wild
Data Science in the Wild Lecture 4 59 Apache Spark 60 1 What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General
More informationApache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
More informationMAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
More informationCS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
More informationKafka & Redis for Big Data Solutions
Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationIntroduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
More informationData Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol
Data Algorithms Mahmoud Parsian Beijing Boston Farnham Sebastopol Tokyo O'REILLY Table of Contents Foreword xix Preface xxi 1. Secondary Sort: Introduction 1 Solutions to the Secondary Sort Problem 3 Implementation
More informationCS 378 Big Data Programming. Lecture 1 Introduc:on
CS 378 Big Data Programming Lecture 1 Introduc:on Class Logis:cs Class meets MW, 9:30 AM 11:00 AM Office Hours GDC 4.706 MW 11:00 12:00 AM By appointment Email: dfranke@cs.utexas.edu Web page: cs.utexas.edu/~dfranke/courses/2015spring/cs378-
More informationBIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
More informationFractions and Linear Equations
Fractions and Linear Equations Fraction Operations While you can perform operations on fractions using the calculator, for this worksheet you must perform the operations by hand. You must show all steps
More informationYahoo! Grid Services Where Grid Computing at Yahoo! is Today
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations marco@yahoo-inc.com What is Apache Hadoop? Distributed File System and Map-Reduce programming platform
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
More informationRelational Database: Additional Operations on Relations; SQL
Relational Database: Additional Operations on Relations; SQL Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Overview The course packet
More informationHadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
More informationLesson 7 Pentaho MapReduce
Lesson 7 Pentaho MapReduce Pentaho Data Integration, or PDI, is a comprehensive ETL platform allowing you to access, prepare and derive value from both traditional and big data sources. During this lesson,
More informationData Intensive Computing Handout 5 Hadoop
Data Intensive Computing Handout 5 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.
More informationClasses and Objects in Java Constructors. In creating objects of the type Fraction, we have used statements similar to the following:
In creating objects of the type, we have used statements similar to the following: f = new (); The parentheses in the expression () makes it look like a method, yet we never created such a method in our
More informationLAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP
LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP ICTP, Trieste, March 24th 2015 The objectives of this session are: Understand the Spark RDD programming model Familiarize with
More informationAnalyzing Big Data at. Web 2.0 Expo, 2010 Kevin Weil @kevinweil
Analyzing Big Data at Web 2.0 Expo, 2010 Kevin Weil @kevinweil Three Challenges Collecting Data Large-Scale Storage and Analysis Rapid Learning over Big Data My Background Studied Mathematics and Physics
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationExercise 1: Python Language Basics
Exercise 1: Python Language Basics In this exercise we will cover the basic principles of the Python language. All languages have a standard set of functionality including the ability to comment code,
More informationSpark Application Carousel. Spark Summit East 2015
Spark Application Carousel Spark Summit East 2015 About Today s Talk About Me: Vida Ha - Solutions Engineer at Databricks. Goal: For beginning/early intermediate Spark Developers. Motivate you to start
More informationTo reduce or not to reduce, that is the question
To reduce or not to reduce, that is the question 1 Running jobs on the Hadoop cluster For part 1 of assignment 8, you should have gotten the word counting example from class compiling. To start with, let
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationReal-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis
Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets Andrew Psaltis About Me Recently started working at Ensighten on Agile Maketing Platform Prior 4.5 years
More informationOther Map-Reduce (ish) Frameworks. William Cohen
Other Map-Reduce (ish) Frameworks William Cohen 1 Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci
More informationCS 378 Big Data Programming. Lecture 9 Complex Writable Types
CS 378 Big Data Programming Lecture 9 Complex Writable Types Review Assignment 4 - CustomWritable QuesIons/issues? Hadoop Provided Writables We ve used several Hadoop Writable classes Text LongWritable
More informationData Intensive Computing Handout 6 Hadoop
Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.
More informationWriting Standalone Spark Programs
Writing Standalone Spark Programs Matei Zaharia UC Berkeley www.spark- project.org UC BERKELEY Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging Building
More informationCommon Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com
Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data
More informationBig Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com
Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationBig Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig
Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types
More informationData Structures and Algorithms Written Examination
Data Structures and Algorithms Written Examination 22 February 2013 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students: Write First Name, Last Name, Student Number and Signature where
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationCS101 Lecture 24: Thinking in Python: Input and Output Variables and Arithmetic. Aaron Stevens 28 March 2011. Overview/Questions
CS101 Lecture 24: Thinking in Python: Input and Output Variables and Arithmetic Aaron Stevens 28 March 2011 1 Overview/Questions Review: Programmability Why learn programming? What is a programming language?
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationMR-(Mapreduce Programming Language)
MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming
More informationConvex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
More informationProject 5 Twitter Analyzer Due: Fri. 2015-12-11 11:59:59 pm
Project 5 Twitter Analyzer Due: Fri. 2015-12-11 11:59:59 pm Goal. In this project you will use Hadoop to build a tool for processing sets of Twitter posts (i.e. tweets) and determining which people, tweets,
More informationAdvanced Data Management Technologies
ADMT 2015/16 Unit 15 J. Gamper 1/53 Advanced Data Management Technologies Unit 15 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information
More informationBig Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
More informationFast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
More informationFP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
More informationDduP Towards a Deduplication Framework utilising Apache Spark
DduP Towards a Deduplication Framework utilising Apache Spark Niklas Wilcke Datenbanken und Informationssysteme (ISYS) University of Hamburg Vogt-Koelln-Strasse 30 22527 Hamburg 1wilcke@informatik.uni-hamburg.de
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
More informationGetting t Ge o K tting t no o K w Scala no for fo r Da D t a a t a Sc S i c e i n e c n e c @TheTomFlaherty
Getting to Know Scala for Data Science @TheTomFlaherty Bio: I have been a Chief Architect for 20 years, where I first become enamored by Scala in 2006. I wrote a symbolic math application in Scala at Glaxo
More informationThe MapReduce Framework
The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background
More informationIntroduction to Fractions
Section 0.6 Contents: Vocabulary of Fractions A Fraction as division Undefined Values First Rules of Fractions Equivalent Fractions Building Up Fractions VOCABULARY OF FRACTIONS Simplifying Fractions Multiplying
More informationWelcome to Introduction to programming in Python
Welcome to Introduction to programming in Python Suffolk One, Ipswich, 4:30 to 6:00 Tuesday Jan 14, Jan 21, Jan 28, Feb 11 Welcome Fire exits Toilets Refreshments 1 Learning objectives of the course An
More information1.6 The Order of Operations
1.6 The Order of Operations Contents: Operations Grouping Symbols The Order of Operations Exponents and Negative Numbers Negative Square Roots Square Root of a Negative Number Order of Operations and Negative
More informationMap Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
More information1.00 Lecture 1. Course information Course staff (TA, instructor names on syllabus/faq): 2 instructors, 4 TAs, 2 Lab TAs, graders
1.00 Lecture 1 Course Overview Introduction to Java Reading for next time: Big Java: 1.1-1.7 Course information Course staff (TA, instructor names on syllabus/faq): 2 instructors, 4 TAs, 2 Lab TAs, graders
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad
More informationChapter 2: Elements of Java
Chapter 2: Elements of Java Basic components of a Java program Primitive data types Arithmetic expressions Type casting. The String type (introduction) Basic I/O statements Importing packages. 1 Introduction
More informationCopy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)
Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --
More informationE6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
More informationData Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationICAB4136B Use structured query language to create database structures and manipulate data
ICAB4136B Use structured query language to create database structures and manipulate data Release: 1 ICAB4136B Use structured query language to create database structures and manipulate data Modification
More informationStep 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
More informationIntroduction to Python
Introduction to Python Sophia Bethany Coban Problem Solving By Computer March 26, 2014 Introduction to Python Python is a general-purpose, high-level programming language. It offers readable codes, and
More informationHadoop, Hive & Spark Tutorial
Hadoop, Hive & Spark Tutorial 1 Introduction This tutorial will cover the basic principles of Hadoop MapReduce, Apache Hive and Apache Spark for the processing of structured datasets. For more information
More informationActivity 1: Using base ten blocks to model operations on decimals
Rational Numbers 9: Decimal Form of Rational Numbers Objectives To use base ten blocks to model operations on decimal numbers To review the algorithms for addition, subtraction, multiplication and division
More informationComputer Science III Advanced Placement G/T [AP Computer Science A] Syllabus
Computer Science III Advanced Placement G/T [AP Computer Science A] Syllabus Course Overview This course is a fast-paced advanced level course that focuses on the study of the fundamental principles associated
More informationCourse Syllabus. MATH 1350-Mathematics for Teachers I. Revision Date: 8/15/2016
Course Syllabus MATH 1350-Mathematics for Teachers I Revision Date: 8/15/2016 Catalog Description: This course is intended to build or reinforce a foundation in fundamental mathematics concepts and skills.
More informationSQL Database queries and their equivalence to predicate calculus
SQL Database queries and their equivalence to predicate calculus Russell Impagliazzo, with assistence from Cameron Helm November 3, 2013 1 Warning This lecture goes somewhat beyond Russell s expertise,
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationMap- reduce, Hadoop and The communica3on bo5leneck. Yoav Freund UCSD / Computer Science and Engineering
Map- reduce, Hadoop and The communica3on bo5leneck Yoav Freund UCSD / Computer Science and Engineering Plan of the talk Why is Hadoop so popular? HDFS Map Reduce Word Count example using Hadoop streaming
More informationHow to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13
How to properly misuse Hadoop Marcel Huntemann NERSC tutorial session 2/12/13 History Created by Doug Cutting (also creator of Apache Lucene). 2002 Origin in Apache Nutch (open source web search engine).
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationLecture 4: Binary. CS442: Great Insights in Computer Science Michael L. Littman, Spring 2006. I-Before-E, Continued
Lecture 4: Binary CS442: Great Insights in Computer Science Michael L. Littman, Spring 26 I-Before-E, Continued There are two ideas from last time that I d like to flesh out a bit more. This time, let
More information