How To Write A Program On A Microsoft.Com (For Microsoft) With A Microsail (For Linux) And A Microosoft.Com Microsails (For Windows) (For Macintosh) ( For Microsils

Size: px
Start display at page:

Download "How To Write A Program On A Microsoft.Com (For Microsoft) With A Microsail (For Linux) And A Microosoft.Com Microsails (For Windows) (For Macintosh) ( For Microsils"

Transcription

1 Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues)

2 What I m Working on at Google We ve got fantasjc cloud building blocks: BigTable, MapReduce, Pregel, and on and on (and so do you: EC2, Hadoop, Redis, ZooKeeper, ) To build your app: 1. Think hard, and choose a few building blocks (BB s) 2. SJck your app logic into a blender, and pour it into the various BB abstracjons (keys/values, MR funcjons, callbacks, ) 3. Tune it: stupid map- reduce tricks ; set bazillions of flags 4. Hope that Your BB choices were right, and stay right for a while Nobody ever has to understand your app by reading the code You never a_empt big changes to your app logic or algorithms

3 What I m Working on at Google We ve got fantasjc cloud building blocks: BigTable, MapReduce, Pregel, and on and on (and so do you: EC2, Hadoop, Redis, ZooKeeper, ) To build your app: 1. Think hard, and choose a few building blocks (BB s) 2. SJck What your app if we logic could into separate a blender, the and app pour logic it into the various BB abstracjons from the assemblage (keys/values, of MR building funcjons, blocks? callbacks, ) 3. Tune it: stupid map- reduce tricks ; set bazillions of flags 4. Hope that Your BB choices were right, and stay right for a while Nobody ever has to understand your app by reading the code You never a_empt big changes to your app logic or algorithms

4 Research at Google High- risk/high- reward research happening across the company Successful research projects oben become successful products (e.g. speech recognijon) No pressure to publish incremental papers Interdisciplinary In my case: DB + PL + AI We re hiring J

5 Now, back to your regularly scheduled program

6 Context Elaborate processing of large data sets e.g.: web search pre- processing cross- dataset linkage web informajon extracjon inges)on storage & processing serving

7 Context storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sorjng & hashing e.g. Map- Reduce Overview scalable file system e.g. GFS Detail: Inspector Gadget, RubySky Debugging aides: Before: example data generator During: instrumentajon framework Aber: provenance metadata manager

8 Pig: A High- Level Dataflow Language & RunJme for Hadoop Web browsing sessions with happy endings. Visits = load /data/visits as (user, url, time);! Visits = foreach Visits generate user, Canonicalize(url), time;!! Pages = load /data/pages as (url, pagerank);!! VP = join Visits by url, Pages by url;! UserVisits = group VP by user;! Sessions = foreach UserVisits generate flatten(findsessions(*));! HappyEndings = filter Sessions by BestIsLast(*);!! store HappyEndings into '/data/happy_endings';!

9 vs. map- reduce: less code! "The [Hofmann PLSA E/M] algorithm was implemented in pig in lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig. " " -- Prasenjit Mukherjee, Mahout project" /20 the lines of code Minutes /16 the development Jme Hadoop Pig Hadoop Pig performs on par with raw Hadoop

10 vs. SQL: step- by- step style; lower- level control "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of" creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. " " -- Jasmine Novak, Engineer, Yahoo!" "PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP.. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on Map-Reduce] doesnʼt). " " -- Ricky Ho, Adobe Software"

11 Conceptually: A Graph of Data TransformaJons Find users who tend to visit good pages. Load Visits(user, url, Jme) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), Jme) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Filter avgpr > 0.5

12 Load Visits(user, url, Jme) Load Pages(url, pagerank) Illustrated! Transform to (user, Canonicalize(url), Jme) (Amy, cnn.com, 8am) (Amy, h_p:// 9am) (Fred, 11am) (Amy, 8am) (Amy, 9am) (Fred, 11am) Join url = url Group by user (Amy, 8am, 0.9) (Amy, 9am, 0.4) (Fred, 11am, 0.4) ( 0.9) ( 0.4) (Amy, { (Amy, 8am, 0.9), (Amy, 9am, 0.4) }) (Fred, { (Fred, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgpr) (Amy, 0.65) (Fred, 0.4) Filter avgpr > 0.5 (Amy, 0.65)

13 Load Visits(user, url, Jme) Load Pages(url, pagerank) Illustrated! Transform to (user, Canonicalize(url), Jme) (Amy, cnn.com, 8am) (Amy, h_p:// 9am) (Fred, 11am) (Amy, 8am) (Amy, 9am) (Fred, 11am) Join url = url Group by user (Amy, 8am, 0.9) (Amy, 9am, 0.4) (Fred, 11am, 0.4) ( 0.9) ( 0.4) (Amy, { (Amy, 8am, 0.9), (Amy, 9am, 0.4) }) (Fred, { (Fred, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgpr) ILLUSTRATE lets me check the output of my lengthy (Amy, 0.65) batch jobs and their (Fred, 0.4) custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive. Filter " " avgpr > Russell Jurney, LinkedIn" (Amy, 0.65)

14 (Naïve Algorithm) Load Visits(user, url, Jme) Transform to (user, Canonicalize(url), Jme) (Amy, cnn.com, 8am) (Amy, h_p:// 9am) (Fred, 11am) (Amy, 8am) (Amy, 9am) (Fred, 11am) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Load Pages(url, pagerank) ( 0.9) ( 0.4) Filter avgpr > 0.5

15 Original UI Prototype (2008)

16 InteracJve Graphics Programming UI by Bret Victor, Apple Designer (2012) Creators need an immediate connection to what they create If you make a change you need to see the effect of that immediately. " " -- Bret Victor, Apple"

17 Pig Today Open- source (the Apache Pig Project) Dev./support/training by Cloudera, Hortonworks Offered on Amazon ElasJc Map- Reduce Used by LinkedIn, Nexlix, Salesforce, Twi_er, Yahoo... At Yahoo, as of early 2011: 1000s of jobs/day 75%+ of Hadoop jobs Has an interacjve example- generator command, but no side- by- side UI L

18 Next: NOVA storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sorjng & hashing e.g. Map- Reduce scalable file system e.g. GFS Debugging aides: Before: example data generator During: instrumentajon framework Aber: provenance metadata manager

19 Why a Workflow Manager? ConJnuous data processing (simulated on top of Pig/Hadoop stajc processing layer) Independent scheduling of workflow modules

20 Example RSS feed NEW Workflow template detecjon ALL news arjcles ALL news site templates ALL NEW template tagging NEW NEW shingling NEW shingle hashes seen ALL NEW NEW de- duping NEW unique arjcles

21 Data Passes Through Many Sub- Systems Nova datum X ingesjon Pig Map- Reduce low- latency processor serving datum Y GFS provenance of X? metadata queries

22 Ibis Project metadata Ibis metadata queries integrated metadata answers data processing sub- systems metadata manager users Benefits: Provide uniform view to users Factor out metadata management code Decouple metadata lifejme from data/subsystem lifejme Challenges: Overhead of shipping metadata Disparate data/processing granularijes

23 Example Data and Process GranulariJes Workflow Table MR program Pig script Web page Row Cell Column group Column AlternaJve Pig job MR job MR job phase MR task Pig logical operajon Pig physical operajon Task a_empt data granulari7es process granulari7es

24 What s Hard About MulJ- Granularity Provenance? Inference: Given relajonships expressed at one granularity, answer queries about other granularijes (the seman;cs are tricky here!) Efficiency: Implement inference without resorjng to materializing everything in terms of finest granularity (e.g. cells)

25 Inferring TransiJve Links map phase reduce phase IMDB web page Yahoo! Movies web page IMDB extracted table )tle year lead actor Avatar 2009 Worthington IncepJon 2010 DiCaprio Yahoo extracted table )tle year lead actor Avatar 2009 Saldana IncepJon 2010 DiCaprio map output 1 map output 2 combined table )tle year lead actor Avatar 2009 A1: Worthington A2: Saldana IncepJon 2010 DiCaprio Q: Which web page said Saldana?

26 Next: INSPECTOR GADGET storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sorjng & hashing e.g. Map- Reduce scalable file system e.g. GFS Debugging aides: Before: example data generator During: instrumentajon framework Aber: provenance metadata manager

27 MoJvated by User Interviews Interviewed 10 Yahoo dataflow programmers (mostly Pig users; some users of other dataflow environments) Asked them how they (wish they could) debug

28 Summary of User Interviews # of requests feature 7 crash culprit determinajon 5 row- level integrity alerts 4 table- level integrity alerts 4 data samples 3 data summaries 3 memory use monitoring 3 backward tracing (provenance) 2 forward tracing 2 golden data/logic tesjng 2 step- through debugging 2 latency alerts 1 latency profiling 1 overhead profiling 1 trial runs

29 Our Approach Goal: a programming framework for adding these behaviors, and others, to Pig Precept: avoid modifying Pig or tampering with data flowing through Pig Approach: perform Pig script rewrijng insert special UDFs that look like no- ops to Pig

30 Pig w/ Inspector Gadget load filter load join IG coordinator group count store

31 Example: Integrity Alerts load filter load alert! join IG coordinator propagate alert to user group count store

32 Example: Crash Culprit Determina;on Phases 1 to n- 1: record counts Phase n: records load filter load join IG coordinator group Phases 1 to n- 1: maintain count lower bounds Phase n: maintain last- seen records count store

33 Example: Forward Tracing load load filter IG coordinator traced records join group tracing instrucjons report traced records to user count store

34 Flow end user result dataflow program + app. parameters applicajon IG driver library launch instrumented dataflow run(s) raw result(s) load load IG coordinator filter join dataflow engine run;me store

35 Agent & Coordinator APIs Agent Class init(args) tags = observerecord(record, tags) receivemessage(source, message) finish() Agent Messaging sendtocoordinator(message) sendtoagent(agentid, message) senddownstream(message) sendupstream(message) Coordinator Class init(args) receivemessage(source, message) output = finish() Coordinator Messaging sendtoagent(agentid, message)

36 ApplicaJons Developed For IG # of requests feature lines of code (Java) 7 crash culprit determinajon row- level integrity alerts 89 4 table- level integrity alerts 99 4 data samples 97 3 data summaries memory use monitoring N/A 3 backward tracing (provenance) forward tracing golden data/logic tesjng step- through debugging N/A 2 latency alerts latency profiling overhead profiling trial runs 93

37 RubySky: ScripJng Framework with Built- in Debugging Support Scenario: ad- hoc data analysis e.g. we just did a crawl; is it too biased toward Yahoo content? In this scenario, somejmes using new data; always running new code à Bugs are the norm à Debugging should be a first class cijzen!

38 RubySky Example :Crawl_data <= load(context, "/crawl_data", schema {string :url;...})!! # quick-and-dirty determination of "domain" from "url"! # (e.g. => yahoo.com)! :With_domains <= Crawl_data.foreach() do r! url = r[:url]! prefix = if (url[0, prefix.len] == prefix)! domain = url[prefix.len, url.index('/', prefix.len) - prefix.len]! else! throw Don t know how to parse this url: + url! end! Tuple.new([url, domain])! end!! :Yahoo_only <= With_domains.filter do r! r[:domain] =~ /^yahoo/! end!! :Grouped <= Yahoo_only.group!! :Grouped.each { r puts NUM. YAHOO URLs: + r[:yahoo_only].size }!

39 RubySky Example: w/debug stmts...! :With_domains <= Crawl_data.foreach() do r!...! else! domain = PUNT url! end! Tuple.new([url, domain])! end!! # examine a few extracted domains, as sanity check! :Domain_sample <= With_domains.foreach() do r! $count += 1! ($count < SAMPLE_SIZE)? [Tuple.new([r[:domain]])] : []! end! SHOW Domain_sample!! # alert if any extracted domain is empty or null! :Domain_missing <= With_domains.foreach() do r! (r[:domain] == "" or r[:domain] == nil)?! [Tuple.new(["ALERT! Missing domain from: " + r[:url]])] : []! end! SHOW Domain_missing!!...!

40 RubySky ExecuJon launch load client punt request response (code and/or data) domain samples empty domain alerts extract domain spy agent filter group answer (# yahoo URLs) count

41 Summary storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sorjng & hashing e.g. Map- Reduce scalable file system e.g. GFS Nova Pig Debugging aides: Before: example data generator During: instrumentajon framework Aber: provenance metadata manager Dataflow Illustrator Inspector Gadget, RubySky Ibis

42 Related Work Pig: DryadLINQ, Hive, Jaql, Scope, rela;onal query languages Nova: BigTable, CBP, Oozie, Percolator, scien;fic workflow, incremental view maintenance Dataflow illustrator: [Mannila/Raiha, PODS 86], reverse query processing, constraint databases, hardware verifica;on & model checking Inspector gadget, RubySky: XTrace, taint tracking, aspect- oriented programming Ibis: Kepler COMAD, ZOOM user views, provenance management for databases & scien;fic workflows

43 Collaborators Shubham Chopra Tyson Condie Anish Das Sarma Alan Gates Pradeep Kamath Ravi Kumar Shravan Narayanamurthy Olga Natkovich Benjamin Reed Santhosh Srinivasan Utkarsh Srivastava Andrew Tomkins

Programming and Debugging Large- Scale Data Processing Workflows

Programming and Debugging Large- Scale Data Processing Workflows Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues) Context Elaborate processing of large data sets

More information

Programming and Debugging Large- Scale Data Processing Workflows. Christopher Olston and many others Yahoo! Research

Programming and Debugging Large- Scale Data Processing Workflows. Christopher Olston and many others Yahoo! Research Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston and many others Yahoo! Research Context Elaborate processing of large data sets e.g.: web search pre- processing cross-

More information

web-scale data processing Christopher Olston and many others Yahoo! Research

web-scale data processing Christopher Olston and many others Yahoo! Research web-scale data processing Christopher Olston and many others Yahoo! Research Motivation Projects increasingly revolve around analysis of big data sets Extracting structured data, e.g. face detection Understanding

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12 Distributed Calculus with Hadoop MapReduce inside Orange Search Engine What is Big Data? $ 5 billions (2012) to $ 50 billions (by 2017) Forbes «Big Data is the new definitive source of competitive advantage

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Big Data Technology Pig: Query Language atop Map-Reduce

Big Data Technology Pig: Query Language atop Map-Reduce Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems paper:38 Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems Edson Ramiro Lucsa Filho 1, Ivan Luiz Picoli 2, Eduardo Cunha de Almeida 2, Yves Le Traon 1 1 University of Luxembourg

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

Data and Algorithms of the Web: MapReduce

Data and Algorithms of the Web: MapReduce Data and Algorithms of the Web: MapReduce Mauro Sozio May 13, 2014 Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, 2014 1 / 39 Outline 1 MapReduce Introduction MapReduce

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 19 Cloud Programming 2011-2012 Up until now Introduction, Definition of Cloud Computing Pre-Cloud Large Scale Computing: Grid Computing Content Distribution Networks Cycle-Sharing

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Inspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows

Inspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows Inspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows Christopher Olston and Benjamin Reed Yahoo! Research ABSTRACT We consider how to monitor and debug query processing

More information

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects

More information

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011 Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011 What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

Hive Interview Questions

Hive Interview Questions HADOOPEXAM LEARNING RESOURCES Hive Interview Questions www.hadoopexam.com Please visit www.hadoopexam.com for various resources for BigData/Hadoop/Cassandra/MongoDB/Node.js/Scala etc. 1 Professional Training

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

StratoSphere Above the Clouds

StratoSphere Above the Clouds Stratosphere Parallel Analytics in the Cloud beyond Map/Reduce 14th International Workshop on High Performance Transaction Systems (HPTS) Poster Sessions, Mon Oct 24 2011 Thomas Bodner T.U. Berlin StratoSphere

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Introduction To Hive

Introduction To Hive Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1 ecap Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo PC enables programmers to call functions in remote processes. IDL (Interface Definition Language)

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Unlocking Hadoop for Your Rela4onal DB. Kathleen Ting @kate_ting Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Unlocking Hadoop for Your Rela4onal DB. Kathleen Ting @kate_ting Technical Account Manager, Cloudera Sqoop PMC Member BigData. Unlocking Hadoop for Your Rela4onal DB Kathleen Ting @kate_ting Technical Account Manager, Cloudera Sqoop PMC Member BigData.be April 4, 2014 Who Am I? Started 3 yr ago as 1 st Cloudera Support Eng Now

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Big Data Analytics Platform @ Nokia

Big Data Analytics Platform @ Nokia Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform

More information

Apache Hadoop Ecosystem

Apache Hadoop Ecosystem Apache Hadoop Ecosystem Rim Moussa ZENITH Team Inria Sophia Antipolis DataScale project rim.moussa@inria.fr Context *large scale systems Response time (RIUD ops: one hit, OLTP) Time Processing (analytics:

More information

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM An Overview Contents Contents... 1 BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM... 1 Program Overview... 4 Curriculum... 5 Module 1: Big Data: Hadoop

More information

Replicating to everything

Replicating to everything Replicating to everything Featuring Tungsten Replicator A Giuseppe Maxia, QA Architect Vmware About me Giuseppe Maxia, a.k.a. "The Data Charmer" QA Architect at VMware Previously at AB / Sun / 3 times

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild. Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Large Scale Text Analysis Using the Map/Reduce

Large Scale Text Analysis Using the Map/Reduce Large Scale Text Analysis Using the Map/Reduce Hierarchy David Buttler This work is performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract

More information

HBase Schema Design. NoSQL Ma4ers, Cologne, April 2013. Lars George Director EMEA Services

HBase Schema Design. NoSQL Ma4ers, Cologne, April 2013. Lars George Director EMEA Services HBase Schema Design NoSQL Ma4ers, Cologne, April 2013 Lars George Director EMEA Services About Me Director EMEA Services @ Cloudera ConsulFng on Hadoop projects (everywhere) Apache Commi4er HBase and Whirr

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

Big Data and Hadoop for the Executive A Reference Guide

Big Data and Hadoop for the Executive A Reference Guide Big Data and Hadoop for the Executive A Reference Guide Overview The amount of information being collected by companies today is incredible. Wal- Mart has 460 terabytes of data, which, according to the

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Big Data Technology CS 236620, Technion, Spring 2013

Big Data Technology CS 236620, Technion, Spring 2013 Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Roadmap Previous class MR Implementation This class Query Languages

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Unified Batch & Stream Processing Platform

Unified Batch & Stream Processing Platform Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built

More information

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required. What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees

More information

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385 brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Big Data Analytics. Chances and Challenges. Volker Markl

Big Data Analytics. Chances and Challenges. Volker Markl Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD

More information

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12 MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Cloud Computing. Lecture 24 Cloud Platform Comparison 2014-2015

Cloud Computing. Lecture 24 Cloud Platform Comparison 2014-2015 Cloud Computing Lecture 24 Cloud Platform Comparison 2014-2015 1 Up until now Introduction, Definition of Cloud Computing Pre-Cloud Large Scale Computing: Grid Computing Content Distribution Networks Cycle-Sharing

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information