Big Data for the JVM developer. Costin Leau,

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl"

Transcription

1 Big Data for the JVM developer Costin Leau,

2 Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system

3 Data Landscape

4 Data Trends

5 Enterprise Data Trends

6 Enterprise Data Trends

7 Enterprise Data Trends

8 Enterprise Data Trends Unstructured data No predefined model Often doesn t fit in RDBMS Pre-Aggregated Data Computed during data collection Counters Running Averages

9 Cost Trends Big Iron: $40k/CPU Hardware cost halving every 18 months Commodity Cluster: $1k/CPU

10 Cost Trends Big Iron: $40k/CPU Hardware cost halving every 18 months Commodity Cluster: $1k/CPU

11 Value of Data Value from Data Exceeds Hardware & Software costs US retail 60+% increase in net margin possible % annual productivity growth

12 Big Data Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze A subjective and moving target Big data in many sectors today range from 10 s of TB to multiple PB

13 (Big) Data Pipeline

14 Big Data Pipeline Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Database (hbase, Gemfire, Cassandra) Big SQL (Greenplum, AsterData, Etc ) Batch Processing Unstructured Data (HDFS)

15 Data Pipeline Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Unstructured Data in Big Data Filesystem RT Processing HDFS Collect Interactive Processing HBase Cassandra Elasticsearch Transform SQL BIG SQL Data Grid

16 Data Pipeline Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Data Presentation Data Analytics Interactive Processing Batch Processing (Hadoop) HBase Cassandra Elasticsearch SQL BIG SQL Data Grids Unstructured Data in Big Data Filesystem HDFS

17 Taming Big Data

18 JVM as the platform Portable Fast Secure Rich eco-system Massive adoption in the enterprise

19 JVM as the platform Map Reduce Framework (M/R) Hadoop Distributed File System (HDFS)

20 Storage - HDFS Distributed Scalable Portable Data Aware Commodity hardware Unstructured Data (HDFS)

21 Computation Map/Reduce

22 Counting Words aka Hello World

23 Computation Map/Reduce public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken());context.write(word, one); }}} public class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

24 Hadoop Streaming $HADOOP_HOME/bin/hadoop jar \ hadoop-streaming.jar \ -input myinputdirs \ -output myoutputdir \ -mapper /bin/cat \ -reducer /bin/wc

25 Hadoop Streaming $HADOOP_HOME/bin/hadoop jar \ hadoop-streaming.jar \ -input myinputdirs \ -output myoutputdir \ -mapper parseline.py \ -reducer /bin/wc

26 Cascading Mid-level abstraction on top of M/R Hides M/R plumbing through building blocks Handles process planning and scheduling JVM based (Java, Clojure* and Scala*) * External projects to Cascading

27 Cascading Counting Words Scheme sourcescheme = new TextLine(new Fields("line")); Tap source = new Hfs(sourceScheme, inputpath); Scheme sinkscheme = new TextLine(new Fields("word", "count")); Tap sink = new Hfs(sinkScheme, outputpath, SinkMode.REPLACE); Pipe assembly = new Pipe("wordcount"); String regex = "(?<!\\pl)(?=\\pl)[^ ]*(?<=\\pl)(?!\\pl)"; Function function = new RegexGenerator(new Fields("word"), regex); assembly = new Each(assembly, new Fields("line ), function ); assembly = new GroupBy(assembly, new Fields("word ) ); Aggregator count = new Count(new Fields("count )); assembly = new Every(assembly, count);

28 Scalding Counting Words package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ).flatmap('line -> 'word) { line : String => tokenize(line) }.groupby('word) { _.size }.write( Tsv( args("output") ) ) def tokenize(text : String) : Array[String] = { text.tolowercase.replaceall("[^a-za-z0-9\\s]", "").split("\\s+") } }

29 Cascalog Counting Words (ns count-words.core (:use cascalog.api) (:require [cascalog.ops :as c])) (defmapcatop split [^String sentence] (.split sentence "\\s+")) (defn wordcount-query [src] (<- [?word?count] (src?textline) (split?textline :>?word) (c/count?count)))

30 Apache Pig High-level abstraction on top of M/R Procedural ETL scripting language Extensible (Java, Python, Ruby or Groovy)

31 Apache Pig input_lines = LOAD '/tmp/books' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_ words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words';

32 Apache Hive SQL-like abstraction on top of M/R Allows basic ETL Extensible (Java, Python, Ruby or Groovy)

33 Counting Words Hive -- import the file as lines CREATE EXTERNAL TABLE lines(line string) LOAD DATA INPATH books OVERWRITE INTO TABLE lines; -- create a virtual view that splits the lines SELECT word, count(*) FROM lines LATERAL VIEW explode(split(text, )) ltable as word GROUP BY word;

34 Eco-system Oozie HBase Mahout Spring for Apache Hadoop Kafka Elasticsearch Storm

35 Big Data Pipeline Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Database (hbase, Gemfire, Cassandra) Big SQL (Greenplum, AsterData, Etc ) Batch Processing Unstructured Data (HDFS)

36 Wrapping up

37 Wrap-up Rich eco-system Variety of tools/frameworks/solutions they all run on the JVM both good & bad Be agile start small and grow organically Iterate over your design a lot Focus on data, not the tools

38

Building Big Data Pipelines using OSS. Costin Leau Staff Engineer VMware @CostinL

Building Big Data Pipelines using OSS. Costin Leau Staff Engineer VMware @CostinL Building Big Data Pipelines using OSS Costin Leau Staff Engineer VMware @CostinL Costin Leau Speaker Bio Spring committer since 2006 Spring Framework (JPA, @Bean, cache abstraction) Spring OSGi/Dynamic

More information

Fewer Hoops with Hadoop: Spring & Hadoop

Fewer Hoops with Hadoop: Spring & Hadoop Fewer Hoops with Hadoop: Spring & Hadoop Costin Leau, Staff Engineer (@costinl) 2009 VMware Inc. All rights reserved Goal of the talk Improve your Hadoop experience through a POJO-based programming model

More information

MAPREDUCE - COMBINERS

MAPREDUCE - COMBINERS MAPREDUCE - COMBINERS http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm Copyright tutorialspoint.com A Combiner, also known as a semi-reducer, is an optional class that operates by accepting

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

More information

hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.

hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm. hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.br Outline 1 Introduction 2 MapReduce 3 Hadoop 4 How to Install

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING Supreet Oberoi VP Field Engineering, Concurrent Inc GET TO KNOW CONCURRENT Leader in Application Infrastructure

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

LANGUAGES FOR HADOOP: PIG & HIVE

LANGUAGES FOR HADOOP: PIG & HIVE Friday, September 27, 13 1 LANGUAGES FOR HADOOP: PIG & HIVE Michail Michailidis & Patrick Maiden Friday, September 27, 13 2 Motivation Native MapReduce Gives fine-grained control over how program interacts

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010 Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................

More information

Introduc8on to Apache Spark

Introduc8on to Apache Spark Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these

More information

http://glennengstrand.info/analytics/fp

http://glennengstrand.info/analytics/fp Functional Programming and Big Data by Glenn Engstrand (September 2014) http://glennengstrand.info/analytics/fp What is Functional Programming? It is a style of programming that emphasizes immutable state,

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

IBM Big Data Platform

IBM Big Data Platform Mike Winer IBM Information Management IBM Big Data Platform The big data opportunity Extracting insight from an immense volume, variety and velocity of data, in a timely and cost-effective manner. Variety:

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May

Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May 2012 coreservlets.com and Dima May Hadoop Streaming Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite

More information

Mrs: MapReduce for Scientific Computing in Python

Mrs: MapReduce for Scientific Computing in Python Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Enterprise Data Storage and Analysis on Tim Barr

Enterprise Data Storage and Analysis on Tim Barr Enterprise Data Storage and Analysis on Tim Barr January 15, 2015 Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib,

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 Hadoop and HDInsight in a Heartbeat HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Housekeeping. You can download a copy of these slides by clicking on the Resources widget in the bottom dock.

Housekeeping. You can download a copy of these slides by clicking on the Resources widget in the bottom dock. Housekeeping Welcome to today s ACM Webinar. The presentation starts at the top of the hour. If you are experiencing any problems/issues, refresh your console by pressing the F5 key on your keyboard in

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us DATA INTELLIGENCE FOR ALL Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO Agenda 1. Challenges & Motivation 2. DDF Overview 3. DDF Design

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

Big Data 2012 Hadoop Tutorial

Big Data 2012 Hadoop Tutorial Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:

More information

Connecting Hadoop with Oracle Database

Connecting Hadoop with Oracle Database Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.

More information

HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING. Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara

HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING. Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara DRIVING INNOVATION THROUGH DATA HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara ABOUT ME I am a Data Engineer, not a Data

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Moving Faster: Why Intent Media Chose Cascalog for Data Processing and Machine Learning. Kurt Schrader May 20, 2014

Moving Faster: Why Intent Media Chose Cascalog for Data Processing and Machine Learning. Kurt Schrader May 20, 2014 Moving Faster: Why Intent Media Chose Cascalog for Data Processing and Machine Learning Kurt Schrader May 20, 2014 Overview History of data processing at Intent Media Hello World in various data processing

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Hadoop vs Apache Spark

Hadoop vs Apache Spark Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies

More information

Hadoop + Clojure. Hadoop World NYC Friday, October 2, 2009. Stuart Sierra, AltLaw.org

Hadoop + Clojure. Hadoop World NYC Friday, October 2, 2009. Stuart Sierra, AltLaw.org Hadoop + Clojure Hadoop World NYC Friday, October 2, 2009 Stuart Sierra, AltLaw.org JVM Languages Functional Object Oriented Native to the JVM Clojure Scala Groovy Ported to the JVM Armed Bear CL Kawa

More information

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies, BIG DATA The term big data

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Hadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab

Hadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab Hadoop Overview Lavanya Ramakrishnan Iwona Sakrejda Shane Canon Lawrence Berkeley National Lab July 2011 Overview Concepts & Background MapReduce and Hadoop Hadoop Ecosystem Tools on top of Hadoop Hadoop

More information

DRIVING INNOVATION THROUGH DATA USING CASCADING TO BUILD DATA-CENTRIC APPLICATIONS ON SPARK

DRIVING INNOVATION THROUGH DATA USING CASCADING TO BUILD DATA-CENTRIC APPLICATIONS ON SPARK DRIVING INNOVATION THROUGH DATA USING CASCADING TO BUILD DATA-CENTRIC APPLICATIONS ON SPARK Supreet Oberoi VP Field Engineering, Concurrent Inc GET TO KNOW CONCURRENT Leader in Application Infrastructure

More information

Big Data Infrastructure at Spotify

Big Data Infrastructure at Spotify Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure June 12, 2013 2 Agenda Let s talk about Data Infrastructure, how we did it, what we learned and how we ve failed Some Context

More information

HPCHadoop: MapReduce on Cray X-series

HPCHadoop: MapReduce on Cray X-series HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12 Distributed Calculus with Hadoop MapReduce inside Orange Search Engine What is Big Data? $ 5 billions (2012) to $ 50 billions (by 2017) Forbes «Big Data is the new definitive source of competitive advantage

More information

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations marco@yahoo-inc.com What is Apache Hadoop? Distributed File System and Map-Reduce programming platform

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information

Big Data and Data Science Grows Up. Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com

Big Data and Data Science Grows Up. Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com Big Data and Data Science Grows Up Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com 1 Source IDC 2 Hadoop Open Source Distributed Cluster SoGware Distributed file system Java-

More information

Big Data Analytics* Outline. Issues. Big Data

Big Data Analytics* Outline. Issues. Big Data Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example

More information

Programming in Hadoop Programming, Tuning & Debugging

Programming in Hadoop Programming, Tuning & Debugging Programming in Hadoop Programming, Tuning & Debugging Venkatesh. S. Cloud Computing and Data Infrastructure Yahoo! Bangalore (India) Agenda Hadoop MapReduce Programming Distributed File System HoD Provisioning

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

CprE 419 Lab 6: Log File Analysis Using Pig

CprE 419 Lab 6: Log File Analysis Using Pig CprE 419 Lab 6: Log File Analysis Using Pig Department of Electrical and Computer Engineering Iowa State University Spring 2014 Purpose The goal is to use the Pig platform and the scripting language Pig

More information

Modeling Big Data Systems by Extending the Palladio Component Model

Modeling Big Data Systems by Extending the Palladio Component Model München, 2015-11-06 Modeling Big Data Systems by Extending the Palladio Component Model 6 th Symposium on Software Performance (SSP) 2015 Johannes Kroß 1, Andreas Brunnert 1, Helmut Krcmar 2 1 fortiss

More information

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop 1 Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop 2 Pivotal s Full Approach It s More Than Just Hadoop Pivotal Data Labs 3 Why Pivotal Exists First Movers Solve the Big Data Utility Gap

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

Outline of Tutorial. Hadoop and Pig Overview Hands-on

Outline of Tutorial. Hadoop and Pig Overview Hands-on Outline of Tutorial Hadoop and Pig Overview Hands-on 1 Hadoop and Pig Overview Lavanya Ramakrishnan Shane Canon Lawrence Berkeley National Lab October 2011 Overview Concepts & Background MapReduce and

More information

An Overview of Hadoop

An Overview of Hadoop 1 / 26 An Overview of The Ohio State University Department of Linguistics 2 / 26 What is? is a software framework for scalable distributed computing 3 / 26 MapReduce Follows Google s MapReduce framework

More information

Big Data Workshop. dattamsha.com

Big Data Workshop. dattamsha.com Big Data Workshop About Praveen Has more than15 years of experience working on various technologies. Is a Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% score and got through the

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview

More information

Big Data Weather Analytics Using Hadoop

Big Data Weather Analytics Using Hadoop Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,

More information

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH Real-time Data Analytics mit Elasticsearch Bernhard Pflugfelder inovex GmbH Bernhard Pflugfelder Big Data Engineer @ inovex Fields of interest: search analytics big data bi Working with: Lucene Solr Elasticsearch

More information

HADOOP SDJ INFOSOFT PVT LTD

HADOOP SDJ INFOSOFT PVT LTD HADOOP SDJ INFOSOFT PVT LTD DATA FACT 6/17/2016 SDJ INFOSOFT PVT. LTD www.javapadho.com Big Data Definition Big data is high volume, high velocity and highvariety information assets that demand cost

More information

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive

More information

Introduction to Hadoop. Owen O Malley Yahoo Inc!

Introduction to Hadoop. Owen O Malley Yahoo Inc! Introduction to Hadoop Owen O Malley Yahoo Inc! omalley@apache.org Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster:

More information

Introduction to Hadoop. Owen O Malley Yahoo Inc!

Introduction to Hadoop. Owen O Malley Yahoo Inc! Introduction to Hadoop Owen O Malley Yahoo Inc! omalley@apache.org Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster:

More information

CSE-E5430 Scalable Cloud Computing Lecture 3

CSE-E5430 Scalable Cloud Computing Lecture 3 CSE-E5430 Scalable Cloud Computing Lecture 3 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 21.9-2015 1/25 Writing Hadoop Jobs Example: Assume

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Play with Big Data on the Shoulders of Open Source

Play with Big Data on the Shoulders of Open Source OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

What s Ahead for Big Data?

What s Ahead for Big Data? GOTO Chicago 2013 dean@concurren6hought.com @deanwampler polyglotprogramming.com/talks What s Ahead for Big Data? Copyright Dean Wampler, 2011-2013, All Rights Reserved. Photos can only be used with permission.

More information

Information Builders Mission & Value Proposition

Information Builders Mission & Value Proposition Value 10/06/2015 2015 MapR Technologies 2015 MapR Technologies 1 Information Builders Mission & Value Proposition Economies of Scale & Increasing Returns (Note: Not to be confused with diminishing returns

More information

Cloud Computing Era. Trend Micro

Cloud Computing Era. Trend Micro Cloud Computing Era Trend Micro Three Major Trends to Chang the World Cloud Computing Big Data Mobile 什 麼 是 雲 端 運 算? 美 國 國 家 標 準 技 術 研 究 所 (NIST) 的 定 義 : Essential Characteristics Service Models Deployment

More information

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Running Hadoop at Stirling

Running Hadoop at Stirling Running Hadoop at Stirling Kevin Swingler Summary The Hadoopserver in CS @ Stirling A quick intoduction to Unix commands Getting files in and out Compliing your Java Submit a HadoopJob Monitor your jobs

More information