STOCHASTIC ANALYTICS. Chris Jermaine Rice University

Size: px
Start display at page:

Download "STOCHASTIC ANALYTICS. Chris Jermaine Rice University cmj4@cs.rice.edu"

Transcription

1 STOCHASTIC ANALYTICS Chris Jermaine Rice University 1

2 What Is Meant by Stochastic Analytics? 2

3 What Is Meant by Stochastic Analytics? It is actually a relatively little-used term (to date!) It means tackling (Big Data) analytic tasks using stochastic models Will now define stochastic models and (Big Data) analytics 3

4 First: What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? 4

5 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Have a data record describing J. Smith Both John Smith and Jane Smith are people in your data set Record a 50/50 chance of J. Smith referring to John/Jane 5

6 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Missing a person s gender? 52% of people in the data set are women... Represent the gender via a distribution (52% female, 48% male) 6

7 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Provides for a principled way to talk about beliefs I am 99.5% sure that this is gonna be a great presentation! 7

8 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Provides for a principled way to talk about beliefs One great thing about randomness: The theory already exists Can leverage 100s of years of probability theory 8

9 What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) 9

10 What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) Examples: Segmenting/clustering customers for data reduction/viz Learning rules from data: beer implies diapers Learning to recognize Higgs boson in the aftermath of a collision 10

11 What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) Back in the day (that is, the 1990 s), related buzzwords were: Data warehousing Building a large database where operational data go to retire On-Line Analytic Processing (OLAP) Combination data model / viz tool / query interface for a multidimensional database Data Mining The application of scalable intelligent knowledge discovery algorithms to data One of the pioneers in data mining (Usama Fayyad) worked at JPL 11

12 What Are Big Data Analytics? Before 2000s, many tools for Big Data analytics were DB-centric That is, they worked in concert with or were based upon RDBMS technology 12

13 What Are Big Data Analytics? What changed in early 2000s? 13

14 What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech 14

15 What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech Felt it was too inflexible Difficult to implement things like PageRank Didn t lend itself to analyzing sequence-based user data Bad for analyzing graphs Many other annoyances 15

16 What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech Felt it was too inflexible Too tied to proprietary hardware Too slow, didn t scale to 1000 s of machines By early-mid 2000s......the early Internet companies began to publicize the infrastructure they d built Most influential idea was MapReduce See for the seminal 2004 paper on MR In last 5 years or so, an entire NoSQL ecosystem for BDA has sprung up Tools include: Hadoop, Cassandra, HBase, Pig, Hive... 16

17 Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics 17

18 Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics With the wrinkle that the result of the analysis itself is stochastic/random Of key importance because it gives analyst an idea of risk/uncertainty of result 18

19 Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics With the wrinkle that the result of the analysis itself is stochastic/random Of key importance because it gives analyst an idea of risk/uncertainty of result I ll be illustrating this with an example But first a quick crash course on Bayesian statistics Important because the Bayesian approach underlies much of stochastic analytics 19

20 Bayesian Statistics Imagine that we come up with an exam for a college course Want to run a trial with a few students Goal is to see if this exam is too hard or too easy That is, want to say something about what we think the mean exam score will be And what the spread (or variance) of the score will be 20

21 So How Would a Bayesian Do This? First imagine a stochastic generative process for the data For the ith student, we might imagine score i ~ Normal ( μ, σ 2 ) ultimate goal is to guess the value of mu Means is sampled from a RV with a normal dist. Why Normal? Models typical bell shaped data Assume μ, σ 2 are also generated by sampling from RVs, and are unknown 21

22 So How Would a Bayesian Do This? How is the mean generated? We imagine μ ~ Normal (70, 100) Why Normal (70, 100)? Allows for all possible, reasonable exam averages 22

23 So How Would a Bayesian Do This? How is the variance (spread) generated? We imagine ~ InverseGamma (1, 1) σ 2 Why InverseGamma (1, 1)? Allows a really large range of positive vals σ 2 23

24 Thus, Our Generative Process Is Step 1: ~ InverseGamma (1, 1) σ 2 Step 2: μ ~ Normal (75, 100) Step 3: for each i, score i ~ Normal ( μ, σ 2 ) 24

25 Why Have a Generative Process? Given gen process, can use PDF to measure how likely a given pair is Specifically, p( μ, σ 2 ) = IG ( σ 2 1, 1) Normal ( μ 75, 100) So given any μ, σ 2 combo, we can say how likely we think it is This is our prior on mean and variance Note times cause the two values are indep. 25

26 Bayesian Inference To a Bayesian, inference is then the process of updating prior expectations after seeing the data After we see the data: p( μ, σ 2 data) p( data μσ, 2 ) p( μσ, 2 ) p( data) This is equation follows directly from Bayes Theorem Note: this is also a distribution = Known as a posterior distribution test scores 26

27 Pictorially You have a prior distribution on mean and variance: sigma 2 (variance of test scores) mu (average of test scores) 27

28 Pictorially You see some test scores sigma 2 (variance of test scores) mu (average of test scores) seem to average in the high 70 s student 1 student 2 student 3 student 4 student 5 student 6 28

29 Pictorially Then use Bayes to get a posterior distribution on mean and var: sigma 2 (variance of test scores) mu (average of test scores) much narrower after seeing data! 29

30 That s the Bayesian Approach Come up with a generative process Which includes prior distributions on the quantities like to est In our example, the variance and especially the mean See some data Use Bayes Theorem and data to update the priors This gives you a posterior dist The posterior is your estimate Why attractive for use in Big Data analytics? Answer is a dist So it gives you some idea of uncertainty/risk of inferences made using the data 30

31 Now Let s Look At an Example We have archived billions of sales records and want to know: What would my profits have been in 08 if I d cut all of my margins by 10%? How might we use archived data to guess this answer? Need to guess each customer s demand at new price Use to build a hypothetical, revised version our table storing sales in 2008 Finally, query this hypothetical table Here s one possibility......utilizing the Bayesian approach 31

32 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased For a given customer, demand is a linear function (I know, could do better than linear...) Price of part A 32

33 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Demand curve is generated via samples from twin Gamma distributions 33

34 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Distribution over D 0, P 0 defines a whole family of possible demand curves... 34

35 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Here s a new, possible demand curve... 35

36 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) And another... 36

37 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) And so on... 37

38 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) The PDF Ff (.) = Gamma( D 0 k D, θ D ) Gamma( P 0 k P, θ P ) is our prior distribution over demand curves 38

39 Step 2: Learn the Model To apply this model, we need to learn the prior That is, choose k P, k D, θ P, θ D to be realistic for sales in 2008 Reasonable tactic: use the warehoused data to perform an MLE That is, find the set of params that best describes all of the 2008 sales Do this by issuing computations over the warehouse 39

40 Step 3: Apply the Model - the Theory Now we have a prior model (PDF) for demand function f: Ff ( k P, k D, θ P, θ D ) Problem: the actual demand curve for each customer is unseen But can use observed demand to obtain a posterior demand model Ex: for ith sale, a customer bought d units at price p Then posterior demand model is given by: Ff ( i k P, k D, θ P, θ D, f i ( p) = d) 40

41 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) 41

42 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) So F accepts this one... Price of part A 42

43 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) And this one... Price of part A 43

44 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) And this one... Price of part A 44

45 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) And this one... Price of part A 45

46 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A 46

47 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A Those demand curves that are given non-zero weight are ordered exactly as the prior would order them 47

48 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A Those demand curves that are given non-zero weight are ordered exactly as the prior would order them To guess the customer s demand at new price p : Sample a possible f i from the PDF Ff ( i k P, k D, θ P, θ D, f i ( p) = d) Compute f i ( p ), and we have a new demand! 48

49 Step 3: Apply the Model - the Practice To actually compute the overall profit under new prices: I d need to sample an f i from F(f i f i (p) = d) for every sale from 2008 Evaluate f i ( p ) for all those sales Compute the new profits 49

50 Step 3: Apply the Model - the Practice Issue: the profits are actually random You do this once, you get one answer You do this again, you get another answer How to handle this? Redo the computation many times (Monte Carlo) to obtain a distribution of results number of obs negative positive additional profits 0 1M don t do it! 50

51 Illustrative of Stochastic Analytics Begins with a stochastic model Model is applied at very fine granularity to big data set In contrast to summarize-then-analyze approach Result of analysis is a distribution, not a single result Often obtained using Bayesian techniques 51

52 Why Is It Important to Get a Distribution? Gives the final consumer a picture of uncertainty/risk Especially when applying Bayesian approach... lack of data comes through as very wide tails People increasingly aware of pitfalls of not quantifying risk mean:-$20m... would YOU choose this? -100M 0 100M... 1B loss 52

53 Systematic Quant. of Uncertainty/Risk Worthwhile polemic is Sam Savage (Stanford) Flaw of Averages Savage describes two f/laws : Flaw of averages (weak form): Flaw of averages (strong form): Mean correct, Variance ignored Sam Savage s book (why we underestimate risk) Wrong value of mean: f(e[x]) E[f(X)] 21 53

54 Another Example: Linear Regression 54

55 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) 55

56 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) 56

57 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) known age predicted income So as to predict the outcomes associated with a new data point 57

58 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) known age predicted income So as to predict the outcomes associated with a new data point Widely used in practice, esp. in Big Data analytics But typically LS version is applied; not the stochastic/bayesian variant 58

59 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i 59

60 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) intercept is c slope is c 0 1 age (regressor) 60

61 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) given x 1 age (regressor) 61

62 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) y is normally distributed around this point age (regressor) 62

63 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) We imagine entire data set is generated in this way... age (regressor) 63

64 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 As an aside: why use Laplace distribution? i 64

65 Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome)...we imagine generated using unknown params age (regressor) 65

66 Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome) age (regressor) use Bayes rule to get posterior distribution on coefs (Challenging for high-d, Big Data!) 66

67 Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome) given x 1 age (regressor) can use that distribution to generate a distribution of outcomes given a new data point 67

68 Bayesian Linear Regression income (outcome) age (regressor) Why is this Stochastic Analytics? We are using a stochastic model In conjunction with a very large data set To perform a (toy) analytic task: predicting income of a person given an age And we obtain information re the uncertainty of our prediction 68

69 Soooo... We ve seen two examples illustrating stochastic analytics Many more exist Almost any analytic task can be made stochastic by applying appropriate model Likely to grow in importance as application of ML to Big Data becomes common Why? The Bayesian approach is common in modern ML......natural synergy between SA and the Bayesian approach 69

70 Epilogue: How Might One Implement This? Well, you could use our system: MCDB/SimSQL MCDB/SimSQL is itself built on Hadoop/MapReduce That is, a high level spec in MCDB/SimSQL ends up running on Hadoop Apache Hadoop is the de-facto open-source implementation of MapReduce 70

71 What Is MapReduce? It is a simple Big Data processing paradigm To process a data set, you have two pieces of user-supplied code: A map code And a reduce code These are run (potentially over a very large compute cluster built using commodity hardware) using three data processing phases A map phase A shuffle phase And a reduce phase 71

72 The Map Phase Assume that the input data are stored in a huge file This file contains a simple list of pairs of type (key1,value1) And assume we have a user-supplied function of the form map (key1,value1) That outputs a list of pairs of the form (key2, value2) In the map phase of the MapReduce computation this map function is called for every record in the input data set Instances of map run in parallel all over the compute cluster 72

73 The Reduce Phase Assume we have a user-supplied function of the form reduce (key2,list <value2>) That outputs a list of value2 objects In the reduce phase of the MapReduce computation this reduce function is called for every key2 value output by the shuffle Instances of reduce run in parallel all over the compute cluster The output of all of those instances is collected in a (potentially) huge output file 73

74 The Shuffle Phase The shuffle phase accepts all of the (key2, value2) pairs from the map phase And it groups them together So that all of the pairs From all over the cluster Having the same key2 value Are merged into a single (key2, list <value2>) pair Called a shuffle because this is where a potential all-to-all data transfer happens 74

75 Why Do People Like MapReduce? Can specify a computation (such as WordCount) In a few lines of code (Java in the case of Hadoop) MapReduce platform takes care of distributing to 1000 s of cores Great for data parallel tasks Those where have a lot of data and can move computation to data Codes can be much simpler than codes in other DP paradigms 75

76 Questions? 76

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required. What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

ANALYTICS CENTER LEARNING PROGRAM

ANALYTICS CENTER LEARNING PROGRAM Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Business Intelligence for Big Data

Business Intelligence for Big Data Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,

More information

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II. Lecture Notes PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Big Data Big Deal? Salford Systems www.salford-systems.com

Big Data Big Deal? Salford Systems www.salford-systems.com Big Data Big Deal? Salford Systems www.salford-systems.com 2015 Copyright Salford Systems 2010-2015 Big Data Is The New In Thing Google trends as of September 24, 2015 Difficult to read trade press without

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

White Paper: What You Need To Know About Hadoop

White Paper: What You Need To Know About Hadoop CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Speak<geek> Tech Brief. RichRelevance Distributed Computing: creating a scalable, reliable infrastructure

Speak<geek> Tech Brief. RichRelevance Distributed Computing: creating a scalable, reliable infrastructure 3 Speak Tech Brief RichRelevance Distributed Computing: creating a scalable, reliable infrastructure Overview Scaling a large database is not an overnight process, so it s difficult to plan and implement

More information

Big Data Analysis and HADOOP

Big Data Analysis and HADOOP Big Data Analysis and HADOOP B.Jegatheswari and M.Muthulakshmi III year MCA AVC College of engineering, Mayiladuthurai. Email ID: jjega.cool@gmail.com Mobile: 8220380693 Abstract: - Digital universe with

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013 Integrating Hadoop Into Business Intelligence & Data Warehousing Philip Russom TDWI Research Director for Data Management, April 9 2013 TDWI would like to thank the following companies for sponsoring the

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

BIG DATA IS MESSY PARTNER WITH SCALABLE

BIG DATA IS MESSY PARTNER WITH SCALABLE BIG DATA IS MESSY PARTNER WITH SCALABLE SCALABLE SYSTEMS HADOOP SOLUTION WHAT IS BIG DATA? Each day human beings create 2.5 quintillion bytes of data. In the last two years alone over 90% of the data on

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Using Tableau Software with Hortonworks Data Platform

Using Tableau Software with Hortonworks Data Platform Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Big Data Introduction

Big Data Introduction Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights

More information

CS 378 Big Data Programming. Lecture 2 Map- Reduce

CS 378 Big Data Programming. Lecture 2 Map- Reduce CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments

More information

Applying Semantics to Unstructured Data (Big and Getting Bigger)

Applying Semantics to Unstructured Data (Big and Getting Bigger) Applying Semantics to Unstructured Data (Big and Getting Bigger) Wednesday, November 30, 2012 4:00 5:00 Bryan Bell Vice President, Enterprise Solutions, Expert System Lynda Moulton, Analyst & Consultant,

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Apache Hadoop: The Big Data Refinery

Apache Hadoop: The Big Data Refinery Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data

More information

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica jcampbell@vertica.com Big

More information

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011 Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011 What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY

DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY Big Data Analytics DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY Tom Haughey InfoModel, LLC 868 Woodfield Road Franklin Lakes, NJ 07417 201 755 3350 tom.haughey@infomodelusa.com

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation

More information

CS 378 Big Data Programming

CS 378 Big Data Programming CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is

More information

Big Data and Hadoop for the Executive A Reference Guide

Big Data and Hadoop for the Executive A Reference Guide Big Data and Hadoop for the Executive A Reference Guide Overview The amount of information being collected by companies today is incredible. Wal- Mart has 460 terabytes of data, which, according to the

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition

More information

HiBench Installation. Sunil Raiyani, Jayam Modi

HiBench Installation. Sunil Raiyani, Jayam Modi HiBench Installation Sunil Raiyani, Jayam Modi Last Updated: May 23, 2014 CONTENTS Contents 1 Introduction 1 2 Installation 1 3 HiBench Benchmarks[3] 1 3.1 Micro Benchmarks..............................

More information

Web Data Mining: A Case Study. Abstract. Introduction

Web Data Mining: A Case Study. Abstract. Introduction Web Data Mining: A Case Study Samia Jones Galveston College, Galveston, TX 77550 Omprakash K. Gupta Prairie View A&M, Prairie View, TX 77446 okgupta@pvamu.edu Abstract With an enormous amount of data stored

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical

More information

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012 Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster Nov 7, 2012 Who I Am Robert Lancaster Solutions Architect, Hotel Supply Team rlancaster@orbitz.com @rob1lancaster Organizer of Chicago

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All

More information

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases INDUS / AXIOMINE Adopting Hadoop In the Enterprise Typical Enterprise Use Cases. Contents Executive Overview... 2 Introduction... 2 Traditional Data Processing Pipeline... 3 ETL is prevalent Large Scale

More information

Sunnie Chung. Cleveland State University

Sunnie Chung. Cleveland State University Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:

More information

Data Warehousing in the Age of Big Data

Data Warehousing in the Age of Big Data Data Warehousing in the Age of Big Data Krish Krishnan AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD * PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Morgan Kaufmann is an imprint of Elsevier

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction Data Mining and Exploration Data Mining and Exploration: Introduction Amos Storkey, School of Informatics January 10, 2006 http://www.inf.ed.ac.uk/teaching/courses/dme/ Course Introduction Welcome Administration

More information

TRAINING PROGRAM ON BIGDATA/HADOOP

TRAINING PROGRAM ON BIGDATA/HADOOP Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP Business Analytics for All Amsterdam - 2015 Value of Big Data is Being Recognized Executives beginning to see the path from data insights to revenue

More information

SIAM PP 2014! MapReduce in Scientific Computing! February 19, 2014

SIAM PP 2014! MapReduce in Scientific Computing! February 19, 2014 SIAM PP 2014! MapReduce in Scientific Computing! February 19, 2014 Paul G. Constantine! Applied Math & Stats! Colorado School of Mines David F. Gleich! Computer Science! Purdue University Hans De Sterck!

More information

Big Data: Beyond the Hype

Big Data: Beyond the Hype Big Data: Beyond the Hype Why Big Data Matters to You WHITE PAPER Big Data: Beyond the Hype Why Big Data Matters to You By DataStax Corporation October 2011 Table of Contents Introduction...4 Big Data

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

This Symposium brought to you by www.ttcus.com

This Symposium brought to you by www.ttcus.com This Symposium brought to you by www.ttcus.com Linkedin/Group: Technology Training Corporation @Techtrain Technology Training Corporation www.ttcus.com Big Data Analytics as a Service (BDAaaS) Big Data

More information

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine

More information

Navigating Big Data business analytics

Navigating Big Data business analytics mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what

More information

White Paper: Hadoop for Intelligence Analysis

White Paper: Hadoop for Intelligence Analysis CTOlabs.com White Paper: Hadoop for Intelligence Analysis July 2011 A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data. Inside: Apache Hadoop and

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1 Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

The big data revolution

The big data revolution The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing

More information