STOCHASTIC ANALYTICS. Chris Jermaine Rice University
|
|
- Solomon Marsh
- 8 years ago
- Views:
Transcription
1 STOCHASTIC ANALYTICS Chris Jermaine Rice University 1
2 What Is Meant by Stochastic Analytics? 2
3 What Is Meant by Stochastic Analytics? It is actually a relatively little-used term (to date!) It means tackling (Big Data) analytic tasks using stochastic models Will now define stochastic models and (Big Data) analytics 3
4 First: What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? 4
5 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Have a data record describing J. Smith Both John Smith and Jane Smith are people in your data set Record a 50/50 chance of J. Smith referring to John/Jane 5
6 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Missing a person s gender? 52% of people in the data set are women... Represent the gender via a distribution (52% female, 48% male) 6
7 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Provides for a principled way to talk about beliefs I am 99.5% sure that this is gonna be a great presentation! 7
8 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Provides for a principled way to talk about beliefs One great thing about randomness: The theory already exists Can leverage 100s of years of probability theory 8
9 What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) 9
10 What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) Examples: Segmenting/clustering customers for data reduction/viz Learning rules from data: beer implies diapers Learning to recognize Higgs boson in the aftermath of a collision 10
11 What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) Back in the day (that is, the 1990 s), related buzzwords were: Data warehousing Building a large database where operational data go to retire On-Line Analytic Processing (OLAP) Combination data model / viz tool / query interface for a multidimensional database Data Mining The application of scalable intelligent knowledge discovery algorithms to data One of the pioneers in data mining (Usama Fayyad) worked at JPL 11
12 What Are Big Data Analytics? Before 2000s, many tools for Big Data analytics were DB-centric That is, they worked in concert with or were based upon RDBMS technology 12
13 What Are Big Data Analytics? What changed in early 2000s? 13
14 What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech 14
15 What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech Felt it was too inflexible Difficult to implement things like PageRank Didn t lend itself to analyzing sequence-based user data Bad for analyzing graphs Many other annoyances 15
16 What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech Felt it was too inflexible Too tied to proprietary hardware Too slow, didn t scale to 1000 s of machines By early-mid 2000s......the early Internet companies began to publicize the infrastructure they d built Most influential idea was MapReduce See for the seminal 2004 paper on MR In last 5 years or so, an entire NoSQL ecosystem for BDA has sprung up Tools include: Hadoop, Cassandra, HBase, Pig, Hive... 16
17 Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics 17
18 Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics With the wrinkle that the result of the analysis itself is stochastic/random Of key importance because it gives analyst an idea of risk/uncertainty of result 18
19 Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics With the wrinkle that the result of the analysis itself is stochastic/random Of key importance because it gives analyst an idea of risk/uncertainty of result I ll be illustrating this with an example But first a quick crash course on Bayesian statistics Important because the Bayesian approach underlies much of stochastic analytics 19
20 Bayesian Statistics Imagine that we come up with an exam for a college course Want to run a trial with a few students Goal is to see if this exam is too hard or too easy That is, want to say something about what we think the mean exam score will be And what the spread (or variance) of the score will be 20
21 So How Would a Bayesian Do This? First imagine a stochastic generative process for the data For the ith student, we might imagine score i ~ Normal ( μ, σ 2 ) ultimate goal is to guess the value of mu Means is sampled from a RV with a normal dist. Why Normal? Models typical bell shaped data Assume μ, σ 2 are also generated by sampling from RVs, and are unknown 21
22 So How Would a Bayesian Do This? How is the mean generated? We imagine μ ~ Normal (70, 100) Why Normal (70, 100)? Allows for all possible, reasonable exam averages 22
23 So How Would a Bayesian Do This? How is the variance (spread) generated? We imagine ~ InverseGamma (1, 1) σ 2 Why InverseGamma (1, 1)? Allows a really large range of positive vals σ 2 23
24 Thus, Our Generative Process Is Step 1: ~ InverseGamma (1, 1) σ 2 Step 2: μ ~ Normal (75, 100) Step 3: for each i, score i ~ Normal ( μ, σ 2 ) 24
25 Why Have a Generative Process? Given gen process, can use PDF to measure how likely a given pair is Specifically, p( μ, σ 2 ) = IG ( σ 2 1, 1) Normal ( μ 75, 100) So given any μ, σ 2 combo, we can say how likely we think it is This is our prior on mean and variance Note times cause the two values are indep. 25
26 Bayesian Inference To a Bayesian, inference is then the process of updating prior expectations after seeing the data After we see the data: p( μ, σ 2 data) p( data μσ, 2 ) p( μσ, 2 ) p( data) This is equation follows directly from Bayes Theorem Note: this is also a distribution = Known as a posterior distribution test scores 26
27 Pictorially You have a prior distribution on mean and variance: sigma 2 (variance of test scores) mu (average of test scores) 27
28 Pictorially You see some test scores sigma 2 (variance of test scores) mu (average of test scores) seem to average in the high 70 s student 1 student 2 student 3 student 4 student 5 student 6 28
29 Pictorially Then use Bayes to get a posterior distribution on mean and var: sigma 2 (variance of test scores) mu (average of test scores) much narrower after seeing data! 29
30 That s the Bayesian Approach Come up with a generative process Which includes prior distributions on the quantities like to est In our example, the variance and especially the mean See some data Use Bayes Theorem and data to update the priors This gives you a posterior dist The posterior is your estimate Why attractive for use in Big Data analytics? Answer is a dist So it gives you some idea of uncertainty/risk of inferences made using the data 30
31 Now Let s Look At an Example We have archived billions of sales records and want to know: What would my profits have been in 08 if I d cut all of my margins by 10%? How might we use archived data to guess this answer? Need to guess each customer s demand at new price Use to build a hypothetical, revised version our table storing sales in 2008 Finally, query this hypothetical table Here s one possibility......utilizing the Bayesian approach 31
32 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased For a given customer, demand is a linear function (I know, could do better than linear...) Price of part A 32
33 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Demand curve is generated via samples from twin Gamma distributions 33
34 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Distribution over D 0, P 0 defines a whole family of possible demand curves... 34
35 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Here s a new, possible demand curve... 35
36 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) And another... 36
37 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) And so on... 37
38 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) The PDF Ff (.) = Gamma( D 0 k D, θ D ) Gamma( P 0 k P, θ P ) is our prior distribution over demand curves 38
39 Step 2: Learn the Model To apply this model, we need to learn the prior That is, choose k P, k D, θ P, θ D to be realistic for sales in 2008 Reasonable tactic: use the warehoused data to perform an MLE That is, find the set of params that best describes all of the 2008 sales Do this by issuing computations over the warehouse 39
40 Step 3: Apply the Model - the Theory Now we have a prior model (PDF) for demand function f: Ff ( k P, k D, θ P, θ D ) Problem: the actual demand curve for each customer is unseen But can use observed demand to obtain a posterior demand model Ex: for ith sale, a customer bought d units at price p Then posterior demand model is given by: Ff ( i k P, k D, θ P, θ D, f i ( p) = d) 40
41 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) 41
42 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) So F accepts this one... Price of part A 42
43 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) And this one... Price of part A 43
44 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) And this one... Price of part A 44
45 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) And this one... Price of part A 45
46 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A 46
47 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A Those demand curves that are given non-zero weight are ordered exactly as the prior would order them 47
48 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A Those demand curves that are given non-zero weight are ordered exactly as the prior would order them To guess the customer s demand at new price p : Sample a possible f i from the PDF Ff ( i k P, k D, θ P, θ D, f i ( p) = d) Compute f i ( p ), and we have a new demand! 48
49 Step 3: Apply the Model - the Practice To actually compute the overall profit under new prices: I d need to sample an f i from F(f i f i (p) = d) for every sale from 2008 Evaluate f i ( p ) for all those sales Compute the new profits 49
50 Step 3: Apply the Model - the Practice Issue: the profits are actually random You do this once, you get one answer You do this again, you get another answer How to handle this? Redo the computation many times (Monte Carlo) to obtain a distribution of results number of obs negative positive additional profits 0 1M don t do it! 50
51 Illustrative of Stochastic Analytics Begins with a stochastic model Model is applied at very fine granularity to big data set In contrast to summarize-then-analyze approach Result of analysis is a distribution, not a single result Often obtained using Bayesian techniques 51
52 Why Is It Important to Get a Distribution? Gives the final consumer a picture of uncertainty/risk Especially when applying Bayesian approach... lack of data comes through as very wide tails People increasingly aware of pitfalls of not quantifying risk mean:-$20m... would YOU choose this? -100M 0 100M... 1B loss 52
53 Systematic Quant. of Uncertainty/Risk Worthwhile polemic is Sam Savage (Stanford) Flaw of Averages Savage describes two f/laws : Flaw of averages (weak form): Flaw of averages (strong form): Mean correct, Variance ignored Sam Savage s book (why we underestimate risk) Wrong value of mean: f(e[x]) E[f(X)] 21 53
54 Another Example: Linear Regression 54
55 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) 55
56 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) 56
57 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) known age predicted income So as to predict the outcomes associated with a new data point 57
58 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) known age predicted income So as to predict the outcomes associated with a new data point Widely used in practice, esp. in Big Data analytics But typically LS version is applied; not the stochastic/bayesian variant 58
59 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i 59
60 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) intercept is c slope is c 0 1 age (regressor) 60
61 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) given x 1 age (regressor) 61
62 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) y is normally distributed around this point age (regressor) 62
63 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) We imagine entire data set is generated in this way... age (regressor) 63
64 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 As an aside: why use Laplace distribution? i 64
65 Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome)...we imagine generated using unknown params age (regressor) 65
66 Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome) age (regressor) use Bayes rule to get posterior distribution on coefs (Challenging for high-d, Big Data!) 66
67 Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome) given x 1 age (regressor) can use that distribution to generate a distribution of outcomes given a new data point 67
68 Bayesian Linear Regression income (outcome) age (regressor) Why is this Stochastic Analytics? We are using a stochastic model In conjunction with a very large data set To perform a (toy) analytic task: predicting income of a person given an age And we obtain information re the uncertainty of our prediction 68
69 Soooo... We ve seen two examples illustrating stochastic analytics Many more exist Almost any analytic task can be made stochastic by applying appropriate model Likely to grow in importance as application of ML to Big Data becomes common Why? The Bayesian approach is common in modern ML......natural synergy between SA and the Bayesian approach 69
70 Epilogue: How Might One Implement This? Well, you could use our system: MCDB/SimSQL MCDB/SimSQL is itself built on Hadoop/MapReduce That is, a high level spec in MCDB/SimSQL ends up running on Hadoop Apache Hadoop is the de-facto open-source implementation of MapReduce 70
71 What Is MapReduce? It is a simple Big Data processing paradigm To process a data set, you have two pieces of user-supplied code: A map code And a reduce code These are run (potentially over a very large compute cluster built using commodity hardware) using three data processing phases A map phase A shuffle phase And a reduce phase 71
72 The Map Phase Assume that the input data are stored in a huge file This file contains a simple list of pairs of type (key1,value1) And assume we have a user-supplied function of the form map (key1,value1) That outputs a list of pairs of the form (key2, value2) In the map phase of the MapReduce computation this map function is called for every record in the input data set Instances of map run in parallel all over the compute cluster 72
73 The Reduce Phase Assume we have a user-supplied function of the form reduce (key2,list <value2>) That outputs a list of value2 objects In the reduce phase of the MapReduce computation this reduce function is called for every key2 value output by the shuffle Instances of reduce run in parallel all over the compute cluster The output of all of those instances is collected in a (potentially) huge output file 73
74 The Shuffle Phase The shuffle phase accepts all of the (key2, value2) pairs from the map phase And it groups them together So that all of the pairs From all over the cluster Having the same key2 value Are merged into a single (key2, list <value2>) pair Called a shuffle because this is where a potential all-to-all data transfer happens 74
75 Why Do People Like MapReduce? Can specify a computation (such as WordCount) In a few lines of code (Java in the case of Hadoop) MapReduce platform takes care of distributing to 1000 s of cores Great for data parallel tasks Those where have a lot of data and can move computation to data Codes can be much simpler than codes in other DP paradigms 75
76 Questions? 76
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationYou should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationANALYTICS CENTER LEARNING PROGRAM
Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationBig Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationBusiness Intelligence for Big Data
Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,
More informationOPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationPS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationBig Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing
Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:
More informationBig Data Big Deal? Salford Systems www.salford-systems.com
Big Data Big Deal? Salford Systems www.salford-systems.com 2015 Copyright Salford Systems 2010-2015 Big Data Is The New In Thing Google trends as of September 24, 2015 Difficult to read trade press without
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationBig Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
More informationW H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationBig Data and Market Surveillance. April 28, 2014
Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part
More informationTowards running complex models on big data
Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation
More informationWhite Paper: What You Need To Know About Hadoop
CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack
More informationThe Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationBayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationSQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationTesting Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationSpeak<geek> Tech Brief. RichRelevance Distributed Computing: creating a scalable, reliable infrastructure
3 Speak Tech Brief RichRelevance Distributed Computing: creating a scalable, reliable infrastructure Overview Scaling a large database is not an overnight process, so it s difficult to plan and implement
More informationBig Data Analysis and HADOOP
Big Data Analysis and HADOOP B.Jegatheswari and M.Muthulakshmi III year MCA AVC College of engineering, Mayiladuthurai. Email ID: jjega.cool@gmail.com Mobile: 8220380693 Abstract: - Digital universe with
More informationA Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
More informationHow to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
More informationIntegrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013
Integrating Hadoop Into Business Intelligence & Data Warehousing Philip Russom TDWI Research Director for Data Management, April 9 2013 TDWI would like to thank the following companies for sponsoring the
More informationL1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationBIG DATA IS MESSY PARTNER WITH SCALABLE
BIG DATA IS MESSY PARTNER WITH SCALABLE SCALABLE SYSTEMS HADOOP SOLUTION WHAT IS BIG DATA? Each day human beings create 2.5 quintillion bytes of data. In the last two years alone over 90% of the data on
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationUsing Tableau Software with Hortonworks Data Platform
Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationHBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367
HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive
More informationBig Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationBig Data Introduction
Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights
More informationCS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
More informationApplying Semantics to Unstructured Data (Big and Getting Bigger)
Applying Semantics to Unstructured Data (Big and Getting Bigger) Wednesday, November 30, 2012 4:00 5:00 Bryan Bell Vice President, Enterprise Solutions, Expert System Lynda Moulton, Analyst & Consultant,
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationApache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
More informationSession 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges
Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica jcampbell@vertica.com Big
More informationOnline Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011
Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011 What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationDAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY
Big Data Analytics DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY Tom Haughey InfoModel, LLC 868 Woodfield Road Franklin Lakes, NJ 07417 201 755 3350 tom.haughey@infomodelusa.com
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationBig Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012
Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation
More informationCS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
More informationBig Data and Hadoop for the Executive A Reference Guide
Big Data and Hadoop for the Executive A Reference Guide Overview The amount of information being collected by companies today is incredible. Wal- Mart has 460 terabytes of data, which, according to the
More informationA bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
More informationIntroduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing
Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition
More informationHiBench Installation. Sunil Raiyani, Jayam Modi
HiBench Installation Sunil Raiyani, Jayam Modi Last Updated: May 23, 2014 CONTENTS Contents 1 Introduction 1 2 Installation 1 3 HiBench Benchmarks[3] 1 3.1 Micro Benchmarks..............................
More informationWeb Data Mining: A Case Study. Abstract. Introduction
Web Data Mining: A Case Study Samia Jones Galveston College, Galveston, TX 77550 Omprakash K. Gupta Prairie View A&M, Prairie View, TX 77446 okgupta@pvamu.edu Abstract With an enormous amount of data stored
More informationLarge-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
More informationSQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS
Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical
More informationExtending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster Nov 7, 2012 Who I Am Robert Lancaster Solutions Architect, Hotel Supply Team rlancaster@orbitz.com @rob1lancaster Organizer of Chicago
More informationMonitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
More informationBig Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum
Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All
More informationINDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases
INDUS / AXIOMINE Adopting Hadoop In the Enterprise Typical Enterprise Use Cases. Contents Executive Overview... 2 Introduction... 2 Traditional Data Processing Pipeline... 3 ETL is prevalent Large Scale
More informationSunnie Chung. Cleveland State University
Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:
More informationData Warehousing in the Age of Big Data
Data Warehousing in the Age of Big Data Krish Krishnan AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD * PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Morgan Kaufmann is an imprint of Elsevier
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationData Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction
Data Mining and Exploration Data Mining and Exploration: Introduction Amos Storkey, School of Informatics January 10, 2006 http://www.inf.ed.ac.uk/teaching/courses/dme/ Course Introduction Welcome Administration
More informationTRAINING PROGRAM ON BIGDATA/HADOOP
Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,
More informationBIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
More informationIntroduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
More informationBIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP
BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP Business Analytics for All Amsterdam - 2015 Value of Big Data is Being Recognized Executives beginning to see the path from data insights to revenue
More informationSIAM PP 2014! MapReduce in Scientific Computing! February 19, 2014
SIAM PP 2014! MapReduce in Scientific Computing! February 19, 2014 Paul G. Constantine! Applied Math & Stats! Colorado School of Mines David F. Gleich! Computer Science! Purdue University Hans De Sterck!
More informationBig Data: Beyond the Hype
Big Data: Beyond the Hype Why Big Data Matters to You WHITE PAPER Big Data: Beyond the Hype Why Big Data Matters to You By DataStax Corporation October 2011 Table of Contents Introduction...4 Big Data
More informationIntroduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
More informationBig Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
More informationThis Symposium brought to you by www.ttcus.com
This Symposium brought to you by www.ttcus.com Linkedin/Group: Technology Training Corporation @Techtrain Technology Training Corporation www.ttcus.com Big Data Analytics as a Service (BDAaaS) Big Data
More informationHadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
More informationNavigating Big Data business analytics
mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what
More informationWhite Paper: Hadoop for Intelligence Analysis
CTOlabs.com White Paper: Hadoop for Intelligence Analysis July 2011 A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data. Inside: Apache Hadoop and
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
More informationHadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services
Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1 Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationThe big data revolution
The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing
More information