Big Data: Big N. V.C. 14.387 Note. December 2, 2014



Similar documents
Jeffrey D. Ullman slides. MapReduce for data intensive computing

MapReduce (in the cloud)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

CSE-E5430 Scalable Cloud Computing Lecture 2

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Lecture 10: Regression Trees

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Introduction to Parallel Programming and MapReduce

MapReduce and the New Software Stack

Introduction to Hadoop

Introduction to Hadoop

Cloud Computing Summary and Preparation for Examination

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Cloud Computing using MapReduce, Hadoop, Spark

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry


MAPREDUCE Programming Model

Open source Google-style large scale data analysis with Hadoop

Map Reduce / Hadoop / HDFS

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

A programming model in Cloud: MapReduce

The MapReduce Framework

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Data Mining in the Swamp

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Improving Job Scheduling in Hadoop

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

The Hadoop Framework

University of Maryland. Tuesday, February 2, 2010

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Chapter 7. Using Hadoop Cluster and MapReduce

Beyond the Stars: Revisiting Virtual Cluster Embeddings

Mining Large Datasets: Case of Mining Graph Data in the Cloud

MapReduce Approach to Collective Classification for Networks

Map-Reduce for Machine Learning on Multicore

Resource Scalability for Efficient Parallel Processing in Cloud

Hadoop and Map-Reduce. Swati Gore

Challenges for Data Driven Systems

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Cloud Computing at Google. Architecture

Student Project 2 - Apps Frequently Installed Together

Lecture Data Warehouse Systems

Big Data Processing with Google s MapReduce. Alexandru Costan

4/10/2012. Hall. Cherokee. Bartow. Forsyth. Barrow. Gwinnett. Paulding. Cobb. Walton. DeKalb. Douglas. Fulton. Rockdale. Carroll. Clayton.

MapReduce Job Processing

Data-Intensive Computing with Map-Reduce and Hadoop

High Performance Computing with Hadoop WV HPC Summer Institute 2014

Big Data and Apache Hadoop s MapReduce

ImprovedApproachestoHandleBigdatathroughHadoop

An improved task assignment scheme for Hadoop running in the clouds

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Open source large scale distributed data management with Google s MapReduce and Bigtable

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Übung Datenbanksysteme II Web-Scale Data Management. Thorsten Papenbrock

Statistical Machine Learning

Virtual Machine Based Resource Allocation For Cloud Computing Environment

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Performance Tuning and Scheduling of Large Data Set Analysis in Map Reduce Paradigm by Optimal Configuration using Hadoop

CSCI567 Machine Learning (Fall 2014)

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Improving MapReduce Performance in Heterogeneous Environments

Improving Apriori Algorithm to get better performance with Cloud Computing

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

Map Reduce & Hadoop Recommended Text:

Big Data With Hadoop

Introduction to DISC and Hadoop

Machine Learning using MapReduce

MapReduce, Hadoop and Amazon AWS

Classification On The Clouds Using MapReduce

Parallel Processing of cluster by Map Reduce

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Lecture 10 - Functional programming: Hadoop and MapReduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Towards Optimizing Hadoop Provisioning in the Cloud

Operating Systems. Cloud Computing and Data Centers

Distributed Apriori in Hadoop MapReduce Framework

Hadoop Cluster Applications

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Clustering UE 141 Spring 2013

BIG DATA USING HADOOP

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Getting Started with Hadoop with Amazon s Elastic MapReduce

Cloud Computing Training

Hadoop Architecture. Part 1

Large-Scale Data Processing

Hadoop & Spark Using Amazon EMR

Advanced Data Management Technologies

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Transcription:

Big Data: Big N V.C. 14.387 Note December 2, 2014

Examples of Very Big Data Congressional record text, in 100 GBs Nielsen s scanner data, 5TBs Medicare claims data are in 100 TBs Facebook 200,000 TBs See Nuts and Bolts of Big Data, NBER lecture, by Gentzkow and Shapiro. The non-econometric portion of our slides draws on theirs.

Map Reduce & Hadoop The basic idea is that you need to divide work among the cluster of computers since you can t store and analyze the data on a single computer. Simple but powerful algorithm framework. Released by Google around 2004; Hadoop is an open-source version. Map-Reduce algorithm has the following steps: 1. Map: processes chunks of data to produce summaries 2. Reduce: combines summaries from different chunks to produce a single output file

Examples Count words in docs i. Map: i set of (word, count) pairs, C i Reduce: Collapse {C i } by summing over count within word. Hospital i. Map: i records H i for patients who are 65+. Reduce: Append elements of {H i }.

Map-Reduce Functionality Partitions data across machines Schedules execution across nodes Manages communication across machines Handles errors, machine failure

(1) fork User Program (1) fork (1) fork (2) assign map Master (2) assign reduce worker split 0 split 1 split 2 split 3 split 4 (3) read worker (4) local write (5) remote read worker worker (6) write output file 0 output file 1 worker Input files Map phase Intermediate files (on local disks) Reduce phase Output files Figure 1: Execution overview Inverted Index: The map function parses each document, large clusters of commodity PCs connected together with and emits a sequence of hword, document IDi pairs. The reduce function accepts all pairs for a given switched Ethernet [4]. In our environment: (1) Machines are typically dual-processor x86 processors

Amazon Web Services Data centers owned and run by Amazon. You can rent virtual computers minute-by-minute basis more than 80% of the cloud computing market nearly 3,000 employees cost per machine: 0.01 to 4.00 /hour Several services in AWS S3 (Storage) EC2 (Individual Machines) Elastic Map Reduce distribute the data for Hadoop clusters

Distributed and Recursive Computing of Estimators We want to compute the least squares estimator ˆβ arg min n 1 b n (y i x i b) 2. The sample size n is very large and can t load the data into a single machine. What could we do if we have a single machine or many machines? Use the classical sufficiency ideas to distribute jobs across machines, spatially or in time.

The OLS Example We know that ˆβ = (X X ) 1 (X Y ). Hence we can do everything we want with just: X X, X Y, n, S 0, where S 0 is a small random sample (Y i, X i ) i I0 with sample size n 0, where n 0 is large, but small enough that the data can be loaded in the machine. We need X X and X Y to compute the estimators to compute the estimator. We need S 0 to compute robust standard errors and we need to know n to scale these standard errors appropriately.

The OLS Example Continued The terms like X X and X Y are sums that can be computed by distribution of jobs over many machines: 1. Suppose machine j stores sample S j = (X i, Y i ) i Ij of size n j. 2. Then we can map S j to the sufficient statistics ( T j = i, ) X i Y i, n j i I j i I j X i X for each j. 3. We then collect (T j ) M j=1 and reduce them further to T = M T j = (X X, X Y, n). j=1

The LASSO Example The Lasso estimator minimizes or equivalently (Y X β) (Y X β) + λ Ψβ 1, Ψ = diag(x X ) Y Y 2β X Y + β X X β + λ Ψβ 1. Hence in order to compute Lasso and estimate noise level to tune λ we only need to know Y X, X X, n, S 0. Computation of sums could be distributed across machines.

The Two Stage Least Squares The estimator takes the form (X P Z X ) 1 X P Z Y = (X Z(Z Z) 1 Z X ) 1 X Z(Z Z) 1 Z Y. Thus we only need to know Z Z, X Z, Z Y, n, S 0. Computation of sums could be distributed across machines.

Digression: Ideas of Sufficiency are Extremely Useful in Other Contexts Motivated by J. Angrist, Lifetime earnings and the Vietnam era draft lottery: evidence from social security administrative records, AER, 1990. We have a small sample S 0 = (Z i, Y i ) i I0, where Z i are instruments (that also include exogenous covariates) and Y i are earnings. In ML speak, this is called labelled data (they call Y i labels, how uncool) We also have huge (n n 0 ) samples of unlabeled data (no Y i recorded) from which we can obtain Z X, X X, Z Z via distributed computing (if needed). We can compute the final 2SLS-like estimator as n n 0 (X Z(Z Z) 1 Z X ) 1 X Z(Z Z) 1 Can compute standard errors using S 0. n 0 Z i Y i

Exponential Families and Non-Linear Examples Consider estimation using MLE based upon exponential families. Here assume data W i f θ, where Then the MLE maximizes f θ (w) = exp(t (w) θ + ϕ(θ)). n logf θ (W i ) = n T (W i ) θ + ϕ(θ) =: T θ + nϕ(θ). The sufficient statistic T can be obtained via distributed computing. We also need an S 0 to obtain standard errors. Going beyond such quasi-linear examples could be difficult, but possible.

M- and GMM - Estimation The ideas could be pushed forward using 1-step or approximate minimization principles. Here is a very crude form of one possible approach. Suppose that ˆθ minimizes n m(w i, θ). Then given an initial estimator ˆθ 0 computed on S 0 we could do Newton iterations to approximate ˆθ: ( n ) 1 n ˆθ j+1 = ˆθ j 2 θ m(w i, ˆθ j ) θ m(w i, ˆθ j ). Each iteration involves sufficient statistics n n 2 θ m(w i, ˆθ j ), θ m(w i, ˆθ j ) which can obtained via distributed computing.

Conclusions We discussed the large p case, which is difficult. Approximate sparsity was used as a generalization of the usual parsimonius approach used in empirical work. A sea of opportunities for exciting empirical and theoretical work. We discussed the large n case, which is less difficult. Here the key is the distributed computing. Also big n samples often come in unlabeled form, so you need to be creative in order to make good use of them. This is an ocean of opportunities.

One More Thing On-line course evaluations. Please complete. This is important to us. Thank you!