Big Data. Lyle Ungar, University of Pennsylvania



Similar documents
ANALYTICS CENTER LEARNING PROGRAM

Hadoop Ecosystem B Y R A H I M A.

How To Scale Out Of A Nosql Database

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

The Internet of Things and Big Data: Intro

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Transforming the Telecoms Business using Big Data and Analytics

How To Create A Data Visualization With Apache Spark And Zeppelin

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

Big Data and Data Science: Behind the Buzz Words

Big Data and Industrial Internet

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

Big Data on Microsoft Platform

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Unified Big Data Processing with Apache Spark. Matei

Large scale processing using Hadoop. Ján Vaňo

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Data Warehouse design

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Moving From Hadoop to Spark

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data Analytics Hadoop and Spark

Big Data Multi-Platform Analytics (Hadoop, NoSQL, Graph, Analytical Database)

Native Connectivity to Big Data Sources in MSTR 10

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Shark Installation Guide Week 3 Report. Ankush Arora

Big Data Technologies Compared June 2014

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data Explained. An introduction to Big Data Science.

TOP 8 TRENDS FOR 2016 BIG DATA

Big Data Too Big To Ignore

Ali Ghodsi Head of PM and Engineering Databricks

A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

Unified Big Data Analytics Pipeline. 连 城

The? Data: Introduction and Future

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

NoSQL and Hadoop Technologies On Oracle Cloud

Sunnie Chung. Cleveland State University

Dell In-Memory Appliance for Cloudera Enterprise

Large-Scale Data Processing

Agenda. Big Data. Dell Cloud Solutions A Dell Story Summary. Concepts Market Trends and Challenges Dell Solutions

Customized Report- Big Data

The 3 questions to ask yourself about BIG DATA

BIG DATA-AS-A-SERVICE

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Big Data and Apache Hadoop s MapReduce

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data Analytics. Lucas Rego Drumond

Hadoop. Sunday, November 25, 12

So What s the Big Deal?

Cloud Computing and Big Data What Technical Writers Need to Know

BIG DATA TRENDS AND TECHNOLOGIES

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

INTRODUCTION TO CASSANDRA

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

BIRT in the World of Big Data

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Questionnaire about the skills necessary for people. working with Big Data in the Statistical Organisations

Architectures for massive data management

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

HDP Enabling the Modern Data Architecture

Map-Reduce for Machine Learning on Multicore

P4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland

Oracle Big Data Fundamentals Ed 1 NEW

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

From Spark to Ignition:

Architectures for Big Data Analytics A database perspective

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Microsoft Research Windows Azure for Research Training

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

The Inside Scoop on Hadoop

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Chapter 1. Contrasting traditional and visual analytics approaches

How To Get The Most Out Of Big Data

Global SME Big Data Market

White Paper: Datameer s User-Focused Big Data Solutions

I/O Considerations in Big Data Analytics

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

Microsoft Research Microsoft Azure for Research Training

Chase Wu New Jersey Ins0tute of Technology

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Transcription:

Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century - Davenport and Patil, Harvard Business Review 2012 Why is data science booming as a field? Lyle Ungar, University of Pennsylvania

Everybody is selling big data 2

What is Big Data? high volume, velocity and/or variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. (Gartner) 1 Terabyte = 1024 Gigabytes 1 Petabyte = 1024 Terabytes 1 Exabyte = 1024 Petabytes 1 Zettabyte = 1024 Petabytes 3

Who does big data? IBM, HP, EMC, Teradata, Oracle, SAP, Microsoft, AWS, Vmware, Google Splunk, MemSQL, Palantir, Trifacta, Datameer, Neo, DataStax, Infobright, Fractal Analytics http://www.datamation.com/applications/30-big-data-companies-leading-the-way-1.html 4

Big Data u Big n vs. big p u How is big data different? l use available large-scale data rather than hoping for annotated data u Semi-parametric or non-parametric methods 5

Different methods work best at scale u Confusion set disambiguation l Choose the correct word in the set given the context {principle, principal} {then, than} (to, two, too} {weather, whether} 6

The unreasonable effectiveness of data 7

How to handle big data? u Dimensionality reduction u Sampling u Hadoop/MapReduce 8

Big Data: different philosophy Different data handling: u Mostly unstructured data objects (Schema-less NoSQL) u Vast number of attributes and data sources u Data sources added and/or updated frequently u Quality is unknown Different programming philosophy: u Distributed Programming u Fault Tolerant External References http://developer.yahoo.com/hadoop/ http://code.google.com/edu/parallel/mapreduce-tutorial.html 9

What is the slowest part of big data analysis? A) Multiplying X X B) Inverting a matrix (X X) -1? C) Reading from X from disk to memory? D) Other? 10

Data Parallelism 11

Model Parallelism 12

Map-Reduce Dataflow Data is divided across multiple machine ( mappers ) Each mapper does the same thing to different data Results are combined ( reduced ) http://codecapsule.com/ 2010/04/15/efficient-countingmapreduce/ 13

How easy is it to do in map-reduce? u Linear regression u Linear regression with feature selection u SVM u k-nn u K-means / EM A) Easy B) Hard C) Impossible 14

Good tools u LDA l Mallet l Factorie u Deep Nets l Theano l Caffe, Torch l Tensorflow? 15

Hadoop Slide from cloudera 16

In Hadoop u Hive l data warehouse infrastructure providing data summarization, query, and analysis. u Pig, Crunch l high-level platform for creating MapReduce programs used with Hadoop, pipelining code u Mahout u Solr l.scalable machine learning and data mining l u Hue enterprise search platform built on Apache Lucene l visualization 17

Spark u Combines SQL, streaming, and complex analytics u Often runs on Hadoop l or Mesos, standalone, or in the cloud u Bindings to l Java, Scala, Python, R l NLTK l u MLlib Machine Learning Library l Faster than Mahout Seems to be replacing Hadoop 18

Increasingly use a deep stack 19

Increasing in the cloud u X as a Service l SaaS (software) l PaaS (platform) l IaaS (infrastructure) u It s easy to spin these up on AWS l Or MS Azure 20

Tools are changing rapidly u Currently hot: l SMACK: Spark, Mesos, Akka, Cassandra and Kafka l Deep Net package of the week But the fundamental we learned in this class are not changing! 21