Training for Big Data

Similar documents
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Transforming the Telecoms Business using Big Data and Analytics

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

The Internet of Things and Big Data: Intro

How Companies are! Using Spark

Integrating a Big Data Platform into Government:

Enabling Manufacturing Transformation in a Connected World. John Shewchuk Technical Fellow DX

Educational Opportunities in Big Data

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

The 3 questions to ask yourself about BIG DATA

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

How To Handle Big Data With A Data Scientist

Making big data simple with Databricks

BIG DATA TRENDS AND TECHNOLOGIES

Interactive data analytics drive insights

Are You Ready for Big Data?

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

BIG DATA AND MICROSOFT. Susie Adams CTO Microsoft Federal

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Advanced In-Database Analytics

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Big Data and its use for Communications Service Providers. IoT, Washington, DC April 9, 2014 Sanjay Mishra

BIG Big Data Public Private Forum

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Are You Ready for Big Data?

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Why Big Data in the Cloud?

Agenda. Big Data & Hadoop ViPR HDFS Pivotal Big Data Suite & ViPR HDFS ViON Customer Feedback #EMCVIPR

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Using Data Mining and Machine Learning in Retail

Intro to Big Data and Business Intelligence

A Strategic Approach to Unlock the Opportunities from Big Data

Connected Product Maturity Model

In-Memory Analytics for Big Data

From Data to Foresight:

The Bloor Group. The Pillars of Data Science

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Capgemini Big Data Analytics Sandbox for Financial Services

White Paper. Version 1.2 May 2015 RAID Incorporated

Business Intelligence meets Big Data: An Overview on Security and Privacy

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Big Data Executive Survey

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015

HPC technology and future architecture

Systems of Discovery The Perfect Storm of Big Data, Cloud and Internet-of-Things

Apache Hadoop: The Big Data Refinery

The Rise of Industrial Big Data

Hadoop Usage At Yahoo! Milind Bhandarkar

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Luncheon Webinar Series May 13, 2013

Proposal for the Theme on Big Data. Analytics. Qiang Yang, HKUST Jiannong Cao, PolyU Qi-man Shao, CUHK. May 2015

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

The Big Data methodology in computer vision systems

Strengthening the decision making process with data intelligence in publishing industry CONTEC 2014 Frankfurt Germany - October 7 th 2014

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

A Demonstration of a Robust Context Classification System (CCS) and its Context ToolChain (CTC)

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Oracle Big Data Building A Big Data Management System

From Spark to Ignition:

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Cloudera Enterprise Data Hub in Telecom:

Workshop on Hadoop with Big Data

Big Data and Data Science: Behind the Buzz Words

Big Data a threat or a chance?

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

UNIFY YOUR (BIG) DATA

Big Data and Industrial Internet

Big Data Hope or Hype?

Big Data: Overview and Roadmap eglobaltech. All rights reserved.

The 4 Pillars of Technosoft s Big Data Practice

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Statistical Challenges with Big Data in Management Science

Internet of Things. Opportunity Challenges Solutions

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Manifest for Big Data Pig, Hive & Jaql

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data Storage Challenges for the Industrial Internet of Things

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

雲 端 運 算 願 景 與 實 現 馬 維 英 博 士 微 軟 亞 洲 研 究 院 常 務 副 院 長

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Transcription:

Training for Big Data Learnings from the CATS Workshop Raghu Ramakrishnan Technical Fellow, Microsoft Head, Big Data Engineering Head, Cloud Information Services Lab

Store any kind of data What is Big Data? Files, relations, docs, logs, graphs, multimedia HDFS Do any type of analysis SQL queries, ML, image processing, log analytics Hive, Map-Reduce, Mahout, In any mode Batch, interactive, streaming Hive, Spark, Storm At any scale terabytes or more or more than your old system Elastic capacity, commodity hardware

Why is it Important? Data is Now a Core Asset Big Data systems can be transformative More data: More is observable, measurable Less cost: Data acquisition, storage, compute, cloud Easier: Elastic capacity, clean-as-you-go Scalable analytic tools: Can get more out of data We can cost-effectively do things we couldn t contemplate before; the Fourth Paradigm can be brought to bear on a wide range of domains: Data-driven science, commerce, government, social programs, manufacturing, medicine Biggest bottleneck trained people

Big Data Applications

Web Scale 1.5+M read rps 4 Key Platforms 10,000+ servers 10 Geo Zones ~2B User Ids ~750M Uniq Users* 175+M Users in US 102B emails/month 13+B Ad serves/day 285+M Mail users 40+ Countries 11B visits/month All Data from July/Aug. Worldwide unless indicated to the contrary * Yahoo!-branded sites 5 10/15/2014

CORE Modeling Overview Offline Modeling Exploratory data analysis Regression, feature selection, collaborative filtering (factorization) Seed online models & explore/exploit methods at good initial points Reduce the set of candidate items Large amount of historical data (user event streams) Online Learning Online regression models, time-series models Model the temporal dynamics Provide fast learning for per-item models Near real-time user feedback Explore/Exploit Multi-armed bandits Find the best way of collecting realtime user feedback (for new items)

Traditional ecology Field work Experiments Theorizing (Slide courtesy Drew Purves, MSR)

Some huge questions we can t answer Will forests accelerate or slow climate change? How can we safely genetically engineer crops? How can we feed 9+ billion humans, with less water, less oil and less phosphorous? Is there enough to water for both agriculture, and How many species industry, in the are there on Earth? future? How can we predict them? (Slide courtesy Drew Purves, MSR) What would we do if a new disease hit wheat? Or the pollinators died out? How can we optimize supply chains to minimize environmental impact? Will the world become more fire prone as the climate changes?

Enter a new kind of ecology? Traditional Ecology Qualitative insights Driven by academic curiosity Fragmented into subdisciplines, divorced from other fields of study Divorced from policy Huge shortage of all data Computation and statistics an afterthought a necessary evil! A disparate set of software tools Joined up Ecology Quantitative predictions Driven by society s needs Integrated across subdisciplines, and with other fields of study Connected to policy Huge abundance of some data, huge shortage of other data Computational techniques and statistics central and exciting! An interoperable suite of software tools (Slide courtesy Drew Purves, MSR)

IoT: Connected Devices, Continuous Observations, Instantaneous Action, Rich History http://blogs.cisco.com/news/the-internet-of-things-infographic/ Gartner: 50 billion intelligent devices by 2015 and 275 Exabytes per day of data being sent across the Internet by 2020

Data Created by People Kinect The Kinect is an array of sensors. Depth, audio, RGB camera SDK provides a 3D virtual skeleton. 20 points around the body. 30 frames per second. Inexpensive and ubiquitous. Between 60-70M sold by May 2013.

Big Data Training What should we teach? To whom, and how? Analysis of Big Data requires cross-disciplinary skills, including the ability to make modeling decisions while trading off optimization and approximation, and being attentive to system robustness. Training Students to Extract Value from Big Data (NAE report on a workshop organized by CATS)

Skills to Teach Data management Machine Learning Parallel computing Security Software engineering Statistical analysis and inference Visualization Grounding in application domain Formulating real problems in terms of actionable outcomes from analyzing specific data Objective evaluation of outcomes and iterative improvement Very cross-disciplinary; ubiquitous in applicability across domains; any given application typically requires careful use of domain expertise

Data Analysis Pipeline Ask a general question Frame it in light of available and relevant data Understanding scope of inferences, sampling, biases Query and transform the data to prepare it Exploratory analysis Get a feel for the data, and what insights it could offer Understanding randomness, variability, uncertainty, sparsity Modeling Dimension reduction, feature selection, prediction, classification, parameter estimation Model forensics, interpretation Residuals, model comparison, model uncertainty Actionable insights Integration into real-world context

Curricula, Programs Should Big Data and Data Science be a new discipline, or seen as a complementary skillset for other disciplines? New major or minor? Certificate programs? Masters programs? How should curricula/programs be developed? Role of parent disciplines/departments? Do we need to design new courses to build effective curricula? Or can we create curricula by taking collections of courses from the parent disciplines? Should we introduce at undergrad or grad level? Role of intense courses ( boot camps ), MOOCs, etc.

Big Data Links H.V. Jagadish et al., CACM, July 2014 Big Data and its Technical Challenges D. Agrawal et al., CACM 56(6):92-101 (2013) Content Recommendation on Web Portals NAE Workshop Report, 2014 Training Students to Extract Value from Big Data Gary Marcus, New Yorker April 3 2013 http://www.newyorker.com/online/blogs/elements/2013/04/steamrolled-by-big-data.html May 23 2013 http://www.newyorker.com/online/blogs/culture/2012/05/google-knowledge-graph.html Alan Feuer, NYT, Mar 23, 2013 http://www.nytimes.com/2013/03/24/nyregion/mayor-bloombergs-geek-squad.html?pagewanted=all&_r=0 Quentin Hardy, NYT, Nov 28, 2012 http://bits.blogs.nytimes.com/2012/11/28/jeff-hawkins-develops-a-brainy-big-data-company/ Steve Lohr, NYT, Feb 11, 2012 http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact in the-world.html Economist, Nov 2011 http://www.economist.com/blogs/dailychart/2011/11/big data 0 Forbes special report on Big Data, February 2012 http://www.forbes.com/special report/data-driven.html T. Hey et al., (Eds) (2013) The Fourth Paradigm: Data-Intensive Scientific Discovery. J. Manyika et al., McKinsey Global Institute, May 2011 Big Data: The Next Frontier for Innovation, Competition, and Productivity. NPR, Nov 2011 http://www.npr.org/2011/11/30/142893065/the search for analysts-to-make-sense-of-big-data