CIS4930/6930 Data Science: Large-scale Advanced Data Analysis Fall 2011. Daisy Zhe Wang CISE Department University of Florida

Similar documents

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

COMP9321 Web Application Engineering

Big Data and Analytics: Challenges and Opportunities

Searching frequent itemsets by clustering data

Introduction to Data Mining

Sunnie Chung. Cleveland State University

Big Data Analytics Process & Building Blocks

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data a threat or a chance?

Big Data Systems CS 5965/6965 FALL 2014

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Big Data Analytics Building Blocks; Simple Data Storage (SQLite)

Jay Buckingham Dynamic Signal

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Big Data and Analytics (Fall 2015)

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

MAD Skills: New Analysis Practices for Big Data

2015 Analyst and Advisor Summit. Advanced Data Analytics Dr. Rod Fontecilla Vice President, Application Services, Chief Data Scientist

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

?????? Data Analytics

Cleveland State University

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

Big Data Analytics. Lucas Rego Drumond

Demystifying The Data Scientist

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

Real Time Analytics for Big Data. NtiSh Nati

Statistics for BIG data

Big Data Research in the AMPLab: BDAS and Beyond

Exploring Big Data in Social Networks

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Statistical Challenges with Big Data in Management Science

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Big Data & Security. Aljosa Pasic 12/02/2015

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Exploratory Data Analysis with #codemash

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Doing Multidisciplinary Research in Data Science

What s next for the Berkeley Data Analytics Stack?

How To Make Sense Of Data With Altilia

Machine Learning using MapReduce

ESS event: Big Data in Official Statistics

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Big Data and Data Science: Behind the Buzz Words

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Cleveland State University

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Fall 2007 Lecture 1 - Class Introduction

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Radoop: Analyzing Big Data with RapidMiner and Hadoop

The Future of Business Analytics is Now! 2013 IBM Corporation

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Advanced In-Database Analytics

Analyzing Big Data with AWS

Distributed Computing and Big Data: Hadoop and MapReduce

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

This Symposium brought to you by

The Need for Training in Big Data: Experiences and Case Studies

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Information Management course

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Training for Big Data

Integrating a Big Data Platform into Government:

ANALYTICS CENTER LEARNING PROGRAM

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Big Data and Economics, Big Data and Economies. Susan Athey, Stanford University Disclosure: The author consults for Microsoft.

"BIG DATA A PROLIFIC USE OF INFORMATION"

LARGE-SCALE DATA-DRIVEN DECISION- MAKING: THE NEXT REVOLUTION FOR TRADITIONAL INDUSTRIES

Tap into Hadoop and Other No SQL Sources

Big Data and Healthcare Payers WHITE PAPER

Transcription:

CIS4930/6930 Data Science: Large-scale Advanced Data Analysis Fall 2011 Daisy Zhe Wang CISE Department University of Florida 1

Vital Information Instructor: Daisy Zhe Wang Office: E456 Class time: Tuesdays 3-5pm, Thursdays 4-5pm Office hours: right after the class (one hour) TA: none Course page: http://www.cise.ufl.edu/class/cis6930fa11lad/ (read announcements frequently!) 2

Overview Trend: Bigger Data and Deeper Analysis Data Science: Uses advanced analysis over large-scale data to create data products 3

Data Science A Working Definition Data Science is the science which uses computer science, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products. 4

Course Goal In this course, we will have in-depth discussions of recent publications related to Data Science. I will put most emphasis on the systems, applications and algorithms for large-scale advanced data analysis. 5

This Course will Give you exposure to research topics and existing work in Data Science. Ask you to critique the papers we are going to read. Strongly encourage you to explore new research problems, come up with better solutions and make contribution. 6

This Course will NOT Teach you statistics, machine learning, database systems. Teach you programming. Teach you how to be an expert in map-reduce, statistical packages, parallel databases. 7

Expectations Require Information and Database Systems I (CIS4301) Data structures and algorithms, Coding (C, Java) Good Maths and Statistics Background Knowledge Encourage Actively participate in discussions in the classroom Read Data Science literature in general Experience in Machine Learning, NLP, Data Mining Academic honesty 8

Course Outline Data Analysis I Systems and Frameworks Data Collection, Cleaning, and Integration Data Analysis II Applications and Algorithms Interface Design and Data Visualization 9

Text Books Not required, but recommended. Class notes + papers. 10

Additional Reading Pointers Data Science Summit (Strata) (http://www.datascientistsummit.com/ ) Kaggle Competitions (http://www.kaggle.com/) Data Science course at Berkeley (http://datascienc.es/ ) Conferences and Journals VLDB, ICDE, SIGMOD CIDR, KDD, ICDM 11

Grading How can I get an A? Homework (25 %) Project (55 %) Presentations (20 %) Participation (5% bonus) Novelty in Project (5% bonus) Late submission: 20% per day for up to 5 days. 12

Homework (25%) Literary reviews (due before class in hard copy with your name and ID do NOT send email) Main contributions (goals, techniques, evaluations) Positive Critiques Negative Critiques This Thursday: Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton: MAD Skills: New Analysis Practices for Big Data. PVLDB 2(2): 1481-1492 (2009) Next Thursday: Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun: Map-Reduce for Machine Learning on Multicore. NIPS 2006: 281-288

Presentation (20%) Select 1-2 papers from the reading list (I will announce how to sign up) Prepare 1 hour presentation Send me the presentation (ppt) 1 week before the presentation in class Improve presentation according to feedback Deliver the presentation Lead the discussion

Project (55%) Work in groups of 2-3 people Project proposal (1-2 pages) (Sep 27) Form the groups Project mid-term evaluation (Nov 3/8) Novelties, Progress, and Deliverables Project final presentation and demo (Dec 6/8) Final Report (Dec 15)

Bonuses (5% each) Class participation Novelty in the Project

An Overview of Research Topics in Data Science 17

Goal of Data Science Turn data into data products.

Data Application Databases Wireless Sensor Data, Seismic, Astronomy Data Text Data (Webpages, Wikipedia, Emails, Enterprise Documents) Social Media Data (Twitter, Blogs, Social Networks) Software Log Data (Server, API, Database Logs, Click Streams), Images, Videos, Music Scientific Data, Medical, Microarray, Genome Data Data is getting Larger and more Diverse

Data Products Twitter Text Analysis Spam Filter/Similarity Search User Sentiment/Satisfaction/Feedback News Breakout Trend and Topics 200 million users as of 2011, generating over 200 million tweets and handling over 1.6 billion search queries per day

Data Products Netflix Personalized Movie Ratings Movie Recommendations Similar Movies Movie Categories (e.g., 80 s movie with a strong female lead, Kung Fu movies) BlockBuster is out of the business

Data Products LinkedIn/Facebook People you may know Applications you may like Jobs/Events you might be interested Classifier for bad users and bad content With high accuracy, Facebook can guess whether you are single or married Who does not have LinkedIn/Facebook Account?

Data Products Splunk Degradation, Failure Detection Identify Security Breach Event Monitoring Troubleshoot Tools Cross-platform Event Correlation Founded 2004, Rumor has it Close to IPO

Data Products Google Web Search News Recommendation Engine Google Map Google Ads Google Analytics Still the hottest IT company to work for now -- Microsoft of 90 s, IBM of the 70 s

Techniques used in Data Science Statistics Machine Learning Data Management Visualization HCI

Related Research Areas Databases, Systems Data warehouse, ETL, Data Cubes Statistics and Machine Learning Data Mining Data Visualizations Human Computer Interactions Privacy

Challenges in Data Science Preparing Data (Noisy, Incomplete, Diverse, Streaming ) Analyze Data (Scalable, Accurate, Real-time, Advanced Methods, Probabilities and Uncertainties...) Represent Analysis Results (i.e. data product) (Story-telling, Interactive, explainable )

Sexy Job in the next 10 years The sexy job in the next ten years will be statisticians The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it that s going to be a hugely important skill. -- Hal Varian, Google Chief Economist, 2009

Sexy Job in the next 10 years The sexy job in the next ten years will be data scientists The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it that s going to be a hugely important skill. -- Hal Varian, Google Chief Economist, 2009

Skill Set of a Data Scientist Data Management Data collection, storage, cleaning, filtering, integration Statistics and Machine Learning Data modeling, inference, prediction, pattern recognition Interface and Data Visualization HCI design, visualization, story-telling

The Life of Data (state-of-the-art) Users Interface Collect Clean Integrate Analysis Visualization Data Sources

Data Collection Data in the First Mile Collect data effectively Sensors: Acquisitional data collection Surveys: Model-based Re-ordering of questions (Usher) Collect Structured Data from the Web WebTables Scuba Extraction from Text (more later )

Data Cleaning Deal with dirty, noisy, incomplete data E.g., Census, Sensor Networks, Text Extractions Interactive Data Transformation and Cleaning Data Wranger, 2011 Potter s Wheel, 2001 Pay-as-you-go user feedback More automatic (avoid boring tasks) still a research challenge

Data Integration Merging multiple data sources Schema Mapping Entity Resolution (SERF) Querying over Multiple data sources Pay-as-you-go data integration Probabilistic data integration Fusion Table: web-centered data integration

Data Analysis (I) Systems and Frameworks MADLib Mahout Spark Map-Reduce Online SciDB RIOT MauveDB DataPath Dremel: Google Analytics

Data Analysis (II) Applications and Algorithms Text Analysis (OpenIE, Computational Journalism, DBLife, BayesStore) Classification Information extraction Relation extraction Reference reconciliation (Co-reference) On-line advertisement (MADLib) Fraud Detection (Splash)

Data Analysis (II) Applications and Algorithms (cont.) Risk Management (MCDB) Probabilistic Databases (MystiQ, Trio, BayesStore..) Mechanical Turks (CrowdDB, CrowdFlow, CrowdSearch ) hot topic now! Recommendation Systems, Log Analysis

Interface Design and Data Visualization Polaris Tableau (Stanford) Spreadsheet (e.g., Excel) (MIT) Interactive Querying interface (Berkeley) CONTROL Interactive Cleaning (D^p) Query-by-Example (IBM) ManyEyes

Course Project (I) An Application of Data Science Using a Framework We discussed in class Relational Database, Parallel Database Hadoop, Map-Reduce, Mechanical Turk Using a statistical/machine learning algorithm Text Analysis (Classification, Information Extraction, Entity Resolution) A/B Testing, MCMC simulation

Course Project (II) Improving Data Science Framework New Interface Design for data cleaning/integration/querying/feedback New Technology to improve Crowdsourcing Service New Framework supporting Data Science applications

Research Directions Interactive Query-Driven Text Analysis Information/Relation Extraction Reference Reconciliation Classification Pay-as-you-go Machine Learning On-line Learning Quality Control, Lineage Probabilistic Knowledge Base Probabilistic Database + Crowd-sourcing

Homework Today Think about Project, form groups early (project proposal due Sep 27) Reviews due This Thursday: Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton: MAD Skills: New Analysis Practices for Big Data. PVLDB 2(2): 1481-1492 (2009) Next Thursday: Cheng-Tao Chu, Sang Kyun Kim, Yi- An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun: Map-Reduce for Machine Learning on Multicore. NIPS 2006: 281-288