CAS CS 565, Data Mining



Similar documents
Information Management course

B490 Mining the Big Data. 0 Introduction

Introduction to Data Mining

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Mathematics for Business Analysis I Fall 2007

College of Health and Human Services. Fall Syllabus

Dynamic Data in terms of Data Mining Streams

Introduction to Data Mining. Lijun Zhang

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

GET 114 Computer Programming Course Outline. Contact: Office Hours: TBD (or by appointment)

Statistics for BIG data

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

COURSE SYLLABUS. Enterprise Information Systems and Business Intelligence

Server Load Prediction

Introduction. A. Bellaachia Page: 1

The University of Jordan

Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

Data Mining Analytics for Business Intelligence and Decision Support

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Big Data Systems CS 5965/6965 FALL 2015

Data, Measurements, Features

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Principles of Data Mining by Hand&Mannila&Smyth

Pierce College Online Math. Math 115. Section #0938 Fall 2013

CS Data Science and Visualization Spring 2016

Office: LSK 5045 Begin subject: [ISOM3360]...

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

Cleveland State University

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Introduction to Data Mining

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

IC05 Introduction on Networks &Visualization Nov

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Analytics on Big Data

Executive Master of Public Administration. QUANTITATIVE TECHNIQUES I For Policy Making and Administration U6310, Sec. 03

Scalable Machine Learning - or what to do with all that Big Data infrastructure

San José State University CS160, Software Engineering, Sections 1, 2, and 4, Fall, 2015

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

How To Learn Data Analytics

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Spring 2006 Lecture 1 - Class Introduction

Data Warehousing and Data Mining

Principles of Data Mining

CORE CLASSES: IS 6410 Information Systems Analysis and Design IS 6420 Database Theory and Design IS 6440 Networking & Servers (3)

ANALYTICS CENTER LEARNING PROGRAM

Assessment for Master s Degree Program Fall Spring 2011 Computer Science Dept. Texas A&M University - Commerce

INFO/CS 4302 Web Information Systems. FT 2012 Week 1: Course Introduction

MATH 1111 College Algebra Fall Semester 2014 Course Syllabus. Course Details: TR 3:30 4:45 pm Math 1111-I4 CRN 963 IC #322

Online Basic Statistics

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Perspectives on Data Mining

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Concept and Applications of Data Mining. Week 1

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

A Statistical Text Mining Method for Patent Analysis

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Three Perspectives of Data Mining

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Introduction to Data Mining

B669 Sublinear Algorithms for Big Data

CHAPTER 1 INTRODUCTION

MS1b Statistical Data Mining

Classification and Prediction

MA2823: Foundations of Machine Learning

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

Sanjeev Kumar. contribute

A Time Efficient Algorithm for Web Log Analysis

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

SAS JOINT DATA MINING CERTIFICATION AT BRYANT UNIVERSITY

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Collaborations between Official Statistics and Academia in the Era of Big Data

Metrics for Managers: Big Data and Better Answers. Fall Course Syllabus DRAFT. Faculty: Professor Joseph Doyle E

College information system research based on data mining

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Fall 2007 Lecture 1 - Class Introduction

Project Report. 1. Application Scenario

Introduction. Mathematics 41P Precalculus: First Semester

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

Introduction to Data Mining Techniques

LAUREA MAGISTRALE - CURRICULUM IN INTERNATIONAL MANAGEMENT, LEGISLATION AND SOCIETY. 1st TERM (14 SEPT - 27 NOV)

Data Mining Solutions for the Business Environment

Subject Description Form

MAT 103B College Algebra Part I Winter 2016 Course Outline and Syllabus

Knowledge Discovery Process and Data Mining - Final remarks

MAC 2233, STA 2023, and junior standing

Transcription:

CAS CS 565, Data Mining

Course logistics Course webpage: http://www.cs.bu.edu/~evimaria/cs565-10.html Schedule: Mon Wed, 4-5:30 Instructor: Evimaria Terzi, evimaria@cs.bu.edu Office hours: Mon 2:30-4pm, Tues 10:30am- 12 (or by appointment) Mailing list : cascs565a1-l@bu.edu

Topics to be covered (tentative) Introduction to data mining and prototype problems Frequent pattern mining Frequent itemsets and association rules Clustering Dimensionality reduction Classification Link analysis ranking Recommendation systems Time-series data Privacy-preserving data mining

Syllabus Sept 8 Sept 13 Sept 15, 20 Sept 22, 27, 29, Oct 4 Oct 6, 12 Oct 11 Oct 13 Oct 18, 20, 25, 27 Nov 1, 3, 8, 10 Nov 15, 17, 22 Nov 24, 29 Dec 6, 08 Week starting Dec 13 Introduction to data mining Basic algorithms and prototype problems Frequent itemsets and association rules Clustering algorithms Dimensionality reduction Holiday-Tues instead Midterm exam Classification Link-analysis ranking Recommendation systems Time series analysis Privacy-preserving data mining Final exam; exact date to be determined

Course workload Three programming assignments (30%) Three problem sets (20%) Midterm exam (20%) Final exam (30%) Late assignment policy: 10% per day up to three days; credit will be not given after that Incompletes will not be given

Textbooks D. Hand, H. Mannila and P. Smyth: Principles of Data Mining. MIT Press, 2001 Jiawer Han and Micheline Kamber: Data Mining: Concepts and Techiques. Second Edition. Morgan Kaufmann Publishers, March 2006 Toby Segaran: Programming Collective Intelligence: Building Smart Web 2.0 Applications. O Reilly Research papers (pointers will be provided)

Prerequisites Basic algorithms: sorting, set manipulation, hashing Analysis of algorithms: O-notation and its variants, perhaps some recursion equations, NP-hardness Programming: some programming language, ability to do small experiments reasonably quickly Probability: concepts of probability and conditional probability, expectations, binomial and other simple distributions Some linear algebra: e.g., eigenvector and eigenvalue computations

Above all The goal of the course is to learn and enjoy The basic principle is to ask questions when you don t understand Say when things are unclear; not everything can be clear from the beginning Participate in the class as much as possible

Introduction to data mining Why do we need data analysis? What is data mining? Examples where data mining has been useful Data mining and other areas of computer science and statistics Some (basic) data-mining tasks

Why do we need data analysis Really really lots of raw data data!! Moore s law: more efficient processors, larger memories Communications have improved too Measurement technologies have improved dramatically It possible to store and collect lots of raw data The data-analysis methods are lagging behind Need to analyze the raw data to extract knowledge

The data is also very complex Multiple types of data: tables, time series, images, graphs, etc Spatial and temporal aspects Large number of different variables Lots of observations large datasets

Example: transaction data Billions of real-life customers: e.g., walmart, safeway customers, etc Billions of online customers: e.g., amazon, expedia, etc.

Example: document data Web as a document repository: billions of web pages Wikipedia: 4 million articles (and counting) Online collections of scientific articles

Example: network data Web: 50 billion pages linked via hyperlinks Facebook: 400 million users MySpace: 300 million users Instant messenger: ~1billion users Blogs: 250 million blogs worldwide, presidential candidates run blogs

Example: genomic sequences http://www.1000genomes.org/page.php Full sequence of 1000 individuals 310^9 nucleotides per person 310^12 nucleotides Lots more data in fact: medical history of the persons, gene expression data

Example: environmental data Climate data (just an example) http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php a database of temperature, precipitation and pressure records managed by the National Climatic Data Center, Arizona State University and the Carbon Dioxide Information Analysis Center 6000 temperature stations, 7500 precipitation stations, 2000 pressure stations

We have large datasets so what? Goal: obtain useful knowledge from large masses of data Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst Tell me something interesting about the data; describe the data Exploratory analysis on large datasets

What can data-mining methods do? Extract frequent patterns There are lots of documents that contain the phrases association rules, data mining and efficient algorithm Extract association rules 80% of the walmart customers that buy beer and sausage also buy mustard Extract rules If occupation=phd student then income < 20K

What can data-mining methods do? Rank web-query results What are the most relevant web-pages to the query: Student housing BU? Find good recommendations for users Recommend amazon customers new books Recommend facebook users new friends/groups Find groups of entities that are similar (clustering) Find groups of facebook users that have similar friends/interests Find groups amazon users that buy similar products Find groups of walmart customers that buy similar products

Goal of this course Describe some problems that can be solved using datamining methods Discuss the intuition behind data-mining methods that solve these problems Illustrate the theoretical underpinnings of these methods Show how these methods can be useful in practice

Data mining and related areas How does data mining relate to machine learning? How does data mining relate to statistics? Other related areas?

Data mining vs machine learning Machine learning methods are used for data mining Classification, clustering Amount of data makes the difference Data mining deals with much larger datasets and scalability becomes an issue Data mining has more modest goals Automating tedious discovery tasks, not aiming at human performance in real discovery Helping users, not replacing them

Data mining vs. statistics tell me something interesting about this data what else is this than statistics? The goal is similar Different types of methods In data mining one investigates lot of possible hypotheses Data mining is more exploratory data analysis In data mining there are much larger datasets algorithmics/scalability is an issue

Data mining and databases Ordinary database usage: deductive Knowledge discovery: inductive Inductive reasoning is exploratory New requirements for database management systems Novel data structures, algorithms and architectures are needed

Data mining and algorithms Lots of nice connections A wealth of interesting research questions We will focus on some of these questions later in the course

Some simple data-analysis tasks Given a stream or set of numbers (identifiers, etc) How many numbers are there? How many distinct numbers are there? What are the most frequent numbers? How many numbers appear at least K times? How many numbers appear only once? etc

Finding the majority element A neat problem A stream of identifiers; one of them occurs more than 50% of the time How can you find it using no more than a few memory locations? Suggestions?

Finding the majority element (solution) A = first item you see; count = 1 for each subsequent item B if (A==B) count = count + 1 else { count = count - 1 if (count == 0) {A=B; count = 1} } endfor return A Why does this work correctly?

Finding the majority element (solution and correctness proof) A = first item you see; count = 1 for each subsequent item B if (A==B) count = count + 1 else { endfor } return A count = count - 1 if (count == 0) {A=B; count = 1} Basic observation: Whenever we discard element u we also discard a unique element v different from u

Finding a number in the top half Given a set of N numbers (N is very large) Find a number x such that x is *likely* to be larger than the median of the numbers Simple solution Sort the numbers and store them in sorted array A Any value larger than A[N/2] is a solution Other solutions?

Finding a number in the top half efficiently A solution that uses small number of operations Randomly sample K numbers from the file Output their maximum median N/2 items N/2 items Failure probability (1/2)^K