Attend Part 1 (2-3pm) to get 1 point extra credit. Polo will announce on Piazza options for DL students.

Similar documents
Big Data Analytics Process & Building Blocks

Introduction to Data Mining

Scaling Up HBase, Hive, Pegasus

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Big Data Analytics Building Blocks; Simple Data Storage (SQLite)

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Text Analytics (Text Mining)

Big Data Analytics Building Blocks; Simple Data Storage (SQLite)

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Machine Learning using MapReduce

Analytics on Big Data

BIG DATA What it is and how to use?

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Data Mining for Business Analytics

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Using Data Mining and Machine Learning in Retail

Machine Learning for Data Science (CS4786) Lecture 1

DATABASES often contain uncertain and imprecise references

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Information Management course

MS1b Statistical Data Mining

Chapter 20: Data Analysis

The Need for Training in Big Data: Experiences and Case Studies

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Using multiple models: Bagging, Boosting, Ensembles, Forests

The Data Mining Process

Foundations of Artificial Intelligence. Introduction to Data Mining

OUTLIER ANALYSIS. Data Mining 1

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Data Mining + Business Intelligence. Integration, Design and Implementation

B490 Mining the Big Data. 0 Introduction

Big Data and Analytics: Challenges and Opportunities

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Anomaly detection. Problem motivation. Machine Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CS Data Science and Visualization Spring 2016

COMP9321 Web Application Engineering

IC05 Introduction on Networks &Visualization Nov

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Big Data Systems CS 5965/6965 FALL 2014

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

BIRCH: An Efficient Data Clustering Method For Very Large Databases

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

Big Data Analytics and Healthcare

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

SoSe 2014: M-TANI: Big Data Analytics

Office: LSK 5045 Begin subject: [ISOM3360]...

TIETS34 Seminar: Data Mining on Biometric identification

Bringing Big Data Modelling into the Hands of Domain Experts

Quick Introduction of Data Mining Techniques

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Solving Big Data Problems in Computer Vision with MATLAB Loren Shure

Big Data and Marketing

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Outline. What is Big data and where they come from? How we deal with Big data?

DEGREE CURRICULUM BIG DATA ANALYTICS SPECIALITY. MASTER in Informatics Engineering

The Lane s Gifts v. Google Report

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015

Bayesian networks - Time-series models - Apache Spark & Scala

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Introduction. A. Bellaachia Page: 1

How To Cluster

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Lecture: Mon 13:30 14:50 Fri 9:00-10:20 ( LTH, Lift 27-28) Lab: Fri 12:00-12:50 (Rm. 4116)

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Machine Learning Capacity and Performance Analysis and R

CS 493N: Big Data Engineering - Overview

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016

Concept and Project Objectives

: Introduction to Machine Learning Dr. Rita Osadchy

How To Create A Data Science System

2015 The MathWorks, Inc. 1

Using Data Mining for Mobile Communication Clustering and Characterization

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Chapter ML:XI (continued)

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Data Mining: Overview. What is Data Mining?

Principles of Data Mining by Hand&Mannila&Smyth

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets

Social Media Mining. Data Mining Essentials

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

Predictive Modeling Techniques in Insurance

Visualization methods for patent data

Mammoth Scale Machine Learning!

BIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou

Transcription:

Attend Part 1 (2-3pm) to get 1 point extra credit. Polo will announce on Piazza options for DL students. Data Science/Data Analytics and Scaling to Big Data with MathWorks Using Data Analytics to turn large volumes of complex data into actionable information can help you improve design and decision-making processes. However, developing effective analytics and integrating them into business systems can be challenging. In this seminar you will learn approaches and techniques available in MATLAB to tackle these challenges Data Science/Data Analytics (Part 1) Fri, Feb 6, 2-3pm Klaus 1116 * Access and Explore data: Accessing, exploring, and analyzing data stored in files, the web, and databases * Data Munging: Techniques for cleaning, exploring, visualizing, and combining complex multivariate data sets * Developing predictive models: Prototyping, testing, and refining predictive models using machine learning methods * Integrating analytics with systems: Integrating and running analytics within enterprise business systems and interactive web applications Scale to Big Data (Part 2) Fri, Feb 6, 3-4pm Klaus 2443 * Work with out-of-memory datasets with MATLAB * MapReduce algorithms and Hadoop integration in MATLAB

CSE 6242 / CX 4242 Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Final words about Data Integration

Freebase (a graph of entities) a large collaborative knowledge base consisting of metadata composed mainly by its community members Wikipedia. 4

Crowd-sourcing Approaches: Freebase http://wiki.freebase.com/wiki/what_is_freebase%3f 5

We need ways to identify the many ways that one thing may be called. How? 6

Entity Resolution (A hard problem in data integration) Polo Chau P. Chau Duen Horng Chau Duen Chau D. Chau 7

Why is Entity Resolution so Important?

D-Dupe Interactive Data Deduplication and Integration TVCG 2008 University of Maryland Bilgic, Licamele, Getoor, Kang, Shneiderman http://linqs.cs.umd.edu/basilic/web/publications/2008/kang:tvcg08/kang-tvcg08.pdf http://www.cs.umd.edu/projects/linqs/ddupe/ (skip to 0:55) 10

Polo Poalo

Numerous similarity functions Excellent read: http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf Euclidean distance Euclidean norm / L2 norm Manhattan distance Jaccard Similarity e.g., overlap of nodes #neighbors String edit distance e.g., Polo Chau vs Polo Chan Canberra distance 13

Core components: Similarity functions Determine how two entities are similar. D-Dupe s approach: Attribute similarity + relational similarity Similarity score for a pair of entities 14

Attribute similarity (a weighted sum) 15

Data Mining Concepts & Tasks

http://www.amazon.com/data-science-businessdata-analytic-thinking/dp/1449361323

1. Classification or class Probability Estimation Predict which of a (small) set of classes an entity belong to. Cancer testing (yes, no) Movie genre (action, drama, etc.) sports (win, loss) email spam filter (spam, or not) gesture detection (pinch, swipe ) planet zone habitable or not gene prediction news types (sports, entertainment) virus scanning (malware or not) 18

2. Regression ( value estimation ) Predict the numerical value of some variable for an entity. point value of wine (50-100) credit score (start with classification; default or not) stock prices wall street relationship between price and sales weather sports and game scores 19

3. Similarity Matching Find similar entities (from a large dataset) based on what we know about them. recommending items you may want to buy find similar gene sequences (that may be repeating, or does similar things) online dating building auditing (energy consumption) patent search carpool matching (find people to carpool) detecting fake identies 20

4. Clustering (unsupervised learning) Group entities together by their similarity. (User provides # of clusters) cluster people into demographics groups (young, old, etc) cluster people by accents (y all, you all) hierarchical clustering for metabolomics clustering images on the web (cat?) ~=dimensionality reduction 21

5. Co-occurrence grouping (Many names: frequent itemset mining, association rule discovery, market-basket analysis) Find associations between entities based on transactions that involve them (e.g., bread and milk often bought together) http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girlwas-pregnant-before-her-father-did/ 22

6. Profiling / Pattern Mining / Anomaly Detection (unsupervised) Characterize typical behaviors of an entity (person, computer router, etc.) so you can find trends and outliers. Examples? computer instruction prediction removing noise from experiment (data cleaning) detect anomalies in network traffic moneyball weather anomalies (e.g., big storm) google sign-in (alert) smart security camera embezzlement trending articles 23

7. Link Prediction / Recommendation Predict if two entities should be connected, and how strongly that link should be. linkedin/facebook: people you may know amazon/netflix: because you like terminator suggest other movies you may also like 24

8. Data reduction ( dimensionality reduction ) Shrink a large dataset into smaller one, with as little loss of information as possible 1. if you want to visualize the data (in 2D/3D) 2. faster computation/less storage 3. reduce noise 25

Start Thinking About Project! What problems do you want to solve? Using what (large) datasets? What techniques do you need? 26