CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016



Similar documents
Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Quick Introduction of Data Mining Techniques

Data Mining: Introduction

Introduction to Data Mining

Information Management course

Data Mining. Yeow Wei Choong Anne Laurent

DATA MINING - 1DL105, 1Dl111

Foundations of Artificial Intelligence. Introduction to Data Mining

Introduction. A. Bellaachia Page: 1

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Introduction to Data Mining

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Sunnie Chung. Cleveland State University

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

An Introduction to Data Mining

Data Mining System, Functionalities and Applications: A Radical Review

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Marta Zorrilla Universidad de Cantabria

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Big Data and Analytics: Challenges and Opportunities

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University

More details >>> HERE <<<

Data Mining Solutions for the Business Environment

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

COMP9321 Web Application Engineering

Big Data. Introducción. Santiago González

not possible or was possible at a high cost for collecting the data.

Data Warehousing and Data Mining

Introduction to Data Mining

Analytics on Big Data

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

Business Intelligence and Data Mining

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining: Concepts and Techniques

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Transforming the Telecoms Business using Big Data and Analytics

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Data Mining Introduction

Concept and Applications of Data Mining. Week 1

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

1. Introduction to Data Mining

Knowledge Discovery Process and Data Mining - Final remarks

Data Warehousing and Data Mining

Spatio-Temporal Networks:

Data Mining for Fun and Profit

Big Data. Fast Forward. Putting data to productive use

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Doing Multidisciplinary Research in Data Science

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

BIG DATA What it is and how to use?

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data Analytics and Healthcare

Big Data Explained. An introduction to Big Data Science.

Data Science, Predictive Analytics & Big Data Analytics Solutions. Service Presentation

Hadoop. Sunday, November 25, 12

Tools for Mining Massive Datasets

Big Data a threat or a chance?

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Integrating a Big Data Platform into Government:

Data Warehousing and Data Mining for improvement of Customs Administration in India. Lessons learnt overseas for implementation in India

Analyzing Big Data with AWS

Statistics for BIG data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Outline. What is Big data and where they come from? How we deal with Big data?

DATA MINING - SELECTED TOPICS

Hexaware E-book on Predictive Analytics

Introduction Predictive Analytics Tools: Weka

Big Data Executive Survey

Data Mining Techniques

Data Warehousing and Data Mining in Business Applications

Big Data Analytics. Lucas Rego Drumond

How To Understand Business Intelligence

Big Data and Data Science: Behind the Buzz Words

Big Data Analytics. What to Do with Big Data? V. CHRISTOPHIDES. Department of Computer Science University of Crete. Data contains value and knowledge

BIG DATA IN BUSINESS ENVIRONMENT

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data Analytics Building Blocks; Simple Data Storage (SQLite)

Topics in basic DBMS course

BIG DATA: BIG BOOST TO BIG TECH

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Transcription:

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining Chengkai Li University of Texas at Arlington Spring 2016

Big Data http://dilbert.com/strip/2012-07-29

Big Data http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Big Data The 4 Vs o Volume o Variety o Velocity o Veracity

Volume: How much data is out there? http://www.sciencedaily.com/releases/2013/05/130522085217.htm http://www.storagenewsletter.com/rubriques/marketreportsresearch/ibm-cmo-study/

Variety: Types of Data Structured Data o (relational) database tables o CSV/TSV files Semi-structured Data o XML o JSON o RDF Unstructured Data o text data (documents, Web pages, short texts (e.g., social media)) Multimedia Data (images, videos, audios) Other types of data o matrices, graphs, sequences, time-series, spatio-temporal

Velocity: Streaming Data Stock Trades Highway Sensors Weather Data Social Media Telephone Calls Video Streaming

http://mashable.com/2012/06/22/data-created-every-minute/

Datasets Amazon Public Data Sets Data.gov Linked Open Data Knowledge Bases, Encyclopedia Yahoo! Webscope Network/Graph Datasets UCI Machine Learning Repository UCR Time Series Classification/Clustering Time Series Data Library KDnuggets Dataset List KDD Cup Datasets

Amazon Public Data Sets http://aws.amazon.com/public-data-sets/ o NASA NEX: A collection of Earth science data sets maintained by NASA, including climate change projections and satellite images of the Earth's surface o Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pages o 1000 Genomes Project: A detailed map of human genetic variation o Google Books Ngrams: A data set containing Google Books n- gram corpuses o US Census Data: US demographic data from 1980, 1990, and 2000 US Censuses o Freebase Data Dump: A data dump of all the current facts and assertions in the Freebase system, an open database covering millions of topics

Data.gov http://www.data.gov/ (137,608 datasets) o Consumer Complaint Database o U.S. International Trade in Goods and Services: Monthly report that provides national trade data including imports, exports, and balance of payments for goods and services. o DTV Reception Maps o Climate Data Online o Food Access Research Atlas presents a spatial overview of food access indicators for low-income and other census tracts using different measures of supermarket... o U.S. Hourly Precipitation Data o Great Chile Earthquake of May 22, 1960 o Consumer Expenditure Survey o Campus Security Data o Farmers Markets Geographic Data: longitude and latitude, state, address, name, and zip code of Farmers Markets in the United States o Crimes - 2001 to present (City of Chicago)

Linked Data http://linkeddata.org/ (hundreds of datasets, billions of RDF triples)

Knowledge Bases, Encyclopedia o Wikipedia, Dbpedia o Freebase/Google Knowledge Graph o YAGO o Probase o LibraryThing

Yahoo! Webscope Datasets o Language Data o Graph and Social Data o Ratings and Classification Data o Advertising and Market Data o Competition Data o Computing Systems Data o Image Data

Stanford Large Network Dataset Collection http://snap.stanford.edu/data/ o o o o o o o o o o Social networks : online social networks, edges represent interactions between people Networks with ground-truth communities : ground-truth network communities in social and information networks Communication networks : email communication networks with edges representing communication Citation networks : nodes represent papers, edges represent citations Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper) Web graphs : nodes represent webpages and edges are hyperlinks Amazon networks : nodes represent products and edges link commonly copurchased products Internet networks : nodes represent computers and edges communication Road networks : nodes represent intersections and edges roads connecting the intersections

Time Series Data Library http://robjhyndman.com/tsdl/

KDnuggets Dataset List http://www.kdnuggets.com/datasets/index.html

KDD Cup Datasets http://www.sigkdd.org/kddcup/index.php

Data in Every Application Area o o o o o o o o o o o o o o Business: e-commerce, transactions (retailers, banking, credit cards), ratings, reviews, stock trading, Web, social media (YouTube, Flickr, ), and social networks (Facebook, Twitter, ) News Science: bioinformatics, scientific experiments, environment, climate, astronomy Logs and measurements Personal information: emails, calendars, digital photos, videos Transportation Telecommunication Education Entertainment (film, music, gaming, ) Sports Health care Crime, security

What is Data Mining? Data mining (knowledge discovery from data) o Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

What is not Data Mining? Retrieve data instead of knowledge or pattern Not interesting o trivial o explicit o known o useless

Example: What is not Data Mining? What is not Data Mining? Look up phone number in phone directory Query a Web search engine for information about Amazon What is Data Mining? Certain names are more prevalent in certain US locations (O Brien, O Rurke, O Reilly in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

Knowledge Discovery (KDD) Process This is a view from typical database systems and data warehousing communities Data mining plays an essential role in the knowledge discovery process Data Mining Pattern Evaluation Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration 23 Databases

Data Mining in Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA

KDD Process: A Typical View from ML and Statistics Input Data Data Pre- Processing Data Mining Post- Processing Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis Pattern evaluation Pattern selection Pattern interpretation Pattern visualization This is a view from typical machine learning and statistics communities

Data Mining: Confluence of Multiple Disciplines Machine Learning Pattern Recognition Statistics Applications Data Mining Visualization Algorithm Database Technology High-Performance Computing 26

Data Mining Software Free, open-source o RapidMiner o Weka: Data mining tool in java o SCaVis: scientific computation and visualization, Java o Orange: Python suite o Scikit-learn: Python machine learning lbirary o NumPy/SciPy/Ipython/ mlpy (python modules for scientific computing, scientific library, interactive computing, machine learning) o R: statistical computing and graphic o RattleGUI: data mining GUI using R o Octave: numerical analysis o Shogun: machine learning toolkit in C++ Text Mining Tools o NLTK (NLP Toolkit): NLP suite for Python o SenticNet API: sentiment analysis o Stanford NLP software o UIMA Large-Scale Data Processing, Machine Learning o Apache Mahout o GraphLab o MapReduce/Hadoop o Spark o Pregel/Giraph Commercial o Matlab o Oracle Data Mining o SAS o IBM SPSS o Microsoft SQL Server Analysis Services o HP Vertica

Data Mining Tasks Prediction Methods Description Methods From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Data Mining Tasks... Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation/Anomaly Detection [Predictive]

Classification: Definition Given a collection of records (training set ) attributes class Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. test set

10 10 Classification Example Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K? 2 No Married 100K No Yes Married 50K? 3 No Single 70K No No Married 150K? 4 Yes Married 120K No Yes Divorced 90K? 5 No Divorced 95K Yes No Single 40K? 6 No Married 60K No 7 Yes Divorced 220K No No Married 80K? Test Set 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Training Set Learn Classifier Model

Classification: Application 1 Direct Marketing targeting {buy, don t buy} class attribute From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 2 Fraud Detection

Classification: Application 3 Customer Attrition/Churn: From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 4 Sky Survey Cataloging From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classifying Galaxies Courtesy: http://aps.umn.edu Early Class: Stages of Formation Intermediate Attributes: Image features, Characteristics of light waves received, etc. Late Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB

Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Similarity Measures:

Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized

Clustering: Application 1 Market Segmentation:

Clustering: Application 2 Document Clustering:

Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in these documents (after some word filtering). Category Total Correctly Articles Placed Financial 555 364 Foreign 341 260 National 273 36 Metro 943 746 Sports 738 573 Entertainment 354 278

Clustering of S&P 500 Stock Data Observe Stock Movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day. We used association rules to quantify a similarity measure. 1 2 3 4 Discovered Clusters Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Industry Group Technology1-DOWN Technology2-DOWN Financial-DOWN Oil-UP

Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection; TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Association Rule Discovery: Application 1 Marketing and Sales Promotion: o Let the rule discovered be {Bagels, } --> {Potato Chips} o Potato Chips as consequent o Bagels in the antecedent o Bagels in antecedent and Potato chips in consequent =>

Association Rule Discovery: Application 2 Supermarket shelf management.

Association Rule Discovery: Application 3 Inventory Management:

Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: Typical network traffic at University level may reach over 100 million connections per day

Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data