CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining Chengkai Li University of Texas at Arlington Spring 2016
Big Data http://dilbert.com/strip/2012-07-29
Big Data http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Big Data The 4 Vs o Volume o Variety o Velocity o Veracity
Volume: How much data is out there? http://www.sciencedaily.com/releases/2013/05/130522085217.htm http://www.storagenewsletter.com/rubriques/marketreportsresearch/ibm-cmo-study/
Variety: Types of Data Structured Data o (relational) database tables o CSV/TSV files Semi-structured Data o XML o JSON o RDF Unstructured Data o text data (documents, Web pages, short texts (e.g., social media)) Multimedia Data (images, videos, audios) Other types of data o matrices, graphs, sequences, time-series, spatio-temporal
Velocity: Streaming Data Stock Trades Highway Sensors Weather Data Social Media Telephone Calls Video Streaming
http://mashable.com/2012/06/22/data-created-every-minute/
Datasets Amazon Public Data Sets Data.gov Linked Open Data Knowledge Bases, Encyclopedia Yahoo! Webscope Network/Graph Datasets UCI Machine Learning Repository UCR Time Series Classification/Clustering Time Series Data Library KDnuggets Dataset List KDD Cup Datasets
Amazon Public Data Sets http://aws.amazon.com/public-data-sets/ o NASA NEX: A collection of Earth science data sets maintained by NASA, including climate change projections and satellite images of the Earth's surface o Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pages o 1000 Genomes Project: A detailed map of human genetic variation o Google Books Ngrams: A data set containing Google Books n- gram corpuses o US Census Data: US demographic data from 1980, 1990, and 2000 US Censuses o Freebase Data Dump: A data dump of all the current facts and assertions in the Freebase system, an open database covering millions of topics
Data.gov http://www.data.gov/ (137,608 datasets) o Consumer Complaint Database o U.S. International Trade in Goods and Services: Monthly report that provides national trade data including imports, exports, and balance of payments for goods and services. o DTV Reception Maps o Climate Data Online o Food Access Research Atlas presents a spatial overview of food access indicators for low-income and other census tracts using different measures of supermarket... o U.S. Hourly Precipitation Data o Great Chile Earthquake of May 22, 1960 o Consumer Expenditure Survey o Campus Security Data o Farmers Markets Geographic Data: longitude and latitude, state, address, name, and zip code of Farmers Markets in the United States o Crimes - 2001 to present (City of Chicago)
Linked Data http://linkeddata.org/ (hundreds of datasets, billions of RDF triples)
Knowledge Bases, Encyclopedia o Wikipedia, Dbpedia o Freebase/Google Knowledge Graph o YAGO o Probase o LibraryThing
Yahoo! Webscope Datasets o Language Data o Graph and Social Data o Ratings and Classification Data o Advertising and Market Data o Competition Data o Computing Systems Data o Image Data
Stanford Large Network Dataset Collection http://snap.stanford.edu/data/ o o o o o o o o o o Social networks : online social networks, edges represent interactions between people Networks with ground-truth communities : ground-truth network communities in social and information networks Communication networks : email communication networks with edges representing communication Citation networks : nodes represent papers, edges represent citations Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper) Web graphs : nodes represent webpages and edges are hyperlinks Amazon networks : nodes represent products and edges link commonly copurchased products Internet networks : nodes represent computers and edges communication Road networks : nodes represent intersections and edges roads connecting the intersections
Time Series Data Library http://robjhyndman.com/tsdl/
KDnuggets Dataset List http://www.kdnuggets.com/datasets/index.html
KDD Cup Datasets http://www.sigkdd.org/kddcup/index.php
Data in Every Application Area o o o o o o o o o o o o o o Business: e-commerce, transactions (retailers, banking, credit cards), ratings, reviews, stock trading, Web, social media (YouTube, Flickr, ), and social networks (Facebook, Twitter, ) News Science: bioinformatics, scientific experiments, environment, climate, astronomy Logs and measurements Personal information: emails, calendars, digital photos, videos Transportation Telecommunication Education Entertainment (film, music, gaming, ) Sports Health care Crime, security
What is Data Mining? Data mining (knowledge discovery from data) o Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
What is not Data Mining? Retrieve data instead of knowledge or pattern Not interesting o trivial o explicit o known o useless
Example: What is not Data Mining? What is not Data Mining? Look up phone number in phone directory Query a Web search engine for information about Amazon What is Data Mining? Certain names are more prevalent in certain US locations (O Brien, O Rurke, O Reilly in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
Knowledge Discovery (KDD) Process This is a view from typical database systems and data warehousing communities Data mining plays an essential role in the knowledge discovery process Data Mining Pattern Evaluation Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration 23 Databases
Data Mining in Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA
KDD Process: A Typical View from ML and Statistics Input Data Data Pre- Processing Data Mining Post- Processing Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis Pattern evaluation Pattern selection Pattern interpretation Pattern visualization This is a view from typical machine learning and statistics communities
Data Mining: Confluence of Multiple Disciplines Machine Learning Pattern Recognition Statistics Applications Data Mining Visualization Algorithm Database Technology High-Performance Computing 26
Data Mining Software Free, open-source o RapidMiner o Weka: Data mining tool in java o SCaVis: scientific computation and visualization, Java o Orange: Python suite o Scikit-learn: Python machine learning lbirary o NumPy/SciPy/Ipython/ mlpy (python modules for scientific computing, scientific library, interactive computing, machine learning) o R: statistical computing and graphic o RattleGUI: data mining GUI using R o Octave: numerical analysis o Shogun: machine learning toolkit in C++ Text Mining Tools o NLTK (NLP Toolkit): NLP suite for Python o SenticNet API: sentiment analysis o Stanford NLP software o UIMA Large-Scale Data Processing, Machine Learning o Apache Mahout o GraphLab o MapReduce/Hadoop o Spark o Pregel/Giraph Commercial o Matlab o Oracle Data Mining o SAS o IBM SPSS o Microsoft SQL Server Analysis Services o HP Vertica
Data Mining Tasks Prediction Methods Description Methods From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks... Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation/Anomaly Detection [Predictive]
Classification: Definition Given a collection of records (training set ) attributes class Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. test set
10 10 Classification Example Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K? 2 No Married 100K No Yes Married 50K? 3 No Single 70K No No Married 150K? 4 Yes Married 120K No Yes Divorced 90K? 5 No Divorced 95K Yes No Single 40K? 6 No Married 60K No 7 Yes Divorced 220K No No Married 80K? Test Set 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Training Set Learn Classifier Model
Classification: Application 1 Direct Marketing targeting {buy, don t buy} class attribute From [Berry & Linoff] Data Mining Techniques, 1997
Classification: Application 2 Fraud Detection
Classification: Application 3 Customer Attrition/Churn: From [Berry & Linoff] Data Mining Techniques, 1997
Classification: Application 4 Sky Survey Cataloging From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classifying Galaxies Courtesy: http://aps.umn.edu Early Class: Stages of Formation Intermediate Attributes: Image features, Characteristics of light waves received, etc. Late Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB
Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Similarity Measures:
Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized
Clustering: Application 1 Market Segmentation:
Clustering: Application 2 Document Clustering:
Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in these documents (after some word filtering). Category Total Correctly Articles Placed Financial 555 364 Foreign 341 260 National 273 36 Metro 943 746 Sports 738 573 Entertainment 354 278
Clustering of S&P 500 Stock Data Observe Stock Movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day. We used association rules to quantify a similarity measure. 1 2 3 4 Discovered Clusters Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Industry Group Technology1-DOWN Technology2-DOWN Financial-DOWN Oil-UP
Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection; TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Association Rule Discovery: Application 1 Marketing and Sales Promotion: o Let the rule discovered be {Bagels, } --> {Potato Chips} o Potato Chips as consequent o Bagels in the antecedent o Bagels in antecedent and Potato chips in consequent =>
Association Rule Discovery: Application 2 Supermarket shelf management.
Association Rule Discovery: Application 3 Inventory Management:
Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: Typical network traffic at University level may reach over 100 million connections per day
Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data