Lecture 3: Data Mining.
|
|
- Christine Floyd
- 7 years ago
- Views:
Transcription
1 Lecture 3: Data Mining 1 1
2 Roadmap What is data mining? Data Mining Tasks Classification/Decision Tree Clustering Association Mining Data Mining Algorithms Decision Tree Construction Frequent 2-itemsets Frequent Itemsets (Apriori) Clustering/Collaborative Filtering 2 2
3 What is Data Mining? Discovery of useful, possibly unexpected, patterns in data. Subsidiary issues: Data cleansing: detection of bogus data. E.g., age = 150. Visualization: something better than megabyte files of output. Warehousing of data (for retrieval). 3 3
4 Typical Kinds of Patterns 1. Decision trees: succinct ways to classify by testing properties. 2. Clusters: another succinct classification by similarity of properties. 3. Bayes, hidden-markov, and other statistical models, frequent-itemsets: expose important associations within data. 4 4
5 Example: Clusters x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x 5 5
6 Applications (Among Many) Intelligence-gathering. Total Information Awareness. Web Analysis. PageRank. Marketing. Run a sale on diapers; raise the price of beer. Detective? 6 6
7 Cultures Databases: concentrate on large-scale (non-mainmemory) data. AI (machine-learning): concentrate on complex methods, small data. Statistics: concentrate on inferring models. 7 7
8 Models vs. Analytic Processing To a database person, data-mining is a powerful form of analytic processing --- queries that examine large amounts of data. Result is the data that answers the query. To a statistician, data-mining is the inference of models. Result is the parameters of the model. 8 8
9 Meaningfulness of Answers A big risk when data mining is that you will discover patterns that are meaningless. Statisticians call it Bonferroni s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap. 9 9
10 Examples A big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents privacy. The Rhine Paradox: a great example of how not to conduct scientific research
11 Rhine Paradox --- (1) David Rhine was a parapsychologist in the 1950 s who hypothesized that some people had Extra-Sensory Perception. He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue. He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right! 11 11
12 Rhine Paradox --- (2) He told these people they had ESP and called them in for another test of the same type. Alas, he discovered that almost all of them had lost their ESP. What did he conclude? Answer on next slide
13 Rhine Paradox --- (3) He concluded that you shouldn t tell people they have ESP; it causes them to lose it
14 A Concrete Example This example illustrates a problem with intelligencegathering. Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. We want to find people who at least twice have stayed at the same hotel on the same day
15 10 9 people being tracked days. The Details Each person stays in a hotel 1% of the time (10 days out of 1000). Hotels hold 100 people (so 10 5 hotels). If everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious? 15 15
16 Calculations --- (1) Probability that persons p and q will be at the same hotel on day d : 1/100 * 1/100 * 10-5 = Probability that p and q will be at the same hotel on two given days: 10-9 * 10-9 = Pairs of days: 5*
17 Calculations --- (2) Probability that p and q will be at the same hotel on some two days: 5*10 5 * = 5* Pairs of people: 5* Expected number of suspicious pairs of people: 5*10 17 * 5*10-13 = 250,
18 Conclusion Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. Analysts have to sift through 250,010 candidates to find the 10 real cases. Not gonna happen. But how can we improve the scheme? 18 18
19 Data Mining Tasks Data mining is the process of semi-automatically analyzing large databases to find useful patterns Prediction based on past history Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age,..) and past history Predict if a pattern of phone calling card usage is likely to be fraudulent Some examples of prediction mechanisms: Classification Given a new item whose class is unknown, predict to which class it belongs Regression formulae Given a set of mappings for an unknown function, predict the function result for a new parameter value 19 19
20 Data Mining (Cont.) Descriptive Patterns Associations Find books that are often bought by similar customers. If a new such customer buys one such book, suggest the others too. Associations may be used as a first step in detecting causation E.g. association between exposure to chemical X and cancer, Clusters E.g. typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics 20 20
21 Decision Trees Example: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign sale custid car age city newcar c1 taurus 27 sf yes c2 van 35 la yes c3 van 40 sf yes c4 taurus 22 sf yes c5 merc 50 la no c6 taurus 25 la no training set 21 21
22 One Possibility Y age<30 N sale custid car age city newcar c1 taurus 27 sf yes c2 van 35 la yes c3 van 40 sf yes c4 taurus 22 sf yes c5 merc 50 la no c6 taurus 25 la no city=sf Y N likely unlikely car=van Y N likely unlikely 22 22
23 Another Possibility Y car=taurus N sale custid car age city newcar c1 taurus 27 sf yes c2 van 35 la yes c3 van 40 sf yes c4 taurus 22 sf yes c5 merc 50 la no c6 taurus 25 la no city=sf Y N likely unlikely age<45 Y N likely unlikely 23 23
24 Issues Decision tree cannot be too deep would not have statistically significant amounts of data for lower decisions Need to select tree that most reliably predicts outcomes 24 24
25 Clustering income education age 25 25
26 Another Example: Text Each document is a vector e.g., < > contains words 1,4,5,... Clusters contain similar documents Useful for understanding, searching documents international news business sports 26 26
27 Issues Given desired number of clusters? Finding best clusters Are clusters semantically meaningful? 27 27
28 Association Rule Mining sales records: transaction id customer id tran1 cust33 p2, p5, p8 tran2 cust45 p5, p8, p11 tran3 cust12 p1, p9 tran4 cust40 p5, p8, p11 tran5 cust12 p2, p9 tran6 cust12 p9 products bought market-basket data Trend: Products p5, p8 often bough together Trend: Customer 12 likes product p
29 Association Rule Rule: {p 1, p 3, p 8 } Support: number of baskets where these products appear High-support set: support threshold s Problem: find all high support sets 29 29
30 Finding High-Support Pairs Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; 30 30
31 Example basket item t1 p2 t1 p5 t1 p8 t2 p5 t2 p8 t2 p basket item1 item2 t1 p2 p5 t1 p2 p8 t1 p5 p8 t2 p5 p8 t2 p5 p11 t2 p8 p check if count s 31 31
32 Issues Performance for size 2 rules big basket item t1 p2 t1 p5 t1 p8 t2 p5 t2 p8 t2 p basket item1 item2 t1 p2 p5 t1 p2 p8 t1 p5 p8 t2 p5 p8 t2 p5 p11 t2 p8 p even bigger! Performance for size k rules 32 32
33 Roadmap What is data mining? Data Mining Tasks Classification/Decision Tree Clustering Association Mining Data Mining Algorithms Decision Tree Construction Frequent 2-itemsets Frequent Itemsets (Apriori) Clustering/Collaborative Filtering 33 33
34 Classification Rules Classification rules help assign new objects to classes. E.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk? Classification rules for above example could use a variety of data, such as educational level, salary, age, etc. person P, P.degree = masters and P.income > 75,000 P.credit = excellent person P, P.degree = bachelors and (P.income 25,000 and P.income 75,000) P.credit = good Rules are not necessarily exact: there may be some misclassifications Classification rules can be shown compactly as a decision tree
35 Decision Tree 35 35
36 Decision Tree Construction NAME Balance Employed Age Default Matt 23,000 No 30 yes Ben 51,100 No 48 yes Chris 68,000 Yes 55 no Jim 74,000 No 40 no David 23,000 No 47 yes John 100,000 Yes 49 no Employed Yes Class=Not Default <50K No Balance Root >=50K Node Leaf Class=Yes Default <45 Age >=45 Class=Not Default Class=Yes Default 36 36
37 Construction of Decision Trees Training set: a data sample in which the classification is already known. Greedy top down generation of decision trees. Each internal node of the tree partitions the data into groups based on a partitioning attribute, and a partitioning condition for the node Leaf node: all (or most) of the items at the node belong to the same class, or all attributes have been considered, and no further partitioning is possible
38 Finding the Best Split Point for Numerical Attributes Complete Class Histograms Counts class-2 class-1 The data comes from a IBM Quest synthetic dataset for function 0 Age Gain Function Best Split Point gain gain function In-core algorithms, such as C4.5, will just online sort the numerical attributes! Age 38 38
39 Best Splits Pick best attributes and conditions on which to partition The purity of a set S of training instances can be measured quantitatively in several ways. Notation: number of classes = k, number of instances = S, fraction of instances in class i = p i. The Gini measure of purity is defined as [ k Gini (S) = 1 - i- - 1 p 2 i When all instances are in a single class, the Gini value is 0 It reaches its maximum (of 1 1 /k) if each class the same number of instances
40 Best Splits (Cont.) Another measure of purity is the entropy measure, which is defined as entropy (S) = p i- 1 ilog 2 p i k i- When a set S is split into multiple sets Si, I=1, 2,, r, we can measure the purity of the resultant set of sets as: purity(s 1, S 2,.., S r ) = r i= = 1 S i S purity (S i ) The information gain due to particular split of S into S i, i = 1, 2,., r Information-gain (S, {S 1, S 2,., S r ) = purity(s ) purity (S 1, S 2, S r ) 40 40
41 Finding Best Splits Categorical attributes (with no meaningful order): Multi-way split, one child for each value Binary split: try all possible breakup of values into two sets, and pick the best Continuous-valued attributes (can be sorted in a meaningful order) Binary split: Sort values, try each as a split point E.g. if values are 1, 10, 15, 25, split at 1, 10, 15 Pick the value that gives best split Multi-way split: A series of binary splits on the same attribute has roughly equivalent effect 41 41
42 Decision-Tree Construction Algorithm Procedure GrowTree (S ) Partition (S ); Procedure Partition (S) if ( purity (S ) > δ p or S < δ s ) then return; for each attribute A evaluate splits on attribute A; Use best split found (across all attributes) to partition S into S 1, S 2,., S r, for i = 1, 2,.., r Partition (S i ); 42 42
43 Finding Association Rules We are generally only interested in association rules with reasonably high support (e.g. support of 2% or greater) Naïve algorithm 1. Consider all possible sets of relevant items. 2. For each set find its support (i.e. count how many transactions purchase all items in the set). Large itemsets: sets with sufficiently high support 43 43
44 Example: Association Rules How do we perform rule mining efficiently? Observation: If set X has support t, then each X subset must have at least support t For 2-sets: if we need support s for {i, j} then each i, j must appear in at least s baskets 44 44
45 Algorithm for 2-Sets (1) Find OK products those appearing in s or more baskets (2) Find high-support pairs using only OK products 45 45
46 Algorithm for 2-Sets INSERT INTO okbaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s; Perform mining on okbaskets SELECT I.item, J.item, COUNT(I.basket) FROM okbaskets I, okbaskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; 46 46
47 Counting Efficiently One way: threshold = 3 basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p sort basket I.item J.item t3 p2 p3 t3 p2 p8 t1 p5 p8 t2 p5 p8 t3 p5 p8 t2 p8 p count & remove count I.item J.item 3 p5 p8 5 p12 p
48 Counting Efficiently Another way: threshold = 3 basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p scan & count count I.item J.item 1 p2 p3 2 p2 p8 3 p5 p8 5 p12 p18 1 p21 p22 2 p21 p remove count I.item J.item 3 p5 p8 5 p12 p keep counter array in memory 48 48
49 Yet Another Way basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p (2) scan & remove false positive (1) scan & hash & count count bucket 1 A 5 B 2 C 1 D 8 E 1 F basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p5 p8 t5 p12 p18 t8 p12 p in-memory hash table threshold = 3 in-memory counters count count I.item J.item 3 p5 p8 1 p8 p11 5 p12 p (3) scan& count I.item J.item 3 p5 p8 5 p12 p (4) remove 49
50 Discussion Hashing scheme: 2 (or 3) scans of data Sorting scheme: requires a sort! Hashing works well if few high-support pairs and many lowsupport ones frequency iceberg queries threshold item-pairs ranked by frequency 50 50
51 Review (1) What s classification problem? What s decision tree? How to build it? What frequent itemset mining problem? How to use SQL to express finding frequent pairs problem? What s the alternatives to find such pairs? 51 51
52 Review (2) Decision Tree Construction NAME Balance Employed Age Default Matt 23,000 No 30 yes Ben 51,100 No 48 yes Chris 68,000 Yes 55 no Jim 74,000 No 40 no David 23,000 No 47 yes John 100,000 Yes 49 no Recursively construct the decision tree Best test condition for a node Categorical attributes, best subset test Numerical attributes, best split point Partition the dataset Leaf Employed Yes Class=Not Default <50K No Class=Yes Default Balance <45 >=50K Age Root Node >=45 Class=Not Default Class=Yes Default 52 52
53 Review (3) Finding High-Support Pairs Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; 53 53
54 Frequent Itemsets Mining TID Transactions Desired frequency 50% (support level) { A, B, E } { B, D } { A, B, E } { A, C } { B, C } { A, C } { A, B } { A, B, C, E } { A, B, C } { A, C, E } {A},{B},{C},{A,B}, {A,C} Down-closure (apriori) property If an itemset is frequent, all of its subset must also be frequent 54 54
55 Lattice for Enumerating Frequent Itemsets null A B C D E AB AC AD AE BC BD BE CD CE DE Found to be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets 55 ABCDE 55
56 Apriori L 0 = C 1 = { 1-item subsets of all the transactions } For ( k=1; C k 0; k++ ) { * support counting * } for all transactions t D for all k-subsets s of t if k C k s.count++ { * candidates generation * } L k = { c C k c.count>= min sup } C k+1 = apriori_gen( L k ) Answer = U k L k 56 56
57 Clustering Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster Can be formalized using distance metrics in several ways Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized Centroid: point defined by taking average of coordinates in each dimension. Another metric: minimize average distance between every pair of points in a cluster Has been studied extensively in statistics, but on small data sets Data mining systems aim at clustering techniques that can handle very large data sets E.g. the Birch clustering algorithm! 57 57
58 K-Means Clustering 58 58
59 Hierarchical Clustering Example from biological classification (the word classification here does not mean a prediction mechanism) chordata mammalia reptilia leopards humans snakes crocodiles Other examples: Internet directory systems (e.g. Yahoo!) Agglomerative clustering algorithms Build small clusters, then cluster small clusters into bigger clusters, and so on Divisive clustering algorithms Start with all items in a single cluster, repeatedly refine (break) clusters into smaller ones 59 59
60 Collaborative Filtering Goal: predict what movies/books/ a person may be interested in, on the basis of Past preferences of the person Other people with similar past preferences The preferences of such people for a new movie/book/ One approach based on repeated clustering Cluster people on the basis of preferences for movies Then cluster movies on the basis of being liked by the same clusters of people Again cluster people based on their preferences for (the newly created clusters of) movies Repeat above till equilibrium Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest 60 60
61 Other Types of Mining Text mining: application of data mining to textual documents cluster Web pages to find related pages cluster pages a user has visited to organize their visit history classify Web pages automatically into a Web directory Data visualization systems help users examine large volumes of data and detect patterns visually Can visually encode large amounts of information on a single screen Humans are very good a detecting visual patterns 61 61
62 Data Streams What are Data Streams? Continuous streams Huge, Fast, and Changing Why Data Streams? The arriving speed of streams and the huge amount of data are beyond our capability to store them. Real-time processing Window Models Landscape window (Entire Data Stream) Sliding Window Damped Window Mining Data Stream 62 62
63 A Simple Problem Finding frequent items Given a sequence (x 1, x N ) where x i [1,m], and a real number θbetween zero and one. Looking for x i whose frequency > θ Naïve Algorithm (m counters) The number of frequent items 1/ θ Problem: N>>m>>1/θ P (Nθ) N 63 63
64 KRP algorithm Karp, et. al (TODS 03) m=12 N=30 Θ=0.35 1/θ = 3 N/ ( 1/θ ) Nθ 64 64
65 Enhance the Accuracy m=12 Θ=0.35 1/θ = 3 Deletion/Reducing = 9 N=30 NΘ>10 ε=0.5 Θ(1- ε )= /(θ ε) = 6 (ε*θ)n (Θ-(ε*Θ))N
66 Frequent Items for Transactions with Fixed Length Each transaction has 2 items Θ=0.60 1/θ 2=
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationBig Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs
1 Big Data Analytics Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE
More informationData Mining Apriori Algorithm
10 Data Mining Apriori Algorithm Apriori principle Frequent itemsets generation Association rules generation Section 6 of course book TNM033: Introduction to Data Mining 1 Association Rule Mining (ARM)
More informationCOMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
More informationUnique column combinations
Unique column combinations Arvid Heise Guest lecture in Data Profiling and Data Cleansing Prof. Dr. Felix Naumann Agenda 2 Introduction and problem statement Unique column combinations Exponential search
More informationData Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar
Data Mining: Association Analysis Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of
More informationClassification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
More informationData Mining: Foundation, Techniques and Applications
Data Mining: Foundation, Techniques and Applications Lesson 1b :A Quick Overview of Data Mining Li Cuiping( 李 翠 平 ) School of Information Renmin University of China Anthony Tung( 鄧 锦 浩 ) School of Computing
More informationData Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
More informationData Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationBenchmark Databases for Testing Big-Data Analytics In Cloud Environments
North Carolina State University Graduate Program in Operations Research Benchmark Databases for Testing Big-Data Analytics In Cloud Environments Rong Huang Rada Chirkova Yahya Fathi ICA CON 2012 April
More informationAssociation Analysis: Basic Concepts and Algorithms
6 Association Analysis: Basic Concepts and Algorithms Many business enterprises accumulate large quantities of data from their dayto-day operations. For example, huge amounts of customer purchase data
More informationChapter 22: Advanced Querying and Information Retrieval. Decision Support Systems
Chapter 22: Advanced Querying and Information Retrieval! Decision-Support Systems " Data Analysis # OLAP # Extended aggregation features in SQL Windowing and ranking " Data Mining " Data Warehousing! Information-Retrieval
More informationPart 2: Community Detection
Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationOutline. What is Big data and where they come from? How we deal with Big data?
What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,
More information131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationData Mining Association Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/24
More informationAnalytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationIntroduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
More informationAssociation Rule Mining
Association Rule Mining Association Rules and Frequent Patterns Frequent Pattern Mining Algorithms Apriori FP-growth Correlation Analysis Constraint-based Mining Using Frequent Patterns for Classification
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More informationLecture Notes on Database Normalization
Lecture Notes on Database Normalization Chengkai Li Department of Computer Science and Engineering The University of Texas at Arlington April 15, 2012 I decided to write this document, because many students
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationProfessor Anita Wasilewska. Classification Lecture Notes
Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,
More informationData Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
More informationData Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction
Data Mining and Exploration Data Mining and Exploration: Introduction Amos Storkey, School of Informatics January 10, 2006 http://www.inf.ed.ac.uk/teaching/courses/dme/ Course Introduction Welcome Administration
More informationIntroduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing
Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition
More informationData Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
More informationData Mining Techniques
15.564 Information Technology I Business Intelligence Outline Operational vs. Decision Support Systems What is Data Mining? Overview of Data Mining Techniques Overview of Data Mining Process Data Warehouses
More informationFinding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm
R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*
More informationFoundations of Artificial Intelligence. Introduction to Data Mining
Foundations of Artificial Intelligence Introduction to Data Mining Objectives Data Mining Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees Present
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationChapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1
Chapter 4 Data Mining A Short Introduction 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining
More informationMAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH
MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH M.Rajalakshmi 1, Dr.T.Purusothaman 2, Dr.R.Nedunchezhian 3 1 Assistant Professor (SG), Coimbatore Institute of Technology, India, rajalakshmi@cit.edu.in
More informationInformation Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationRecognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationClustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
More informationFuzzy Association Rules
Vienna University of Economics and Business Administration Fuzzy Association Rules An Implementation in R Master Thesis Vienna University of Economics and Business Administration Author Bakk. Lukas Helm
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationCS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team
CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team Lecture Summary In this lecture, we learned about the ADT Priority Queue. A
More informationUsing Data Mining and Machine Learning in Retail
Using Data Mining and Machine Learning in Retail Omeid Seide Senior Manager, Big Data Solutions Sears Holdings Bharat Prasad Big Data Solution Architect Sears Holdings Over a Century of Innovation A Fortune
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
More informationStatic Data Mining Algorithm with Progressive Approach for Mining Knowledge
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive
More informationData Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition
Brochure More information from http://www.researchandmarkets.com/reports/2171322/ Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition Description: This book reviews state-of-the-art methodologies
More informationData Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1
Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints
More informationDEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE
DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE SK MD OBAIDULLAH Department of Computer Science & Engineering, Aliah University, Saltlake, Sector-V, Kol-900091, West Bengal, India sk.obaidullah@gmail.com
More informationClustering through Decision Tree Construction in Geology
Nonlinear Analysis: Modelling and Control, 2001, v. 6, No. 2, 29-41 Clustering through Decision Tree Construction in Geology Received: 22.10.2001 Accepted: 31.10.2001 A. Juozapavičius, V. Rapševičius Faculty
More informationDecision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010
Decision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010 Ernst van Waning Senior Sales Engineer May 28, 2010 Agenda SPSS, an IBM Company SPSS Statistics User-driven product
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationIntroduction to Learning & Decision Trees
Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
More informationChapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
More informationBusiness Problems and Data Science Solutions
CSCI E-84 A Practical Approach to Data Science Ramon A. Mata-Toledo, Ph.D. Professor of Computer Science Harvard Extension School Unit 1 - Lecture 2 February, Wednesday 3, 2016 Business Problems and Data
More informationMethods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL
Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations
More informationFrequent item set mining
Frequent item set mining Christian Borgelt Frequent item set mining is one of the best known and most popular data mining methods. Originally developed for market basket analysis, it is used nowadays for
More informationSPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
More informationData Mining Techniques and its Applications in Banking Sector
Data Mining Techniques and its Applications in Banking Sector Dr. K. Chitra 1, B. Subashini 2 1 Assistant Professor, Department of Computer Science, Government Arts College, Melur, Madurai. 2 Assistant
More informationExtend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia
More informationDHL Data Mining Project. Customer Segmentation with Clustering
DHL Data Mining Project Customer Segmentation with Clustering Timothy TAN Chee Yong Aditya Hridaya MISRA Jeffery JI Jun Yao 3/30/2010 DHL Data Mining Project Table of Contents Introduction to DHL and the
More informationData Mining for Business Analytics
Data Mining for Business Analytics Lecture 2: Introduction to Predictive Modeling Stern School of Business New York University Spring 2014 MegaTelCo: Predicting Customer Churn You just landed a great analytical
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationMining Association Rules. Mining Association Rules. What Is Association Rule Mining? What Is Association Rule Mining? What is Association rule mining
Mining Association Rules What is Association rule mining Mining Association Rules Apriori Algorithm Additional Measures of rule interestingness Advanced Techniques 1 2 What Is Association Rule Mining?
More informationPractical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING
Practical Applications of DATA MINING Sang C Suh Texas A&M University Commerce r 3 JONES & BARTLETT LEARNING Contents Preface xi Foreword by Murat M.Tanik xvii Foreword by John Kocur xix Chapter 1 Introduction
More informationDistributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
More informationData Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over
More informationUniversité de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr
Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
More informationTransforming the Telecoms Business using Big Data and Analytics
Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe
More informationHOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14
HOW TO USE MINITAB: DESIGN OF EXPERIMENTS 1 Noelle M. Richard 08/27/14 CONTENTS 1. Terminology 2. Factorial Designs When to Use? (preliminary experiments) Full Factorial Design General Full Factorial Design
More informationMining Sequence Data. JERZY STEFANOWSKI Inst. Informatyki PP Wersja dla TPD 2009 Zaawansowana eksploracja danych
Mining Sequence Data JERZY STEFANOWSKI Inst. Informatyki PP Wersja dla TPD 2009 Zaawansowana eksploracja danych Outline of the presentation 1. Realtionships to mining frequent items 2. Motivations for
More informationBIRCH: An Efficient Data Clustering Method For Very Large Databases
BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
More informationIntroduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
More informationBuilding A Smart Academic Advising System Using Association Rule Mining
Building A Smart Academic Advising System Using Association Rule Mining Raed Shatnawi +962795285056 raedamin@just.edu.jo Qutaibah Althebyan +962796536277 qaalthebyan@just.edu.jo Baraq Ghalib & Mohammed
More informationIntroduction. A. Bellaachia Page: 1
Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.
More informationA Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model
A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model ABSTRACT Mrs. Arpana Bharani* Mrs. Mohini Rao** Consumer credit is one of the necessary processes but lending bears
More informationData Warehousing and Data Mining. A.A. 04-05 Datawarehousing & Datamining 1
Data Warehousing and Data Mining A.A. 04-05 Datawarehousing & Datamining 1 Outline 1. Introduction and Terminology 2. Data Warehousing 3. Data Mining Association rules Sequential patterns Classification
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationClassification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationA COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING
A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING M.Gnanavel 1 & Dr.E.R.Naganathan 2 1. Research Scholar, SCSVMV University, Kanchipuram,Tamil Nadu,India. 2. Professor
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationA Data Mining Tutorial
A Data Mining Tutorial Presented at the Second IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 98) 14 December 1998 Graham Williams, Markus Hegland and Stephen
More informationOverview. Background. Data Mining Analytics for Business Intelligence and Decision Support
Mining Analytics for Business Intelligence and Decision Support Chid Apte, PhD Manager, Abstraction Research Group IBM TJ Watson Research Center apte@us.ibm.com http://www.research.ibm.com/dar Overview
More informationCHAPTER-12. Analytical Characterization : Analysis of Attribute Relevance
CHAPTER-12 Analytical Characterization : Analysis of Attribute Relevance 12.1 Introduction 12.2 Methods of Attribute Relevance Analysis 12.3 Review Questions 12.4 References 12. Analytical Characterization
More informationEffective Pruning for the Discovery of Conditional Functional Dependencies
Effective Pruning for the Discovery of Conditional Functional Dependencies Jiuyong Li 1, Jiuxue Liu 1, Hannu Toivonen 2, Jianming Yong 3 1 School of Computer and Information Science, University of South
More informationData Mining with SQL Server Data Tools
Data Mining with SQL Server Data Tools Data mining tasks include classification (directed/supervised) models as well as (undirected/unsupervised) models of association analysis and clustering. 1 Data Mining
More informationWhite Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics
White Paper Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics Contents Self-service data discovery and interactive predictive analytics... 1 What does
More information!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
More informationData Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier
Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.
More informationSummary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen
Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More information