Big Data Analysis. Rajen D. Shah (Statistical Laboratory, University of Cambridge) joint work with Nicolai Meinshausen (Seminar für Statistik, ETH
|
|
- Eleanore Willis
- 8 years ago
- Views:
Transcription
1 Big Data Analysis Rajen D Shah (Statistical Laboratory, University of Cambridge) joint work with Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) University of Cambridge Mathematical Sciences Showcase 29 January 2014 Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
2 What is Big Data? The size of the data is such that computational considerations become important when choosing what algorithm to use Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
3 Large-scale classification with binary data Customer 1a {cheese, eggs, juice, milk, } Customer 2a {cereal, cheese, eggs, milk, } Customer 3a {cheese, crisps, eggs, milk, } Customer 1b {cereal, eggs, juice, milk, } Customer 2b {book, cheese, crisps, DVD, eggs, } Customer 3b {cheese, crisps, juice, milk, } Given two groups of customers, the aim is to find a collection of items that is often bought together within one group, but only rarely in the other group Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
4 Large-scale classification with binary data Customer 1a {cheese, eggs, juice, milk, } Customer 2a {cereal, cheese, eggs, milk, } Customer 3a {cheese, crisps, eggs, milk, } Customer 1b {cereal, eggs, juice, milk, } Customer 2b {book, cheese, crisps, DVD, eggs, } Customer 3b {cheese, crisps, juice, milk, } Given two groups of customers, the aim is to find a collection of items that is often bought together within one group, but only rarely in the other group Can also imagine having two groups of documents, or s Here the aim would be to find groups of words that occur frequently together in one class, but not in the other Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
5 Alternative view of the data book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
6 Searching for subsets book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
7 Searching for subsets book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
8 Searching for subsets book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
9 Alternative view of the data book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
10 Scaling of the algorithm With p different variables, the number of subsets of size 2 (or potential two-way interactions) is roughly p 2 /2 The number of three-way interactions is roughly p 3 /6 In general, the number of d-way interactions will be O(p d ) Take p = 10, 000 Then the number of two-way interactions is roughly ie 50 million Three-way interactions: > ie 100 billion Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
11 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
12 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
13 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
14 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
15 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
16 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
17 Restricting the search space book cereal cheese crisps DVD eggs juice milk 1a a a b b b Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
18 Decision trees and related methods Decision Trees like CART (Breiman, 84) build up interactions / patterns greedily, starting from the individual variables They often work well, but give no guarantee that a strong interaction will be found Alternative strategies such as linear models, logistic regression and association rule mining techniques all have the same drawback Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
19 Toy example where most current methods fails Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 1a a 1 1 3a a a a a 1 1 8a a a a a a a Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
20 Idea: look at rows rather than columns Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 1a a 1 1 3a a a a a 1 1 8a a a a a a a Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
21 Idea: look at rows rather than columns Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 1a a 1 1 3a a a a a 1 1 8a a a a a a a Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
22 Arrange the search in a tree Computing intersections between large sets can be time-consuming However, computing an intersection between a small and large set is very cheap We should try to re-use intersections between large sets that we have calculated Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
23 !!! Example: Tic-Tac-Toe Data!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Dataset with endgames of Tic-Tac-Toe games Learn the rules of the game (or probabilities of winning) by looking at the database!! Each variable is coded as binary (eg is the first square occupied by a black stone? ) Marginal effects are weak Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
24 Arranging the search on a tree Random Intersection Tree Intersections are shown in the nodes Random observations along edges Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
25 Analysis of the algorithm We can try to study the average number of operations required to recover an interaction with a given probability Can show that for the coin flipping toy example, the number of operations required is just over O(p) (there are some additional log(p) terms) More generally, whenever the prevalence of the target interaction is high and the data is sparse, Random Intersection Trees tends to perform quite well Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
26 Analysis of the algorithm We can try to study the average number of operations required to recover an interaction with a given probability Can show that for the coin flipping toy example, the number of operations required is just over O(p) (there are some additional log(p) terms) More generally, whenever the prevalence of the target interaction is high and the data is sparse, Random Intersection Trees tends to perform quite well To speed up evaluation of the prevalence of candidate interactions, we use a method based on a technique from Computer Science called min-wise hashing (Broder, 1998) Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
27 Discussion Simple problems in the small data setting can become interesting in the big data setting Sometimes solutions to big data problems must draw on ideas from both Computer Science and Statistics Rajen Shah (Cambridge) Big Data Analysis 29 Jan / 25
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is
More informationD-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationPredictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
More informationApplied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationBeating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
More informationIMPLEMENTING CLASSIFICATION FOR INDIAN STOCK MARKET USING CART ALGORITHM WITH B+ TREE
P 0Tis International Journal of Scientific Engineering and Applied Science (IJSEAS) Volume-2, Issue-, January 206 IMPLEMENTING CLASSIFICATION FOR INDIAN STOCK MARKET USING CART ALGORITHM WITH B+ TREE Kalpna
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationChapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationBeating the MLB Moneyline
Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationEXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationLOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
More informationBIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationTHE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
More informationMicrosoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
More informationCOMMON CORE STATE STANDARDS FOR
COMMON CORE STATE STANDARDS FOR Mathematics (CCSSM) High School Statistics and Probability Mathematics High School Statistics and Probability Decisions or predictions are often based on data numbers in
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationDiscrete Math in Computer Science Homework 7 Solutions (Max Points: 80)
Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80) CS 30, Winter 2016 by Prasad Jayanti 1. (10 points) Here is the famous Monty Hall Puzzle. Suppose you are on a game show, and you
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationDatabases - Data Mining. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 1 / 25
Databases - Data Mining (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 1 / 25 This lecture This lecture introduces data-mining through market-basket analysis. (GF Royle, N Spadaccini 2006-2010)
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationAN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS
AN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS Maddela Pradeep 1, V. Nagi Reddy 2 1 M.Tech Scholar(CSE), 2 Assistant Professor, Nalanda Institute Of Technology(NIT), Siddharth Nagar, Guntur,
More informationData Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
More informationCSC 177 Fall 2014 Team Project Final Report
CSC 177 Fall 2014 Team Project Final Report Project Title, Data Mining on Farmers Market Data Instructor: Dr. Meiliu Lu Team Members: Yogesh Isawe Kalindi Mehta Aditi Kulkarni CSc 177 DM Project Cover
More informationStatistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
More informationDATA STRUCTURES USING C
DATA STRUCTURES USING C QUESTION BANK UNIT I 1. Define data. 2. Define Entity. 3. Define information. 4. Define Array. 5. Define data structure. 6. Give any two applications of data structures. 7. Give
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationFoundations of Artificial Intelligence. Introduction to Data Mining
Foundations of Artificial Intelligence Introduction to Data Mining Objectives Data Mining Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees Present
More informationIdentifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
More informationIn the situations that we will encounter, we may generally calculate the probability of an event
What does it mean for something to be random? An event is called random if the process which produces the outcome is sufficiently complicated that we are unable to predict the precise result and are instead
More informationWhy do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationPredictive Modeling and Big Data
Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation
More informationClassification using Logistic Regression
Classification using Logistic Regression Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng Institut für Informatik This lecture covers Logistic regression hypothesis Decision Boundary Cost function
More informationTree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern
More informationCIS 631 Database Management Systems Sample Final Exam
CIS 631 Database Management Systems Sample Final Exam 1. (25 points) Match the items from the left column with those in the right and place the letters in the empty slots. k 1. Single-level index files
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationA Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
More informationData Mining Applications in Fund Raising
Data Mining Applications in Fund Raising Nafisseh Heiat Data mining tools make it possible to apply mathematical models to the historical data to manipulate and discover new information. In this study,
More informationThe Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
More informationHow To Play The Math Game
Game Information 1 Introduction Math is an activity that is perfect for reviewing key mathematics vocabulary in a unit of study. It can also be used to review any type of mathematics problem. Math provides
More informationParallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data
Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data Jun Wang Department of Mechanical and Automation Engineering The Chinese University of Hong Kong Shatin, New Territories,
More informationModel-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups
Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and
More informationBenchmarking Open-Source Tree Learners in R/RWeka
Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationUnion-Find Algorithms. network connectivity quick find quick union improvements applications
Union-Find Algorithms network connectivity quick find quick union improvements applications 1 Subtext of today s lecture (and this course) Steps to developing a usable algorithm. Define the problem. Find
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationBIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
More informationData Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine
Data Mining SPSS 12.0 1. Overview Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Types of Models Interface Projects References Outline Introduction Introduction Three of the common data mining
More informationA Decision Theoretic Approach to Targeted Advertising
82 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 A Decision Theoretic Approach to Targeted Advertising David Maxwell Chickering and David Heckerman Microsoft Research Redmond WA, 98052-6399 dmax@microsoft.com
More informationEvent driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016
Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency
More informationL3: Statistical Modeling with Hadoop
L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...
More informationPrediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
More informationCHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V
CHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V Chapters 13 and 14 introduced and explained the use of a set of statistical tools that researchers use to measure
More informationIntroduction Solvability Rules Computer Solution Implementation. Connect Four. March 9, 2010. Connect Four
March 9, 2010 is a tic-tac-toe like game in which two players drop discs into a 7x6 board. The first player to get four in a row (either vertically, horizontally, or diagonally) wins. The game was first
More informationMultiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationData Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
More informationPresentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering
Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen 9-October 2015 Presentation by: Ahmad Alsahaf Research collaborator at the Hydroinformatics lab - Politecnico di
More information- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time
Skip Lists CMSC 420 Linked Lists Benefits & Drawbacks Benefits: - Easy to insert & delete in O(1) time - Don t need to estimate total memory needed Drawbacks: - Hard to search in less than O(n) time (binary
More informationREPORT DOCUMENTATION PAGE
REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,
More informationCurriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different
More informationJournée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationnot possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
More informationIn-Database Analytics
Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing
More informationBattleships Searching Algorithms
Activity 6 Battleships Searching Algorithms Summary Computers are often required to find information in large collections of data. They need to develop quick and efficient ways of doing this. This activity
More informationOPTIMIZATION AND FORECASTING WITH FINANCIAL TIME SERIES
OPTIMIZATION AND FORECASTING WITH FINANCIAL TIME SERIES Allan Din Geneva Research Collaboration Notes from seminar at CERN, June 25, 2002 General scope of GRC research activities Econophysics paradigm
More informationII. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
More informationData Mining Techniques
15.564 Information Technology I Business Intelligence Outline Operational vs. Decision Support Systems What is Data Mining? Overview of Data Mining Techniques Overview of Data Mining Process Data Warehouses
More informationNine Common Types of Data Mining Techniques Used in Predictive Analytics
1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better
More informationParallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
More informationScalable Prefix Matching for Internet Packet Forwarding
Scalable Prefix Matching for Internet Packet Forwarding Marcel Waldvogel Computer Engineering and Networks Laboratory Institut für Technische Informatik und Kommunikationsnetze Background Internet growth
More informationOptimization in ICT and Physical Systems
27. OKTOBER 2010 in ICT and Physical Systems @ Aarhus University, Course outline, formal stuff Prerequisite Lectures Homework Textbook, Homepage and CampusNet, http://kurser.iha.dk/ee-ict-master/tiopti/
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationComparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
More informationOffline 1-Minesweeper is NP-complete
Offline 1-Minesweeper is NP-complete James D. Fix Brandon McPhail May 24 Abstract We use Minesweeper to illustrate NP-completeness proofs, arguments that establish the hardness of solving certain problems.
More informationParallel & Distributed Optimization. Based on Mark Schmidt s slides
Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear
More informationWeight of Evidence Module
Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically,
More informationDECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING
DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four
More informationHow To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationCourse Syllabus. Purposes of Course:
Course Syllabus Eco 5385.701 Predictive Analytics for Economists Summer 2014 TTh 6:00 8:50 pm and Sat. 12:00 2:50 pm First Day of Class: Tuesday, June 3 Last Day of Class: Tuesday, July 1 251 Maguire Building
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationHow To Understand And Solve A Linear Programming Problem
At the end of the lesson, you should be able to: Chapter 2: Systems of Linear Equations and Matrices: 2.1: Solutions of Linear Systems by the Echelon Method Define linear systems, unique solution, inconsistent,
More informationTable of Contents. June 2010
June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and
More informationCART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
More informationMachine Learning and Data Mining. Fundamentals, robotics, recognition
Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,
More informationNumerical Algorithms Group
Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the non-trivial extraction of implicit, previously unknown and potentially useful
More informationA Data Mining Tutorial
A Data Mining Tutorial Presented at the Second IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 98) 14 December 1998 Graham Williams, Markus Hegland and Stephen
More informationHoover High School Math League. Counting and Probability
Hoover High School Math League Counting and Probability Problems. At a sandwich shop there are 2 kinds of bread, 5 kinds of cold cuts, 3 kinds of cheese, and 2 kinds of dressing. How many different sandwiches
More information