Decision Trees and Random Forests. Reference: Leo Breiman,

Size: px
Start display at page:

Download "Decision Trees and Random Forests. Reference: Leo Breiman,"

Transcription

1 Decision Trees and Random Forests Reference: Leo Breiman, 1. Decision trees Example (Guerts, Fillet, et al., Bioinformatics 2005): Patients to be classified: normal vs. diseased

2 Decision trees Classification of biomarker data: large number of values (e.g., microarray or mass spectrometry analysis of biological sample)

3 Decision trees Mass spectrometry parameters or gene expression parameters (around 15k values) Given new patient with biomarker data, is s/he normal or ill?

4 Decision trees Needed: selection of relevant variables from many 8 x 3 3 3œ" Number 8 of known examples in H œ ÖÐ ßC Ñ is small (characteristic of data mining problems) Assume we have for each biological sample a feature vector x, and will classify it: diseased: Cœ"à normal: Cœ "Þ Goal: find function 0ÐxÑ C which predicts C from x.

5 Decision trees How to estimate error of 0ÐxÑ and avoid over-fitting the small dataset H? Use cross-validation to test predictor part of the sample HÞ 0ÐxÑ in an unexamined For biological sample, feature vector x œðbßáßb ". Ñ consists of features ( or biomarkers or attributes) B3 œ E3 describing the biological sample from which x is obtained.

6 The decision tree approach Decision tree approach to finding predictor 0ÐxÑ œ C on data set H: based Š form a tree whose nodes are attributes B3 œ E3 in x Š decide which attributes E3 to look at first in predicting C from x find those with highest information gain - place these at top of tree Š then use recursion to form sub-trees based on attributes not used in the higher nodes:

7 The decision tree approach Advantages: interpretable, easy to use, scalable, robust

8 Decision tree example Example 1 (Moore): UCI data repository ( MPG ratings of cars: Goal: predict MPG rating of a car from a set of attributes E 3

9 Decision tree example Examples (each row is attribute set for a sample car): R. Quinlan

10 Decision tree example Simple assessment of information gain: how much does a particular attribute E 3 help to classify a car with respect to MPG?

11 Decision tree example Begin the decision tree: start with most informative

12 Decision tree example criterion, cylinders:

13 Decision tree example Recursion: build next level of tree. Initially have: Now build sub-trees: split each set of cylinder numbers into

14 Decision tree example further groups- Resulting next level:

15 Final tree: Decision tree example

16 Decision tree example

17 Points: Decision tree example è Don't split node if all records have same value (e.g. cylinders œ'ñ è Don't split node if can't have more than 1 child (e.g. acceleration œ medium)

18 Pseudocode: Program Tree(Input, Output) If all output values are the same, then return leaf (terminal) node which predicts the unique output If input values are balanced in a leaf node (e.g. 1 good, 1 bad in acceleration) then return leaf predicting majority of outputs on same level (e.g. bad in this case) Else find attribute with highest information gain E 3 If attribute at current node has values E 7 3

19 then Return internal (non-leaf) node with 7 children Build child 3 by calling Tree(NewIn, NewOut), where NewIn œ values in dataset consistent with value E 3 and all values above this node

20 Another decision tree: prediction of wealth from census data (Moore):

21 Prediction of age from census:

22 Prediction of gender from census: A. Moore

23 2. Important point: always cross-validate It is important to test your model on new data (test data) which are different from the data used to train the model (training data). This is cross-validation. Cross-validation error 2% is good; 40% is poor.

24 3. Background: mass spectroscopy What does a mass spectrometer do? 1. It measures masses of molecules better than any other technique. 2. It can give information about chemical structures of molecules.

25 Mass spectroscopy How does it work? 1. Takes unknown molecule Mß adds 3 protons to it giving it charge 3(forming MH Ñ 2. Accelerates ion MH 3 in known electric field I. 3. Measures time of flight along a known distance H. 4. Time X of flight is inversely proportional to electric charge 3 and proportional to mass 7 of ion. 3

26 Thus Mass spectroscopy Xº3Î7 So mass spectrometer measures ratio of charge 3 (also known as D) and 7, i.e., 3Î7 œ DÎ7Þ With a large number of molecules in a biosample, this gives a spectrum of DÎ7 values, which allows identification of molecules in sample (here IgG œ immunoglobin G)

27 Mass spectroscopy

28 Mass spectroscopy What are the measurements good for? To identify, verify, and quantify: metabolites, proteins, oligonucleotides, drug candidates, peptides, synthetic organic chemicals, polymers Applications of Mass Spectrometry Biomolecule characterization Pharmaceutical analysis Proteins and peptides Oligonucleotides

29 Mass spectroscopy How does a mass spectrometer work?

30 Mass spectroscopy [Source: Sandler Mass Spectroscopy]

31 Two types of ionization: Mass spectroscopy 1. Electrospray ionization (ESI):

32 Mass spectroscopy SMS

33 3 Mass spectroscopy [above MH denotes molecule with 3 protons (H ) attached]

34 2. MALDI: Mass spectroscopy SMS

35 Mass spectroscopy Mass analyzers separate ions based on their mass-tocharge ratio (m/z) Operate under high vacuum Measure mass-to-charge ratio of ions (m/z)

36 Components: Mass spectroscopy 1. Quadrupole Mass Analyzer (filter) Uses a combination of RF and DC voltages to operate as a mass filter before masses are accelerated. Has four parallel metal rods. Lets one mass pass through at a time. Can scan through all masses or only allow one fixed mass.

37 Mass spectroscopy

38 Mass spectroscopy 2. Time-of-flight (TOF) Mass Analyzer Accelerates ions with electric field, detects them, analyzes flight time.

39 Mass spectroscopy

40 Mass spectroscopy Ions are formed in pulses. The drift region is field free. Measures the time for ions to reach the detector. Small ions reach the detector before large ones. Ion trap mass analyzer:

41 Mass spectroscopy

42 Mass spectroscopy 3. Detector: Ions are detected with a microchannel plate:

43 Mass spectroscopy Mass spectrum shows the results: ESI Spectrum of Trypsinogen (MW 23983)

44 Mass spectroscopy

45 Mass spectroscopy 4. Dimensional reduction (G. Izmirlian): Sometimes we perform a dimension reduction by reducing mass spectrum information of human subject 3 to store only peaks:

46 Mass spectroscopy

47 Mass spectroscopy Then have (compressed) peak information in feature vector x œðbßáßb ". Ñ, >2 with B5 œ location of 5 mass spectrum peak (above a fixed threshold). Compressed or not, outcome value to feature vector subject 3 is C œ "Þ 3 x 3 for

48 5. Random forest example Example (Guerts, et al.): Normal/sick dichotomy for RA and for IBD (above - Geurts, et al.): we now build a forest of decision trees based on differing attributes in the nodes:

49 Random forest application For example: Could use mass spectroscopy data to determine disease state

50 Random forest application Mass Spec segregates protein and other molecules through spectrum of 7ÎD ratios ( 7 œ mass; D œ charge).

51 Random forest application Geurts, et al.

52 Random Forests: Advantages: accurate, easy to use (Breiman software), fast, robust Disadvantages: difficult to interpret More generally: How to combine results of different predictors (e.g. decision trees)? Random forests are examples of ensemble methods, which combine predictions of weak classifiers :Ð Ñ. 3 x

53 Ensemble methods: observations 1. Boosting: As seen earlier, take linear combination of predictions :Ð 3 xñ by classifiers 3(assume these are decision trees) 0ÐxÑ œ " + 3: 3 ÐxÑ, (1) if tree predicts illness where :Ð Ñœ " 3 3 x œ, " otherwise >2 and predict Cœ" if 0ÐxÑ! and Cœ " if 0ÐxÑ!Þ 3

54 Ensemble methods: observations 2. Bagging: Take a vote: majority rules (equivalent in this case to setting + 3 œ " for all 3 in (1) above). Example of a Bagging algorithm is random forest, where a forest of decision trees takes a vote.

55 General features of a random forest: If original feature vector x has. features E ßáßE,. ". Each tree uses a random selection of 7 È. features 7 ÖE 3 4œ" chosen from all features E", E# ß á, E. ; 4 the associated feature space is different for each tree and denoted by Jß"Ÿ5ŸOœ 5 # trees. (Often O œ # trees is large; e.g., O œ &!!ÑÞ For each split in a tree node based on a given variable choose the variable from information content. E 3

56 Information content in a node To compute information content of a node: Assume input set to node is W: then information content of node R is

57 Information content in a node MÐRÑ œ lwl LÐWÑ lw l LÐW Ñ lw l LÐW Ñß P P V V where lwl œ input sample size; lwpßvl œ size of leftß right subclasses of W LÐWÑ œ Shannon entropy of W œ ": log : 3œ " 3 # 3 with : œ TÐG s lwñ œ proportion of class G in sample W

58 Information content in a node [later we will use Gini index, another criterion] Thus LÐWÑ œ "variablity" or "lack of full information" in the probabilities : 3 forming sample W input into current node R. MÐRÑ œ "information from node R" Þ For each variable E3, average over all nodes R in all trees involving this variable to find average information content LavÐE3Ñof E3.

59 Information content in a node (a) Rank all variables E 3 according to information content (b) For each fixed 8" 8 use only the first 8" variables. Select which minimizes prediction error. 8 "

60 Information content in a node Geurts, et al.

61 Application to: Random forests: application ì early diagnosis of Rehumatoid arthritis ì rapid diagnosis of inflammatory bowel diseases (IBD) 3 patient groups (University Hospital of Liege):

62 Random forests: application RA IBD Disease patients Negative controls Inflammatory controls Total Mass spectra obtained by SELDI-TOF mass spectrometry on chip arrays:

63 Random forests: application ì Hydrophobic (H4) ì weak cation-exchange (CM10) ì strong cation-exchange (Q10)

64 Random forests: application Feature vectors: x J consists of about 15,000 values in each case. Effective dimension reduction method: Discretize horizontally and vertically to go from 15,000 to 300 variables

65 Random forests: application

66 Random forests: application Sensitivity and specificity:

67 Random forests: application Accuracy measures: DT=Decision tree; RF=random forest; 5NN = 5-nearest neighbors; Note on sensitivity and specificity: use confusion matrix Actual Condition Test outcome True False Positive TP FP Negative FN TN

68 Sensitivity œ Random forests: application XT XT œ XT JR Total positives Specificity œ XR XR JT œ XR Total negatives Positive predictive value œ XT XT JT œ XT Total predicted positives

69 Random forests: application Variable ranking on the IBD dataset: 10 most important variables in spectrum:

70 Random forests: application RF-based (tree ensemble) - based variable ranking vs. variable ranking by individual variable : values:

71 Random forests: application RA IBD

72 6. RF software: Spider: html Leo Breiman: software.htm WEKA machine learning software

Decision Trees and Random Forests. Reference: Leo Breiman, http://www.stat.berkeley.edu/~breiman/randomforests

Decision Trees and Random Forests. Reference: Leo Breiman, http://www.stat.berkeley.edu/~breiman/randomforests Decision Trees and Random Forests Reference: Leo Breiman, http://www.stat.berkeley.edu/~breiman/randomforests 1. Decision trees Example (Guerts, Fillet, et al., Bioinformatics 2005): Patients to be classified:

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

Functional Data Analysis of MALDI TOF Protein Spectra

Functional Data Analysis of MALDI TOF Protein Spectra Functional Data Analysis of MALDI TOF Protein Spectra Dean Billheimer dean.billheimer@vanderbilt.edu. Department of Biostatistics Vanderbilt University Vanderbilt Ingram Cancer Center FDA for MALDI TOF

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society APLV@sibs.ac.cn

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society APLV@sibs.ac.cn Aiping Lu Key Laboratory of System Biology Chinese Academic Society APLV@sibs.ac.cn Proteome and Proteomics PROTEin complement expressed by genome Marc Wilkins Electrophoresis. 1995. 16(7):1090-4. proteomics

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data M. Cannataro, P. H. Guzzi, T. Mazza, and P. Veltri Università Magna Græcia di Catanzaro, Italy 1 Introduction Mass Spectrometry

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Electrospray Ion Trap Mass Spectrometry. Introduction

Electrospray Ion Trap Mass Spectrometry. Introduction Electrospray Ion Source Electrospray Ion Trap Mass Spectrometry Introduction The key to using MS for solutions is the ability to transfer your analytes into the vacuum of the mass spectrometer as ionic

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

13C NMR Spectroscopy

13C NMR Spectroscopy 13 C NMR Spectroscopy Introduction Nuclear magnetic resonance spectroscopy (NMR) is the most powerful tool available for structural determination. A nucleus with an odd number of protons, an odd number

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm awm@cs.cmu.

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm awm@cs.cmu. Decision Trees Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 42-268-7599 Copyright Andrew W. Moore Slide Decision Trees Decision trees

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

LC-MS/MS for Chromatographers

LC-MS/MS for Chromatographers LC-MS/MS for Chromatographers An introduction to the use of LC-MS/MS, with an emphasis on the analysis of drugs in biological matrices LC-MS/MS for Chromatographers An introduction to the use of LC-MS/MS,

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Mass Spectrometry Based Proteomics

Mass Spectrometry Based Proteomics Mass Spectrometry Based Proteomics Proteomics Shared Research Oregon Health & Science University Portland, Oregon This document is designed to give a brief overview of Mass Spectrometry Based Proteomics

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis The First International Symposium on Optimization and Systems Biology (OSB 07) Beijing, China, August 8 10, 2007 Copyright 2007 ORSC & APORC pp. 45 51 Integrated Data Mining Strategy for Effective Metabolomic

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Chapter 7. Feature Selection. 7.1 Introduction

Chapter 7. Feature Selection. 7.1 Introduction Chapter 7 Feature Selection Feature selection is not used in the system classification experiments, which will be discussed in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Fraud Detection for Online Retail using Random Forests

Fraud Detection for Online Retail using Random Forests Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Predicting borrowers chance of defaulting on credit loans

Predicting borrowers chance of defaulting on credit loans Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm

More information

Factors Influencing LC/MS/MS Moving into Clinical and Research Laboratories

Factors Influencing LC/MS/MS Moving into Clinical and Research Laboratories Factors Influencing LC/MS/MS Moving into Clinical and Research Laboratories Matthew Clabaugh, Market Development AACC Workshop St. Louis, MO September 17,18 2013 Factors Influencing LCMSMS Moving into

More information

Introduction to mass spectrometry (MS) based proteomics and metabolomics

Introduction to mass spectrometry (MS) based proteomics and metabolomics Introduction to mass spectrometry (MS) based proteomics and metabolomics Tianwei Yu Department of Biostatistics and Bioinformatics Rollins School of Public Health Emory University September 10, 2015 Background

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Pesticide Analysis by Mass Spectrometry

Pesticide Analysis by Mass Spectrometry Pesticide Analysis by Mass Spectrometry Purpose: The purpose of this assignment is to introduce concepts of mass spectrometry (MS) as they pertain to the qualitative and quantitative analysis of organochlorine

More information

Introduction to Proteomics 1.0

Introduction to Proteomics 1.0 Introduction to Proteomics 1.0 CMSP Workshop Tim Griffin Associate Professor, BMBB Faculty Director, CMSP Objectives Why are we here? For participants: Learn basics of MS-based proteomics Learn what s

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis Thang V. Pham and Connie R. Jimenez OncoProteomics Laboratory, Cancer Center Amsterdam, VU University Medical Center De Boelelaan 1117,

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

0 10 20 30 40 50 60 70 m/z

0 10 20 30 40 50 60 70 m/z Mass spectrum for the ionization of acetone MS of Acetone + Relative Abundance CH 3 H 3 C O + M 15 (loss of methyl) + O H 3 C CH 3 43 58 0 10 20 30 40 50 60 70 m/z It is difficult to identify the ions

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF Tutorial for Proteomics Data Submission Katalin F. Medzihradszky Robert J. Chalkley UCSF Why Have Guidelines? Large-scale proteomics studies create huge amounts of data. It is impossible/impractical to

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Overview. Triple quadrupole (MS/MS) systems provide in comparison to single quadrupole (MS) systems: Introduction

Overview. Triple quadrupole (MS/MS) systems provide in comparison to single quadrupole (MS) systems: Introduction Advantages of Using Triple Quadrupole over Single Quadrupole Mass Spectrometry to Quantify and Identify the Presence of Pesticides in Water and Soil Samples André Schreiber AB SCIEX Concord, Ontario (Canada)

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Professor Anita Wasilewska. Classification Lecture Notes

Professor Anita Wasilewska. Classification Lecture Notes Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,

More information

Monday Morning Data Mining

Monday Morning Data Mining Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik

More information

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs AB SCIEX TOF/TOF 4800 PLUS SYSTEM Cost effective flexibility for your core needs AB SCIEX TOF/TOF 4800 PLUS SYSTEM It s just what you expect from the industry leader. The AB SCIEX 4800 Plus MALDI TOF/TOF

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

Agilent 6000 Series LC/MS Solutions CONFIDENCE IN QUANTITATIVE AND QUALITATIVE ANALYSIS

Agilent 6000 Series LC/MS Solutions CONFIDENCE IN QUANTITATIVE AND QUALITATIVE ANALYSIS Agilent 6000 Series LC/MS Solutions CONFIDENCE IN QUANTITATIVE AND QUALITATIVE ANALYSIS AGILENT 6000 SERIES LC/MS SOLUTIONS UNRIVALED ACHIEVEMENT AT EVERY LEVEL The Agilent 6000 Series of LC/MS instruments

More information

Boosting Applied to Classification of Mass Spectral Data

Boosting Applied to Classification of Mass Spectral Data Journal of Data Science 1(2003), 391-404 Boosting Applied to Classification of Mass Spectral Data K. Varmuza 1,PingHe 2,3 and Kai-Tai Fang 2 1 Vienna University of Technology, 2 Hong Kong Baptist University

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Time-of-Flight Mass Spectrometry

Time-of-Flight Mass Spectrometry Time-of-Flight Mass Spectrometry Technical Overview Introduction Time-of-flight mass spectrometry (TOF MS) was developed in the late 1940 s, but until the 1990 s its popularity was limited. Recent improvements

More information

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La References Alejandro Cruz-Marcelo, Rudy Guerra, Marina Vannucci, Yiting Li, Ching C. Lau, and Tsz-Kwong Man. Comparison of algorithms for pre-processing

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Waters Core Chromatography Training (2 Days)

Waters Core Chromatography Training (2 Days) 2015 Page 2 Waters Core Chromatography Training (2 Days) The learning objective of this two day course is to teach the user core chromatography, system and software fundamentals in an outcomes based approach.

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Data mining techniques: decision trees

Data mining techniques: decision trees Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Package trimtrees. February 20, 2015

Package trimtrees. February 20, 2015 Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,

More information

CHE334 Identification of an Unknown Compound By NMR/IR/MS

CHE334 Identification of an Unknown Compound By NMR/IR/MS CHE334 Identification of an Unknown Compound By NMR/IR/MS Purpose The object of this experiment is to determine the structure of an unknown compound using IR, 1 H-NMR, 13 C-NMR and Mass spectroscopy. Infrared

More information

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India lav_dlr@yahoo.com

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information