Statistical text mining using R. Tom Liptrot The Christie Hospital
|
|
|
- Hillary Nichols
- 9 years ago
- Views:
Transcription
1 Statistical text mining using R Tom Liptrot The Christie Hospital
2 Motivation
3
4 Example 1: Example 2: Dickens to matrix Electronic patient records
5 Dickens to Matrix: a bag of words IT WAS the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
6 Dickens to Matrix: a matrix Words a 11 a 12 a 1n Documents a 21 a 22 a 2n a m1 a m2 a mn #Example matrix syntax A = matrix(c(1, rep(0,6), 2), nrow = 4) library(slam) S = simple_triplet_matrix(c(1, 4), c(1, 2), c(1, 2)) library(matrix) M = sparsematrix(i = c(1, 4), j = c(1, 2), x = c(1, 2))
7 Dickens to Matrix: tm package library(tm) #load the tm package corpus_1 <- Corpus(VectorSource(txt)) # creates a corpus from a vector corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removewords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removepunctuation) corpus_1 <- tm_map(corpus_1, stemdocument) corpus_1 <- tm_map(corpus_1, stripwhitespace) it was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
8 Dickens to Matrix: stopwords library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removewords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removepunctuation) corpus_1 <- tm_map(corpus_1, stemdocument) corpus_1 <- tm_map(corpus_1, stripwhitespace) it was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
9 Dickens to Matrix: stopwords library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removewords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removepunctuation) corpus_1 <- tm_map(corpus_1, stemdocument) corpus_1 <- tm_map(corpus_1, stripwhitespace) best times, worst times, age wisdom, age foolishness, epoch belief, epoch incredulity, season light, season darkness, spring hope, winter despair, everything us, nothing us, going direct heaven, going direct way- short, period far like present period, noisiest authorities insisted received, good evil, superlative degree comparison.
10 Dickens to Matrix: punctuation library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removewords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removepunctuation) corpus_1 <- tm_map(corpus_1, stemdocument) corpus_1 <- tm_map(corpus_1, stripwhitespace) best times worst times age wisdom age foolishness epoch belief epoch incredulity season light season darkness spring hope winter despair everything us nothing us going direct heaven going direct way short period far like present period noisiest authorities insisted received good evil superlative degree comparison
11 Dickens to Matrix: stemming library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removewords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removepunctuation) corpus_1 <- tm_map(corpus_1, stemdocument) corpus_1 <- tm_map(corpus_1, stripwhitespace) best time worst time age wisdom age foolish epoch belief epoch incredul season light season dark spring hope winter despair everyth us noth us go direct heaven go direct way short period far like present period noisiest author insist receiv good evil superl degre comparison
12 Dickens to Matrix: cleanup library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removewords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removepunctuation) corpus_1 <- tm_map(corpus_1, stemdocument) corpus_1 <- tm_map(corpus_1, stripwhitespace) best time worst time age wisdom age foolish epoch belief epoch incredul season light season dark spring hope winter despair everyth us noth us go direct heaven go direct way short period far like present period noisiest author insist receiv good evil superl degre comparison
13 Dickens to Matrix: Term Document Matrix tdm <- TermDocumentMatrix(corpus_1) <<TermDocumentMatrix (terms: 35, documents: 1)>> Non-/sparse entries: 35/0 Sparsity : 0% Maximal term length: 10 Weighting : term frequency (tf) class(tdm) [1] "TermDocumentMatrix" "simple_triplet_matrix dim (tdm) [1] 35 1 age 2 epoch 2 insist 1 short 1 author 1 everyth 1 light 1 spring 1 belief 1 evil 1 like 1 superl 1 best 1 far 1 noisiest 1 time 2 comparison 1 foolish 1 noth 1 way 1 dark 1 good 1 period 2 winter 1 degre 1 heaven 1 present 1 wisdom 1 despair 1 hope 1 receiv 1 worst 1 direct 2 incredul 1 season 2
14 Dickens to Matrix: Ngrams
15 Dickens to Matrix: Ngrams Library(Rweka) four_gram_tokeniser <- function(x, n) { RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 1, max = 4)) } tdm_4gram <- TermDocumentMatrix(corpus_1, control = list(tokenize = four_gram_tokeniser))) dim(tdm_4gram) [1] age 2 author insist receiv good 1 dark 1 age foolish 1 belief 1 dark spring 1 age foolish epoch 1 belief epoch 1 dark spring hope 1 age foolish epoch belief 1 belief epoch incredul 1 dark spring hope winter 1 age wisdom 1 belief epoch incredul season 1 degre 1 age wisdom age 1 best 1 degre comparison 1 age wisdom age foolish 1 best time 1 despair 1 author 1 best time worst 1 despair everyth 1 author insist 1 best time worst time 1 despair everyth us 1 author insist receiv 1 comparison 1 despair everyth us noth 1
16 Electronic patient records: Gathering structured medical data Doctor enters structured data directly
17 Electronic patient records: Gathering structured medical data Doctor enters structured data directly Trained staff extract structured data from typed notes
18 Electronic patient records: example text Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0 History: X year old lady who presented with progressive dysphagia since X and was known at X Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
19 Electronic patient records: targets Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0 History: X year old lady who presented with progressive dysphagia since X and was known at X Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
20 Electronic patient records: steps 1. Identify patients where we have both structured data and notes (c.20k) 2. Extract notes and structured data from SQL database 3. Make term document matrix (as shown previously) (60m x 20k) 4. Split data into training and development set 5. Train classification model using training set 6. Assess performance and tune model using development set 7. Evaluate system performance on independent dataset 8. Use system to extract structured data where we have none
21 Electronic patient records: predicting disease site using the elastic net β = argmin β OLS + RIDGE + LASSO y Xβ 2 + λ 2 β 2 + λ 1 β 1 #fits a elastic net model, classifying into oesophagus or not selecting lambda through cross validation library(glmnet) dim(tdm) #22,843 documents, 677,017 Ngrams #note tdm must either be a matrix or a SparseMatrix NOT a simple_triplet_matrix mod_oeso <- cv.glmnet( x = tdm, y = disease_site == 'Oesophagus', family = "binomial")
22 Electronic patient records: The Elastic Net #plots non-zero coefficients from elastic net model coefs <- coef(mod_oeso, s = mod_oeso$lambda.1se)[,1] coefs <- coefs[coefs!= 0] coefs <- coefs[order(abs(coefs), decreasing = TRUE)] barplot(coefs[-1], horiz = TRUE, col = 2) P(site = Oesophagus ) = 0.03
23 Electronic patient records: classification performance: primary disease site Training set = 20,000 Test set = 4,000 patients 80% of patients can be classified with 95% accuracy (remaining 20% can be done by human abstractors) AUC = 90% Next step is full formal evaluation on independent dataset Working in combination with rules based approach from Manchester University
24 Electronic patient records: Possible extensions Classification (hierarchical) Cluster analysis (KNN) Time Survival Drug toxicity Quality of life
25 Thanks
26 Books example get_links <- function(address, link_prefix = '', link_suffix = ''){ page <- geturl(address) # Convert to R tree <- htmlparse(page) ## Get All link elements links <- xpathsapply(tree, path = "//*/a", fun = xmlgetattr, name = "href") ## Convert to vector links <- unlist(links) ## add prefix and suffix paste0(link_prefix, links, link_suffix) } links_authors <- get_links(" '/', link_prefix =' links_text <- alply(links_authors, 1,function(.x){ get_links(.x, link_prefix =.x, link_suffix = '') }) books <- llply(links_text, function(.x){ aaply(.x, 1, geturl) })
27
28 Principle components analysis ## Code to get the first n principal components ## from a large sparse matrix term document matrix of class dgcmatrix library(irlba) n m xt.x x.means xt.x svd <- 5 # number of components to calculate <- nrow(tdm) # terms in tdm matrix <- crossprod(tdm) <- colmeans(tdm) <- (xt.x - m * tcrossprod(x.means)) / (m-1) <- irlba(xt.x, nu=0, nv=n, tol=1e-10)
29 PC3 PCA plot plot(svd$v[i,c(2,3)] + 1, col = books_df$author, log = 'xy', xlab = 'PC2', ylab = 'PC3') ARISTOTLE BURROUGHS DICKENS KANT PLATO SHAKESPEARE PC2
Text Mining with R Twitter Data Analysis 1
Text Mining with R Twitter Data Analysis 1 Yanchang Zhao http://www.rdatamining.com R and Data Mining Workshop for the Master of Business Analytics course, Deakin University, Melbourne 28 May 2015 1 Presented
Classification of Documents using Text Mining Package tm
Classification of Documents using Text Mining Package tm Pavel Brazdil LIAAD - INESC Porto LA FEP, Univ. of Porto http://www.liaad.up.pt Escola de verão Aspectos de processamento da LN F. Letras, UP, 4th
A Tale of Two Cities
Reflections: A Student Response Journal for A Tale of Two Cities by Charles Dickens written by Jack Turner Copyright 2007 by Prestwick House, Inc., P.O. Box 658, Clayton, DE 19938. 1-800-932-4593. www.prestwickhouse.com
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Data Mining with R. Text Mining. Hugh Murrell
Data Mining with R Text Mining Hugh Murrell reference books These slides are based on a book by Yanchang Zhao: R and Data Mining: Examples and Case Studies. http://www.rdatamining.com for further background
Text Analytics Illustrated with a Simple Data Set
CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to
Regularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
Introduction to basic Text Mining in R.
Introduction to basic Text Mining in R. As published in Benchmarks RSS Matters, January 2014 http://web3.unt.edu/benchmarks/issues/2014/01/rss-matters Jon Starkweather, PhD 1 Jon Starkweather, PhD [email protected]
How To Write A Blog Post In R
R and Data Mining: Examples and Case Studies 1 Yanchang Zhao [email protected] http://www.rdatamining.com April 26, 2013 1 2012-2013 Yanchang Zhao. Published by Elsevier in December 2012. All rights
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Statistical Feature Selection Techniques for Arabic Text Categorization
Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000
Identifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
Package metafuse. November 7, 2015
Type Package Package metafuse November 7, 2015 Title Fused Lasso Approach in Regression Coefficient Clustering Version 1.0-1 Date 2015-11-06 Author Lu Tang, Peter X.K. Song Maintainer Lu Tang
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
NC STATE UNIVERSITY Exploratory Analysis of Massive Data for Distribution Fault Diagnosis in Smart Grids
Exploratory Analysis of Massive Data for Distribution Fault Diagnosis in Smart Grids Yixin Cai, Mo-Yuen Chow Electrical and Computer Engineering, North Carolina State University July 2009 Outline Introduction
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Distributed Text Mining with tm
Distributed Text Mining with tm Stefan Theußl 1 Ingo Feinerer 2 Kurt Hornik 1 Institute for Statistics and Mathematics, WU Vienna 1 Institute of Information Systems, DBAI Group Technische Universität Wien
Package bigdata. R topics documented: February 19, 2015
Type Package Title Big Data Analytics Version 0.1 Date 2011-02-12 Author Han Liu, Tuo Zhao Maintainer Han Liu Depends glmnet, Matrix, lattice, Package bigdata February 19, 2015 The
Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
Lecture 5 : The Poisson Distribution
Lecture 5 : The Poisson Distribution Jonathan Marchini November 10, 2008 1 Introduction Many experimental situations occur in which we observe the counts of events within a set unit of time, area, volume,
This factsheet aims to outline the characteristics of some rare lung cancers, and highlight where each type of lung cancer may be different.
There are several different kinds of lung cancer, often referred to as lung cancer subtypes. Some of these occur more often than others. In this factsheet we will specifically look at the subtypes of cancers
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Segmentation and Classification of Online Chats
Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 [email protected] Abstract One method for analyzing textual chat
C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER
INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013
Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.
Clustering through Decision Tree Construction in Geology
Nonlinear Analysis: Modelling and Control, 2001, v. 6, No. 2, 29-41 Clustering through Decision Tree Construction in Geology Received: 22.10.2001 Accepted: 31.10.2001 A. Juozapavičius, V. Rapševičius Faculty
# load in the files containing the methyaltion data and the source # code containing the SSRPMM functions
################ EXAMPLE ANALYSES TO ILLUSTRATE SS-RPMM ######################## # load in the files containing the methyaltion data and the source # code containing the SSRPMM functions # Note, the SSRPMM
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Support Vector Machines
Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric
LASSO Regression. Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013.
Case Study 3: fmri Prediction LASSO Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013 Emily Fo013 1 LASSO Regression LASSO: least
Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients
Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with
# For usage of the functions, it is necessary to install the "survival" and the "penalized" package.
###################################################################### ### R-script for the manuscript ### ### ### ### Survival models with preclustered ### ### gene groups as covariates ### ### ### ###
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
Investigating Clinical Care Pathways Correlated with Outcomes
Investigating Clinical Care Pathways Correlated with Outcomes Geetika T. Lakshmanan, Szabolcs Rozsnyai, Fei Wang IBM T. J. Watson Research Center, NY, USA August 2013 Outline Care Pathways Typical Challenges
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of
KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: [email protected]
KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: [email protected] Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
JetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
What If I Have a Spot on My Lung? Do I Have Cancer? Patient Education Guide
What If I Have a Spot on My Lung? Do I Have Cancer? Patient Education Guide A M E R I C A N C O L L E G E O F C H E S T P H Y S I C I A N S Lung cancer is one of the most common cancers. About 170,000
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Data Mining. Dr. Saed Sayad. University of Toronto 2010 [email protected]. http://chem-eng.utoronto.ca/~datamining/
Data Mining Dr. Saed Sayad University of Toronto 2010 [email protected] http://chem-eng.utoronto.ca/~datamining/ 1 Data Mining Data mining is about explaining the past and predicting the future by
Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari [email protected]
Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari [email protected] Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Contents. Dedication List of Figures List of Tables. Acknowledgments
Contents Dedication List of Figures List of Tables Foreword Preface Acknowledgments v xiii xvii xix xxi xxv Part I Concepts and Techniques 1. INTRODUCTION 3 1 The Quest for Knowledge 3 2 Problem Description
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14 Insurance has always been a data business The industry has successfully
Predicting Market Value of Soccer Players Using Linear Modeling Techniques
1 Predicting Market Value of Soccer Players Using Linear Modeling Techniques by Yuan He Advisor: David Aldous Index Introduction ----------------------------------------------------------------------------
Hodgkin Lymphoma Disease Specific Biology and Treatment Options. John Kuruvilla
Hodgkin Lymphoma Disease Specific Biology and Treatment Options John Kuruvilla My Disclaimer This is where I work Objectives Pathobiology what makes HL different Diagnosis Staging Treatment Philosophy
Our Raison d'être. Identify major choice decision points. Leverage Analytical Tools and Techniques to solve problems hindering these decision points
Analytic 360 Our Raison d'être Identify major choice decision points Leverage Analytical Tools and Techniques to solve problems hindering these decision points Empowerment through Intelligence Our Suite
Renal Cell Carcinoma (Kidney Cancer)
Renal Cell Carcinoma (Kidney Cancer) I tell my friends that if it wasn t for Dr. Merkle, I would be pushing up daisies by now, no doubt! He (Oncologist) was, shall I say shocked that I was in fact still
Professor Anita Wasilewska. Classification Lecture Notes
Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
PREDICTING MARKET VOLATILITY FEDERAL RESERVE BOARD MEETING MINUTES FROM
PREDICTING MARKET VOLATILITY FROM FEDERAL RESERVE BOARD MEETING MINUTES Reza Bosagh Zadeh and Andreas Zollmann Lab Advisers: Noah Smith and Bryan Routledge GOALS Make Money! Not really. Find interesting
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
Package wordcloud. R topics documented: February 20, 2015
Package wordcloud February 20, 2015 Type Package Title Word Clouds Version 2.5 Date 2013-04-11 Author Ian Fellows Maintainer Ian Fellows Pretty word clouds. License LGPL-2.1 LazyLoad
Clustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern
A Demonstration of Hierarchical Clustering
Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised
Issues in Information Systems Volume 16, Issue IV, pp. 30-36, 2015
DATA MINING ANALYSIS AND PREDICTIONS OF REAL ESTATE PRICES Victor Gan, Seattle University, [email protected] Vaishali Agarwal, Seattle University, [email protected] Ben Kim, Seattle University, [email protected]
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining [email protected] Outline Predictive modeling methodology k-nearest Neighbor
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
Chapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
Maximum Likelihood Estimation by R
Maximum Likelihood Estimation by R MTH 541/643 Instructor: Songfeng Zheng In the previous lectures, we demonstrated the basic procedure of MLE, and studied some examples. In the studied examples, we are
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).
MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
LUNG CANCER SCREENING: UNDERSTANDING LUNG NODULES. 1-800-298-2436 LungCancerAlliance.org
LUNG CANCER SCREENING: UNDERSTANDING LUNG NODULES 1-800-298-2436 LungCancerAlliance.org 1 1 CONTENTS What is a Nodule?...3 Finding Nodules...4 If a Nodule Is Found...5 What Happens Next?...7 Questions
Morphological analysis on structural MRI for the early diagnosis of neurodegenerative diseases. Marco Aiello On behalf of MAGIC-5 collaboration
Morphological analysis on structural MRI for the early diagnosis of neurodegenerative diseases Marco Aiello On behalf of MAGIC-5 collaboration Index Motivations of morphological analysis Segmentation of
Why Ensembles Win Data Mining Competitions
Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:
bowel cancer screening
bowel cancer screening take control of your health www.bupacromwellhospital.com Bowel cancer is the third most common cancer in both men and women in the UK. The condition is treatable if diagnosed in
