Identification of Promising Couples using Machine Learning
|
|
|
- Edgar Gibson
- 9 years ago
- Views:
Transcription
1 Identification of Promising Couples using Machine Learning Yao Liu Chris Labiak clabiak Matt Kliemannn mtkliema Kaustubh Srivastava kaustubh Yao Xiao xyaoinum Abstract Divorces, or more generally, failed relationships, are tremendous social issues that not only affect the two people involved, but also friends and family. In this project, we try to address this social issue by finding out what features are indicative of an unpromising relationship and use these features to predict the outcome of a romantic relationship over a 4 year time frame. The existing social science research is mostly based on simple statistical models. Our project aims to address this issue by applying machine learning algorithms. Specifically, using a variety of feature selection and classification methods, we have been able to achieve predictions with accuracy in the low eighty percent range and discovered several new important factors that affect a relationships outcome, such as meeting on the internet having a negative impact. 1 Introduction 1.1 Motivation This project aims to predict whether or not a couple (either married or in a romantic relationship) will break in the next four years. We also aim to identify the features that are most indicative of a breakup. The motivation of pursuing this project is to tackle the widespread social issue of failed relationships. According to current social studies, divorce rates have doubled over the past two decades among persons over age 35. In addition, younger couples have become increasingly more critical about choosing their significant other, resulting in a low marriage rate among younger people which can have significant socioeconomic consequences [Kennedy and Ruggles, 2014]. As a result, the identification of important factors predicting a long lasting relationship is an important task. By solving this problem, we can get great insight into what the most significant features are that make a couple stay together for a long period of time. These features could then be used by individuals as guidelines when looking for a partner to increase the chances of being successful couples. As a result, this study may help alleviate increasing divorce rate issues and make it easier for people to find their most compatible partner. 1.2 Challenges The dataset is from the survey How Couples Meet and Stay Together [Rosenfeld et al., 2011]. This collection consists of the initial survey given in 2009 and three follow up surveys in 2010, 2011, and 2013 (referred to as wave 2, 3, 4), respectively. 1
2 The initial survey is terminated early if the respondent is not in an active relationship - therefore our baseline is respondents who were in a relationship during the initial survey. The follow up surveys were also no longer conducted if a respondent broke up or got divorced in the prior survey. We classify a respondent s relationship outcome as positive if they indicated they divorced or ended the relationship during one of the follow up surveys. If they responded that they were still in the same relationship during the 4th wave, we consider the respondent relationship to be a negative example. We have 1874 examples, 450 of which are positive examples (failed relationships). This limited size of data set makes it hard to train predictive models. In addition, among the dataset, the ratio between positive and negative examples is imbalanced (1:4 ratio). A limited number of failed relationship examples makes it hard to discover important factors that caused break-up in the relationship. Furthermore, to investigate the social issues on same-sex couples, the survey recruits more same-sex couples intentionally. However, the number of same-sex couples samples is not enough for us to train models to learn the different behaviors in relationships for two types of couples [Weisshaar, 2014]. 2 Method 2.1 Assumptions We do not consider the length of time that couples been together prior to the the start of the study aside from as a feature. Our model considers the 4 year outlook at any point of time in the relationship. We do not currently consider the weighting information provided by the survey firm to account for oversampling of parts of the population in the survey. For example, approximately 15% of respondents were same-sex couples while this is closer to 2% in the US general population. We do not consider the variables that may evolve after wave 1 such as income and education levels because followup surveys are not conducted once a couple has broken up. We assume the respondents answers to be unbiased and reflect the true nature/ reality of their relationship (brought about by culture difference) [Thomas, 2011] 2.2 Feature Preprocessing The dataset requires some significant preprocessing in order to train easier and more understandable models. Besides pre-existing binary data, are two major types of data included in this dataset which special preprocessing: continuous and categorical Categorical Features Many of the features are in categorical form. For example, the feature indicating race holds values: 1 for Caucasian, 2 for Black, 3 for Hispanic, 4 for Asian, and 5 for other races. This sort of representation lacks an ordering to the categories and would distort interclass distances. First, the missing data is imputed using the most frequent value of existing data for that feature. Most frequent value is used since other basic approaches like median or mean arent built for categorical data. Next, the categorical features are transformed into binary values through the one-hot encoding (one-of-k encoding) method provide by Scikit Learn. This method transforms a categorical feature with [M] possible values into [M] binary features where only one is true. 2
3 2.2.2 Continuous Features For continuous features, we rectify missing values using median value imputation based on the median of values present for that feature. We also do feature standardization on continuous features by scaling them with zero mean and unit variance. This enables us to use different models such as SVM with gaussian kernel. After this process, we have 138 features to utilize. The entire feature set will serve as the baseline. 2.3 Feature Selection One of the major challenges in our project is that we have a large number of features compared to our small sample size. To address this issue, we propose two feature selection methods. The first method is to select features based on subjective evaluation from relevant social science academic research. The other method is to apply and evaluate various feature selection algorithms.[iavindrasana et al., 2009] Subjective Feature Selection There have been several social studies working on the same data set [Weisshaar, 2014][Rosenfeld and Thomas, 2012] [Thomas, 2011]. We incorporate features from conclusions and statistical analysis process of these studies as subjective features. The features we selected are household income [Weisshaar, 2014], same sex couples [Weisshaar, 2014], existing divorce [Kennedy and Ruggles, 2014], cohabitating couple [Thomas, 2011], years of education [Thomas, 2011], met through friends [Thomas, 2011], met at school [Thomas, 2011], met at church [Thomas, 2011] and met through internet [Rosenfeld and Thomas, 2012] Algorithmic Feature Selection We do univariate feature selection using chi-squared score. The top 20 features based on score are selected. [Pedregosa et al., 2011] We do multivariate feature selection using a variety of algorithms. We use filter methods such as fisher score based feature selection, correlation-based feature selection. fast correlation-based filter feature selection. [Zhao et al., 2010] We also use embedded methods such as L1-SVM with linear kernel feature selection and L1-logistic regression feature selection. L1 regularization enforces sparsity which causes only a few features to be selected. [Pedregosa et al., 2011] [Iavindrasana et al., 2009] Another embedded method used is recursive feature elimination (evaluated using linear SVM) [Pedregosa et al., 2011] 2.4 Model Training After feature selection, the dataset was classified by the following supervised learning classifiers: Gaussian naive Bayes, Support Vector Machine with linear kernel, k-nearest Neighbors, Decision Tree, Ridge Regression, L1 Regularized Logistic Regression, and Adaboost using decision tree. Each classifier model was trained using 5-fold cross validation and is evaluated using a training-testing split to avoid overfitting issues. 2.5 Model Evaluation 1. Overall Accuracy 2. Precision and Recall 3. Area Under ROC Curve (AUC) 3 Related Works One category of related work is social science work. We refer to the statistical prediction process and conclusion of these works as our first subjective method of features selection [Weisshaar, 2014][Rosenfeld and Thomas, 2012] [Thomas, 2011]. These works are based on social 3
4 science, so they only do simple regression and test the statistical significance, which may induce overfitting problems. In our case, we combine knowledge from these research work and train the models through machine learning algorithms. Gottman is the expert in marriage outcome research. He was able to construct a short term divorce prediction model (7- years) for married couples which is 93% accurate. [Gottman and Levenson, 2000] His model primary uses features about marital interaction and emotion. He found that negative affect during marital conflict most predicted breakup. Our model differs in that the feature set we use primarily consists of demographic data. In addition, we predict the outcome of a relationship in general, not only married couples. Finally, we do extensive cross validation on our dataset, which is not typically done in research about relationship outcomes. [Heyman and Slep, 2001] The second category of related work we consider is clinical machine learning problem because of the similar issues such as unbalanced dataset, no clear boundary, large number of features and factor analysis issues etc. We refer to this work on feature selection and evaluation methods [Iavindrasana et al., 2009]. 4 Experimental Results A machine learning feature selection approach will provide better classification power than subjectively selecting features, as is conventionally done in social science research. In addition, the features selected by machine learning algorithms have significantly increased the power to correctly classify positive examples (higher AUC). This is important to identify the most contributing factors in an unsuccessful relationship. The most predictive feature of a relationship success turned out to be a high relationship quality (1). Another very predictive feature of relationship success include living together (2) for a long time (4), which confirms with the social science work [Thomas, 2011]. We can also confirm that the longer a couple is in a relationship, they are less likely to breakup [Thomas, 2011]. Specifically, features 6, 10, 11, 12, 15. The only downside is meeting at an older age (16), which has a negative weight. This is likely related to being divorced (9), as divorced people who are back on the market at an older age do not find success. Other features that reflect positively on a couples outcome include positive parental approval (14), high household income (8). Meeting on the internet turns out surprisingly to reflect negatively on relationship success (17), which is indicated as an increasing intermediary to look for partners in recent years in social science work [Rosenfeld and Thomas, 2012]. Finally, if the respondent is paying for their home in some manner (3,5), that has a positive impact on their chance for increasing relationship success, as compared to someone who lives for free. 4
5 Results over selected feature selection algorithms and different classifiers have been listed above. Most models have achieved classification accuracy in the low 80% range. SVM model with features selected by fisher scores provides the best combination of accuracy, recall, and precision results. Using features obtained from the Fisher test and trained on a linear SVM classifier, we obtained the weights each feature is allocated in the model. Some features with very low weights have not been shown. 5 Conclusion From the experiment results above, the introduction of machine learning algorithms to this social science field has increased the predictive power of models. Moreover, it helps discover new important factors such as negative impact of meeting couple through internet and how a divorced person has a reduced chance of having a new successful relationship. We are able to calculate a four year relationship success rate with approximately 82% accuracy. Even though the data set itself only has few subjective and emotional evaluation of the relationship from the survey respondent, the self described relationship quality has been listed as most important factor, than other demographic statistics, by the best performance training model (SVM with top 20 features selected based on fisher scores). This also confirms the important role 5
6 of emotional factors that play in a romantic relationship, which is revealed by Gottmans work [Gottman and Levenson, 2000]. Based on the comparison between social science work and our approach, various changes can be done in the future to help better understand this social problem. The first aspect of future work may focus on decreasing the bias of our study by referring to social science methodology, such as using multiple imputation through SVD to fill in missing values [Howell, 2007], taking into account of weighting information [Winship and Radbill, 1994] and considering evolving values during the period with event history models [Weisshaar, 2014]. Another aspect of future work can focus on constructing new features based on social science domain knowledge on this problem, such as difference in race. Lastly, our work doesnt solely consider married couples or dating couples as is conventionally done by social science researchers. We considered both and our models tended to select marriage status as a useful feature and weight it highly. This informs us we should consider both instances separately - evaluating success prior to marriage and during marriage individually. [Kennedy and Ruggles, 2014] [Weisshaar, 2014] [Rosenfeld and Thomas, 2012] [Thomas, 2011] [Heyman and Slep, 2001] [Rosenfeld et al., 2011] [Gottman and Levenson, 2000] [Iavindrasana et al., 2009] [Guyon and Elisseeff, 2003] [Pedregosa et al., 2011] [Zhao et al., 2010] [Metz, 1978] [Hanley and McNeil, 1982] References [Gottman and Levenson, 2000] Gottman, J. M. and Levenson, R. W. (2000). The timing of divorce: Predicting when a couple will divorce over a 14-year period. Journal of Marriage and Family, 62(3): [Guyon and Elisseeff, 2003] Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3: [Hanley and McNeil, 1982] Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1): [Heyman and Slep, 2001] Heyman, R. E. and Slep, A. M. S. (2001). The hazards of predicting divorce without crossvalidation. Journal of Marriage and Family, 63(2): [Howell, 2007] Howell, D. C. (2007). The treatment of missing data. The Sage handbook of social science methodology, pages [Iavindrasana et al., 2009] Iavindrasana, J., Cohen, G., Depeursinge, A., Müller, H., Meyer, R., and Geissbuhler, A. (2009). Clinical data mining: a review. Yearb Med Inform, 2009: [Kennedy and Ruggles, 2014] Kennedy, S. and Ruggles, S. (2014). Breaking up is hard to count: The rise of divorce in the united states, Demography, 51(2): [Metz, 1978] Metz, C. E. (1978). Basic principles of roc analysis. In Seminars in nuclear medicine, volume 8, pages Elsevier. [Pedregosa et al., 2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: [Rosenfeld and Thomas, 2012] Rosenfeld, M. J. and Thomas, R. J. (2012). Searching for a mate the rise of the internet as a social intermediary. American Sociological Review, 77(4): [Rosenfeld et al., 2011] Rosenfeld, M. J., Thomas, R. J., and Falcon, M. (2011). and 2014.how couples meet and stay together. waves 1, 2, and 3: Public version 3.04, plus wave 4 supplement version 1.02 [computer files]. [Thomas, 2011] Thomas, R. J. (2011). How americans (mostly don t) find an interracial partner: Race and ethnic differences in the use of social foci and networks for couple formation. [Weisshaar, 2014] Weisshaar, K. (2014). Earnings equality and relationship stability for same-sex and heterosexual couples. Social Forces, 93(1): [Winship and Radbill, 1994] Winship, C. and Radbill, L. (1994). Sampling weights and regression analysis. Sociological Methods & Research, 23(2):
7 [Zhao et al., 2010] Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., and Liu, H. (2010). Advancing feature selection research asu feature selection repository. School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe. 7
Car Insurance. Prvák, Tomi, Havri
Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes
Improving Credit Card Fraud Detection with Calibrated Probabilities
Improving Credit Card Fraud Detection with Calibrated Probabilities Alejandro Correa Bahnsen, Aleksandar Stojanovic, Djamila Aouada and Björn Ottersten Interdisciplinary Centre for Security, Reliability
Spotting false translation segments in translation memories
Spotting false translation segments in translation memories Abstract Eduard Barbu Translated.net [email protected] The problem of spotting false translations in the bi-segments of translation memories
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Classification of Electrocardiogram Anomalies
Classification of Electrocardiogram Anomalies Karthik Ravichandran, David Chiasson, Kunle Oyedele Stanford University Abstract Electrocardiography time series are a popular non-invasive tool used to classify
Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL
Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
How to Win at the Track
How to Win at the Track Cary Kempston [email protected] Friday, December 14, 2007 1 Introduction Gambling on horse races is done according to a pari-mutuel betting system. All of the money is pooled,
An Ensemble Learning Approach for the Kaggle Taxi Travel Time Prediction Challenge
An Ensemble Learning Approach for the Kaggle Taxi Travel Time Prediction Challenge Thomas Hoch Software Competence Center Hagenberg GmbH Softwarepark 21, 4232 Hagenberg, Austria Tel.: +43-7236-3343-831
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
e-commerce product classification: our participation at cdiscount 2015 challenge
e-commerce product classification: our participation at cdiscount 2015 challenge Ioannis Partalas Viseo R&D, France [email protected] Georgios Balikas University of Grenoble Alpes, France [email protected]
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
MHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
The Number of Children Being Raised by Gay or Lesbian Parents. Corbin Miller Joseph Price. Brigham Young University. Abstract
The Number of Children Being Raised by Gay or Lesbian Parents Corbin Miller Joseph Price Brigham Young University Abstract We use data from the American Community Survey and National Survey of Family Growth
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
Feature Selection with Monte-Carlo Tree Search
Feature Selection with Monte-Carlo Tree Search Robert Pinsler 20.01.2015 20.01.2015 Fachbereich Informatik DKE: Seminar zu maschinellem Lernen Robert Pinsler 1 Agenda 1 Feature Selection 2 Feature Selection
Comparison of machine learning methods for intelligent tutoring systems
Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu
CS 6220: Data Mining Techniques Course Project Description
CS 6220: Data Mining Techniques Course Project Description College of Computer and Information Science Northeastern University Spring 2013 General Goal In this project, you will have an opportunity to
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
Analysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Cross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Early Detection of Diabetes from Health Claims
Early Detection of Diabetes from Health Claims Rahul G. Krishnan, Narges Razavian, Youngduck Choi New York University {rahul,razavian,yc1104}@cs.nyu.edu Saul Blecker, Ann Marie Schmidt NYU School of Medicine
Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product
Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago [email protected] Keywords:
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Microsoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql [email protected] http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Predicting Student Performance by Using Data Mining Methods for Classification
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
Business Analytics and Credit Scoring
Study Unit 5 Business Analytics and Credit Scoring ANL 309 Business Analytics Applications Introduction Process of credit scoring The role of business analytics in credit scoring Methods of logistic regression
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Sentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
A Survey of Classification Techniques in the Area of Big Data.
A Survey of Classification Techniques in the Area of Big Data. 1PrafulKoturwar, 2 SheetalGirase, 3 Debajyoti Mukhopadhyay 1Reseach Scholar, Department of Information Technology 2Assistance Professor,Department
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
Predicting Gold Prices
CS229, Autumn 2013 Predicting Gold Prices Megan Potoski Abstract Given historical price data up to and including a given day, I attempt to predict whether the London PM price fix of gold the following
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
Data Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
Detecting Obfuscated JavaScripts using Machine Learning
Detecting Obfuscated JavaScripts using Machine Learning Simon Aebersold, Krzysztof Kryszczuk, Sergio Paganoni, Bernhard Tellenbach, Timothy Trowbridge Zurich University of Applied Sciences, Switzerland
Orange: Data Mining Toolbox in Python
Journal of Machine Learning Research 14 (2013) 2349-2353 Submitted 3/13; Published 8/13 Orange: Data Mining Toolbox in Python Janez Demšar Tomaž Curk Aleš Erjavec Črt Gorup Tomaž Hočevar Mitar Milutinovič
LCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Binary Logistic Regression
Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including
Predicting earning potential on Adult Dataset
MSc in Computing, Business Intelligence and Data Mining stream. Business Intelligence and Data Mining Applications Project Report. Predicting earning potential on Adult Dataset Submitted by: xxxxxxx Supervisor:
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Drugs store sales forecast using Machine Learning
Drugs store sales forecast using Machine Learning Hongyu Xiong (hxiong2), Xi Wu (wuxi), Jingying Yue (jingying) 1 Introduction Nowadays medical-related sales prediction is of great interest; with reliable
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
III. DATA SETS. Training the Matching Model
A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: [email protected] Prem Melville IBM T.J. Watson
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
An Analysis of the Health Insurance Coverage of Young Adults
Gius, International Journal of Applied Economics, 7(1), March 2010, 1-17 1 An Analysis of the Health Insurance Coverage of Young Adults Mark P. Gius Quinnipiac University Abstract The purpose of the present
Car Insurance. Havránek, Pokorný, Tomášek
Car Insurance Havránek, Pokorný, Tomášek Outline Data overview Horizontal approach + Decision tree/forests Vertical (column) approach + Neural networks SVM Data overview Customers Viewed policies Bought
How can we discover stocks that will
Algorithmic Trading Strategy Based On Massive Data Mining Haoming Li, Zhijun Yang and Tianlun Li Stanford University Abstract We believe that there is useful information hiding behind the noisy and massive
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
MEUSE: RECOMMENDING INTERNET RADIO STATIONS
MEUSE: RECOMMENDING INTERNET RADIO STATIONS Maurice Grant [email protected] Adeesha Ekanayake [email protected] Douglas Turnbull [email protected] ABSTRACT In this paper, we describe a novel Internet
MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
Hong Kong Stock Index Forecasting
Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei [email protected] [email protected] [email protected] Abstract Prediction of the movement of stock market is a long-time attractive topic
Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction
Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University [email protected] Taghi M. Khoshgoftaar
A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode
A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode Seyed Mojtaba Hosseini Bamakan, Peyman Gholami RESEARCH CENTRE OF FICTITIOUS ECONOMY & DATA SCIENCE UNIVERSITY
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction
Sentiment Analysis of Movie Reviews and Twitter Statuses Introduction Sentiment analysis is the task of identifying whether the opinion expressed in a text is positive or negative in general, or about
