ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on
|
|
|
- Jonathan Reynolds
- 10 years ago
- Views:
Transcription
1 ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on Jaume Bacardit The Interdisciplinary Compu/ng and Complex BioSystems (ICOS) research group Newcastle University
2 Outline Descrip/on of the dataset and evalua/on Final ranking and presenta/ons of par/cipants Overall analysis of the compe//on
3 DATASET DESCRIPTION AND EVALUATION
4 Source of dataset: Protein Structure Predic/on Protein Structure Predic/on (PSP) aims to predict the 3D structure of a protein based on its primary sequence Primary Sequence 3D Structure
5 Protein Structure Predic/on Beside the overall 3D PSP (an op/miza/on problem), several structural aspects can be predicted for each protein residue Coordina/on number Solvent accessibility Etc. These problems can be modelled in may ways: Regression or classifica/on problems Low/high number of classes Balanced/unbalanced classes Adjustable number of a]ributes Ideal benchmarks h]p://ico2s.org/datasets/psp_benchmark.html
6 Contact Map Two residues of a chain are said to be in contact if their distance is less than a certain threshold Contact Map (CM): binary matrix that contains a 1 for a cell if the residues at the row & column are in contact, 0 otherwise This matrix is very sparse, in real proteins there are less than 2% of contacts Highly unbalanced dataset Contact helices sheets
7 Characterisa/on of the contact map problem (631 variables) Three types of input informa/on were used 1. Detailed informa/on of three different windows of residues centered around The two target residues (2x) The middle point between them 552 variables 2. Informa/on about the connec/ng segment between the two target residues ( 38 variables) 3. Informa/on about the whole chain ( 41 variables) 1 3 2
8 A]ribute relevance (es/mated from the ICOS contact map predictor) Colour = group of features Bubble size = a]ribute relevance in our rule- based contact map predictor All a]ributes were used in our models (but some rarely)
9 Contact Map dataset A diverse set of 3262 proteins with known structure were selected 90% of this set was used for training 10% for test Instances were generated for pairs of AAs at least 6 posi/ons apart in the chain The resul/ng training set contained 32 million pairs of AA and 631 a]ributes Less than 2% of those are actual contacts +60GB of disk space Test set of 2.89M instances
10 Evalua/on Four metrics are computed for each submission True Posi/ve Rate (TP/P) True Nega/ve Rate (TN/N) Accuracy (TP+TN)/(P+N) Final score of TPR x TNR The final score was selected to promote predic/ng the minority class (P) of the problem
11 Submission of predic/ons Aler registra/on each team was given a submission code To submit predic/ons teams had to specify The team name The submission code Upload a file with the predic/ons (one predicted class per row) A brief descrip/on of method and resources
12 OVERALL SCORE AND PARTICIPANT S PRESENTATIONS
13 Overall scores Team Name Predic8ons TPR TNR Acc TPR TNR Efdamis ICOS UNSW HyperEns PUC- Rio_ICA EmeraldLogic LidiaGroup Total: 364 predic/ons submi]ed TPR TNR TPR TNR
14 Timeline
15 LidiaGroup, Universidade da Coruña One- layer neural network using all the data (with oversampling) and only 100 features. Resources not specified Also tested one- class classifica/on score Team LidiaGroup Date
16 EmeraldLogic Linear Gene/c Programming- style 0.46 learning algorithm Hardware Intel i7-3930k CPU, Nvidia Titan Black GPU, 32GB RAM 70 minutes CPU+GPU training /me plus 0.3 man hours post analysis score Team EmeraldLogic Date
17 PUC- Rio_ICA GeForce GTX Titan Linear Gene/c Programming Resources not specificed score Team PUC Rio_ICA Date
18 HyperEns Standard and Budgeted SVMs with bayesian op/misa/on of parameters (C, γ, cost- sensi/ve class errors) Best model: 4.7 days of parameter op/misa/on in a 16- core machine score Team HyperEns Date
19 UNSW Hybrid Learning Classifier System/ Deep Learning Architecture Each run used 8250 CPU hours in a 24- core machine score Team UNSW Date
20 ICOS Team ICOS Ensemble of rule sets learnt from samples with 1:1 ra/o of posi/ve- nega/ve examples Using the BioHEL evolu/onary machine learning algorithm to generate rule sets Training /me for best solu/on ~3000 CPU hours score Date
21 Efdamis Hadoop- based solu/on Pipeline of random over- sampling, evolu/onary feature weigh/ng and Random Forests Best model: 39h of Wall- clock /me in a 144- core cluster (not used exclusively) score Team Efdamis Date
22 Computa/onal effort vs score Team Name Wall- clock hours CPU hours Score Efdamis 39h ICOS Best case: 12h 2998h UNSW 70h 8250h HyperEns 21.1h 337.6h PUC- Rio_ICA???? EmeraldLogic 70 (CPU+GPU) LidiaGroup????
23 Learning paradigm vs score Team Name Learning paradigm Score Efdamis Random Forest ICOS Ensemble of Rule sets UNSW LCS of Deep Learning classifiers HyperEns SVM PUC- Rio_ICA Linear GP EmeraldLogic ~Linear GP LidiaGroup 1- layer NN
24 Evolu/onary vs Non Evolu/onary Totally evolu/onary: ICOS (2 nd ), UNSW (3 rd ), PUC- Rio_ICA (5 th ), EmeraldLogic (6 th ) Par/ally evolu/onary: EFDAMIS (1 st ) Non- evolu/onary: HyperEns (4 th ), LidiaGroup (7 th )
25 What went well Diversity of Learning paradigms Computa/on framework Contribu/on of evolu/onary computa/on Total flexibility of strategy for par/cipants, lightweight submission system Teams enjoyed it, learnt a lot from experience
26 What could be be]er Analysis of computa/onal effort is quite qualita/ve Compe//on plaworm More sophis/cated engine giving richer informa/on Separate valida/on (for leaderboard) and test (for final score) sets
27 Next one? Data donors! This challenge was possible because the dataset, un/l the compe//on, was private We need data sources that can remain non- public while the compe//on runs Other types of datasets Prize? Challenge with uniform computa/onal budget but at the same /me flexibility of strategy
28 ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on Thanks to all par/cipants!
Data Mining. Supervised Methods. Ciro Donalek [email protected]. Ay/Bi 199ab: Methods of Computa@onal Sciences hcp://esci101.blogspot.
Data Mining Supervised Methods Ciro Donalek [email protected] Supervised Methods Summary Ar@ficial Neural Networks Mul@layer Perceptron Support Vector Machines SoLwares Supervised Models: Supervised
Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD Senior Research Manager GTC 2016 Deep learning proof points as of today Vision Speech Text Other Search & information
COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes
A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept.
SQream Technologies Ltd - Confiden7al
SQream Technologies Ltd - Confiden7al 1 Ge#ng Big Data Done On a GPU- Based Database Ori Netzer VP Product 26- Mar- 14 Analy7cs Performance - 3 TB, 18 Billion records SQream Database 400x More Cost Efficient!
Microsoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql [email protected] http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Design and Evalua.on of a Real- Time URL Spam Filtering Service
Design and Evalua.on of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Jus.n Ma, Vern Paxson, Dawn Song University of California, Berkeley Interna.onal Computer Science Ins.tute Mo.va.on
Some Security Challenges of Cloud Compu6ng. Kui Ren Associate Professor Department of Computer Science and Engineering SUNY at Buffalo
Some Security Challenges of Cloud Compu6ng Kui Ren Associate Professor Department of Computer Science and Engineering SUNY at Buffalo Cloud Compu6ng: the Next Big Thing Tremendous momentum ahead: Prediction
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Data Management in the Cloud: Limitations and Opportunities. Annies Ductan
Data Management in the Cloud: Limitations and Opportunities Annies Ductan Discussion Outline: Introduc)on Overview Vision of Cloud Compu8ng Managing Data in The Cloud Cloud Characteris8cs Data Management
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Car Insurance. Havránek, Pokorný, Tomášek
Car Insurance Havránek, Pokorný, Tomášek Outline Data overview Horizontal approach + Decision tree/forests Vertical (column) approach + Neural networks SVM Data overview Customers Viewed policies Bought
Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis
Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2
CS 378 Big Data Programming. Lecture 1 Introduc:on
CS 378 Big Data Programming Lecture 1 Introduc:on Class Logis:cs Class meets MW, 9:30 AM 11:00 AM Office Hours GDC 4.706 MW 11:00 12:00 AM By appointment Email: [email protected] Web page: cs.utexas.edu/~dfranke/courses/2015spring/cs378-
Intelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
Experiments on cost/power and failure aware scheduling for clouds and grids
Experiments on cost/power and failure aware scheduling for clouds and grids Jorge G. Barbosa, Al0no M. Sampaio, Hamid Harabnejad Universidade do Porto, Faculdade de Engenharia, LIACC Porto, Portugal, [email protected]
Data Stream Algorithms in Storm and R. Radek Maciaszek
Data Stream Algorithms in Storm and R Radek Maciaszek Who Am I? l Radek Maciaszek l l l l l l Consul9ng at DataMine Lab (www.dataminelab.com) - Data mining, business intelligence and data warehouse consultancy.
SEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Deep dive into Haven Predictive Analytics Powered by HP Distributed R and
Nodes, Ties and Influence
Nodes, Ties and Influence Chapter 2 Chapter 2, Community Detec:on and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010. 1 IMPORTANCE OF NODES 2 Importance of Nodes Not
PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012
PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
CS231M Project Report - Automated Real-Time Face Tracking and Blending
CS231M Project Report - Automated Real-Time Face Tracking and Blending Steven Lee, [email protected] June 6, 2015 1 Introduction Summary statement: The goal of this project is to create an Android
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
Learning from Diversity
Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology
Data Mining: STATISTICA
Data Mining: STATISTICA Outline Prepare the data Classification and regression 1 Prepare the Data Statistica can read from Excel,.txt and many other types of files Compared with WEKA, Statistica is much
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION
ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION CSE 537 Ar@ficial Intelligence Professor Anita Wasilewska GROUP 2 TEAM MEMBERS: SAEED BOOR BOOR - 110564337 SHIH- YU TSAI - 110385129 HAN LI 110168054 SOURCES
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on
Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classifica6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output
Deep Learning For Text Processing
Deep Learning For Text Processing Jeffrey A. Bilmes Professor Departments of Electrical Engineering & Computer Science and Engineering University of Washington, Seattle http://melodi.ee.washington.edu/~bilmes
Using MATLAB to Measure the Diameter of an Object within an Image
Using MATLAB to Measure the Diameter of an Object within an Image Keywords: MATLAB, Diameter, Image, Measure, Image Processing Toolbox Author: Matthew Wesolowski Date: November 14 th 2014 Executive Summary
Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
What is Data Science? Data, Databases, and the Extraction of Knowledge Renée T., @becomingdatasci, November 2014
What is Data Science? { Data, Databases, and the Extraction of Knowledge Renée T., @becomingdatasci, November 2014 Let s start with: What is Data? http://upload.wikimedia.org/wikipedia/commons/f/f0/darpa
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
High Productivity Data Processing Analytics Methods with Applications
High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected]
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected] Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual
Big-data Analytics: Challenges and Opportunities
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)
Improving Demand Forecasting
Improving Demand Forecasting 2 nd July 2013 John Tansley - CACI Overview The ideal forecasting process: Efficiency, transparency, accuracy Managing and understanding uncertainty: Limits to forecast accuracy,
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Big Data Use Cases. At Salesforce.com. Narayan Bharadwaj Director, Product Management Salesforce.com. @nadubharadwaj
Big Data Use Cases At Salesforce.com Narayan Bharadwaj Director, Product Management Salesforce.com @nadubharadwaj Safe harbor Safe harbor statement under the Private Securi9es Li9ga9on Reform Act of 1995:
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
SURVEY REPORT DATA SCIENCE SOCIETY 2014
SURVEY REPORT DATA SCIENCE SOCIETY 2014 TABLE OF CONTENTS Contents About the Initiative 1 Report Summary 2 Participants Info 3 Participants Expertise 6 Suggested Discussion Topics 7 Selected Responses
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.
Title Introduction to Data Mining Dr Arulsivanathan Naidoo Statistics South Africa OECD Conference Cape Town 8-10 December 2010 1 Outline Introduction Statistics vs Knowledge Discovery Predictive Modeling
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
Machine Learning Capacity and Performance Analysis and R
Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Conclusions and Future Directions
Chapter 9 This chapter summarizes the thesis with discussion of (a) the findings and the contributions to the state-of-the-art in the disciplines covered by this work, and (b) future work, those directions
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14 Insurance has always been a data business The industry has successfully
Understanding and Detec.ng Real- World Performance Bugs
Understanding and Detec.ng Real- World Performance Bugs Gouliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu Presented by Cindy Rubio- González Feb 10 th, 2015 Mo.va.on Performance bugs
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario [email protected]
Data Mining in CRM & Direct Marketing Jun Du The University of Western Ontario [email protected] Outline Why CRM & Marketing Goals in CRM & Marketing Models and Methodologies Case Study: Response Model Case
Predictive Modeling and Big Data
Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
Car Insurance. Prvák, Tomi, Havri
Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes
Run$me Query Op$miza$on
Run$me Query Op$miza$on Robust Op$miza$on for Graphs 2006-2014 All Rights Reserved 1 RDF Join Order Op$miza$on Typical approach Assign es$mated cardinality to each triple pabern. Bigdata uses the fast
Risk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING
= + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015 KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data sparsity:
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
FUNNELBRAIN ONLINE MARKETING GET EDUCATED ON THE SITE THE DELIVERS QUALIFIED STUDENTS TO YOUR SCHOOL. FunnelBrain
FUNNELBRAIN ONLINE MARKETING GET EDUCATED ON THE SITE THE DELIVERS QUALIFIED STUDENTS TO YOUR SCHOOL FunnelBrain ABOUT FUNNELBRAIN Founded in 2008, by Internet execu4ves from REALTOR.com, WebMD and educa4onal
Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC
Technical Paper (Last Revised On: May 6, 2013) Big Data Analytics Benchmarking SAS, R, and Mahout Allison J. Ames, Ralph Abbey, Wayne Thompson SAS Institute Inc., Cary, NC Accurate and Simple Analysis
Na#onal Asbestos Forum 2013: Advance in Medical Research on Asbestos- Related Diseases
Na#onal Asbestos Forum 2013: Advance in Medical Research on Asbestos- Related Diseases Professor Nico van Zandwijk Asbestos Diseases Research Ins#tute Content List of Asbestos- Related Diseases Epidemiology
