ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on

Similar documents
Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Azure Machine Learning, SQL Data Mining and R

A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes

SQream Technologies Ltd - Confiden7al

Microsoft Azure Machine learning Algorithms

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Design and Evalua.on of a Real- Time URL Spam Filtering Service

Some Security Challenges of Cloud Compu6ng. Kui Ren Associate Professor Department of Computer Science and Engineering SUNY at Buffalo

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Fast Analytics on Big Data with H20

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

The Data Mining Process

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Advanced analytics at your hands

Classification of Bad Accounts in Credit Card Industry

Car Insurance. Havránek, Pokorný, Tomášek

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

CS 378 Big Data Programming. Lecture 1 Introduc:on

Intelligent Heuristic Construction with Active Learning

Experiments on cost/power and failure aware scheduling for clouds and grids

Data Stream Algorithms in Storm and R. Radek Maciaszek

SEIZE THE DATA SEIZE THE DATA. 2015

Nodes, Ties and Influence

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Scalable Developments for Big Data Analytics in Remote Sensing

CS231M Project Report - Automated Real-Time Face Tracking and Blending

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Learning from Diversity

Data Mining: STATISTICA

How To Identify A Churner

Understanding the Benefits of IBM SPSS Statistics Server

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Characterizing Task Usage Shapes in Google s Compute Clusters

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Deep Learning For Text Processing

Using MATLAB to Measure the Diameter of an Object within an Image

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Machine learning for algo trading

High Productivity Data Processing Analytics Methods with Applications

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Bringing Big Data Modelling into the Hands of Domain Experts

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Similarity Search in a Very Large Scale Using Hadoop and HBase

E-commerce Transaction Anomaly Classification

Steven C.H. Hoi School of Information Systems Singapore Management University

Big-data Analytics: Challenges and Opportunities

Improving Demand Forecasting

Predicting Flight Delays

Big Data Use Cases. At Salesforce.com. Narayan Bharadwaj Director, Product Management

Decision Trees from large Databases: SLIQ

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

HT2015: SC4 Statistical Data Mining and Machine Learning

SURVEY REPORT DATA SCIENCE SOCIETY 2014

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Programming Exercise 3: Multi-class Classification and Neural Networks

IBM SPSS Direct Marketing 23

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Machine Learning Capacity and Performance Analysis and R

Using multiple models: Bagging, Boosting, Ensembles, Forests

Conclusions and Future Directions

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Understanding and Detec.ng Real- World Performance Bugs

Challenges for Data Driven Systems

IBM SPSS Direct Marketing 22

HPC with Multicore and GPUs

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Predictive Modeling and Big Data

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

Car Insurance. Prvák, Tomi, Havri

Run$me Query Op$miza$on

Risk pricing for Australian Motor Insurance

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

FUNNELBRAIN ONLINE MARKETING GET EDUCATED ON THE SITE THE DELIVERS QUALIFIED STUDENTS TO YOUR SCHOOL. FunnelBrain

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Na#onal Asbestos Forum 2013: Advance in Medical Research on Asbestos- Related Diseases

Transcription:

ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on Jaume Bacardit jaume.bacardit@ncl.ac.uk The Interdisciplinary Compu/ng and Complex BioSystems (ICOS) research group Newcastle University

Outline Descrip/on of the dataset and evalua/on Final ranking and presenta/ons of par/cipants Overall analysis of the compe//on

DATASET DESCRIPTION AND EVALUATION

Source of dataset: Protein Structure Predic/on Protein Structure Predic/on (PSP) aims to predict the 3D structure of a protein based on its primary sequence Primary Sequence 3D Structure

Protein Structure Predic/on Beside the overall 3D PSP (an op/miza/on problem), several structural aspects can be predicted for each protein residue Coordina/on number Solvent accessibility Etc. These problems can be modelled in may ways: Regression or classifica/on problems Low/high number of classes Balanced/unbalanced classes Adjustable number of a]ributes Ideal benchmarks h]p://ico2s.org/datasets/psp_benchmark.html

Contact Map Two residues of a chain are said to be in contact if their distance is less than a certain threshold Contact Map (CM): binary matrix that contains a 1 for a cell if the residues at the row & column are in contact, 0 otherwise This matrix is very sparse, in real proteins there are less than 2% of contacts Highly unbalanced dataset Contact helices sheets

Characterisa/on of the contact map problem (631 variables) Three types of input informa/on were used 1. Detailed informa/on of three different windows of residues centered around The two target residues (2x) The middle point between them 552 variables 2. Informa/on about the connec/ng segment between the two target residues ( 38 variables) 3. Informa/on about the whole chain ( 41 variables) 1 3 2

A]ribute relevance (es/mated from the ICOS contact map predictor) Colour = group of features Bubble size = a]ribute relevance in our rule- based contact map predictor All a]ributes were used in our models (but some rarely)

Contact Map dataset A diverse set of 3262 proteins with known structure were selected 90% of this set was used for training 10% for test Instances were generated for pairs of AAs at least 6 posi/ons apart in the chain The resul/ng training set contained 32 million pairs of AA and 631 a]ributes Less than 2% of those are actual contacts +60GB of disk space Test set of 2.89M instances

Evalua/on Four metrics are computed for each submission True Posi/ve Rate (TP/P) True Nega/ve Rate (TN/N) Accuracy (TP+TN)/(P+N) Final score of TPR x TNR The final score was selected to promote predic/ng the minority class (P) of the problem

Submission of predic/ons Aler registra/on each team was given a submission code To submit predic/ons teams had to specify The team name The submission code Upload a file with the predic/ons (one predicted class per row) A brief descrip/on of method and resources

OVERALL SCORE AND PARTICIPANT S PRESENTATIONS

Overall scores Team Name Predic8ons TPR TNR Acc TPR TNR Efdamis 156 0.730432 0.730183 0.730188 0.533349 ICOS 63 0.703210 0.730155 0.729703 0.513452 UNSW 51 0.699159 0.727631 0.727153 0.508730 HyperEns 10 0.640027 0.763378 0.761308 0.488583 PUC- Rio_ICA 40 0.657092 0.714599 0.713634 0.469558 EmeraldLogic 17 0.686926 0.669737 0.670025 0.460059 LidiaGroup 27 0.653042 0.695753 0.695036 0.454356 Total: 364 predic/ons submi]ed 0.54 0.53 0.52 0.51 0.5 0.49 0.48 0.47 0.46 0.45 TPR TNR TPR TNR

Timeline

LidiaGroup, Universidade da Coruña One- layer neural network using all the data (with oversampling) and only 100 features. Resources not specified Also tested one- class classifica/on score 0.50 0.45 0.40 0.35 0.30 0.25 0.20 Team LidiaGroup 2014 06 20 2014 06 23 2014 06 26 Date

EmeraldLogic Linear Gene/c Programming- style 0.46 learning algorithm Hardware Intel i7-3930k CPU, Nvidia Titan Black GPU, 32GB RAM 70 minutes CPU+GPU training /me plus 0.3 man hours post- 0.45 analysis score Team EmeraldLogic 2014 05 20 2014 05 23 2014 05 26 2014 05 29 Date

PUC- Rio_ICA GeForce GTX Titan Linear Gene/c Programming Resources not specificed score 0.50 0.45 0.40 0.35 Team PUC Rio_ICA 0.30 0.25 0.20 2014 06 09 2014 06 12 2014 06 15 2014 06 18 2014 06 21 2014 06 24 Date

HyperEns Standard and Budgeted SVMs with bayesian op/misa/on of parameters (C, γ, cost- sensi/ve class errors) Best model: 4.7 days of parameter op/misa/on in a 16- core machine score 0.50 0.45 0.40 0.35 0.30 Team HyperEns 2014 06 02 2014 06 09 2014 06 16 Date

UNSW Hybrid Learning Classifier System/ Deep Learning Architecture Each run used 8250 CPU hours in a 24- core machine score 0.51 0.50 0.49 0.48 0.47 0.46 Team UNSW 0.45 2014 06 15 2014 06 18 2014 06 21 2014 06 24 2014 06 27 2014 06 30 Date

ICOS Team ICOS Ensemble of rule sets learnt from samples with 1:1 ra/o of posi/ve- nega/ve examples Using the BioHEL evolu/onary machine learning algorithm to generate rule sets Training /me for best solu/on ~3000 CPU hours score 0.52 0.51 0.50 0.49 0.48 0.47 2014 05 26 2014 06 02 2014 06 09 2014 06 16 2014 06 23 2014 06 30 Date

Efdamis Hadoop- based solu/on Pipeline of random over- sampling, evolu/onary feature weigh/ng and Random Forests Best model: 39h of Wall- clock /me in a 144- core cluster (not used exclusively) score 0.54 0.53 0.52 0.51 0.50 0.49 0.48 0.47 0.46 Team Efdamis 0.45 2014 05 19 2014 05 26 2014 06 02 2014 06 09 2014 06 16 2014 06 23 2014 06 30 Date

Computa/onal effort vs score Team Name Wall- clock hours CPU hours Score Efdamis 39h - - - 0.533349 ICOS Best case: 12h 2998h 0.513452 UNSW 70h 8250h 0.508730 HyperEns 21.1h 337.6h 0.488583 PUC- Rio_ICA???? 0.469558 EmeraldLogic 70 (CPU+GPU) - - - 0.460059 LidiaGroup???? 0.454356

Learning paradigm vs score Team Name Learning paradigm Score Efdamis Random Forest 0.533349 ICOS Ensemble of Rule sets 0.513452 UNSW LCS of Deep Learning classifiers 0.508730 HyperEns SVM 0.488583 PUC- Rio_ICA Linear GP 0.469558 EmeraldLogic ~Linear GP 0.460059 LidiaGroup 1- layer NN 0.454356

Evolu/onary vs Non Evolu/onary Totally evolu/onary: ICOS (2 nd ), UNSW (3 rd ), PUC- Rio_ICA (5 th ), EmeraldLogic (6 th ) Par/ally evolu/onary: EFDAMIS (1 st ) Non- evolu/onary: HyperEns (4 th ), LidiaGroup (7 th )

What went well Diversity of Learning paradigms Computa/on framework Contribu/on of evolu/onary computa/on Total flexibility of strategy for par/cipants, lightweight submission system Teams enjoyed it, learnt a lot from experience

What could be be]er Analysis of computa/onal effort is quite qualita/ve Compe//on plaworm More sophis/cated engine giving richer informa/on Separate valida/on (for leaderboard) and test (for final score) sets

Next one? Data donors! This challenge was possible because the dataset, un/l the compe//on, was private We need data sources that can remain non- public while the compe//on runs Other types of datasets Prize? Challenge with uniform computa/onal budget but at the same /me flexibility of strategy

ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on Thanks to all par/cipants!