Data Mining and Machine Learning in Bioinformatics



Similar documents
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Introduction to Data Mining

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Foundations of Artificial Intelligence. Introduction to Data Mining

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Information Management course

Big Data. Introducción. Santiago González

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Introduction to Data Mining

Data Mining Part 5. Prediction

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data, Measurements, Features

Classification algorithm in Data mining: An Overview

Learning is a very general term denoting the way in which agents:

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

A Review of Data Mining Techniques

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Data Mining. Yeow Wei Choong Anne Laurent

Data Mining Applications in Higher Education

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

How To Cluster

Machine Learning and Data Mining. Fundamentals, robotics, recognition

MA2823: Foundations of Machine Learning

Machine Learning and Statistics: What s the Connection?

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

The Tonnabytes Big Data Challenge: Transforming Science and Education. Kirk Borne George Mason University

Principles of Data Mining by Hand&Mannila&Smyth

Machine Learning: Overview

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

An Overview of Knowledge Discovery Database and Data mining Techniques

MS1b Statistical Data Mining

Association Technique on Prediction of Chronic Diseases Using Apriori Algorithm

Azure Machine Learning, SQL Data Mining and R

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Introduction to Pattern Recognition

Learning outcomes. Knowledge and understanding. Competence and skills

Classification and Prediction

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

How To Use Neural Networks In Data Mining

DATA MINING - 1DL105, 1Dl111

Semi-Supervised and Unsupervised Machine Learning. Novel Strategies

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Data Warehousing and Data Mining

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

life science data mining

DATA MINING TECHNIQUES AND APPLICATIONS

Statistics for BIG data

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

The Data Mining Process

Analytics on Big Data

An Empirical Study of Application of Data Mining Techniques in Library System

Maschinelles Lernen mit MATLAB

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Knowledge Discovery and Data Mining

Keywords data mining, prediction techniques, decision making.

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

College of Health and Human Services. Fall Syllabus

International Journal of Innovative Research in Computer and Communication Engineering

Data Warehousing and Data Mining

Data Mining On Diabetics

Database Marketing, Business Intelligence and Knowledge Discovery

Social Media Mining. Data Mining Essentials

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Statistics Graduate Courses

Distributed forests for MapReduce-based machine learning

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Introduction to data mining. Example of remote sensing image analysis

Big Data Analytics. The Hype and the Hope* Dr. Ted Ralphs Industrial and Systems Engineering Director, Laboratory

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data Mining: Overview. What is Data Mining?

Course Requirements for the Ph.D., M.S. and Certificate Programs

Chapter 12 Discovering New Knowledge Data Mining

Big Data: Rethinking Text Visualization

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Specific Usage of Visual Data Analysis Techniques

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

A Survey on Pre-processing and Post-processing Techniques in Data Mining

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Learning from Big Data in

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining Algorithms Part 1. Dejan Sarka

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Transcription:

Data Mining and Machine Learning in Bioinformatics PRINCIPAL METHODS AND SUCCESSFUL APPLICATIONS Ruben Armañanzas http://mason.gmu.edu/~rarmanan Adapted from Iñaki Inza slides http://www.sc.ehu.es/isg

OUTLINE n Intro to bioinf and data mining n Basic vocabulary and main techniques in machine learning Bioinformatic applications

BIOINFORMATICS SUCCINCT INTRO n Bioinformatics applies methods of information science for the analysis, modeling, and knowledge discovery of biological processes in living organisms n It brings together several disciplines molecular biology, mathematics, chemistry, physics, and informatics, with the aim of understanding life

GENERAL SCHEME OF ML APPS IN BIOINFORMATICS

DATA MINING ROOTS n Data collected and stored at enormous speeds (GB/hour). Data collections-floods, which were not envisioned to be analyzed few years ago, are being collected and warehoused: remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data electronic purchases and transactions

DATA MINING ROOTS n Computers and storage systems have become cheaper and more powerful n Since 90 s, much more data is being stored than analyzed (around 5-10%) n Data tsunami : in 2010 enterprises stored 7 exabytes (10 18 bytes)= 7,000,000,000 GB n Traditional data analysis techniques unfeasible for raw data

DEFINITION: DATA MINING

BIG DATA n New technological concept n Related to the challenges exposed to manipulate massive datasets (petabytes, exabytes): Capture and storage Processing and computing Analysis and mining n Demands the development of new platforms: MapReduce, Hadoop,

DEFINITION: MACHINE LEARNING n Machine Learning refers to the application of induction algorithms, which is one step in the knowledge discovery process n Training examples are either externally supplied, or supplied by a previous stage of the data mining process. n Machine Learning is the field of scientific study that concentrates on induction algorithms and on other algorithms that can be said to learn n Kohavi & Provost: Glossary of ML Terms: http://ai.stanford.edu/~ronnyk/glossary.html

DM + ML: MAIN TASKS n Prediction Methods Use some variables to predict unknown or future values of other variables n Supervised classification: nominal variable to be predicted n Regression: ordinal variable to be predicted n Description Methods Find human-interpretable patterns that describe the data n Clustering unsupervised classification n Association rule discovery n Feature selection: discover the key predictive features n Outlier detection

SUPERVISED CLASSIFICATION n Given a collection of records-samples (training set ) Each record contains a set of attributes-features-predictors Each record belongs to a class, our variable of interest (variable to be predicted)

SUPERVISED CLASSIFICATION n Find a model for class attribute as a function of the values of other attributes. There is a broad range of model types: Decision trees, Bayesian networks, neural networks n Goal: previously unseen records should be assigned a class as accurately as possible A test set is used to estimate the accuracy of the model. There is a broad range of techniques for accuracy estimation: crossvalidation, hold-out, bootstrap,

SUPERVISED CLASSIFICATION: the standard scenario

SUPERVISED CLASSIFICATION: models

SUPERVISED CLASSIFICATION: models

BIOMEDICAL INFORMATICS - BIOINFORMATICS DIAGNOSIS AND PROGNOSIS OF DISEASES BIOMARKER DISCOVERY

BIOMEDICAL INFORMATICS - BIOINFORMATICS DIAGNOSIS AND PROGNOSIS OF DISEASES BIOMARKER DISCOVERY

UNSUPERVISED CLASSIFICATION CLUSTERING n Given a collection of records-samples (training set ) Each record contains a set of attributes-features-predictors No target feature (class) which supervises the learning process n Find groups of cases with: n n Large intra-group homogeneity Large inter-groups heterogeneity Difficult evaluation-measure of these properties à no recognition rate Number of groups

CLUSTERING: MODELS

DNA MICROARRAY CLUSTERING n Find genes with similar expression profiles à a way to infer the function of genes whose function is unknown n Biclustering a classic concept in fashion again: Hartigan JA (1972). "Direct clustering of a data matrix". Journal of the American Statistical Association 67 (337)

SEMI-SUPERVISED CLASSIFICATION n Given a collection of records-samples (training set ) Each record contains a set of attributes-features-predictors A small subset of the samples is categorized (known class value) Most of the samples do not show a class value. Why? n Categorization: human-time consuming task n No knowledge to categorize the samples Can a learning process which takes advantage of unlabeled samples, construct a better supervised classification model?

PREDICTION OF GENES RELATED TO CANCER n n n n It is already known that certain genes are related to cancer For the rest of the genes it can not be stated that they are not related to cancer Helpful to prioritize, for oncogenic experts, the depth-study of specific genes More difficult than semi-supervised classification: one-class (partially supervised)

OTHER TYPES OF CLASSIFICATION PROBLEMS MULTIDIMENSIONAL CLASSIFICATION n Several class variables to be jointly predicted n Learn relationships between class variables n New term: Joint accuracy

MULTIDIMENSIONAL CLASSIFICATION APPLICATIONS

MULTILABEL CLASSIFICATION

MULTIPLE INSTANCE LEARNING

MULTIPLE INSTANCE LEARNING

ASSOCIATION RULES n Given a set of records each of which contain some number of items from a given collection; Depedendency rules which will predict occurrence of an item based on occurrences of other items. Rules are composed of antecedent and consequence parts: IF-THEN form No class concept: any item can be in the antecedent or consequence part Support and Confidence concepts