TOWARD BIG DATA ANALYSIS WORKSHOP



Similar documents
Statistics Graduate Courses

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

STATISTICA Formula Guide: Logistic Regression. Table of Contents

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Decision Trees from large Databases: SLIQ

Principles of Data Mining by Hand&Mannila&Smyth

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Data, Measurements, Features

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

STA 4273H: Statistical Machine Learning

Personalized Predictive Medicine and Genomic Clinical Trials

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

MS1b Statistical Data Mining

Big Data - Lecture 1 Optimization reminders

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Linear Threshold Units

Azure Machine Learning, SQL Data Mining and R

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Machine Learning Introduction

Distributed forests for MapReduce-based machine learning

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Final Project Report

Data Mining: Overview. What is Data Mining?

Predict Influencers in the Social Network

DATA SCIENCE Workshop November 12-13, 2015

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Master of Science (MS) in Biostatistics Program Director and Academic Advisor:

Classification algorithm in Data mining: An Overview

Statistical issues in the analysis of microarray data

Logistic Regression (1/24/13)

Handling attrition and non-response in longitudinal data

Big Data Analytics for Healthcare

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

The Data Mining Process

High-Dimensional Data Visualization by PCA and LDA

Distances, Clustering, and Classification. Heatmaps

College Readiness LINKING STUDY

Ph.D. Biostatistics Note: All curriculum revisions will be updated immediately on the website

Feature Selection vs. Extraction

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

GA as a Data Optimization Tool for Predictive Analytics

Effective Analysis and Predictive Model of Stroke Disease using Classification Methods

Bayesian Penalized Methods for High Dimensional Data

The Scientific Data Mining Process

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Chapter 6. The stacking ensemble approach

How To Understand The Theory Of Probability

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Constrained curve and surface fitting

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

D-optimal plans in observational studies

Detection of changes in variance using binary segmentation and optimal partitioning

Introduction to Data Mining

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Statistics for BIG data

Advanced Big Data Analytics with R and Hadoop

Fast Analytics on Big Data with H20

Analytics on Big Data

Package bigdata. R topics documented: February 19, 2015

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Feature Selection for Stock Market Analysis

Statistical Machine Learning

MSCA Introduction to Statistical Concepts

Predicting the Stock Market with News Articles

Statistical machine learning, high dimension and big data

Data Mining and Machine Learning in Bioinformatics

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

How To Cluster

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Class-specific Sparse Coding for Learning of Object Representations

Using multiple models: Bagging, Boosting, Ensembles, Forests

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Towards running complex models on big data

Image Compression through DCT and Huffman Coding Technique

Programming Exercise 3: Multi-class Classification and Neural Networks

Map-Reduce for Machine Learning on Multicore

Group Testing a tool of protecting Network Security

Linear Classification. Volker Tresp Summer 2015

Bio-Informatics Lectures. A Short Introduction

Transcription:

TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA) 在 巨 量 資 料 的 深 層 分 析 (deep analytics) 將 扮 演 重 要 的 角 色, 但 也 有 其 待 解 決 的 問 題 與 待 開 發 的 技 術 及 環 境 本 演 講 將 先 以 幾 個 實 例 介 紹 矩 陣 視 覺 化 (Matrix Visualization: MV) 在 象 徵 性 資 料 分 析 (Symbolic Data Analysis) 的 應 用 及 方 法 論, 與 所 開 發 之 軟 體 模 組 igap (Generalized Association Plots for Interval-Type Symbolic Data) 接 著 我 們 試 著 以 igap 與 Hadoop 計 算 環 境 解 決 MV 面 對 巨 量 資 料 時 於 關 係 矩 陣 計 算 排 序 與 呈 現 時 所 面 臨 的 困 難, 提 供 一 套 可 能 可 以 看 巨 量 資 料 之 EDA 工 具 1

APPLICATIONS OF STATISTICS TO CANCER GENOMICS BIG DATA Yu-Chin Hsu 1, Jan-Gowth Chang 2 and Grace S. Shieh 1, 1 Institute of Statistical Science, Academia Sinica, 2 Department of Laboratory Medicine, and Center of RNA Biology and Clinical Application, China Medical University Hospital and China Medical University, Taiwan. We first introduce two types of next generation-sequencing data, which are in the frontier of biotechnologies. Both DNA target-sequencing (illumine, 1000x) and exome-sequencing (illumine, 60x) are used to find mutations in genes of 10 local endometria cancer patients. There are 3.2 10 9 bps in human genome, and each aforementioned data set easily contains 10 50 GBs. Here, to distinguish artifacts (due to PCR errors and sequencing) from true mutations in cancer is of interest. Although GATK (DePristo et al., 2011) has developed some statistics to filter artifacts from true variants (SNPs in particular), there is room for improvement for cancer genomics data. We have developed two novel statistics, in addition to adopting four features of GATK, and applied it to aforementioned two NGS data sets of 10 endometria cancer patients. Performances of our methods and GATK are compared. We further applied our method, with trained thresholds for some statistics (by the endometria cancer NGS data), to exome-sequencing data of 12 thyroid cancer patients to predict true and false mutations; biological validations are on-going. 2

FACE RECOGNITION BY MPCA AND PIRE Ting-Li Chen Institute of Statistical Science, Academia Sinica Face recognition can be viewed as an image classification problem. Linear discriminant analysis (LDA) is a typical method for classification. One key step of LDA is to standardize the data. However, the standardization step which involves a matrix inversion can not be directly executed for this problem due to the dimension being larger than the sample size (large p small n). Sliced inverse regression (SIR) can be viewed as an extension to LDA, and partial inverse regression estimator (PIRE) adapts SIR approach to deal with the problem of large p small n. We apply PIRE on face recognition problem. Before the PIRE step, we process the images with multilinear principal component analysis (MPCA) to reduce the dimension first. From our simulation studies, MPCA+PIRE produces better results than the existing methods on several benchmark face recognition image datasets. 3

SOLVING FUSED GROUP LASSO PROBLEMS VIA BLOCK SPLITTING ALGORITHMS Tso-Jung Yen Institute of Statistical Science, Academia Sinica In this paper we propose a distributed optimization-based method for solving the fused group lasso problem, in which the penalty function is a sum of Euclidean distances between pairs of parameter vectors. As a result of that, the penalty function is not separable in terms of these parameter vectors. To make the penalty function separable, one common way is to introduce a set of auxiliary variables that represent the differences between pairs of parameter vectors. This representation can be seen as a linear operator on the joint vector of the parameter vectors, and the resulting augmented Lagrangian will have a coupling quadratic term involving the linear representation. Even though the linear representation is separable in terms of the parameter vectors, the coupling quadratic term is not. To make the coupling quadratic term separable, we further introduce a set of equality constraints that connect each parameter vector to a group of paired auxiliary variables. With these newly introduced equality constraints, we are able to derive a modified augmented Lagrangian that is separable either in terms of the parameter vectors or in terms of the paired auxiliary variables. This separable property further facilitates us to solve the fused group lasso problem by developing an iterative algorithm in that most tasks can be carried out independently in parallel. We evaluate performance of the parallel algorithm by carrying out fused group lasso estimation for regression models using simulated data sets. Our results show that the parallel algorithm has a massive advantage over its non-parallel counterpart in terms of computational time and memory usage. In addition, with additional steps in each iteration, the parallel algorithm can obtain parameter values almost identical to those obtained by the non-parallel algorithm. Keywords: Fused lasso; Group lasso; Euclidean distance; Alternating direction method of multipliers; Block splitting algorithms. 4

APPLICATION OF SEQUENTIAL METHODS FOR ANALYZING BIG DATA FORM HEALTH CLOUD Charlotte Wang 1,2, Yu-Tseng Chu 2, Yuan-Chin Ivan Chang 2, Mei-Shu Lai 1 1 Institute of Epidemiology and Preventive Medicine, National Taiwan University 2 Institute of Statistical Science, Academia Sinica With advance in information and biomedical technology, people have more chance to receive various medical examinations such as traditional biochemical tests and genetic tests, and a massive volume of health-related records will be recorded. In addition, governments, medical institutes and researchers also collect and set up datasets for diverse purposes such as the National Health Insurance database, all kinds of surveys, electronic health record, screening data, and epidemiological cohort databases. When researchers explore health-related issues based on combining a variety of databases, the volume, variety, and veracity of data become major challenges for analyzing data. Tackling these problems by reducing intractably huge among of data to more computationally manageable numbers will be a possible solution. Moreover, sequential methods, a widely used statistical hypothesis testing in clinical trials, do not assume a fixed sample size in advance and evaluate data as it is collected. Further sampling is stopped according to a pre-defined stopping rule as significant results are observed. In this study, we consider the characteristics and advantages of sequential methods and apply them to analyze big data from health cloud. Using National Health Insurance Database as an example, we demonstrated how to apply sequential methods for analyzing data and modified operating techniques for existing datasets rather than sequentially collected data in clinical trials. Keywords: Big data, health cloud, sequential method. 5

DETECTION OF GENE-GENE INTERACTIONS USING MULTISTAGE SPARSE AND LOW-RANK REGRESSION Hung Hung Institute of Epidemiology & Preventive Medicine, National Taiwan University Finding an efficient and computationally feasible approach to deal with the curse of high-dimensionality is a daunting challenge faced by modern biological science. The problem becomes even more severe when the interactions are the research focus. To improve the performance of statistical analyses, we propose a sparse and low-rank (SLR) screening based on the combination of a low-rank interaction model and the Lasso screening. SLR models the interaction effects using a low-rank matrix to achieve parsimonious parametrization. The low-rank model increases the efficiency of statistical inference and, hence, SLR screening is able to more accurately detect gene-gene interactions than conventional methods. Incorporation of SLR screening into the Screen-and-Clean approach (Wasserman and Roeder, 2009; Wu et al., 2010) is also discussed, which suffers less penalty from Boferroni correction, and is able to assign p-values for the identified variables in high dimensional model. We apply the proposed screening procedure to the Warfarin dosage study and the CoLaus study. The results suggest that the new procedure can identify main and interaction effects that would have been omitted by conventional screening methods. 6

BIG DATA CASE STUDY: MALWARES AND HONEYNET LOGS ANALYSIS 安 興 彥 國 家 高 速 網 路 與 計 算 中 心 To resist the various threats from Internet, NCHC builds a honeynet on TANet. With the help of the honeynet we can observe the hackers who run the botnets to search and attack new victims. As our honeynet grows, the large mount of data cannot be easily handled by the relational databases. In this talk, I will share our experience of malwares and honeynet logs analysis. 7

SCORE-SCALE DECISION TREE FOR PAIRED COMPARISON DATA Yu-Shan Shih Department of Mathematics, National Chung Cheng University Paired comparison data are collected by comparing objects in couples. A new decision tree method for analyzing such data is proposed. It finds the preference patterns (ranks) of the subjects based on some covariates. A scoring system is implemented first and the total scores associated with each object for each subject are counted. The GUIDE regression tree for multi-responses is then applied to the score outcomes and the mean scores of the objects are used to give the preference scale of the subjects. This way of preference ranking is identical to that given by the Bradley-Terry model when the 2-1-0 scoring system is employed. The usefulness of our tree method is demonstrated through simulation and data analysis. 8

BAYESIAN VARIABLE SELECTION FOR MULTI-RESPONSE LINEAR REGRESSION Ray-Bing Chen Department of Statistics, National Cheng Kung University This paper studies the variable selection problem in high dimensional linear regression, where there are multiple response vectors, and they share the same or similar subsets of predictor variables to be selected from a large set of candidate variables. In the literature, this problem is called multi-task learning, support union recovery or simultaneous sparse coding in different contexts. In this paper, we propose a Bayesian method for solving this problem by introducing two nested sets of binary indicator variables. In the first set of indicator variables, each indicator is associated with a predictor variable or a regressor, indicating whether this variable is active for any of the response vectors. In the second set of indicator variables, each indicator is associated with both a predicator variable and a response vector, indicating whether this variable is active for the particular response vector. The problem of variable selection can then be solved by sampling from the posterior distributions of the two sets of indicator variables. We develop the Gibbs sampling algorithm for posterior sampling and demonstrate the performances of the proposed method for both simulated and real data sets. 9

ACTIVE LEARNING: SUBJECT AND VARIABLE SELECTION Hsiang-Ling Hsu Institute of Statistics, National University of Kaohsiung In this work, we are interested in active learning for the binary classification problems based on the logistic models. According to the concept of the active learning, we modify a sequential design procedure as a subject selection strategy to select the unlabeled points into training set. Furthermore, we implement a stepwise variable selection procedure to identify proper classification model based on the current training set. Thus the proposed active learning algorithm iteratively performs these two procedures to add more training points and then to update the classification model. Simulations are used to demonstrate the advantages of the proposed active learning algorithm. Comparing with the classification results obtained by the whole training set and the full model, the simulation results show that the proposed algorithm can obtain the almost the same classification rates with much fewer training points and a compact classification model. This is a joint work with Professor Yuan-chin Ivan Chang and Professor Ray-Bing Chen. Keywords: active learning algorithm, backward elimination, D-efficiency criterion, forward selection, single feature optimization 10