Current Topics in Information-theoretic Data Mining

Similar documents
Social Media Mining. Data Mining Essentials

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Unsupervised Data Mining (Clustering)

FlowMergeCluster Documentation

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

How To Cluster

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

How To Cluster Of Complex Systems

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

MIMO CHANNEL CAPACITY

Supervised Feature Selection & Unsupervised Dimensionality Reduction

SCAN: A Structural Clustering Algorithm for Networks

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

How To Make A Credit Risk Model For A Bank Account

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

How To Solve The Cluster Algorithm

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Categorical Data Visualization and Clustering Using Subjective Factors

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Advanced Concepts

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Data Mining for Knowledge Management. Classification

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Big Graph Processing: Some Background

DATA ANALYSIS II. Matrix Algorithms

Lecture 10: Regression Trees

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Clustering UE 141 Spring 2013

Chapter 7. Cluster Analysis

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

REVIEW OF ENSEMBLE CLASSIFICATION

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Fast Analytics on Big Data with H20

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Apache Hama Design Document v0.6

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Introduction to Clustering

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Learning outcomes. Knowledge and understanding. Competence and skills

Discovering Local Subgroups, with an Application to Fraud Detection

Neural Network Add-in

Data Mining Algorithms Part 1. Dejan Sarka

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Architectures for Big Data Analytics A database perspective

MapReduce Approach to Collective Classification for Networks

The Data Mining Process

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Compact Representations and Approximations for Compuation in Games

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Hadoop SNS. renren.com. Saturday, December 3, 11

Clustering Via Decision Tree Construction

Information Management course

Environmental Remote Sensing GEOG 2021

Component Ordering in Independent Component Analysis Based on Data Power

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

DATA PREPARATION FOR DATA MINING

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Mining Social-Network Graphs

Anomaly detection in data represented as graphs

Assessing Data Mining: The State of the Practice

Standardization and Its Effects on K-Means Clustering Algorithm

Chapter ML:XI (continued)

TIETS34 Seminar: Data Mining on Biometric identification

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Didacticiel - Études de cas

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

BIG DATA What it is and how to use?

ANALYTICS IN BIG DATA ERA

Bisecting K-Means for Clustering Web Log data

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

A Method for Decentralized Clustering in Large Multi-Agent Systems

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Lecture 7: NP-Complete Problems

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

MSCA Introduction to Statistical Concepts

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Data Mining Lab 5: Introduction to Neural Networks

Transcription:

Current Topics in Information-theoretic Data Mining NINA HUBIG, ANNIKA TONCH

Helmholtz-Hochschul research group ikdd Applications: Neuroscience, Diabetes research.

Outline 1. Introduction 2. General Information 3. Short Presentation of Topics 4. Selection of Topics

Information-theoretic Data Mining INTRODUCTION

Example Clustering: Find a natural grouping of the data objects. How many clusters? What to do with outliers?

The Algorithm K-Means 1) Initialize K cluster centers randomly. 2) Assign points to the closest center. 3) Update centers. 4) Iterate 2) und 3) until convergence. + fast convergence, + well-defined objective function, + gives a model describing the result. J k n ( j) j 1 i 1 x i c j 2

We need a quality criterion for clustering bad better perfect

Measuring Clustering Quality by Data Compression Data compression is a good criterion for - the required number of clusters - the goodness of a cluster structure - the quality of a cluster description How can a cluster be compressed?

Measuring Clustering Quality by Data Compression Data compression is a good criterion for 0.4 0.2 19% 2.3 bit pdf Lapl(3.5,1) (x) - the required number of clusters - the goodness of a cluster structure - the quality of a cluster description by a pdf How can a cluster be compressed? 0 1 2 3 4 5 6 7 0 0.2 0.4 1 2 3 4 5 6 7 5% 4.3 bit pdf Gauss(3.5,1) (y) Minimum Description Length (MDL) Principle: Automatic balance of Goodness-of-fit and model complexity

Key Idea Data compression is a very general measure for: - The amount of any kind of non-random information in any kind of data, - The success of any kind of data mining technique.

General Information ABOUT THE SEMINAR

Goals of the Seminar Learn how to: Read scientific papers Discover the state-of-the-art on a specific topic Write a scientific report Do a scientific presentation

The Seminar in Practice ECTS: 3 Credits (Bachelor), 6 Credits (Master) Master students get the harder papers ;) Presentation: 20 min presentation/10 min questions. Download the template from the seminar web page Write a report (max 8 pages). 3-4 pages Bachelor students 5-6 pages Master students Attendance and participation of the seminar meetings ASK the lecturers ;) Seminar days: February 19-20, time to be announced at the website.

Contents of the Report

Contents of the Presentation

Topic Selection FIND YOUR OWN PAPER

Mining Numerical and Mixed Data BASIC CLUSTERING FINDING ALTERNATIVE CLUSTERINGS MIXED (NUMERICAL, CATEGORICAL DATA)

Algorithm RIC: Robust Information-theoretic Clustering (KDD 2006) uniform Start with an arbitrary partitioning 1. Robust Fitting (RF): Purifies individual clusters from noise, determines a stable model. gaussian gaussian subspace cluster laplacian 2. Cluster Merging (CM): Stiches clusters which match well together. Additional value-add: Description of the cluster content by assigning model distribution functions to the individual coordinates. Free from sensitve parameter settings! Helmholtz-Hochschul-Nachwuchsgruppe ikdd, Claudia Plant

A Nonparametric Information- Theoretic Clustering Algorithm first google pick for information theoretical clustering ;) close to machine learning uses entropy and mutual information as quality function a bit different than our MDL-based approaches!

mincentropy: a Novel Information Theoretic Approach for the Generation of Alternative Clusterings Aims at finding different alternative clusterings for the same data set Uses a general entropy as objective function (not Shannon) can also be used semi-supervised (close to machine learning topics)

INCONCO: Interpretable Clustering of Numerical and Categorical Objects Uses Minimum Description Length (MDL) ;) Tackles mixed-type attributes: numerical, categorical data Clusters by revealing dependency patterns among attributes by using and extended Cholesky decomposition

Dependency Clustering across measurement scales Uses MDL ;) supports mixed-type attributes finds attribute dependencies regardless the measurement scale

Relevant overlapping subspace clusters on categorical data Focus on subspace clustering on categorical data. Non redundant approach Parameter free /automized

Graph Mining CLUSTERING WEIGHTED GRAPHS SUMMARIZATION, STRUCTURE MINING

Fully Automatic Cross-Associations Content Finding structures in datasets (parameter-free, fully automatic, scalable to very large matrices) Input data: binary matrix (for example gained by graph data) Rearrangement of rows and columns according to the smallest coding costs suggested by MDL

Weighted Graph Compression for Parameter-free Clustering With PaCCo Clustering weighted graphs (parameter-free, fully automatic, reduced runtime) Input data: adjacency matrix (containing weight information) Downsplitting of the clusters according to the smallest coding costs suggested by MDL

Summarization-based Mining Bipartite Graphs Mining bipartite graphs Transforming the original graph into a compact summary graph controlled by MDL Contributions: Clustering, hidden structure Mining, link prediction

Subdue: Compression-Based Frequent Pattern Discovery in Graph Data Discovering interesting patterns Input data: single graph or set of graphs (labeled or unlabeled) Outputting substructures that best compress the input data set according to MDL

Compression-based Graph Mining Exploiting Structure Primitives Graph clusterer that distinguishes different pattern in graphs Suitable for sparse graphs Minimum Description Length compression leads to favorizing stars or cliques

PICS: Parameter-free Identification of Cohesive Subgroups in Large Attributed Graphs Summarizes Graphs with node Attributes Fully Automatic Linear runtime

VOG: Summarizing and Understanding Large Graphs Compressing a graph with structure patterns: cliques, hubs, chains near linear runtime Newest paper on the line ;)

Mining Connection Pathways for Marked Nodes in Large Graphs determining connection pathways different ways of link analysis NP hard problem (travelling salesman) Uses minimum description length

Vielen Dank für die Aufmerksamkeit