Challenging multidimensional data. Maria Cristina F. de Oliveira cristina@icmc.usp.br http://vicg.icmc.usp.br

Similar documents
Database Marketing, Business Intelligence and Knowledge Discovery

Information Visualization WS 2013/14 11 Visual Analytics

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Visualizing Large, Complex Data

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

Introduction to Data Mining

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Exploratory data analysis for microarray data

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

Real-time Processing and Visualization of Massive Air-Traffic Data in Digital Landscapes

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Information Management course

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Knowledge Discovery from patents using KMX Text Analytics

PEx-WEB: Content-based visualization of web search results

GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL CLUSTERING

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

How To Make Visual Analytics With Big Data Visual

Introduction to Data Mining

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

2.1. Data Mining for Biomedical and DNA data analysis

Taking Inverse Graphics Seriously

Big Data Analytics and Decision Analysis for Manufacturing Intelligence to Empower Industry 3.5

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

An Initial Study on High-Dimensional Data Visualization Through Subspace Clustering

The Value of Visualization 2

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Supervised and unsupervised learning - 1

Introduction. A. Bellaachia Page: 1

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COSC 6344 Visualization

CHAPTER 1 INTRODUCTION

BIG DATA COURSE 1 DATA QUALITY STRATEGIES - CUSTOMIZED TRAINING OUTLINE. Prepared by:

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Faculty of Science School of Mathematics and Statistics

Visualization Quick Guide

HDDVis: An Interactive Tool for High Dimensional Data Visualization

Data Mining and Neural Networks in Stata

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

How To Create A Multidimensional Data Projection

Data Mining + Business Intelligence. Integration, Design and Implementation

A bachelor of science degree in electrical engineering with a cumulative undergraduate GPA of at least 3.0 on a 4.0 scale

Analyse, Collaborate and Publish Statistics for Measuring Progress in our Society using Storytelling. The most ancient of social rituals

Statistics for BIG data

Accelerated Undergraduate/Graduate (BS/MS) Dual Degree Program in Computer Science

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Principles of Data Mining by Hand&Mannila&Smyth

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Visualization of Breast Cancer Data by SOM Component Planes

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

MarkerView Software for Metabolomic and Biomarker Profiling Analysis

CITRIS Founding Corporate Members Meeting

Video Camera Image Quality in Physical Electronic Security Systems

Situational Awareness Through Network Visualization

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Monitoring chemical processes for early fault detection using multivariate data analysis methods

Tracking System for GPS Devices and Mining of Spatial Data

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Big Data Opportunities and Challenges in Monitoring Health Behaviors in the Home and Environment

Standardization and Its Effects on K-Means Clustering Algorithm

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

BIG DATA What it is and how to use?

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

Investigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses

IC05 Introduction on Networks &Visualization Nov

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

ICT Perspectives on Big Data: Well Sorted Materials

Data Warehousing and Data Mining in Business Applications

Maschinelles Lernen mit MATLAB

The Scientific Data Mining Process

Fluency With Information Technology CSE100/IMT100

Final Project Report

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

An Overview of Knowledge Discovery Database and Data mining Techniques

Requirements Analysis Concepts & Principles. Instructor: Dr. Jerry Gao

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Extend your analytic capabilities with SAP Predictive Analysis

Patient Trajectory Modeling and Analysis

Data Exploration Data Visualization

Technology White Paper Capacity Constrained Smart Grid Design

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Visualization methods for patent data

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

A Survey on Pre-processing and Post-processing Techniques in Data Mining

DRIVING THE CHANGE ENABLING TECHNOLOGY FOR FINANCE 15 TH FINANCE TECH FORUM SOFIA, BULGARIA APRIL

Computational Science and Informatics (Data Science) Programs at GMU

Big Data Text Mining and Visualization. Anton Heijs

Concept and Applications of Data Mining. Week 1

Computer Science Electives and Clusters

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Network Architectures & Services

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Data Mining: Algorithms and Applications Matrix Math Review

Transcription:

Challenging multidimensional data Maria Cristina F. de Oliveira cristina@icmc.usp.br http://vicg.icmc.usp.br Latin American e-science Workshop 2013 1

Data intelligence: the future Patient History Reports INPUT Preprocessing Transformation Discretization Data Mining Clustering Sensors Patient Cleaning Classification Images Selection, binarization,... Regression,... Repository History of Patients Knowledg e Knowledge 2

Outline 0 abstract data visualization 0 visualizing high- dimensional data 0 an approach to visualize high- dimensional data 0 what are the challenges 0 why you should consider working on this Latam e-science 2013 3

Scientists & data: a real scenario 0 http://www.ifsc.usp.br/nbionet 0 developing biosensors: nanostructured Eilms (thin molecular Eilms) of biologically relevant materials 0 molecular interaction between different materials produce electrical responses that can be measured, e.g., with impedance spectroscopy Latam e-science 2013 4

Scientists & data: examples 0 sensor to detect the presence of antibodies for Chagas Disease or Leishmaniasis in blood samples 0 sensor to detect phytic acid at very low concentrations 0 electronic tongues 0 0 test a wide variety of sensor coneigurations to obtain optimal selectivity and sensitivity: lots of measurements, very dynamic scenario Latam e-science 2013 5

Scientists & data: research questions 0 Einding an optimal sensor (thin Eilm architecture) or optimizing performance of existing sensor 0 understanding/explaining why it is optimal 0 How do they analyze their data? 0 very limited repertoire of tools, e.g., PCA at one particular frequency of the spectra (throw away the rest!) Latam e-science 2013 6

High- dimensional data 5 12 15 2 7 5 0 12 9 0 8 12 5 0 12 12 12 12 12 18 12 12 0 1 05 10 15 12 8 12 9 11 5 0 12 01 12 9 0 12 10 5 5 12 12 8 05 12 12 12 8 12 9 12 12 10 12 0 11 10 2 7 12 2 16 7 5 6 8 12 12 15 12 6 9 17 0 7 12 05 0 12 12 10 17 9 12 12 2 10 05 15 12 1 12 10 9 8 2 12 12 7 12 0 12 0 12 10 12 12 6 12 05 17 12 10 12 12 9 12 8 12 10 2 12 1 12 12 11 6 0 12 1 12 05 12 12 16 2 12 9 12 0 10 0 12 12 9 12 0 10 12 12 8 0 12 1 12 12 5 1 7 11 12 12 8 2 11 10 7 12 5 12 15 10 0 dimensional embedding Pairwise distances (feature space) and/or (adimensional data) 7

Techniques 0 Point placement: 2D or 3D similarity- based layouts 0 Projection-based 0 variations on MDS or other dimension reduction approaches 0 data mapped to low- dimensional visual space 0 preserving distances vs neighborhoods, global vs. local control, segregation 0 Tree-based 0 hierarchy of similarity relations 0 variations on tree layouts 8

X R m f Y R k ij ( δ(x i,x j ) d( y i,y j )) 2 xxx δ(x i, x j ) 2 ij 0 δ: x i, x j R, x i,x j X 0 d: y i, y j R, y i,y j Y 0 f: X Y, δ(x i,x j ) d(f(x i ), f(x j )) 0, x i,x j X E = ( δ(x i, x j ) d(y i, y j )) 2 ij δ(x i, x j ) 2 ij 9

LSP, 2008 LAMP, 2011 NJ trees, 2007 PLP 2011 PLMP, 2010 HiPP, 2008 10

Visualization & Imaging faculty Fernando Paulovich João E. S. Batista Luis Gustavo Nonato Maria Cristina Moacir Ponti Rosane Minghim 11

Multidimensional projection 0 old idea, new techniques 0 current techniques must comply with requirements imposed by visualization- oriented applications: 0 speed (low computation cost) 0 capability to handle very large & massive data 0 interactivity (allow user intervention) Latam e-science 2013 12

~600 scientific papers ~2000 RSS news feeds (2006) 13

Back to the scientists: biosensor data analysis 0 Problem: distinguish blood samples with antibodies for leishmaniasis from samples with antibodies for the Chagas disease (caused by Tripanosoma Cruzi). Many false positives in clinical exams 0 8 types of analytes 0 25 different substances (some anaylytes at different concentrations), 9 samples each Latam e-science 2013 14

Sammon s Mapping: four sensors Buffer Tris- Hcl 5 mm Negative + buffer Leishmania + buffer Cruzi + buffer Serum A Negative Serum B w/ Leishmania Serum C w/ Cruzi Mixture + buffer Perinotto et al., Anal. Chem. 2010 Paulovich et al., Anal. Bioanal. Chem. 2011 15

Back to the scientists: biosensor data analysis 0 coneiguration of 4 sensors 0 bare electrode, PAMAM/antigen Leish electrode, PAMAM/antigen T. Cruzi electrode, PAMAM/PVS electrode 0 measure on 58 frequencies, 2 each (real & imaginary): 116 data attributes for each sensor 0 25 x 9 = 225 samples, each described by 464 attributes (capacitance spectra) 0 Data normalization: 0 average, 1 standard deviation Latam e-science 2013 16

Scientists & data: why visualization 0 Exploratory scenario 0 Flexibility 0 Rapid feedback 0 User knowledge input 0 Multidisciplinary & applied 0 Lots of room for novel contributions, both in applications and in fundamental aspects of CS Latam e-science 2013 17

Application: music Similarity map (LSP + DTW) of 1,300 songs (MIDI) with classical (blue), rock (red) and latin country (green) with musical icons relative to the selections. Vargas et al. Visualizing the structure of music. Submitted, IEEE Infovis 2013 18

Application: text, web search Gomez-Nieto et al. Similarity Preserving Snippet-Based Visualization of Web Search Results. Submitted IEEE Trans. Visualization &Computer Graphics 19

Challenges 0 many other applications, e.g., social nets, biological images, volume data... 0 Big data: handling massive data still difeicult 0 Data representation & metric of dissimilarity 0 Choice of techniques vs data characteristics 0 Distance distribution, spatial distribution, structural relationships, noise,... 0 Dimensionality curse : lossy process 0 Validation Latam e-science 2013 20

Challenges 0 Better metaphors 0 Quality assessement & evaluation 0 Deployment: user in the loop 0 Many types of problems & applications 0 Diverse skills required... Latam e-science 2013 21

Partners & collaborators 0 Guilherme P. Telles and Hélio Pedrini IC-UNICAMP" 0 Alneu Lopes, LABIC, ICMC-USP" 0 William Schwartz, UFMG" 0 Milton Shimabukuro, Danilo Medeiros Eler, UNESP, Pres. Prudente" 0 Paulo Pagliosa, UFMS" 0 Lars Linsen, Jacobs University, Germany! 0 Alexandru Telea, University of Groningen, the Netherlands" 0 Charl Botha, T.U. Delft, the Netherlands! 0 Haim Levkowitz, University of Massachusetts Lowell, USA! 0 Cláudio Silva, NYU-Poly, USA! 0 Osvaldo N Oliveira Jr., IFSC-USP, and nbionet research network http://www.ifsc.usp.br/nbionet/" 0 Armando Vieira, Biology Department, UFSCar " 22

0 Paulovich et al., Information visualization techniques for sensing and biosensing, Analyst, 2011. 0 Siqueira Jr. et al., Strategies to optimize biosensors based on impedance spectroscopy to detect phytic acid using layer- by- layer Eilms, Analytical Chemistry, 2010. 0 Perinoto et al., Biosensors for efeicient diagnosis of leishmaniasis: innovations in bioanalytics for a neglected disease, Analytical Chemistry, 2010. 0 Joia et al. Local afeine multidimensional projection, IEEE Trans. Visualization & Computer Graphics 2011. 0 Paulovich et al. Two- phase mapping for projecting massive data sets, IEEE Trans. Visualization & Computer Graphics 2010. 0 Paulovich et al. Least Square Projection: A fast high- precision multidimensional projection technique and its application to document mapping, IEEE Trans. Visualization & Computer Graphics 2008. 0 Paiva et al. Improved similarity trees and their application to visual data classieication, IEEE Trans. Visualization and Computer Graphics, 2011 23

VICG@ ICMC Collaborators, posdocs & students welcome!! http://vicg.icmc.usp.br cristina@icmc.usp.br 24