Challenging multidimensional data Maria Cristina F. de Oliveira cristina@icmc.usp.br http://vicg.icmc.usp.br Latin American e-science Workshop 2013 1
Data intelligence: the future Patient History Reports INPUT Preprocessing Transformation Discretization Data Mining Clustering Sensors Patient Cleaning Classification Images Selection, binarization,... Regression,... Repository History of Patients Knowledg e Knowledge 2
Outline 0 abstract data visualization 0 visualizing high- dimensional data 0 an approach to visualize high- dimensional data 0 what are the challenges 0 why you should consider working on this Latam e-science 2013 3
Scientists & data: a real scenario 0 http://www.ifsc.usp.br/nbionet 0 developing biosensors: nanostructured Eilms (thin molecular Eilms) of biologically relevant materials 0 molecular interaction between different materials produce electrical responses that can be measured, e.g., with impedance spectroscopy Latam e-science 2013 4
Scientists & data: examples 0 sensor to detect the presence of antibodies for Chagas Disease or Leishmaniasis in blood samples 0 sensor to detect phytic acid at very low concentrations 0 electronic tongues 0 0 test a wide variety of sensor coneigurations to obtain optimal selectivity and sensitivity: lots of measurements, very dynamic scenario Latam e-science 2013 5
Scientists & data: research questions 0 Einding an optimal sensor (thin Eilm architecture) or optimizing performance of existing sensor 0 understanding/explaining why it is optimal 0 How do they analyze their data? 0 very limited repertoire of tools, e.g., PCA at one particular frequency of the spectra (throw away the rest!) Latam e-science 2013 6
High- dimensional data 5 12 15 2 7 5 0 12 9 0 8 12 5 0 12 12 12 12 12 18 12 12 0 1 05 10 15 12 8 12 9 11 5 0 12 01 12 9 0 12 10 5 5 12 12 8 05 12 12 12 8 12 9 12 12 10 12 0 11 10 2 7 12 2 16 7 5 6 8 12 12 15 12 6 9 17 0 7 12 05 0 12 12 10 17 9 12 12 2 10 05 15 12 1 12 10 9 8 2 12 12 7 12 0 12 0 12 10 12 12 6 12 05 17 12 10 12 12 9 12 8 12 10 2 12 1 12 12 11 6 0 12 1 12 05 12 12 16 2 12 9 12 0 10 0 12 12 9 12 0 10 12 12 8 0 12 1 12 12 5 1 7 11 12 12 8 2 11 10 7 12 5 12 15 10 0 dimensional embedding Pairwise distances (feature space) and/or (adimensional data) 7
Techniques 0 Point placement: 2D or 3D similarity- based layouts 0 Projection-based 0 variations on MDS or other dimension reduction approaches 0 data mapped to low- dimensional visual space 0 preserving distances vs neighborhoods, global vs. local control, segregation 0 Tree-based 0 hierarchy of similarity relations 0 variations on tree layouts 8
X R m f Y R k ij ( δ(x i,x j ) d( y i,y j )) 2 xxx δ(x i, x j ) 2 ij 0 δ: x i, x j R, x i,x j X 0 d: y i, y j R, y i,y j Y 0 f: X Y, δ(x i,x j ) d(f(x i ), f(x j )) 0, x i,x j X E = ( δ(x i, x j ) d(y i, y j )) 2 ij δ(x i, x j ) 2 ij 9
LSP, 2008 LAMP, 2011 NJ trees, 2007 PLP 2011 PLMP, 2010 HiPP, 2008 10
Visualization & Imaging faculty Fernando Paulovich João E. S. Batista Luis Gustavo Nonato Maria Cristina Moacir Ponti Rosane Minghim 11
Multidimensional projection 0 old idea, new techniques 0 current techniques must comply with requirements imposed by visualization- oriented applications: 0 speed (low computation cost) 0 capability to handle very large & massive data 0 interactivity (allow user intervention) Latam e-science 2013 12
~600 scientific papers ~2000 RSS news feeds (2006) 13
Back to the scientists: biosensor data analysis 0 Problem: distinguish blood samples with antibodies for leishmaniasis from samples with antibodies for the Chagas disease (caused by Tripanosoma Cruzi). Many false positives in clinical exams 0 8 types of analytes 0 25 different substances (some anaylytes at different concentrations), 9 samples each Latam e-science 2013 14
Sammon s Mapping: four sensors Buffer Tris- Hcl 5 mm Negative + buffer Leishmania + buffer Cruzi + buffer Serum A Negative Serum B w/ Leishmania Serum C w/ Cruzi Mixture + buffer Perinotto et al., Anal. Chem. 2010 Paulovich et al., Anal. Bioanal. Chem. 2011 15
Back to the scientists: biosensor data analysis 0 coneiguration of 4 sensors 0 bare electrode, PAMAM/antigen Leish electrode, PAMAM/antigen T. Cruzi electrode, PAMAM/PVS electrode 0 measure on 58 frequencies, 2 each (real & imaginary): 116 data attributes for each sensor 0 25 x 9 = 225 samples, each described by 464 attributes (capacitance spectra) 0 Data normalization: 0 average, 1 standard deviation Latam e-science 2013 16
Scientists & data: why visualization 0 Exploratory scenario 0 Flexibility 0 Rapid feedback 0 User knowledge input 0 Multidisciplinary & applied 0 Lots of room for novel contributions, both in applications and in fundamental aspects of CS Latam e-science 2013 17
Application: music Similarity map (LSP + DTW) of 1,300 songs (MIDI) with classical (blue), rock (red) and latin country (green) with musical icons relative to the selections. Vargas et al. Visualizing the structure of music. Submitted, IEEE Infovis 2013 18
Application: text, web search Gomez-Nieto et al. Similarity Preserving Snippet-Based Visualization of Web Search Results. Submitted IEEE Trans. Visualization &Computer Graphics 19
Challenges 0 many other applications, e.g., social nets, biological images, volume data... 0 Big data: handling massive data still difeicult 0 Data representation & metric of dissimilarity 0 Choice of techniques vs data characteristics 0 Distance distribution, spatial distribution, structural relationships, noise,... 0 Dimensionality curse : lossy process 0 Validation Latam e-science 2013 20
Challenges 0 Better metaphors 0 Quality assessement & evaluation 0 Deployment: user in the loop 0 Many types of problems & applications 0 Diverse skills required... Latam e-science 2013 21
Partners & collaborators 0 Guilherme P. Telles and Hélio Pedrini IC-UNICAMP" 0 Alneu Lopes, LABIC, ICMC-USP" 0 William Schwartz, UFMG" 0 Milton Shimabukuro, Danilo Medeiros Eler, UNESP, Pres. Prudente" 0 Paulo Pagliosa, UFMS" 0 Lars Linsen, Jacobs University, Germany! 0 Alexandru Telea, University of Groningen, the Netherlands" 0 Charl Botha, T.U. Delft, the Netherlands! 0 Haim Levkowitz, University of Massachusetts Lowell, USA! 0 Cláudio Silva, NYU-Poly, USA! 0 Osvaldo N Oliveira Jr., IFSC-USP, and nbionet research network http://www.ifsc.usp.br/nbionet/" 0 Armando Vieira, Biology Department, UFSCar " 22
0 Paulovich et al., Information visualization techniques for sensing and biosensing, Analyst, 2011. 0 Siqueira Jr. et al., Strategies to optimize biosensors based on impedance spectroscopy to detect phytic acid using layer- by- layer Eilms, Analytical Chemistry, 2010. 0 Perinoto et al., Biosensors for efeicient diagnosis of leishmaniasis: innovations in bioanalytics for a neglected disease, Analytical Chemistry, 2010. 0 Joia et al. Local afeine multidimensional projection, IEEE Trans. Visualization & Computer Graphics 2011. 0 Paulovich et al. Two- phase mapping for projecting massive data sets, IEEE Trans. Visualization & Computer Graphics 2010. 0 Paulovich et al. Least Square Projection: A fast high- precision multidimensional projection technique and its application to document mapping, IEEE Trans. Visualization & Computer Graphics 2008. 0 Paiva et al. Improved similarity trees and their application to visual data classieication, IEEE Trans. Visualization and Computer Graphics, 2011 23
VICG@ ICMC Collaborators, posdocs & students welcome!! http://vicg.icmc.usp.br cristina@icmc.usp.br 24