19º SIMPÓSIO BRASILEIRO DE BANCOS DE DADOS. 18 a 20 de Outubro de 2004 Brasília Distrito Federal Brasil ANAIS

Transcription

1 19º SIMPÓSIO BRASILEIRO DE BANCOS DE DADOS 18 a 20 de Outubro de 2004 Brasília Distrito Federal Brasil ANAIS Promoção Apoio Edição SBC Sociedade Brasileira de Computação Comissão Especial de Banco de Dados ACM Association for Computing Machinery SIGMOD Special Interest Group on Manangement of Data VLDB Endowment Sérgio Lifschitz (Pontifícia Universidade Católica do Rio de Janeiro PUC-Rio) Organização Murilo S. de Camargo (Universidade de Brasília UnB) Realização Departamento de Ciência da Computação da Universidade de Brasília - UnB i

2 CIP CATALOGAÇÃO NA PUBLICAÇÃO Simpósio Brasileiro de Bancos de Dados(19.:2004 out : Brasília). Anais/Edição Sérgio Lifschitz Rio de Janeiro: Pontifícia Universidade Católica do Rio de Janeiro, p.: il. ISBN Conhecido também como SBBD Bancos de Dados. I. Lifschitz, Sérgio. II. SBBD (19.: 2004: Brasília). Esta obra foi impressa a partir de originais entregues, já compostos pelos autores Capa: Fernando Ribeiro Editoração: Diogo Rispoli, Hugo Lousa, Wantuil Firmiano Jr. ii

3 19th BRAZILIAN SYMPOSIUS ON DATABASES October, 18-20, 2004 Brasília Distrito Federal Brazil PROCEEDINGS Promotion SBC Brazilian Computing Society Special Committee on Databases ACM Association for Computing Machinery SIGMOD Special Interest Group on Manangement of Data Suppoort VLDB Endowment Editor Sérgio Lifschitz (Pontifícia Universidade Católica do Rio de Janeiro PUC-Rio) Organization Murilo S. de Camargo (Universidade de Brasília UnB) Realization Departamento de Ciência da Computação da Universidade de Brasília - UnB iii

4 APRESENTAÇÃO O Simpósio Brasileiro de Bancos de Dados (SBBD) é um evento promovido anualmente pela Sociedade Brasileira de Computação (SBC), através de sua Comissão Especial de Bancos de Dados. Neste ano de 2004, a 19 a edição do evento acontece na cidade de Brasília, capital federal, junto com o Simpósio Brasileiro de Engenharia de Software (SBES). O Departamento de Ciência da Computação da Universidade de Brasília (UNB) é o responsável pela organização dos dois eventos. O SBBD é o principal fórum nacional de discussão e apresentação de resultados de pesquisa na área de bancos de dados. Desde 1998, o SBBD conta com o apoio do Grupo de Interesse em Gerência de Dados da Association for Computing Machinery (ACM- SIGMOD). Esse ano, assim como nos dois anteriores, o SBBD contou também com o apoio institucional e financeiro do VLDB Endownment, confirmando assim o reconhecimento de sua importância como evento científico por parte da comunidade internacional. O principal objetivo do SBBD é difundir resultados recentes de pesquisa na área através de sessões técnicas para divulgação e discussão de artigos técnicos em bancos de dados e outras áreas diretamente relacionadas. Este ano, o SBBD recebeu 111 submissões de artigos técnicos, dos quais 25 foram selecionados para inclusão nestes anais. O processo de seleção foi efetuado pelos membros do Comitê de Programa, formado, em sua maioria, por pesquisadores de instituições brasileiras. Entretanto, há também grande participação (40%) de importantes pesquisadores de diversos países e continentes. Em conjunto com avaliadores adicionais externos ao comitê, foi realizada uma revisão criteriosa dos trabalhos submetidos. A pequena taxa de aceitação (22%) é comparável aos melhores eventos mundiais em banco de dados, atestando assim a alta qualidade dos trabalhos selecionados para apresentação no evento. Como acontece desde 1998, os artigos selecionados para apresentação no SBBD são também avaliados por uma comissão especialmente designada para escolher o vencedor do Prêmio José Mauro de Castilho, anunciado na cerimônia especial do SBBD/SBES Este prêmio uma homenagem a um dos pesquisadores pioneiros da área de bancos de dados no Brasil é um reconhecimento à excelente qualidade técnica do artigo. Se os autores assim desejarem, o artigo premiado poderá vir a ser publicado, em versão estendida e novamente revisada, no Journal of the Brazilian Computer Society, editado pela SBC. O programa do SBBD inclui ainda a apresentação de tutoriais, voltados para temas de pesquisa em bancos de dados avançados e alguns mini-cursos complementares à formação tradicional em bancos de dados normalmente oferecida nos cursos de graduação em computação. O programa do SBBD também conta com a terceira edição do Workshop de Teses e Dissertações em Bancos de Dados (WTDBD). Como novidade, teremos pela primeira vez uma Sessão de Demos, com apresentação de ferramentas e protótipos desenvolvidos em diversas instituições de ensino e pesquisa. Outros destaques da programação do SBBD são as palestras convidadas, especialmente apoiadas pelo VLDB Endowment, que serão proferidas por pesquisadores de renome internacional. Este ano, temos a honra de ter como palestrantes Antonio Furtado, da PUC do Rio de Janeiro; Serge Abiteboul, do INRIA Futurs (França); e Raghu Ramakrishnan, da Universidade de Wisconsin, em Madison (E.U.A). O SBBD retoma em sua programação um painel de discussão sobre tema relevante na área de banco de dados, contando com a presença de alguns dos pesquisadores internacionais convidados e outros pesquisadores com forte atuação no Brasil e no exterior. iv

5 Inúmeras pessoas e instituições contribuíram para que a realização do SBBD 2004 fôsse possível. Gostaria de agradecer, antes de mais nada, a todos os membros do Comitê de Organização, em particular ao Coordenador Geral, Murilo Camargo, pela oferta de organizar o SBBD e o SBES com tão pouca antecedência. A proposta do Murilo, apoiada em sua experiência como coordenador geral do SBBD e SBES em 1999, deu a tranqüilidade a todos de que teríamos um excelente evento esse ano, e isso, por si só, é digno de registro e reconhecimento. A seriedade e dedicação de todos da equipe de organização, em sua maioria ligados ao Departamento de Ciência da Computação da UNB, garantem a ótima qualidade do evento, que pode ser atestada por todos os participantes. Meus agradecimentos também ao Comitê Diretivo do SBBD, em particular ao coordenador da Comissão Especial de Bancos de Dados, Alberto Laender, pelo apoio em todos os momentos, em particular naqueles mais críticos. Aos colegas Carlos Alberto Heuser e Sandra de Amo, os meus sinceros agradecimentos pela coordenação dos Tutoriais e do WTDBD, respectivamente. Cabe destacar aqui a iniciativa dos colegas Carina Friedriech, Eduardo Kroth e José Palazzo, pela proposta e conseqüente organização da primeira Sessão de Demos. Cabe mencionar aqui também o apoio de nossa colega Cláudia Bauzer Medeiros, Presidente da SBC, assim como de toda a equipe da secretaria da SBC em Porto Alegre, em particular Gabriela Conceição, pelo apoio e promoção permanentes do evento. Gostaria de agradecer especialmente a todos os autores que submeteram artigos e, claro, a todos os avaliadores, em particular aos membros do Comitê de Programa, pelo cuidadoso trabalho de avaliação realizado, que resultou em um programa técnico da melhor qualidade. Cabe destacar que alguns colegas foram responsáveis também por revisar as versões para o inglês de alguns artigos submetidos originalmente em português. Foram várias as sugestões recebidas para qualificar e classificar os artigos, todas as participações buscando, sem exceção, o rigor técnico e o melhor equilíbrio possível na avaliação. Por fim, gostaria de agradecer aos desenvolvedores do WIMPE, sistema que permitiu a submissão e a realização do processo de revisão dos artigos. Os colegas Guilherme Travassos e Leonardo Murta, da COPPE Sistemas, UFRJ, nos deram um grande apoio para configuração do sistema, cedendo gentilmente programas auxiliares para alocação e distribuição de artigos para revisão. No Departamento de Informática da PUC-Rio tive o apoio fundamental da equipe de suporte na instalação do WIMPE e na hospedagem do site web do evento. Gostaria de destacar o auxílio fundamental do aluno de doutorado da PUC- Rio, José Maria Monteiro, desde o primeiro momento, em todas as atividades e tarefas relacionadas com a programação do SBBD. Apesar do risco de esquecer alguns outros nomes, gostaria de citar também o apoio de diversos outros colegas e alunos da PUC-Rio, como é o caso de Luiz Fernando Bessa Seibel, Manuel Antônio Junior e Marcos Vaz Salles. Last but not least, não poderia deixar de agradecer o suporte recebido de agências governamentais, como o CNPq, CAPES, FINEP e FAP-DF, além de inúmeras empresas e instituições que, com o seu apoio e patrocínio, viabilizaram a realização deste Simpósio. Rio de Janeiro e Brasília, outubro de Sérgio Lifschitz Coordenador do Comitê de Programa do SBBD 2004 v

6 FOREWORD The Brazilian Symposium on Databases (SBBD) is an annual event promoted by the Brazilian Computing Society (SBC), through its Database Special Committee. In 2004, the SBBD19 th edition occurs in Brasília, Brazil s capital, once again jointly held with the Brazilian Symposium on Software Engineering (SBES). The coordination of the Organizing Committee of these two events is in charge of the Department of Computer Science of the Universidade de Brasília (UnB). SBBD is the main national forum for discussion and presentation of database research results. Since 1998, SBBD has been held in cooperation with the ACM SIGMOD. In addition, and since 2002, it has also been supported by the VLDB Endownment, thus confirming the international community recognition of its importance as scientific event in the area. The main goal of SBBD is to disseminate recent research results in the database area by means of technical sessions, invited talks, tutorials, and short courses. This year, SBBD has received 111 technical papers submissions (the same as in 2003), 25 of which have been selected for publication in these proceedings. The Program Committee members, mostly composed by Brazilian researchers, were responsible for the reviewing process. However, there is also large participation (40%) of important researchers from different countries and continents. Together with additional external referees, a rigorous reviewing process was accomplished. The relatively small acceptance rate (22%), comparable to the best international database events, guarantees the high quality of the papers selected for presentation during the event. As it happens since 1998, a special committee to select José Mauro de Castilho Award s winner has reviewed the best papers selected by the Program Committee. This award, a tribute to one of the pioneering database researchers in Brazil, recognizes the high quality of this paper and is announced during the SBBD/SBES ceremony. If the corresponding authors agree, this year's winner paper will be published, in an extended version, in the Journal of the Brazilian Computer Society, edited by SBC, after a new reviewing process. The SBBD s program also includes tutorials presentations, concerning advanced research topics in database research, and some short courses that complement the conventional knowledge on databases given in typical computer science undergraduate courses. There is also the third edition of the Database Thesis and Dissertations Workshop (WTDBD). A new program item this year is the Demonstration Session, with the presentation of prototypes and tools developed in many different research institutions. An important part of the SBBD technical program is our invited talks, specially supported by the VLDB Endowment and given as always by internationally known researchers. In 2004 we have to pleasure to have as our guest speakers Antonio Furtado, from PUC-Rio, Serge Abiteboul, INRIA Futurs, France, and Raghu Ramakrishnan, from the University of Wisconsin-Madison, USA. These, and other important researchers, also take part of the SBBD panel, which discusses a relevant issue for all those people interested in the database area. Several people and institutions have contributed to make SBBD 2004 possible. First of all, I would like to thank all members of the Organizing Committee, specially the General Chair, Murilo Camargo, for his proposal to organize the SBBD/SBES this year with little more than one year for its preparation. Murilo s offer and his previous experience and general chair for SBBD/SBES in 1999, has assured us of an excellent event this year, and we must recognize and register this fact. The very serious work of all members at the Organization Committee, vi

7 mostly attached to the Computer Science Department at UnB, has guaranteed a very high quality event this year, which can be appreciated by all participants. My deep appreciation for all SBBD Steering Committee members, particularly the Database Special Commission Chair, Alberto Laender, for their permanent support, especially during the most critical situations. Special thanks are due to my colleagues Carlos Alberto Heuser and Sandra de Amo, who have coordinated the tutorials and the WTDBD, respectively. I would like to mention the names of our colleagues Carina Friedriech, Eduardo Kroth and José Palazzo, for their idea and organization of a Demonstration Session during SBBD. Last but not least, I d like to thank Cláudia Bauzer Medeiros, SBC president, and all SBC staff in Porto Alegre, particularly Gabriela Conceição, for their continuous support. I would also like to thank all authors that submitted papers and all reviewers, in particular, the Program Committee members, for their careful review of the submitted papers, which has resulted in a high standard technical program. Some of these colleagues have helped a lot with the final English versions of papers originally submitted in Portuguese. There were many suggestions to improve the evaluation and classification process, all of them looking forward to technical correctness and the best possible balanced evaluation. I would also like to thank the WIMPE developing team for providing the conference management system that enabled the submission and the paper reviewing process to be done electronically. Guilherme Travassos and Leonardo Murta, both from COPPE Sistemas at UFRJ, have given us important help with the WIMPE configuration. Also, they have gently offered auxiliary tools for paper assignment and distribution to referees. The system s support team at PUC-Rio has helped a lot with WIMPE installation and web site hosting. Even risking being unfair by not citing some the names of other colleagues, I would like to mention the help from Luiz Fernando Bessa Seibel, Manuel Antônio Junior e Marcos Vaz Salles, all from PUC-Rio. Particularly, I m grateful for the help of José Maria Monteiro, that was always available and ready to work, regarding all tasks related to the SBBD program. Finally, I should also thank the support given by many governmental agencies like CNPq, CAPES, FINEP and FAP-DF, besides local companies and many different institutions that sponsored this symposium. Rio de Janeiro e Brasília, Outubro de Sérgio Lifschitz SBBD 2004 Program Committee Chair vii

8 0 19 Simpósio Brasileiro de Bancos de Dados 19th Brazilian Symposium on Databases Comissão Especial de Banco de Dados da SBC SBC Database Special Commission Comitê Diretivo/Steering Committee Carlos Alberto Heuser (INF UFRGS) Alberto H. F. Laender (DCC UFMG) Coordenador/Chair Sérgio Lifschitz (DI PUC-Rio) Marta de Lima Queiróz Mattoso (COPPE UFRJ) Ana Carolina Salgado (CIN UFPE) Coordenador Geral General Chair Murilo S. de Camargo (Computação UnB) Coordenador do Comitê de Programa do SBBD SBBD Program Committee Chair Sérgio Lifschitz (Informática PUC-Rio) Coordenador dos Tutoriais SBBD SBBD Tutorials Chair Carlos Alberto Heuser (Informática UFRGS) Coordenadora do Workshop de Teses e Dissertações em Bancos de Dados Database Thesis and Dissertations Workshop Chair Sandra de Amo (Computação UFU) Coordenadores da Sessão de Demos em Bancos de Dados Database Demos Session Chairs Carina Friedrich Dorneles (INF UFRGS) Eduardo Kroth (UNISC RS) viii

9 Membros do Comitê de Programa Program Committee Members Fernanda Baião - COPPE UFRJ, Rio de Janeiro Alvaro Barbosa - INF UFES, Espírito Santo Karin Becker - INF PUCRS, Rio Grande do Sul Mauro Biajiz - Computação UFSCar, São Paulo Nicole Bidoit-Tollu - LRI Univ. Paris Sud, France Angelo Brayner - Computação UNIFOR, Ceará Maria Helena Braz -ICIST UT Lisboa, Portugal Luc Bouganim - INRIA Rocquencourt, France Stefano Ceri - Politecnico de Milano, Italy Nina Edelweiss - INF UFRGS, Rio Grande dosul João E Ferreira - IME USP, São Paulo Renato Ferreira - DCC UFMG, Minas Gerais Fernando Fonseca - CIN UFPE, Pernambuco Juliana Freire - OGI/OHSU and Univ. Utah, USA Pedro Furtado - Univ. Coimbra, Portugal Silvia Gordillo - LIFIA-UNLP, Argentina Theo Haerder - Univ. Kaiserlautern, Germany Carmem S Hara - INF UFPR, Paraná Carlos Hurtado - Computación Univ. Chile, Chile Cirano Iochpe - INF UFRGS, Rio Grande do Sul Ramamohanarao Kotagiri University of Melbourne, Australia Alberto HF Laender - DCC UFMG, Minas Gerais Bernadette F Lóscio - Fac Sete Setembro, Ceará Javam C Machado - Computação UFC, Ceará Geovane C Magalhães IC UNICAMP/CPqD, São Paulo Ioana Manolescu - INRIA Futurs, France Rubens N Melo - DI PUC-Rio, Rio de Janeiro Regina Motz - Univ. de la República, Uruguay Ana Maria de C. Moura - IME RJ, Rio de Janeiro Edleno S. de Moura - DCC UFAM, Amazonas Vincent Oria - New Jersey Inst Technology, USA M Tamer Ozsu - Univ. Waterloo, Canada Oscar Pastor - U Politecnica Valencia, Spain Mario Piattini UCL Madrid, Spain Philippe Picouet LUSSI ENSTBretagne,France Paulo Pires - NCE UFRJ, Rio de Janeiro Alain Pirotte - Univ. de Louvain, Belgium Alexandre Plastino - IC UFF, Rio de Janeiro Fabio Andre M. Porto - IME/LNCC, Rio de Janeiro and EPFL, Switzerland Rodolfo Resende - DCC UFMG, Minas Gerais Marcus C. Sampaio - DSC UFCG, Paraíba Luiz Fernando Bessa Seibel DI PUC-Rio, Rio de Janeiro Paulo Pinheiro da Silva - KSL Stanford, USA Altigran S. da Silva - DCC UFAM, Amazonas Kian-Lee Tan School Computing National University, Singapore Caetano Traina Jr - ICMC USPScar, São Paulo Alejandro Vaisman U. Buenos Aires, Argentina Victor Vianu - Univ. California San Diego, USA Vânia Maria P Vidal - Computação UFC, Ceará Geraldo Xexeo - COPPE UFRJ, Rio de Janeiro Osmar Zaiane - Univ. Alberta, Canada Avaliadores Adicionais Additional Reviewers Mirian Halfeld Ferrari Alves Claudio Baptista Maria da Conceição Soares Batista Karla Borges Benjamin Buffereau Maria Luiza Campos Vitor Campos Joyce Carvalho Ricardo Rodrigues Ciferri Eduardo Fernandez-Medina Inhaúma Ferraz Robson Fidalgo Miguel Fornari Renata Galante Ana Cristina Garcia Geórgia Gomes Edward Hermann Haeusler Guillermo Hess Manuel de Castro Junior Juliano Palmieri Lage José Antônio Fernandes de Macêdo Adriana Marotta Marta Mattoso Renato Campos Mauro Mirella Moro Simone Moura Martin Musicante Mario Nascimento Melise Paula Duncan Ruiz Ulrich Schiel Diva Silva Pedro Manoel da Silveira Sean Siqueira Luciana Thomé Valéria Times Maria Salete Marcon Gomes Vaz ix

10 Membros do Comitê de Tutoriais Tutorials Committee Members Alberto HF Laender, DCC UFMG, Minas Gerais Cláudia Bauzer Medeiros, UNICAMP, São Paulo Marta Mattoso, COPPE UFRJ, Rio de Janeiro Duncan Ruiz, PUCRS, Rio Grande do Sul Membros do Comitê do Workshop Teses e Dissertações em Bancos de Dados Workshop Database Thesis and Dissertations Committee Members Altigran S. da Silva, DCC UFAM, Amazonas Astério Tanaka, UNIRIO, Rio de Janeiro Caetano Traina Jr., ICMC USPSCar - São Paulo Denise Giuliato, UFU, Minas Gerais Ilmério Reis da Silva, UFU Minas Gerais José Palazzo M de Oliveira, INF UFRGS, Rio Grande do Sul Karin Becker, INF PUCRS, Rio Grande do Sul Maria Luiza Campos, IM UFRJ, Rio de Janeiro Marina T. Pires Vieira, UNIMEP e Computação UFSCar, São Paulo Rodolfo S. Ferreira Resende, DCC UFMG, Minas Gerais Ulrich Schiel, DSC UFCG, Paraíba Membros do Comitê da Sessão de Demos em Bancos de Dados Database Demos Session Committee Members Altigran S. da Silva, DCC UFAM, Amazonas Caetano Traina Jr., ICMC USP - São Carlos Décio Fonseca, CIN UFPE, Pernambuco José Palazzo M de Oliveira, INF UFRGS, Rio Grande do Sul Luiz Fernando B Seibel, DI PUC-Rio, Rio de Janeiro Mariano Consens, University Toronto, Canada Marta Mattoso, COPPE UFRJ, Rio de Janeiro Wagner Meira Jr, DCC UFMG, Minas Gerais Membros do Comitê de Organização Organizing Committee Members Alba Cristina M. A. de Mello, UnB, Brasília DF Célia Ghedini Ralha, UnB, Brasília DF Daniel Arruda Santos Anjos, UnB, Brasília DF Diogo de Carvalho Rispoli, UnB, Brasília DF Hugo Antônio de Azevedo Lousa, UnB, Brasília DF João José Costa Gondim, UnB, Brasília DF José Carlos Ralha, UnB, Brasília DF Maria Emília, UnB, Brasília DF Murilo S. de Camargo, UnB, Brasília DF Rafael de Timóteo de Souza, UnB, Brasília DF Ricardo P. Jacobi, UnB, Brasília DF Ricardo Puttini, UnB, Brasília DF Robson de O. Albuquerque, UnB, Brasília DF Soemes Castilho Dias, UnB, Brasília DF Wantuil Firmiano Júnior, UnB, Brasília DF x

11 Sumário Contents Resumos Tutoriais SBBD SBBD Tutorials Abstracts XML Query Processing: Storage and Query Model Interplay 1 Ioana Manolescu (INRIA Futurs and LRI, Gemo Group, France) Web Semântica e Serviços Web: Automatizando a Integração de Recursos na Web 2 Maria Luiza Machado Campos, Paulo de Figueiredo Pires(DCC-IM/UFRJ, Brazil) Mineração do Uso da Web 3 Karin Becker, Mariângela Vanzin (INF PUCRS, Brazil) Artigos das Palestras Convidadas Sessão VLDB Endowment VLDB Endowment Session - Invited Papers Narratives over Real-life and Fictional Domains 4 Antonio L. Furtado (Pontifícia Universidade Católica do Rio de Janeiro PUC Rio, Brazil) Active XML, Security and Access Control 13 Serge Abiteboul (INRIA Futurs, LRI-Université Paris-Sud and Xyleme SA, France), Omar Benjelloun, Bogdan Cautis (INRIA Futurs and LRI-Université Paris-Sud, France) and Tova Milo (Tel Aviv University, Israel) The EDAM Project: Exploratory Data Analysis and Management at Wisconsin 23 Raghu Ramakrishnan (University of Wisconsin-Madison, U.S.A.) Artigos Selecionados pelo Comitê de Programa do SBBD Papers Selected by the SBBD Program Committee Sessões Técnicas SBBD (ST) Technical Sessions (TS) ST1 MINERAÇÃO DE DADOS TS1 DATA MINING Visual Analysis of Feature Selection for Data Mining Processes 33 Humberto L. Razente, Fabio Jun Takada Chino, Maria Camila N. Barioni, Agma J. M. Traina, Caetano Traina Jr (Universidade do Estado de São Paulo USP, São Carlos, Brazil) An Apriori-based Approach for First-Order Temporal Pattern Mining 48 Sandra de Amo, Daniel A. Furtado (Universidade Federal de Uberlândia, Minas Gerias - UFU, Brazil), Arnaud Giacometti (Université de Tours, France), Dominique Laurent (Université de Cergy-Pontoise, France) A Hypotheses-based Method for Identifying Skewed Itemsets 63 Juliano Brito da Justa Neves (Universidade Federal de São Carlos, São Paulo - UFSCar, Brazil), Marina Teresa Pires Vieira (Universidade Federal de São Carlos, São Paulo - UFSCar and Universidade Metodista de Piracicaba, UNIMEP, Brazil) xi

12 ST2 PROCESSAMENTO DE CONSULTAS DISTRIBUÍDAS E PARALELAS TS2 PARALLEL and DISTRIBUTED QUERY PROCESSING Aplicação de um Filtro Seletivo para a Otimização de Consultas Paralelas Usando o Agrupamento Prévio 78 Nilton Cézar de Paula (Universidade Estadual do Mato Grosso do Sul - UEMS, Brazil), José Craveiro da Costa Neto (Universidade Federal do Mato Grosso do Sul - UFMS, Brazil) Adaptive Virtual Partitioning for OLAP Query Processing in a Database Cluster 92 Alexandre A. B. Lima, Marta Mattoso (COPPE/ Universidade Federal do Rio de Janeiro -UFRJ, Brazil), Patrick Valduriez (INRIA/Université de Nantes, France) Integrating Heterogeneous Data Sources in Flexible and Dynamic Environments 106 Angelo Brayner, Marcelo Meirelles (Universidade de Fortaleza, Ceará - UNIFOR, Brazil) ST3 ARMAZÉNS DE DADOS TS3 DATA WAREHOUSING AQUAWARE: A Data Quality Support Environment for Data Warehousing 121 Glenda Carla Moura Amaral, Maria Luiza Machado Campos (NCE/ Universidade Federal do Rio de Janeiro - UFRJ, Brazil) A Framework for Data Quality Evaluation in a Data Integration System 134 Verónika Peralta, Raúl Ruggia (Universidad de la República, Uruguay), Zoubida Kedad, Mokrane Bouzeghoub (Université de Versailles, France) Providing Multidimensional and Geographical Integration Based on a GDW and Metamodels 148 Robson N. Fidalgo (Faculdade Integrada do Recife and Universidade Federal de Pernambuco - UFPE, Brazil), Valeria C. Times, Joel Silva, Fernando F. Souza, Ana C. Salgado (Universidade Federal de Pernambuco - UFPE, Brazil) ST4 MÉTODOS DE ACESSO TS4 ACCESS METHODS DBM-Tree: A Dynamic Metric Access Method Sensitive to Local Density Data 163 Marcos R. Vieira, Caetano Traina Jr, Fabio J. T. Chino, Agma J. M. Traina (Universidade do Estado de São Paulo USP, São Carlos, Brazil) Twisting the Metric Space to Achieve Better Metric Trees 178 César Feijó Nadvorny, Carlos Alberto Heuser (Universidade Federal do Rio Grande do Sul - UFRGS, Brazil) i-fox: Um Índice Eficiente e Compacto para Dados XML 191 Cynthia P. Santiago, Javam C. Machado (Universidade Federal do Ceará - UFC, Brazil) xii

13 ST5 LEARNING OBJECTS, RECUPERAÇÃO DE INFORMAÇÃO E CONTROLE DE CONCORRÊNCIA TS5 LEARNING OBJECTS, INFORMATION RETRIEVAL and CONCURRENCY CONTROL Query Processing in ROSA Data Model 204 Fábio Coutinho (Instituto Militar de Engenharia - IME-RJ, Brazil), Fábio Porto (IME-RJ, Brazil and Ecole Polytechnique Federal de Lausanne EPFL, Switzerland) Dependence among Terms in Vector Space Model 219 Ilmério Reis Silva, João Nunes de Souza, Karina Silveira Santos (Universidade Federal de Uberlândia, Minas Gerais - UFU, Brazil) A Lock Manager for Collaborative Processing of Natively Stored XML Documents 230 Michael P. Haustein, Theo Härder (University of Kaiserslautern, Germany) ST6 ARMAZENAMENTO E CONSISTÊNCIA DE DADOS TS6 DATA STORAGE and CONSISTENCY FramePersist: An Object Persistence Framework for Mobile Device Applications 245 Katy C. P. Magalhães, Windson V. Carvalho, Fabrício Lemos, Javam C. Machado, Rossana M. C. Andrade (Universidade Federal do Ceará - UFC, Brazil) On Coding Navigation Paths for In-Memory Navigation in Persistent Object Stores 259 Markus Kirchberg, Klaus-Dieter Schewe, Alexei Tretiakov (Massey University, New Zealand), Alexander Kuckelberg (RWTH Aachen, Germany) A Pull-Based Approach for Incremental View Maintenance in Mobile DWs 269 Marcus Costa Sampaio, Cláudio de Souza Baptista, Elvis Rodrigues da Silva, Fábio Luiz Leite Jr, Plácido Marinho Dias (Universidade Federal de Campina Grande, Paraíba - UFCG, Brazil) ST7 CONSULTAS NA WEB TS7 WEB QUERYING Towards Cost-based Optimization for Data-intensive Web Service Computations 283 Nicolaas Ruberg, Gabriela Ruberg (Universidade Federal do Rio de Janeiro - UFRJ, Brazil), Ioana Manolescu (INRIA Futurs, France) Optimizing Ranking Calculation in Web Search Engines: a Case Study 298 Miguel Costa, Mário J. Silva (Universidade de Lisboa, Portugal) Siphoning Hidden-Web Data through Keyword-Based Interfaces 309 Luciano Barbosa (OGI/OHSU, U.S.A.), Juliana Freire (OGI/OHSU and University of Utah, U.S.A.) xiii

14 ST8 ASPECTOS DE MODELAGEM DE DADOS TS8 DATA MODELING ISSUES Matching of XML Schemas and Relational Schemas 322 Sergio L. S. Mergen, Carlos Alberto Heuser (Universidade Federal do Rio Grande do Sul - UFRGS, Brazil) Computing the Dependency Basis for Nested List Attributes Sven Hartmann and Sebastian Link (Massey University, New Zealand) 335 Modelagem de Bibliotecas Digitais usando a Abordagem 5S: Um Estudo de Caso 350 David Patricio Viscarra del Pozo, Lena Veiga e Silva, Alberto H. F. Laender (Universidade Federal de Minas Gerais - UFMG, Brazil), Marcos André Gonçalves (Virginia Tech, USA) Classificação de Restrições de Integridade em Bancos de Dados Temporais de Versões 363 Robson Leonardo Ferreira Cordeiro, Clesio Saraiva dos Santos, Nina Edelweiss, Renata de Matos Galante (Universidade Federal do Rio Grande do Sul - UFRGS, Brazil) xiv

15 XML Query Processing: Storage and Query Model Interplay Ioana Manolescu INRIA-Futurs and LRI, Gemo Group - France iona.manolescu@inria.fr Abstract XML data management has received a lot of attention from various perspectives. In this tutorial, we attempt to systematize the existing techniques, from the persistent database perspective, in which documents are stored once in a persistent repository, and queried/updated many times in the sequel. We survey existing techniques for storing XML documents, in tight connection with the execution frameworks that such techniques enable; our purpose is to highlight the tight connection between storage and querying, and their mutual influences. 1

16 Web Semântica e Serviços Web: Automatizando a Integração de Recursos na Web Maria Luiza Machado Campos, Paulo de Figueiredo Pires DCC-IM Universidade Federal do Rio de Janeiro (UFRJ) {mluiza, paulopires}@nce.ufrj.br Resumo Serviços Web (Web Services) podem ser definidos como programas modulares, independentes e auto descritivos, que podem ser descobertos e invocados através da Internet ou de uma Intranet Corporativa. Através da tecnologia de Serviços Web, pode-se encapsular processos de negócios pré-existentes, publicá-los como serviços, descobrir dinamicamente serviços publicados, e trocar informações que atravessam os limites de uma corporação. Serviços Web constituem-se em tecnologia chave para promover a interoperabilidade entre aplicações através da World Wide Web e prover a integração de recursos de uma forma mais ampla. Mesmo com este potencial, esta tecnologia não oferece mecanismos de descrição precisos o suficiente para permitir que programas interajam de forma automática. Para que isso se torne viável, é preciso que as idéias propostas pela Web Semântica sejam aplicadas ao mundo dos Serviços Web, através do uso de descritores mais precisos e ontologias, as quais descrevem um vocabulário de termos para comunicação entre humanos e agentes automatizados. O objetivo deste tutorial é discutir a relação entre as abordagens de Serviços Web e Web Semântica, descrever as tecnologias e ferramentas que sustentam estes conceitos, e apresentar de que forma tais conceitos e tecnologias podem ser utilizados no desenvolvimento dos modernos sistemas de informação na Web. Os tópicos abordados neste tutorial representam importante conhecimento para um público de cursos de graduação em informática e ciência da computação visto que constituem uma real tendência para o desenvolvimento e integração de sistemas no ambiente Web atual. 2

17 Mineração do Uso da Web Karin Becker, Mariângela Vanzin FACIN - Pontifícia Universidade Católica do Rio Grande do Sul (PUC-RS) {kbecker, mvanzin}@inf.pucrs.br Resumo O fluxo incessante de acessos às páginas da internet via Web reflete os conteúdos mais diversos, bem como costumes e necessidades pessoais ainda mais distintos, resultando em padrões de utilização extremamente ricos e diversificados. A Mineração do Uso da Web (MUW) é a área que se dedica à atividade de investigação de clickstreams (i.e. seqüência de visitas a páginas feitas por usuários), visando não só reconstituir os passos seguidos pelos usuários, mas principalmente descobrir quais padrões podem ser interessantes para o domínio da aplicação. Estes padrões de uso podem auxiliar as organizações a planejar estratégias de marketing de venda de produtos, a conhecer o tempo de vida dos seus clientes, efetivar campanhas promocionais, a projetar sites de acordo com a necessidade dos usuários, etc. Este tutorial centra-se na Mineração do Uso da Web. Ele apresenta as fases que compõem o processo de MUW, abordando para cada fase os problemas e desafios envolvidos, bem como as principais técnicas propostas. Ferramentas de apoio às diversas fases da MUW também são abordadas. Para facilitar o entendimento, o processo de MUW é apresentado na forma de um estudo de caso realizado num ambiente de ensino a distância da PUCRS. 3

18 Narratives over Real-life and Fictional Domains Antonio L. Furtado Departamento de Informática Pontifícia Universidade Católica do R.J. Rua Marquês de S. Vicente Rio de Janeiro, Brasil Abstract Information systems encompass not only descriptions but also narratives, to be discussed here. In order to treat narratives, conceptual level specifications should include the definition of operations for the application being considered, which may refer to a real-life or to a fictional domain. It turns out that the same plan-oriented methods are adequate in both cases. 1. Introduction Conceptual models, such as the Entity-Relationship (ER) model, provide the basic elements to build the description of a state of the mini-world of interest. A state consists of the set of all facts holding at a given instant. In the ER model, facts are assertions about the existence of instances of entities, the relationship instances holding among these entity instances, and the current values of the attributes of both entity and relationship instances. To express the requirements which a valid state must obey, integrity constraints are declared (e.g. salaries cannot be less than a specified minimum); it is often necessary to add further integrity constraints for the requirements concerning valid transitions between states (e.g. salaries cannot decrease). However information systems also comprise, besides factual descriptions, the story or stories which happen in their mini-world, mostly through the deliberate action of human agents. We call these stories narratives, and suggest that our basically technological viewpoint might profitably be complemented with considerations borrowed from literary analysis. Our approach assumes a compact representation of narratives by plots. As facts are kept in database tables, plots can be recorded in some adequate Log structure. We can proceed to the interpretation of narratives through their plots, by examining the plots against a three-level conceptual schema. These notions are motivated informally in sections 2, 3 and 4, and the overall three-level schema approach to conceptual design is briefly summarized in section 5. Section 6 outlines the interpretation process. Uses of narratives in real-life domains are enumerated in section 7, and section 8 suggests the relevance of studies on fictional domains to practical applications. Section 9 provides some implementation considerations on how to keep a Log in relational tables. Section 10 concludes the paper. 4

19 2. Conceptual-level operations To model operations, still at the conceptual level, we should use a compatible domainoriented terminology, as opposed to implementation-oriented commands such as insert, delete, modify. So, if the mini-world involves entities such as Salesmen, Products and Customers, meaningful operations may well include Order, Buy, Pay, etc. Formally speaking, operations are functions which map current states into new states. As such, they can be conveniently defined in terms of facts holding, respectively, at current and new states, using the STRIPS formalism [9]. Each operation is denoted by its signature, comprising the name of the operation and the types of its parameters. Optionally, the roles played by the parameters can also be indicated, using some nomenclature in the line of Fillmore's case grammars [8]. The pre-conditions part of the definition of an operation is a conjunction of positive and negative facts that must hold before an instance of the operation is executed. The post-conditions part of an operation is formed by two sets of facts: those that are asserted and those that are negated by the operation. The preconditions and post-conditions of the operations can usually be mutually adjusted so that all integrity constraints are obeyed, thus ensuring that only valid states will ever be reached, and that the transitions themselves are valid. Also part of the definition of an operation is the agent (or agents) authorized to perform it. For example, the Order operation would be usually assigned to Customers. The execution of an operation, as it maps a state into another state, produces an event in the mini-world. In much the same way that descriptions consist of facts, events are the building-blocks of narratives, a concept to be discussed below, always in the context of our domain-oriented approach [3]. 3. Narratives and plots Intuitively, narratives have to do with stories occurring in the mini-world. One would point out that temporal databases have a related purpose, since they represent the time-ordered sequence of states reached during the lifetime of the database. However they do not propose to register directly which specific actions effected the transitions. As mentioned in the previous section, we denote an event by the execution of a predefined operation. Let op 1 = A: Op(t, a 1, a 2,..., a n ) represent the execution of operation Op at time instant t, with arguments a 1, a 2,..., a n through the initiative of agent A. As an example, consider Alpha: Order(Alpha, #787, 100), where Alpha is a customer, #787 a product, and 100 the quantity ordered. Let op 2, op 3 and op 4 be other events, similarly indicated. The sequence: op 1, op 2, op 3, op 4 which is called a plot, can be used to denote a series of events occurring in the mini-world. Hence, a plot is a compact way to summarize a narrative. The question now is: to what extent are such summaries able to convey at least some of the rich amount of information that we expect to find in narratives in their usual natural language format? In isolation, a plot is minimally informative, even if the names of the operations are reasonably expressive. But it gains a new dimension when interpreted 5

20 against an adequate conceptual schema, where both the entity classes, with their properties, and the allowed operations are defined. 4. Situations, goals, typical plans When an agent proposes to execute an operation, he normally has in mind some goal of his interest. If the pre-conditions of the operation do not currently hold, he may need to execute other operations beforehand, which may in turn require still other operations, and so on, in the recursive backward chaining process enabled by the STRIPS formalism. The process leads, therefore, from single operations to plans (complex operations), consisting of partially-ordered sets of one or more operations to be executed. One immediately notes that plans and plots are similarly composed. By regarding a plot as expressing the plan of an agent, we are stressing the intentional character of the plot. But we must go a step further to realize that a plot can be far more complex, resulting from some combination of plans, which may contain operations executed by more than one agent. Moreover, there may be conflicts [17] between the combined plans; as a consequence, some of these plans may fail due to negative interferences, with the consequence that some of their constituent operations will not be executed. So, we must distinguish between a plot p expressing one or more possibly conflicting plans, and a plot p ex corresponding to the actual (or simulated) and possibly incomplete execution of p. Suppose two customers order the same product in quantities whose total exceeds what is currently available; the executed plot would still contain both orders, but from some point on the operations for the normal continuation of the purchase and delivery process would be missing (or would indicate reduced values) for at least one of the customers. How do the goals themselves arise? Very often they manifest themselves only if some favourable situation holds at the current state. More precisely, if a situation S, consisting of a logical expression involving positive and negative facts, holds in the current state, agent A will be committed to goal G (also consisting of a logical expression involving positive and negative facts), which we express through situation-goal rules of the form A: S G. Certain requirements in S may coincide with the pre-conditions of operations, but others may simply correspond to motivating circumstances; for example, for a certain customer, the goal of acquiring a product might only arise in a situation where the product is being offered with a promotional discount. On the other hand, not all effects of operations are necessarily contemplated in G; one inevitable effect of paying is to decrease the balance of the payer's bank account, which is clearly not part of what a customer sees as a goal. The STRIPS formalism immediately suggests the application of plan-generation algorithms [18]. As soon as a rule A: S G is successfully applied, i.e. S is found to hold in the current state, one such algorithm would be used to generate a plan P, adequate to move into a new state where G holds. Although this is definitely a desirable capability to have available, one must bear in mind that it is not the only approach utilized in practice. Instead of generating a brand new plan, one may take, "from the shelves", some typical plan used in the past, with or without adaptations. This leads to the useful policy of keeping a Library of Typical Plans (LTP). An LTP can be accessed according to different criteria. A most useful access structure is an ASG-Index [7], whereby one can retrieve all typical plans associated with each situationgoal rule A: S G. 6

21 5. Three-level conceptual schemas We present below an itemized summary of our three-level proposal for conceptual schema specification, as discussed in the previous sections. The method for interpreting plots, to be outlined in the next section, assumes the online availability of such schemas. 1. At the static level, facts are classified according to the Entity-Relationship model. Thus, a fact may refer either to the existence of an entity instance, or to the values of its attributes, or to its relationships with other entity instances. Entity classes may form an is-a hierarchy. All kinds of facts are denoted by predicates. The set of all facts holding at a given instant of time constitutes a database state. 2. The dynamic level covers the events happening in the mini-world represented in the database. Thus, a real world event is perceived as a transition between database states. Our dynamic level schemas specify a fixed repertoire of domain-specific operations, as the only way to cause state transitions. Accordingly, we equate the notion of event with the execution of an operation. Operations are formally specified by the facts that should or should not hold as pre-conditions and by the facts added or deleted as the effect of execution. 3. The behavioural level models how agents are expected to act in the context of the system. To each individual agent (or agent class) A, we assign a set of goalinference rules. A goal-inference rule A:S G has, as antecedent, a situation S and, as consequent, a goal G, both of which are first-order logic expressions having database facts as terms. The meaning of the rule is that, if S is true at a database state, agent A will be motivated to act in order to bring about a state in which G holds. In addition, we indicate the typical plans (partially ordered sequences of operations) usually employed by the agents to achieve their goals. 6. Interpreting plots of narratives 6.1. Event-by-event reading The interpretation process begins with a sequential traversal of the given plot, taking each constituent event to be examined in turn. Note that events, represented by terms of the form op = A: Op(t, a 1, a 2,..., a n ), must appear in the plot in increasing temporal order, according to the time-stamp parameter t. Let p = op 1, op 2,..., op m be a plot. For each event op i, by examining the conceptual schema and, whenever necessary, consulting the database state current at the instant when op i is executed, it is possible to determine: 1. The signature of the event, i.e. the name of the operation executed and the types and values of its parameters. 2. The agent who executed the operation. 3. The positive and negative facts of the pre-conditions, instantiated with the database values retrieved. 4. The effects of the operation, i.e. the set of facts asserted and the set of facts negated, instantiated with the database values retrieved. 7

22 Checking the database state s i holding just before the execution of each op i is straightforward in a temporal database environment. Another possibility, if the state s 0 at which the execution of p began is available, is to obtain s i through the sequential simulated execution of op 1, op 2,..., op i-1, starting from s Trying to find the intentions of agents The previous analysis explains what effects characterize the state transitions, how (i.e. through which operations) they were achieved, who acted, and why it was possible to act at each stage. However this elementary sort of why question addressed in item 3 above, simply verifying the satisfaction of pre-conditions, does not contribute much to explain the behaviour of agents. A more ambitious interpretation step may involve, given a situation-goal rule A:S G, the determination of: 1. The occurrence of the facts in S, duly instantiated, at state s 0 or at some intermediate state s i reached during the execution of a plot p. 2. The occurrence of the facts in G, duly instantiated, at an intermediate state s j occurring later than s i. 3. The agent A to whom the rule A:S G is of interest, assuming that the two preceding items were successfully determined. And, having found that the rule is applicable, a further step can be tried if an LTP is available, equipped with an ASG-Index. Let p' be the subsequence of p leading from state s i to state s j, and let p'' be a plan to achieve G, extracted as a partial subsequence of p' (determined as in plan-generation, and leaving out events in p' not covered by the backward chaining process). We can now try to match plan p'' against the plan patterns stored in the LTP at entry A- S-G. If the match succeeds for some plan pattern p* at the entry, then p* is recognized as the typical plan chosen by A in this occasion (more on this in the next section). If the match fails, then p'' is regarded as a non-typical plan, possibly deserving attention as an exception to usual behaviour. But p'' may also be considered for addition to the set of plan patterns at entry A-S-G. The criterion for inclusion could simply be its repeated occurrence in a number of plots (if "typical" is understood as "frequently used"); alternatively, it may involve some more elaborate considerations. It should be noted that this treatment of plans for plot interpretation can be generalized to become a viable strategy for building incrementally the LTP itself [7]. One can start with the ASG-Index, and populate the initially empty LTP with plans periodically extracted from the Log. 7. Uses of narratives in real-life domains Narratives produced by agents interacting in the mini-world of an information system can be usefully analysed as cases. Resorting to case studies has been for a long time a standard practice in Business Administration. Let us see some possibilities offered by environments 8

23 where narratives, in plot representation, are continuously recorded in a Log, and patterns of typical plans are kept in an LTP. As the system is being used, one can match a small number of observed actions of an agent A against the LTP to try to discover what A is trying to do, noting that his actions may occasionally match more than one possible plan. In contrast to plan-generation, this process is called plan-recognition [10]. If the LTP is equipped with an ASG-Index, recognizing a plan also implies goal-recognition and, consequently, access to alternative plans indicated by the same A-S-G entry. The early recognition of an agent's plan, under execution but still not completed, allows problem detection and prediction of probable future outcomes, and hence provides support for decision-making. A simulation capability can be helpful in this connection. And, if employed with systematically generated sample situations, simulation is also a useful tool for maintenance and redesign, not only of the implemented information system but also of the three-level conceptual schema itself. Recognizing goals and plans leads to the classification of the agents involved allowing to determine who are the good or bad customers, which customers are eligible to a given benefit, etc. In particular, classification together with the detection of specific goals and plans can contribute in a decisive way to effectively customized cooperative user interfaces, able to help the agents to use the system in the most satisfactory way. We suggested that an agent who is aware of his goals can choose between plangeneration or the reuse of some plan "out of the shelves", which means a deliberate access to the appropriate A-S-G entry of LTP to find what typical plans are applicable. If a typical plan needs changes prior to execution, in view of obstacles or other special circumstances, the plan-generation algorithm can be utilized to perform the adaptation, as we have done in our experiments. Another attractive possibility would be to attack the adaptation problem from the perspective of case-based reasoning techniques [15], always recalling that typical plans and plots in general, as summaries of narratives can be viewed as representations of cases. Another very general purpose of LTP emerges especially if, as suggested at the end of the previous section, it is built from analyses of the accumulated Log, by extracting and generalizing frequently adopted plots leading to the identified goals. If so, it opens knowledge discovery/ data mining opportunities, since typical plans show how agents are actually behaving, reflecting their habits and prejudices. 8. The relevance of fictional domains One major contribution to our long-term research project was the literary study of the Russian scholar V. Propp [13], who, at the beginning of the XX th century, proposed a characterization of fairy-tales in terms of 31 functions, performed by 7 different types of dramatis personae (characters). We came to realize that: 1. Propp's functions e.g. Villainy, Struggle, Victory, etc. are operations that cause state transitions in the fictitious mini-world, being, as such, equally amenable to a STRIPS-like treatment by way of pre-conditions and post-conditions. 2. Propp's characters hero, villain, victim, donor, helper, dispatcher, false hero are classes of agents, to whom the execution of specific functions is distributed. 9

24 3. Hence, literary genres, such as fairy-tales, exhibit a strong analogy with practical application domains. Also, in parallel with the notion of typical plans in an LTP repository, a very reputed index has been compiled, listing a large number of types and motifs occurring in different kinds of folktales, produced in all countries along the centuries [1]. Rhetoricians define a number of tropes which turn or alter the meaning of a word. The list of "four master tropes" [5] comprises metaphor, metonymy, synecdoche and irony. The notion of tropes can be extended to treat whole plans, allowing to articulate a highly creative mechanism for plan adaptation. The special power of metaphor has been reviewed in a seminal book [11]. This understanding allowed us to treat business application domains and literary genres from the same perspective, that is: with the same methods and employing the same software tools. Our Prolog tool IPG (Interactive Plot Generator) [3] is able to handle both kinds of plots. As a matter of fact, our experiments with fictional domains are having a "practical" application since we have an ongoing project on digital entertainment [14] along these lines. But, apart from noticing these curious similarities, experience with literary genres led us to consider a special kind of attributes which, besides being essential in fictional domains, have an importance in business environments that has not been fully acknowledged and explored yet: the attributes related to affective aspects of behaviour, such as drives and emotions [16]. Indeed, in the real everyday world, a pre-condition for a customer to order a product from a given vendor is often that he feels happy with the vendor's service. Drives (e.g. basic physical needs, such as hunger and thirst, to which it is legitimate to add social needs, such as the urge to acquire money or prestige) and emotions (e.g. anger, disgust, fear, joy, sadness, surprise) can be modelled as entity or relationship attributes, and their values can be conveniently registered in some numerical scale. Moreover, such attributes can figure in pre-conditions, post-conditions, situations and goals. The affective aspects are an important element in the classification of agents. A simulation tool is reported in [6] for training salesmen, by simulating their interaction with clients with four different personalities: dominant, political, steady, and wary; the same actions of a salesman were expected to elicit different reactions in each case. 9. Some implementation considerations In a relational database environment, entity and relationship classes are conveniently mapped into tables. The current state of the mini-world at a given instant of time is thus represented by the contents of such tables, duly updated until that time. In order to allow the tabular representation of plots, a simple strategy can be adopted. For each conceptual operation Op(a 1, a 2,..., a n ): a) A table OP is created, with n + 1 columns C 0, C 1,..., C n, with the requirement that the rows be kept ordered by the values, of type time stamp, in column C 0. b) A stored procedure P OP is programmed, so that its execution will produce the desired effects of the operation and, in addition, will store in table OP a tuple <t, a 1, a 2,..., a n >, where t is the internal clock time when the procedure is executed. 10

25 The contents of the OP tables at a given instant of time will then provide a time-ordered Log of the operations executed until that time. The Log can be visualized as an ordered sequence S Log by merging together all terms Op(t, a 1, a 2,..., a n ), for each tuple <t, a 1, a 2,..., a n > of each table OP, so that the order imposed by the C 0 columns is preserved. Clearly, any plot p will then correspond to an ordered partial sub-sequence of S Log. There are at least two criteria to establish which partial sub-sequences denote plots of interest. One possibility is to take those terms that refer to the same entity instance. For example: take only the terms that refer to customer Alpha. A simple extension is to also include terms referring to other related instances so, if Alpha ordered product #787, take also the terms where this product is involved. A second criterion is to take terms referring to the same transaction. A transaction is usually understood as a series of actions (executions of operations) initiated by some agent to achieve some purpose. If the designers of the conceptual schema find useful to introduce this notion, then one extra parameter should be added to operations (and an extra column to the OP tables), so that the selection of terms of the same transaction can be easily done. It should be noted that the implementation of operations as stored procedures has the additional advantage of providing an effective way to enforce integrity constraints. The assumption is, of course, that they are correctly programmed according to the defined preconditions and post-conditions, and that they are the only means authorized to users to update the database. In object-oriented terminology, this discipline is equivalent to the requirement that objects can only be handled through a predefined set of messages. Besides relational database software, we find useful to have available some language supporting pattern-matching and logic inference, especially Prolog or Datalog, preferably in combination with Constraint Programming algorithms [12]. 10. Concluding remarks Narratives have been an object of study by literary theoreticians for a considerably long time. It is likely that many of their notions, such as Propp's functions and dramatis personae, will prove helpful to the analysis of narratives arising in daily real-world activities. We are only beginning to look at such contributions, especially in the field of narratology [4, 2]. Human decision-making, as we all know, depends on a complex combination of cognitive and affective aspects. In order to predict and effectively simulate the behaviour of cooperating and competing agents, entity and relationship attributes reflecting their drives and emotions, including those referring to their inter-personal relations, cannot be left out. To analyse their behaviour, the narratives of their participation is the mini-worlds of information systems can provide invaluable insights. And these little dramas need an interdisciplinary approach for their interpretation, in which literary theory will surely play an important role. References 1. Aarne, A.: The Types of the Folktale: A Classification and Bibliography. Translated and enlarged by Stith Thompson, FF Communications, 184. Helsinki: Suomalainen Tiedeakatemia, Bal, M.: Narratology - Introduction to the Theory of Narrative. University of Toronto Press,

26 3. Ciarlini, A. E. M. and Furtado, A. L.: Understanding and Simulating Narratives in the Context of Information Systems. Proc. of the 21st International Conference on Conceptual Modeling, Tampere, Culler, J.: Structuralist Poetics. London: Routledge & Kegan, Culler, J.: Literary Theory. Oxford University Press, Elliott, C.: Using the Affective Reasoner to Support Social Simulations. In Proc. of the 13th International Joint Conference on Artificial Intelligence, , Furtado, A. L. and Ciarlini, A. E. M.: Constructing Libraries of Typical Plans. In Proc. of The Thirteenth International Conference on Computer Advanced Information System Engineering, Interlaken, Fillmore, C.: The case for case. In Universals in Linguistic Theory. E. Bach and R. Harms (eds.). Holt, Rinehart and Winston, Fikes, E. R. and Nilsson, N. J.: STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 2(3-4), Kautz, H. A.: A formal theory of plan recognition and its implementation. In Reasoning about Plans, J. F. Allen et al (eds.), Morgan Kaufmann, San Mateo, Lakoff, G. and Johnson, M.: Metaphors we live by. University of Chicago Press, 2 nd ed., Marriot, K. and Stuckey, P. J.: Programming with Constraints. MIT Press, Propp, V.: Morphology of the Folktale. Austin: University of Texas Press, Scientific American, Volume 283, Number 5, November [Special issue on digital entertainment]. 15. Schank, R. C., Kass, A. and Riesbeck, C. K.: Inside Case-based Explanation. Hillsdale: Lawrence Erlbaum Associates, Velásquez, J. D.: Modeling emotions and other motivations in synthetic agents. In Proc. of the Fourteenth National Conference on Artificial Intelligence, Providence, 10-15, Wilensky, R.: Planning and Understanding. Addison-Wesley Publishing Company, Yang, Q., Tenenberg, J. and Woods, S.: Abstraction in nonlinear planning. Technical Report # CS-92-32, University of Waterloo, Canada,

27 Active XML, Security and Access Control Serge Abiteboul 12 Omar Benjelloun 1 Bogdan Cautis 1 Tova Milo 3 1 INRIA-Futurs and LRI-UniversitéParis-Sud 2 Xyleme SA 3 Tel Aviv University Abstract XML and Web services are revolutioning the automatic management of distributed information, somewhat in the same way that HTML, Web browsers and search engines modified human access to world wide information. We argue in this paper that the combination of XML and Web services allows for a novel distributed data management paradigm, where the exchanged information mixes materialized and intensional, active, information. We illustrate the flexibility of this approach by presenting Active XML, a language that is based on embedding Web service calls in XML data. We focus on two particular issues, namely security and access control. 1. Introduction The field of distributed data management has centered for many years around the relational model. More recently, the Web has made the world wide (or intranet) publication of data much simpler, by relying on HTML as a standard language, universally understood by Web browsers, on plain-text search engines and on query forms. However, because HTML is more of a presentation language for documents than a data model, and because of limitations of the core HTTP protocol, the management of distributed information remained cumbersome. The situation is today dramatically improving with the introduction of XML and Web services. The Extensible Markup Language, XML [19], is a self-describing, semi-structured data model that is becoming the standard format for data exchange over the Web. Web services [22] provide an infrastructure for distributed computing at large, independently of any platform, system or programming language. Together, they provide the appropriate framework for distributed management of information. Active XML (AXML, for short), is a declarative framework that harnesses these emerging standards for the integration and management of distributed Web data. An AXML document is an XML document where some of the data is given explicitly, while some portions are given only intensionally by means of embedded calls to Web services. By calling the services, one can obtain up-to-date information. In particular, AXML provides control over the activation of service calls both from the client side (pull) or from the server side (push). In AXML, all communicationsoccur through service calls. Moreover, AXML encourages an approach to data exchange based on active messages, that is, messages that are AXML documents. The latter can be used both for the input parameters of service calls and for the returned results. Choosing which parts of a should be given explicitly and which should be given in an active/intensionalmannermay be motivatedand influenced by various parameters. These include logical considerations, such as knowledge enrichment or context awareness, and physical considerations like performance or capabilities. The research was partly funded by the RNTL Project e.dot and the French ACI MDP2P 13

28 In the present paper, we focus on the use of AXML for the management of security and access control in the context of distributed data exchange. Such aspects are typically considered in isolation from query processing. Indeed, the historical data models (e.g., the relational model) do not directly address such issues. This is perhaps acceptable in a centralized context, where a unique system is in charge of all data management aspects. It is much less so in the context of distributeddata management, in particular when the various systems that are involved are autonomous. We will argue here that the model we recently proposed, namely Active XML, overcomes this separation. It should be noted that the idea of mixing data and code is not new. Functions embedded in data were already present in relational systems [15] as stored procedures. Also, method calls form a key component of object-oriented databases [8]. In the Web context, scripting languages such as PHP or JSP have made popular the integration of (query) processing inside HTML or XML documents. Embedding calls to Web services in XML documents is just one step further, but is indeed a very important one. The novelty here is that since both XML and Web services are becoming standards, AXML documents can be universally understood, and therefore can be exchanged. The rest of this paper is organized as follows. We first briefly recall some key aspects of XML, Web services (Section 2) and Active XML (Section 3). The following two sections informally discuss security and access control. The last section is a conclusion. 2. XML and Web services XML 19 Simpósio Brasileiro de Bancos de Dados In this section, we briefly discuss the context of our work, i.e., XML and Web services. XML is a new data exchange format promoted by the W3C [21] and widely adopted by industry. An XML document can be viewed as a labeled ordered tree, as seen on the example 1 of Figure 1. XML is becoming a lingua franca, or more precisely an agreed upon syntax, that most pieces of software can understand or will shortly do. Unlike HTML, XML does not provide any information about the document presentation. This is typically provided externally using a CSS or XSL stylesheet. XML documents may be typed using a language called XML Schema [20]. A schema mainly enforces structural relationships between labels of elements in the document tree. For instance, it may request a movie element to consist of a title, zero or more authorsandreviews. The typing proposed by XML Schema is very flexible, in the sense that it can describe, for instance, an HTML webpage, as well as a relational database instance, thus marrying the document world with the structured or semistructured world of databases. The presence of structure in XML documents enables the use of queries beyond keyword search, using query languages such as XPath or XQuery. 1 We will see in the next section that this XML document is also an Active XML document. 14

29 <directory> <movies> <director>hitchcock</director> <sc >Hitchcock</sc> <movie> <title>vertigo</title> <actor>j. Stewart</actor> <actor>k. Novak</actor> <reviews> <sc >Vertigo</sc></reviews> </movie> <movie> <title>psycho</title> <actor>n. Bates</actor> <reviews> <sc >Psycho</sc></reviews> </movie> </movies> </directory> directory movies Hitchcock director movie movie Hitchcock title title actor reviews actor reviews actor "Vertigo" "Psycho" "J. Stewart" "N. Bates" "K. Novak" "Vertigo" "Psycho" Figure 1: An Active XML document and its tree representation 15

30 Web Services 19 Simpósio Brasileiro de Bancos de Dados Web services form a major step in the evolution of the Web. Based on Web services, Web servers, that originally provide HTML pages for human consumption, become gateways to distributed resources. Although most of the hype around Web services comes from e-commerce, one of their main current uses is for the management of distributed information. If XML provides the data model, Web services provide the adequate abstraction level to describe the various actors in data management such as databases, wrappers or mediators and to manage the communications between them. Web services in fact consist of an array of emerging standards. To find the desired service, one can query the yellow-pages: a UDDI [18] directory (Universal Discovery Description and Integration). Then, to understand how to interact with it, one relies on WSDL [23] (Web Service Definition Language), something like Corba s IDL. One can then access the service using SOAP [17], an XML-based lightweight protocol for the exchange of information. Of course, life is more complicated, so one often has to sequence operations (see Web Services Choreography [24]) and consider issues such as confidentiality, transactions, etc. XML and Web services are nothing really new from a technical viewpoint. However, their use as a large scale infrastructure for data sharing and communication provides a new, trendy environment to utilize some old ideas in a new context. For instance, tree automata techniques regained interest as they best model essential manipulations on XML, like typing and querying. This new distributed, setting for data and computation also raises many new challenges for computer science research in general, and data management in particular. For instance, data sources ought to be discovered and their informations integrated dynamically, and this may involve data restructuring and semantic mediation. Classical notions such as data consistency and query complexity must be redefined to accommodate the huge size of the Web and its fast speed of change. 3. Active XML To illustrate the power of combining XML and Web services, we briefly describe Active XML, a framework based on the idea of embedding calls to Web services in XML documents. This section is based on works done in the context of the Active XML project [7]. In Active XML (AXML for short), parts of the data are given explicitly, while other parts consist of calls to Web services that generate more data. AXML is based on a P2P architecture, where each AXML peer is a repository of persistent AXML documents. It acts as a client, by activating Web service calls embedded in its documents, and also acts as a server, by providing Web services that correspond to queries or updates over its repository of documents. The activation of calls can be finely controlled to happen periodically, in reaction to some particular (in the style of database triggers), or in a lazy way, whenever it may contribute data to the answer of a query. AXML is an XML dialect, as illustrated by the document in Figure 1. (Note that the syntax is simplified in the example for presentation purposes.) The sc elements are used to denote embedded service calls. Here, reviews are obtained from cine.com, and information about more Hitchcock movies may be obtained from allocine.com. The data obtained from the call to allocine.com corresponds to the shaded part of the tree. In case the relationship between data and the service call is maintained, we say that data is guarded by the call. 16

31 The data obtained by a call to a Web service may be viewed as intensional (it is originally not present). It may also be viewed as dynamic, since the same service call possibly returns different, data when called at different times. When a service call is activated, the data it returns is inserted in the document that contains it. Therefore, documents evolve in time as a consequence of call activations. Of particular importance is thus the decision to activate a particular service call. In some cases, this activation is decided by the peer hosting the document. For instance, a peer may decide to call a service only when the data it provides is requested by a user; the same peer may choose to refresh the data returned by another call on a periodic basis, say weekly. In other cases, the service provider may decide to send updates to the client, for instance because the latter registered to a subscription-based, continuous service. A key aspect of this approach is that AXML peers exchange AXML documents, i.e., documents with embedded service calls. Let us highlight an essential difference between the exchange of regular XML data and that of AXML data. In frameworks such as Sun s JSP or PHP, dynamic data is supported by programming constructs embedded inside documents. Upon request, all the code is evaluated and replaced by its result to obtain a regular, fully materialized HTML or XML document. But since Active XML documents embed calls to Web services, and the latter provide a standardized interface, one does not need to materialize all the service calls before sending some data. Instead, a more flexible data exchange paradigm is possible, where the sender sends an XML document with embedded service calls (namely, an AXML document) and gives the receiver the freedom to materialize the data if and when needed. More motivations for exchanging AXML data 19 Simpósio Brasileiro de Bancos de Dados We now briefly consider some extra motivations for peers to exchange active messages. A first family of motivations concerns enabling a client to autonomously reuse pieces of information in the result of a service invocation without having to invoke the service again. Another main motivation is to provide dynamic information that adapts to changes over time. Also, an active answer allows the service to return directly some partial extensional information, along with some service calls to obtain more (in the style of answers of search engines). In general, active answers provides a more flexible paradigm for dealing with information such as summarization, context awareness, access to metadata, access to methods to enrich data, etc. Furthermore, besides such logical motivations for using active answers, the notion of an active answer may be useful for guiding the evaluation of queries, when information and query processing are distributed. To see one aspect that typically plays an important role on the performance and quality of Web servers, consider the issue of freshness. Suppose some Web server S publishes a book catalog (as a Web service) and that a comparative shopping site S accesses this catalog service regularly. The cache of S will contain the retrieved data, but information such as book prices that change rapidly will often be stale. The Web server S may decide to make the answers of its catalog service active, e.g. by returning the price information intensionally (i.e. as service calls). Then, the book prices information can be refreshed by calling the corresponding services, without having to reload the entire catalog. This results in savings in communication. 17

32 Supporting techniques 19 Simpósio Brasileiro de Bancos de Dados We have briefly discussed XML and Web services and the advantages of exchanging active data. We presented AXML, that enables such a style of data exchange. To conclude this section, we mention three important issues in this setting, and recent works performed in these directions. To call or not to call Suppose someone asks a query about the Vertigo movie. We may choose to call cine.com to obtain the reviews or not before sending the data. This decision may be guided by considerations such as performance, cost, access rights, security, etc. Now, if we choose to activate the service call, it may return a document with embedded service calls and we have to decide whether to activate those or not, and so on, recursively. We introduce in [12] a technique to decide whether some calls should be activated or not based on typing. First, a negotiation between the peers determines the schema of data to exchange. Then, some complex automata manipulation are used to cast the query answer to the type that has been agreed upon. The general problem has deep connections with alternating automata, i.e., automata alternating between universal and existential states [14]. Lazy service calls and query composition As mentioned earlier, it is possible in AXML to specify that a call is activated only when the data it returns may be needed, e.g., to answer a query. Suppose that a user has received some active document and wants to extract some information from it, by evaluating a query. A difficulty is then to decide whether the activation of a particular call is needed or not to answer that query. For instance, if someone asks for information about the actors of The 39 steps of Hitchcock, we need to call allocine.com to get more movies by this director. Furthermore, if this service is sophisticated enough, we may be able to ask only for information about that particular movie (i.e., to push the selection to the source). Algorithms for guiding the invocation of relevant calls, and pushing queries to them are presented in [2]. Some surprising connections between this problem and optimization techniques for deductive database and logic programming are exhibited in [6]. Cost model We describe in [4] a framework for the management of distribution and replication in the context of AXML. We introduce a cost model for query evaluation and show how it applies to user queries and service calls. In particular, we describe an algorithm that, for a given peer, chooses data and services that the peer should replicate in order to improve the efficiency of maintaining and querying its dynamic data. This is a first step towards controlling the use of intensional answers in a distributed setting. Efficient query processing of structured and centralized data was made feasible by relational databases and its sound logical foundation [15, 5]. Deep connections with descriptive complexity have been exhibited [10, 5]. For the management of answers with active components, we are now at a stage where a field is building and a formal foundation is still in its infancy. Some recent results are presented in [1, 6]. The development of this foundation remains a main challenge for the researchers in the field. 18

33 4. Security A main goal of the present paper is to show how AXML provides a uniform framework for addressing standard query processing issues as well as issues such as security for data management, that are typically considered separately. We believe that, in a Web context where various functionalities may be supported by different peers, it is of particular importance to provide an abstract model for distributed data management that captures these various viewpoints in a unique framework. Such a uniform model allows addressing issues such as data exchange protocol verification in a rigorous manner. Security is a critical issue (arguably the most critical one) for Web applications. Not surprisingly, there is a lot of activity around XML, Web services and security. The W3C is promoting standardization efforts for security, e.g., for XML encryption [27], XML signature [25] or cryptographic keys [28]. The Security Assertion Markup Language [26] is an XML-based framework promoted by the Oasis consortium for exchanging security information. Let us first illustrateby an example howbasic security features may be supported in AXML. Suppose for instance that an AXML peer Bob wants to send a message to an AXML peer Alice, with a portion of the message encrypted. In AXML, the portion of the message to be encrypted is a sub-tree rooted at some particular node, say n. To encrypt it, Bob has to remove the children sub- trees of n, say t1,..., tm, and to replace them with their encrypted value. In an AXML setting, encrypted XML data will be represented using the now standard XML Encryption with the following syntax: <EncryptedData Id? Type? MimeType? Encoding?> <EncryptionMethod/>? <ds:keyinfo> <EncryptedKey>? <AgreementMethod>? <ds:keyname>? <ds:retrievalmethod>? <ds:*>? </ds:keyinfo>? <CipherData> <CipherValue>? <CipherReference URI?>? </CipherData> <EncryptionProperties>? </EncryptedData> We will rely on a public key encryption scheme. Each participant has a unique public/private key pair denoted PUK/PRK. As usual, the private key is private, while the public one is made accessible to the world, through the following service: PublicKey@peer() -> string A similar PrivateKey@peer() service exists, than can only be invoked by the peer itself. We also assume that each peer has available the following generic services, that respectively perform encryption and decryption. encrypt@local(publickey,data) -> encrypteddata decrypt@local(privatekey,encrypteddata) -> data Now, Bob first rewrites the data (to be sent) by replacing the children of n with: 19

34 Note that the resulting message has the same semantics (intensional content) as the original message. They only differ by the materialization of service calls. The exchange of information in AXML is guided by typing. The typing of the interface (typically the output type of a service) willspecify that theinformationsent should notcontain the service call encrypt. Thus, before sending the data, the encrypt service call will have to be performed by Bob and Bob will send: decrypt(privatekey@alice(),e) where E is the encrypted value of t1,.,tm, i.e. the result of: encrypt(publickey@alice(),t1,.,tm). When receiving the message, Alice will have to decrypt it; indeed, she is the only one able to perform the decryption since she is the only one who can perform PrivateKey@Alice(). This is of course a simple setting. More complicated distributed exchange protocols may be supported by AXML. For instance, it is very simple to capture signatures (by switching public and private keys in the above example), authentication or delegation of privileges. 5. Access Control 19 Simpósio Brasileiro de Bancos de Dados Suppose we want to control the access to some resource, say F. Then we can simply hide F and let users access it via a controlled service G. So, for instance, to obtain F(a) for some input data a, a peer will call G(a,l) where l is some login information such as user name and password, possibly encrypted. The service G simply checks that the user has the proper access rights, eventually calls F, gets the result and returns it to the user. Note that this typically happens in a distributed setting with the user, the database and the access control manager residing on different peers. Note also that alternative strategies are also possible. For instance, the user may call the database that calls the access control manager to check that this particular user possesses the proper access rights. To ensure the privacy of data, one may also choose to use a finer-grained control over the requests initiated by users. In [29], we adopted the GUP ster approach [29], that unifies access control and source descriptions, by relying on a single query language to specify both. We use AXML and a single query rewriting mechanism to enforce them. GUP ster access control can be naturally incorporated into AXML documents by providing it as Web services. More precisely, we use filtering services, that enforce the access control rules on AXML data, given as their parameter. These services are used to protect some AXML data by filtering the queries that can be evaluated on them, according to a set of access control rules. Note that the protected data does not have to be sent extensionally as a parameter, but can be represented intensionally by a service call, and thus hidden from the filtering service. Let us see in more details how this works. Suppose we want to evaluate a query on an AXML document for a particular user. This results in a call to some data guarded by a filtering service. The query is pushed to the filtering service. The filtering service rewrites it based on the access control rules defined for the particular user. The rewriting is performed by a GUP ster service. Then, the resulting query is evaluated either directly (lazily) by the filtering service or via a request to the data data source. The choice of a specific evaluation strategy is controlled by the input/output types specified for the filtering services, using techniques introduced in [12]. 20

35 6. Conclusion 19 Simpósio Brasileiro de Bancos de Dados The relational model has been an important breakthrough for data management in centralized information systems. It brought a cleanly formalized data model, with strong logical foundations. The SQL query language was essential for the adoption of this model, since it gave a syntax to define queries. With the advent of the Web, new data management paradigms are considered, to take advantage of what essentially is becoming an inherently distributed planet-scale information system. Semistructured data, and its standard incarnation XML,are being recognized as thesuitable model and language for data representation and exchange on the Web. XQuery, the query language for XML standardized by the W3C, is often advertised as the future SQL of the Web. However, XQuery is just one element of solution to the issue of designing a language for Web data, since it primarily allows to query centralized collections of documents. In some sense, it misses to capture the distributed essence of the Web. With Active XML, we propose a first step towards a suitable data model and language for Web data management. The main contribution essentially consists in introducing intensional portions in semistructured documents, and enabling the exchange of this new kind of data. More precisely, we propose Active XML, a model for distributed data management, based on XML and Web services. In this model, AXML documents can integrate information from other Web sources, through embedded calls to Web services. AXML services open new perspectives for dynamic collaboration among systems on the Web, by enabling the exchange of AXML data. The focus of the present paper was on aspects such as security or access control in an AXML setting. Typically, one would like to consider each aspect separately in the style of aspect-oriented programming [?]. For instance, one would prefer to ignore security issues when designing a data management application. We are currently working on extending AXML to do just that. The idea is to abstract some particular aspect such as security using rewrite rules. The application programmer may then ignore this aspect while designing the application. The rewrite rules are then in charge of automatically rewriting the data (for instance at the time it is exchanged) to meet the requirements of the aspect. This allows for a more modular approach. We have designed and implemented such an extension of AXML. We are currently validating it by considering various settings such as security management, access control, transaction processing or distributed query optimization. References [1] S. Abiteboul, O. Benjelloun, B. Cautis, I. Manolescu, T. Milo, N. Preda, Lazy Query Evaluation for Active XML, Sigmod [2] S. Abiteboul, O. Benjelloun, B. Cautis, I. Fundulaki, T. Milo, A. Sahuguet, An Electronic Patient Record on Steroids : Distributed, Peer to Peer, Secure and Privacy Conscious (demo), VLDB [3] S. Abiteboul, O. Benjelloun, T. Milo, Positive Active XML, In Proc. of ACM PODS, 2004., [4] S. Abiteboul, P. Buneman, D. Suciu, Data on the Web, Morgan Kaufmann Publishers, [5] S. Abiteboul, A. Bonifati, G. Cobena, I. Manolescu, T. Milo, Active XML Documents with Distribution and Replication, ACM SIGMOD, [6] S. Abiteboul, R. Hull, V. Vianu, Foundations of databases, Addison-Wesley,

36 [7] S. Abiteboul, T. Milo, Web Services meet Datalog, 2003, submitted. [8] The AXML project, INRIA, [9] The Object Database Standard: ODMG-93, editorr. G. G. Cattell, MorganKaufmann, San Mateo, California, [10] H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, M. Tommasi, Tata, Tree Automata Techniques and Applications, [11] F. M. Cuenca-Acuna, C. Peery, R. P. Martin, T. D. Nguyen, Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities, Department of Computer Science, Rutgers University, [12] N. Immerman, Descriptive Complexity, Springer [13] G. Kiczales et al., Aspect-Oriented Programming, Proceedings European Conference on Object-Oriented Programming, [14] M. Lenzerini, Data Integration, A Theoretical Perspective, ACM PODS 2002, Madison, Winsconsin, USA, [15] T. Milo, S. Abiteboul, B. Amann, O. Benjelloun, F. Dang Ngoc, Exchanging Intensional XML Data, ACM SIGMOD, [16] M.T. Ozsu, P. Valduriez, Principles of Distributed Database Systems, Prentice-Hall, [17] A. Muscholl, T. Schwentick, L. Segoufin, Active Context-Free Games, Symposium on Theoretical Aspects of Computer Science, [18] J.D. Ullman, Principles of Database and Knowledge Base Systems, Volume I, II, Computer Science Press, [19] The SOAP Specification, version 1.2, [20] Universal Description, Discovery and Integration of Web Services (UDDI), [21] The Extensible Markup Language (XML), [22] XML Typing Language (XML Schema), [23] The World Wide Web Consortium (W3C), [24] The W3C Web Services Activity, [25] The Web Services Description Language (WSDL), [26] W3C, Web Services Choreography, [27] W3C XML Signature specification, [28] The Security Assertion Markup Language, [29] W3C XML Encryption specification, [30] W3C XML Key Management specification, 22

37 The EDAM Project Exploratory Data Analysis and Management at Wisconsin Raghu Ramakrishnan 1 Computer Sciences Department University of Wisconsin-Madison raghu@cs.wisc.edu Abstract Data mining has been a very active area of research in the database, machine learning, and mathematical programming communities in recent years. EDAM (Exploratory Data Analysis and Management) is a joint project between researchers in Atmospheric Chemistry and Computer Science at Carleton College and the University of Wisconsin-Madison that aims to develop data mining techniques for advancing the state of the art in analyzing atmospheric aerosol datasets. There is a great need to better understand the sources, dynamics, and compositions of atmospheric aerosols. The traditional approach for particle measurement, which is the collection of bulk samples of particulates on filters, is not adequate for studying particle dynamics and real-time correlations. This has led to the development of a new generation of real-time instruments that provide continuous or semi-continuous streams of data about certain aerosol properties. However, these instruments have added a significant level of complexity to atmospheric aerosol data, and dramatically increased the amounts of data to be collected, managed, and analyzed. Our ability to integrate the data from all of these new and complex instruments now lags far behind our data-collection capabilities, and severely limits our ability to understand the data and act upon it in a timely manner. In this paper, we present an overview of the EDAM project, which is a collaboration between researchers in Atmospheric Chemistry and Computer Science at Carleton College and the University of Wisconsin-Madison that aims to develop data mining techniques for advancing the state of the art in analyzing atmospheric aerosol datasets. While atmospheric aerosol analysis is an important and challenging domain that motivates us with real problems and serves as a concrete test of our results, our objective is to develop techniques that have broader applicability, and to explore some fundamental challenges in data mining that are not specific to any given application domain. 1. Introduction Increasing concern over the role of atmospheric particles (aerosols) on global climate change, human health, and the Earth s ecosystem has created a great need to better understand the composition, origin, and influence of atmospheric pollutants. The traditional approach for particle measurement, which is the collection of bulk samples of particulates on filters, is not adequate for studying particle dynamics and real-time correlations. This has led to the development of a new generation of real-time instruments, including aerosol mass spectrometers, e.g., [9,11,13,15,18], which provide continuous or 1 This paper reports on joint work. Prof. J.J. Schauer at UW-Madison and Profs. D.G. Gross and D.R. Musicant at Carleton College are co-pis on the EDAM project, which is supported by NSF ITR grant IIS J-Y. Cai, B-C. Chen, L. Chen, Z. Huang, M.M. Shafer, and S.J. Wright are collaborators on various technical results mentioned here; see [7,10]. 23

38 semi-continuous streams of data about certain aerosol properties. However, these instruments have added a significant level of complexity to atmospheric aerosol data, and dramatically increased the amounts of data to be collected, managed, and analyzed. Our ability to integrate the data from all of these new and complex instruments now lags far behind our data-collection capabilities, and severely limits our ability to understand the data and act upon it in a timely manner. Data mining has been a very active area of research in the database, machine learning, and mathematical programming communities in recent years, and there is a wealth of techniques that can be brought to bear on atmospheric aerosol datasets. In particular, we show how a powerful class of analysis techniques developed for analyzing customers purchase histories can, unexpectedly, be brought to bear on mass spectrometry data by preprocessing it appropriately. Unfortunately, while some of these techniques are available in commercial data analysis products, many of the most useful ideas are of very recent origin, and are at the research stage. Further, atmospheric aerosol analysis raises a number of challenges for which there is currently no satisfactory solution. These range from how to incorporate scientists domain knowledge to data provenance, data validation and collaboration support. Large datasets that are gathered in real-time require robust quality-assurance protocols to ensure data reliability, and improved data management tools are a necessary component to achieve this goal. The objectives of the EDAM project can be summarized as follows. We aim to apply and advance the state of the art in data mining in the following main ways: Applying Currently Available Data Mining Techniques to Environmental Monitoring Tasks: A number of currently available techniques can be applied to real-time and semi-continuous atmospheric aerosol data streams to greatly mitigate pressing bottlenecks. Time-series analysis, clustering, and decision trees are wellknown techniques for which robust software is available (both in commercial tools, and in freely distributed source code form). Rather surprisingly, a broad class of techniques (association rules and sequential patterns) developed for analyzing customer transactions is also applicable we show how mass spectrometry data can be approximated in a form that mimics customer transactions. However, for many aerosol data analysis tasks, it is not clear what existing techniques (if any) are applicable, and how best to apply them. The Carleton and UW groups both include computer scientists as well as domain scientists (i.e., chemists, environmental engineers, and atmospheric scientists) because we anticipate that the key to solving such problems will be close, day-to-day interdisciplinary collaborations. Developing Novel Mining Paradigms: There is no framework that enables scientists to create multi-step analyses using one or more mining techniques, and to focus the patterns generated by these techniques by incorporating domain knowledge into the analysis. We aim to generalize and adapt existing algorithms, or develop new ones when necessary, to create a suite of algorithms for traditional analysis tasks that can be easily combined and trained with a variety of additional knowledge. A common pattern of multi-step mining arises when we want to find correlations between (parts of) different datasets, and is motivated by problems arising in combining mass spectrometry and environmental monitoring data. 24

39 This paper is organized as follows. In Section 2, we describe many tasks in environmental data analysis that can benefit from data mining techniques. As an example we consider one of these tasks, labeling mass spectra, in Section 3. We then discuss some data mining themes, with environmental monitoring being one of the application areas but with broader applicability, being investigated in the EDAM project in Section Data Mining for Environmental Monitoring The EDAM project is driven by a number of data analysis and mining tasks arising in environmental monitoring, specifically, monitoring atmospheric aerosol particles. Examples of tasks we have considered include the following: Interpreting ATOFMS Data: We discuss our results in this area in Section 3. Scaling ATOFMS to External Measurements: The ATOFMS data stream is not internally calibrated to represent the mass concentration of each element present in aerosol samples. To this end, limited experiments that seek to calibrate the ATOFMS response with chemical measurements of aerosol samples collected on filters have suggested that when the ATOFMS data is averaged over a moderate number of particles, the measurements can be directly converted to mass concentrations. Previous efforts relied on samples of aerosols collected with filter samplers that were co-located with the ATOFMS and analyzed in a laboratory for their average chemical composition, and were then plotted against the average ATOFMS data stream during the same time periods as the filter-based sample. Particle Classification and Clustering: A major goal of particle analysis is to understand the sources of aerosols in the atmosphere. Classifying or clustering particles on the basis of their composition is one approach to identifying likely sources for the particles. Since tools have been developed that use chemical measurements of filter-based aerosol samples to understand their sources, the results of different clustering algorithms can be compared to filter-based source apportionment results to understand the utility of different clustering algorithms. Using ATOFMS Data to Understand Aerosol Dynamics: Ambient datasets, where the ATOFMS instrument particles continuously over a period of time, have already been collected at several locations by the project team and we anticipate collecting many similar datasets in the near future. The current capability to analyze this data to detect trends, internal structure, and correlations is significantly limited. Fusion of ATOFMS Stream with Other Data Streams: The fusion of ATOFMS data with meteorological data and data on the concentration of gas-phase pollutants (e.g., ozone, carbon monoxide, sulfur dioxide, and nitrogen oxides) can be used to locate the sources of different aerosols. Combining ATOFMS data with other data is also a key to understanding optical and chemical properties of aerosols. For example, relating the chemical composition of the carbonaceous fraction of aerosols with their light absorption properties has proven to be an intractable problem using traditional techniques. Since the ATOFMS provides a data stream that provides a less specific measurement of organic and elemental carbon that relates to the bulk chemical properties of the carbon containing species, there is potential for the ATOFMS to help understand the chemical properties of aerosols that control the light absorbing properties of the carbonaceous aerosols. Through data mining tools, we hope to isolate the ATOFMS characteristics that best correlate with the light absorbing properties of the aerosols. 25

40 3. Interpreting ATOFMS Data: Labeling Spectra In this section, we take a closer look at one of the specific environmental monitoring tasks of interest to us. Mass spectrometry techniques are widely used in many disciplines of science, engineering, and biology for the identification and quantification of elements, chemicals and biological materials. Historically, the specificity of mass spectrometry has been aided by upstream separation to remove mass spectral interference between different species. Examples include gas- (GCMS) and liquid-chromatography mass spectrometry (LCMS), and off-line wet chemistry clean-up techniques often employed upstream of inductively coupled plasma mass spectrometry (ICPMS). Historically, these techniques have been employed in laboratory settings that did not require real-time data collection, and the time required for separation and clean-up has been acceptable. Unlabeled Spectrum: Labeled Spectrum: Al + Ca + Ba + BaO + Fe + Na + Fe 2 O + FeO + K 2 Cl + BaCl + BaFe + BaFeO + BaFeO m/z Figure 1: Mass spectrum labeling In the past decade, a wide range of real-time mass spectrometry instruments have been employed, and the nature of these instruments often precludes separation and clean-up steps. The mass spectrum produced for a particle in real-time by one of these instruments is therefore comprised of overlaid mass spectra from several substances, and the overlap between these spectra makes it difficult to identify the underlying substances. The Aerosol Time-of-Flight Mass Spectrometer (ATOFMS) [9,13,15,18] is an example. It is currently available commercially, and is used to monitor the size-resolved chemical composition of airborne particles. This instrument can obtain mass spectra for up to about 250 particles per minute, producing a time-series with unusual complexity. We have applied our results to ATOFMS data, although the data analysis challenges we describe are equally applicable to other real-time instruments that utilize mass spectrometry, such as the Aerosol Mass Spectrometer (AMS) [11]. Mass spectrum labeling consists of translating the raw plot of intensity versus mass-tocharge (m/z) value to a list of chemical substances or ions and their rough quantities (the quantities omitted in Figure 1) present in the particle. Labeling spectra allows us to think of a stream of mass spectra as a time-series of observations, one per collected particle, where each observation is a set of ion-quantity pairs. This is similar to a time-series of transactions, each recording the items purchased by a customer in a single visit to a store, e.g., [1,4]. This analogy makes a wide range of association rule [3] and sequential pattern algorithms, e.g., [1] applicable to the analysis of labeled mass spectrometry data. Our work to date in labeling mass spectra is summarized in two papers. In [10], we consider labeling individual spectra. Our results in this area include the following: We introduce a new and important class of data mining problems involving the analysis of mass spectra. Labeling also allows us to view spectra abstractly as customer transactions and borrow analysis techniques such as frequent itemsets 26

41 and association rules. There is also a deeper connection to market-basket our formulation of labeling allows us to search for buying patterns of interest ( phenomena [12]) rather than just sets of items frequently purchased together; this offers a promising direction for broad application of our results. We introduce a rigorous framework for labeling and present an elegant theoretical characterization of ambiguity, which arises because of the presence of substances with overlapped spectra in the given input spectrum, and controls whether or not there is a unique way to label the input spectrum. We extend the labeling framework to account for practical complexities such as noise, errors, and the presence of unknown substances, and present two algorithms for labeling a spectrum, together with several optimizations and theoretical results characterizing their behavior. A major difficulty for researchers is the effort involved in generating accurately labeled ground truth datasets. Such datasets are invaluable for training machine learning algorithms, tuning algorithms for a given application context using domain knowledge, and for comparing different algorithms. We present a detailed synthetic data generator that is based on real mass spectra, conforms to realistic problem scenarios, and allows us to produce labeled spectra while controlling several fundamental parameters such as ambiguity and noise. We discuss and rigorously define a metric for measuring the quality of labeling, and present a thorough evaluation of our labeling algorithm and a comparison with machine learning approaches, showing that our algorithms, although slower, achieve uniformly superior accuracy without relying on training datasets. In many real settings, it is unrealistic to expect labeled training sets (e.g., when deploying an instrument in a new location, or when the ambient conditions change significantly), and the ability to label without training sets is essential. However, some additional validation and calibration is required if we are to go beyond simple detection of compounds and achieve quantitatively reliable estimates; we are investigating learning from aggregates in this context (Section 4.2). Finally, we apply our labeling algorithm to a collection of real spectra and compare our results with hand-labeling by domain scientists. While these experiments involve a small collection of spectra because of the labor involved in hand-labeling, they demonstrate that our techniques, while requiring some improvement to handle practical complications, nonetheless are effective enough (achieving 93% accuracy in detecting true labels) to be immediately useful, and we have already deployed a tool incorporating our labeling techniques. Our second main class of results looks at the problem of labeling a stream of spectra, and is presented in [7]. The instruments of interest analyze randomly sampled aerosol particles continuously. Thus, our true interest lies not in labels of individual particles but in how the environment is evolving over time, e.g., rather than ask for the amount of mercury in a given particle, we want to find out the average amount of mercury observed per minute and how this changes over time. Group labeling offers many opportunities for efficient labeling as well as more accurate labeling. There are several directions for future work: Better labeling algorithms: Can we use the algorithms presented here to label spectra, and use the results as training sets for machine learning algorithms? Can we adaptively learn important parameters such as the error bound? Can we address 27

42 peak drift? The overall objective is a fast hybrid algorithm that does not require manually labeled training data and that can adapt to changing input streams automatically. Utilizing domain knowledge for labeling: Scientists usually have a priori knowledge about spectra they collect, e.g., if ion A appears in one particle, then it is likely that ion B will also appear in the same particle. How can we let them specify this information and exploit it for better labeling? This is crucial for choosing the true label from the alternatives that arise due to ambiguity. As another example, domain knowledge could be used to produce synthetic training data that closely mimics expected real-datasets, using the detailed generator that we described. Discovering unknown signatures: A limitation of our framework is that it does not distinguish between noise and unknown signatures, and the precision of labeling is influenced by the existence of unknowns. Can we discover unknowns? Applying our techniques to diverse real datasets: In this paper, we have abstracted many complications that arise in practice. We plan to validate our techniques by applying them to real spectral datasets, with ground truth data collected simultaneously using filter-based techniques. This will also allow us to explore accurate calibration of quantity information. Group labeling: Scientists are interested in many other kinds of labels (e.g., commonly occurring ions) when considering group labeling. We are investigating the design and implementation of a language for describing spectral patterns. This also appears to be of value when applied to customer transaction data, and properly generalizes the widely studied class of frequent itemset patterns. Validation on real data: It is important to recognize that our experiment on a real data set is only a first step. A careful study requires considerable effort in acquiring and manually analyzing spectral data for a variety of applications, with varying characteristics (e.g., degree of ambiguity, number of unknowns, noise) and for instruments with vastly different characteristics from ATOFMS. For example, in our experiment, most of the signatures in the library were relatively simple in that they only contain one or two peaks, due to the nature of the process used to form ions in the ATOFMS. The Aerosol Mass Spectrometer (AMS) [11] represents the opposite extreme. Due to the fact that the ATOFMS uses laser ablation techniques to form ions, the ion fragments that are formed are predominantly individual atoms or clusters of a few atoms. In contrast, the AMS uses thermal vaporization and electron impact (EI) ionization that leads to significantly less molecular fragmentation and more complex signatures [18]. In-depth validation is beyond the scope of this paper, but is a central theme in our ongoing research. 4. New Mining Paradigms In this section, we outline some grand challenge problems that motivate us. These are problems that go beyond the specific domain of atmospheric aerosols, though they have many concrete applications in this domain. For citations and more details, see [17]. 4.1 Multi-Step Mining There is no general framework for systematically applying one or more analysis techniques to (parts of) a dataset in a multi-step mining process there is neither a framework for specification of such multi-step mining strategies, nor a framework for 28

43 optimizing the computation by taking the interplay of the different mining steps into account. Since much of the time involved in data mining efforts is in user-driven, iterative, exploratory application of data mining algorithms, rather in the execution time of the algorithms themselves, progress on a compositional framework multi-step mining can have a significant payoff by reducing the real bottleneck in most mining efforts, which is the time taken by an analyst to digest the result of each analysis step and to set up the next step. Completely automatic discovery of truly useful insights is, in our opinion, a pipe dream in most application scenarios. A more realistic goal, perhaps, is a framework for a user to describe a space of exploratory sequences, together with notions of interestingness for the patterns or insights discovered thereby, and for the system to find ways to explore this space intelligently and efficiently. Domain knowledge can be exploited in one of two ways by using it to focus the patterns found by a given mining technique, and by using it to select appropriate techniques for different tasks and to combine the results. While much remains to be done, there has already been some work showing how mining algorithms can be adapted to incorporate prior knowledge. We seek to generalize and adapt existing algorithms, or develop new ones when necessary, to create a suite of algorithms for traditional analysis tasks that can be easily combined and trained with a variety of additional knowledge. We now describe one class of multi-step mining strategies that we intend to pursue to this end, called subset mining. Database queries and data mining algorithms allow us to discover various properties of a given dataset, ranging from simple aggregates such as average temperature by location or more complex patterns such as clusters and classes. Typically, data mining tasks consist of finding interesting patterns in a dataset. Often, however, the questions of interest have the form Is there some subset of the data that is interesting? The interestingness of a data subset can be measured in a number of ways, e.g., through a database query that computes an aggregate or a data mining query that identifies a class of patterns and defines a measure of interestingness over patterns. We define subset mining to be the class of analysis tasks that search over subsets of a given dataset to identify interesting subsets. The distinguishing characteristic is that some computation is (in principle, at least) carried out over (all or several, possibly overlapping) subsets of a dataset. Clearly, the complexity of the base computation is amplified manifold because of the number of potential subsets over which it must be iterated. Often, however, domain knowledge can be brought to bear on which subsets to consider potentially interesting, and there may be structural relationships between these subsets (e.g., disjointness or a predictable form of overlap). Exploiting these characteristics to arrive at an efficient evaluation plan can make the difference between whether or not a given subset mining task is computable, given reasonable computing resources. We introduce the subset mining paradigm through an example. Consider a table (HgReadings) of hourly reactive mercury-level readings with one row for each reading, and another table of particulate sulfate ion-concentration readings (IonReadings), also measured hourly. If we want to find all times at which the reactive mercury-level is high (above some threshold), this is a simple selection query over the first table. If we want to find the average concentration of, say, particulate sulphate ion, this is a simple aggregate query on the second table. Combining these queries, we can ask for the average 29

44 concentration of sulphate ion when the reactive mercury-level is high. All three queries are readily expressed in SQL, the standard database query language. In contrast, consider the following query, which is of great interest in atmospheric studies seeking to understand the sources of reactive gaseous mercury: Are certain ranges of reactive mercury levels strongly correlated to unusually high concentrations of particulate sulphate ion? This is an example of a subset mining query, and it is not expressible in SQL without significant restrictions. As another example, if we have identified clusters based on the location of each reading, we can readily refine the previous query to ask whether there are such correlations (for some ranges of reactive mercury levels) at certain locations. The main challenge is that we must consider all possible reactive mercury ranges, and for each, carry out (at a minimum) a SQL query that calculates corresponding sulphate ion concentrations. In addition, like typical data mining questions, this query involves inherently fuzzy criteria ( strong correlation, unusually high ) in whose precise formulation we enjoy some latitude. To summarize, there are three main parts to a subset mining query: (1) A criterion that generates several subsets of a table, (2) A correspondence typically a relational expression, but possibly a data mining model, such as a decision tree that predicts a class label column that generates a second set for each of these subsets, and (3) A measure of interestingness for the second set (that indirectly serves as a similar measure for the original subset). To see why subset mining queries are especially useful for integrated analysis of multiple datasets, using a combination of mining techniques, observe that Steps (1) and (3) could both be based on the results of (essentially any) mining techniques, rather than just simple SQL-style selections and aggregation. The computationally challenging aspect arises from the potentially large number of subsets involved, and from the computationally intensive nature of the criterion used for subset generation and interestingness-measurement. (As a special case, we note that Part (2) may be omitted, and we may have to enumerate and identify interesting subsets of a single table.) We note that subset mining is closely related to the subgroup discovery problem studied in inductive logic programming. Given a description language L and a valuation function d, the subgroup discovery problem consists of finding a set S of sentences such that the valuation of each sentence in S is greater than the valuation of any sentence not in S, and further, each sentence is shorter than some threshold length k. Intuitively, each sentence describes a group of data objects, and we want to find the most concisely described groups with the highest value. Subset mining differs from subgroup discovery in two main ways. First, we have chosen to focus on an important three-step analysis pattern, and want to develop algebraic optimization approaches to exploit the connections between these steps. Second, we have not limited ourselves to a language-centric definition of subgroups; the three steps in our approach can use arbitrary (but well-defined) black boxes. There are numerous concrete instances in the environmental monitoring tasks described in Section 2 where the paradigm of subset mining is valuable. A significant technical challenge will be to optimize subset mining, at least for several particular instances (i.e., query classes). This will require research into cost estimation for the data mining techniques used as components of the subset mining instance, as well as ways to push abstract constraints derived from the subset-generation component into the interestingness-measure component (and vice-versa). While there are parallels in database 30

45 query optimization and evaluation of relational algebra operations, we expect that these issues will have to be tackled on a case-by-case basis for different data mining techniques when they are used to instantiate the subset mining framework. These are admittedly difficult challenges, but success is not all-or-nothing. Ultimately, we believe that the ability to identify interesting subsets of a large dataset (and not just specific patterns of interest) will be a significant step forward in the area of data mining, and that if we are able to articulate and efficiently support at least some instances of the paradigm, others in the field will extend the effort by addressing how other mining techniques can be supported in the subset mining context. 4.2 Learning from Aggregates The task of labeling aerosol mass spectra poses an interesting challenge, and a natural approach to this problem offers promise for other problems, in particular, how to do privacy-preserving mining from aggregate information. The challenge is basically that it is unrealistic to expect labeled training data for tuning the labeling algorithm when we deploy the ATOFMS (or other mass spectrometry) instrument in a new environment. The reason for the lack of labeled spectra is simple while the instruments continuously generate spectra for hundreds of particles per minute, manual labeling is still slow and tedious, and requires considerable domain expertise. On the other hand, it is reasonable to expect that for certain compounds of interest, we can co-locate physical filters with the spectrometer and analyze these filters periodically. Thus, if we want to track mercury levels, we can periodically get accurate estimates of the total amount of mercury deposited on the filter, and extrapolate to get the average concentration of mercury over that period. This gives rise to an interesting question: Can we use this aggregate information about mercury levels to cross-check and refine our labeling algorithm for individual aerosol particles? The connection, of course, is that in principle the aggregate mercury level is due to the cumulative effect of mercury contained in all the individual particles. This question can be generalized as follows. Suppose that we have a table, say Customers, that we do not wish to divulge. However, we are willing to materialize and share views that compute aggregates over partitions of this table, using SQL s group-by clause. Can we use a given collection of views to learn models that are as accurate as those learnable from the Customers table itself? A related question is what aggregate views allow us to learn a sufficiently accurate model while satisfying a given privacy policy (dictating what alternative collections of aggregate views can be shared). 5. Conclusions There is a general trend towards emphasizing applications, or even application domains, as an integral part of Computer Science research, and the EDAM project is an example of this trend. Environmental monitoring tasks are the motivation for much of the data mining research being carried out in the project, and the results offer promise of advancing the state of the art in monitoring atmospheric aerosols. At the same time, these tasks often inspired problem formulations that have much broader applicability; all in all, it has been a very stimulating approach to core Computer Science research. 31

46 6. References 1. Agrawal, R., Imielinski, T., Swami, A., Mining Associations between Sets of Items in Massive Databases, Proc. ACM-SIGMOD, Agrawal, R., Faloutsos, C. and Swami, A. Efficient Similarity Search in Sequence Databases. Proc. Conf. on Foundations of Data Organization and Algorithms, Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A. I., Fast Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining, Agrawal, R. and Srikant R.: Fast Algorithms for Mining Association Rules, Proc. VLDB, Basu, S., Banerjee, A. and Mooney, R.J., Semi-supervised Clustering by Seeding. Proc. ICML, Benson, S.J., More, Jorge J., A Limited Memory Variable Metric Method, in Subspaces and Bound Constrained Optimization Problems, Chen, L., Huang, Z. and Ramakrishnan, R., Cost-Based Labeling of Groups of Spectra, Proc. ACM-SIGMOD, Fung, G., Mangasarian, O.L. and Shavlik, J., Knowledge-Based Support Vector Machine Classifiers. Proc. NIPS Gard, E., Mayer J.E., Morrical, B.D., Dienes, T., Fergenson, D.P. and Prather, K.A., Real-Time Analysis of Individual Atmospheric Aerosol Particles: Design and Performance of a Portable ATOFMS, Anal. Chem. 1997, 69, Huang, Z., Chen, L., Ramakrishnan, R., Wright, S. and Cai, J., Spectrum Labeling: Theory and Practice, Proc. ICDM Jayne, J.T., D.C. Leard, X. Zhang, P. Davidovits, K.A. Smith, C.E. Kolb, and Worsnop, D.R., Development of an aerosol mass spectrometer for size and composition analysis of submicron particles, Aerosol Sci. Tech.., 2000, 33, McCarthy, J., Phenomenal data mining, In Communications of the ACM 43 (8), Noble, C.A. and Prather K.A., Real-time Measurement of Correlated Size and Composition Profiles of Individual Atmospheric Aerosol Particles. Environ. Sci. Technol, Nocedal, J. and Wright, S.J., Numerical Optimization, Springer, 1st edition, Prather, K.A., Nordmeyer, T., and Salt, K. Real-time Characterization of Individual Aerosol Particles Using ATOFMS. Anal. Chem., 1994; 66, Srikant, R. and Agrawal, R., Mining Quantitative Association Rules in Large Relational Tables, Proc ACM-SIGMOD, Ramakrishnan, R., Schauer J.J., Chen, L., Gross, D.S., Huang, Z., Musicant D.R., and Shafer, M.M., The EDAM Project: Mining Atmospheric Aerosol Datasets, Journal of Intelligent Information Systems, Special Issue on KDD, 2004 (to appear). 18. Suess, D.T. and Prather K.A., Mass Spectrometry of Aerosols, Chemical Reviews, 1999, 99,

47 Visual Analysis of Feature Selection for Data Mining Processes Humberto L. Razente, Fabio Jun Takada Chino, Maria Camila N. Barioni, Agma J. M. Traina, Caetano Traina Jr. 1 1 ICMC Instituto de Ciências Matemáticas e de Computação Avenida do Trabalhador Sãocarlense, São Carlos, SP USP Universidade de São Paulo Brazil {hlr, chino, mcamila, agma, caetano}@icmc.usp.br Abstract The amount of data collected in the last decades has become a source of valuable information, allowing organizations to improve their competitiveness. However, the associated data analysis processes transforming it into useful information have became a hard work. In many cases, the data is composed of many items and many dimensions of interest, turning their comprehension an awkward process. The elimination of correlated features may reduce the complexity of the analysis techniques. The visual comparison of the results supplied by dimensionality reduction techniques also allows better understanding of the results and may lead to the discovery of correlations among the features. This work presents a new technique named Vertical Data Splitting Visualization, which allows overlapping different mappings of the same multidimensional dataset into a three-dimensional Euclidean space. It allows a visual observation of existent correlations among features. The experiments showed that it is linearly scalable on both number of features and number of tuples mapped and is fast, allowing the interaction with datasets of hundreds of thousands tuples. 1. Introduction According to [9], the ability to collect and store data has far outpaced the ability to process and utilize it. This scenario has produced the phenomenon called data tombs, where data is deposited to merely rest in peace, since it will probably never be accessed again. Data mining techniques are highly desirable in order to make real use of this data and allow the extraction of useful information for the organizations that collected it. However, before these techniques can be applied, some preprocessing is usually needed. In this phase, one of the most important points is related to the high dimensionality of the datasets submitted to analysis algorithms once it impacts their performance. Using dimensionality reduction techniques in the preprocessing phase has become usual since they reduce the complexity and improve the performance of data manipulation and data mining techniques. The aim of the dimensionality reduction techniques consists in achieving a compact dataset representation and reduce the loss of information. Dimensionality reduction techniques can be divided into feature extraction techniques and feature selection techniques. Feature extraction are the techniques that change feature values and group information in a quantity of features that is smaller than the original dimensionality. On the other hand, feature selection techniques choose the most relevant existing features, without changing them. In spite of the existence of many data mining tools, there still exists a large gap between the human ability to find answers and the ability of the existent techniques to present these information in a significant manner [10]. Thus, the integration of dimensionality reduction techniques 33

48 with information visualization approaches presents a great potential to help in the discovery of hidden information in databases. Nowadays there are several information visualization techniques proposed in the literature, but the majority of them are not intuitive. The technique presented in this paper aims at helping feature selection using visual analysis. It allows the comparison of two multidimensional mappings from the same dataset, using two distinct subsets of features. Our goal is to help the user with an intuitive view of how to select features for the subsequent analysis processes, as classification, clustering, outlier detection and pattern recognition. The technique was implemented as an extension of the FastMapDB tool [23, 3]. The tool s goal is to create multidimensional mappings of data stored in relational database management systems (RDBMS), allowing the interactive visualization of this data in 3D representations, preserving the distance among the objects. The remainder of the paper is organized as follows. Section 2 presents the related work. Section 3 briefly describes the FastMap algorithm, the feature extraction algorithm used in the experiments. Section 4 discusses the visualization technique proposed. Section 5 gives experimental results achieved with synthetic and real datasets, and presents benchmark results. Section 6 discusses the conclusions of this work. 2. Related Work Automatic information discovery in databases has become a task of great importance. Nowadays there are several techniques and algorithms developed with the purpose to help in this process, such as dimensionality reduction techniques and data visualization. According to [12], the dimensionality reduction techniques can be divided in two groups: feature extraction and feature selection techniques. The general approach of the techniques based on feature extraction are to change the original representation of the dataset in order to reduce its dimensionality. It is important to note that these techniques must avoid loosing inherent characteristics of the stored information [11]. One of the most used dimensionality reduction processes, that changes the spatial axis represented in the original dataset, is the Principal Component Analysis [14]. Other well-known techniques are the SVD (Singular Value Decomposition) [17], DFT (Discrete Fourier Transform) [19] and DWT (Discrete Wavelet Transform) [7]. One of the landmarks among these techniques is the FastMap [8]. This technique allows cluster recognition, exception detection and pattern recognition by performing data mappings from high dimensional spaces to lower dimensional spaces, preserving the characteristics of the original dataset through the distance computation among pairs of objects. The FastMap technique allows the mapping and visualization of several data types, not only numbers, but also string or categorical data as long as there is a defined distance among them. It is important to note that this technique has linear complexity on both the number of objects and the number of features. The dimensionality reduction done by feature selection techniques is based on a process of choosing which are the most significant features to describe the dataset. The demand for the development of these techniques has attracted researchers attention in machine learning field (an interesting description of the existing methods can be found in [5]). The fractal theory [4] has been employed in the database field to perform this task once it obtain good results and is very fast. The idea in using the fractal theory to select features is to choose those features that most affect the correlation fractal dimension [24, 2, 22]. The embedded dimension of a dataset is defined as the number of its features and the in- 34

49 trinsic dimension is defined as the smallest number of dimensions in which the data can be embedded without presenting an expressive information loss [4]. The intrinsic dimension of a dataset represents the spatial dimension of an object represented by it, without considering the space in which it is embedded. For instance, a line segment embedded in a 20-dimensional space has intrinsic dimension 1 and embedded dimension 20. The motivation to use of the fractal theory cames from the observation that most real world datasets have highly correlated features. In other words, the intrinsic (or fractal) dimension of these datasets is smaller than the embedded dimension. An example of an algorithm that uses the fractal theory to perform selection of the most significant features is proposed in [24]. The main idea of this algorithm, called FDR (Fractal Dimension Reduction), is to drop the irrelevant features (that is, those that are not relevant for the dataset characterization). It is based on the computation of the dataset correlation fractal dimension considering all features and then considering the fractal dimension of the dataset excluding one feature at a time. The features that do not affect the dataset fractal dimension are considered correlated with at least one of the other features, and thus they can be dropped. Besides the dimensionality reduction techniques, some examples of visualization techniques proposed in the literature (based on geometric projections, icons, hierarchical representations and pixels) are presented in [15]. As far as the authors are concerned, none of these techniques allows the comparison of data embedded in different spaces. There is also a large variety of commercial and public domain visualization systems that include visualization techniques for data analysis, such as XmdvTool [21], HD-Eye [13] and VisDB [16]. Some of these systems also include a visual programming environment, as IBM Visualization Data Explorer [1] and SGI Explorer [25]. Other systems include analytical techniques and simulation with visualization, such as the MatLab [18], Mathematica [26] and FastMapDB [23, 3]. However none of these systems allows the discovery of correlations using visual comparison of different data mappings, which is the goal of this work. 3. The FastMap Algorithm The FastMap algorithm [8] is based on the projection of the n objects of a dataset in an E-dimensional space over an axis defined by a pair of objects from the dataset called pivots. The algorithm requires that the distances between pairs of objects in the original space can be computed using any user defined distance function. The algorithm assumes that the distances between pairs of objects in the target space is the Euclidean distance function, so the projection of the objects can be calculated using the Cosine Law. For a triangle O a O i O b, the Cosine Law is intuitively presented in Figure 1 and is defined by Equation 1. d b,i 2 = d a,i 2 + d a,b 2 2x i d a,b (1) Figure 1 illustrates the projection of object O i with regard to the pivots O a e O b, where x i is the projected distance between the object O i and the pivot O a. Once the problem of mapping for k = 1 is solved (where k is the dimension of the target space), the other (E 1) dimensions are projected in an (E 1)-dimensional hyperplane that intercepts the line O a O b and then the objects are mapped onto this hyperplane. Intuitively, the method treats each distance between a pair of objects as a spring between the objects, and tries to rearrange the positions of the n objects to minimize the stress of the springs. The stress function (Equation 2) returns the average relative error of the distances in 35

50 O i d a,i d b,i O a x i O b d a,b Figure 1: Projecting O i on the line O a O b [8]. the mapped space. Using ˆd i,j to denote the distance of the mapping between the objects O i and O j, and d i,j to denote the Euclidean distance between these two objects, the distortion of the mapping will be as small as the summation of the stresses between every pair of objects. stress = i,j( ˆd i,j d i,j ) 2 (2) i,j(d i,j ) 2 Figure 2 shows the effect of the algorithm while trying to find the position of the object a with regard to the objects S 1, S 2 and S 3. With regard to S 1 and S 3 the object could be positioned as a, with regard to S 1 and S 2 it could be positioned as a and with regard to S 2 and S 3 it could be positioned as a. The object a finally is placed in the center of the figure, where it minimizes the stress function. S 3 a a a S 1 a S 2 Figure 2: The effect of the stress function in the mapping. The pivots are pairs of objects that are both far away from each other and close to the border of the dataset. The algorithm needs a pair of objects for each dimension in the target (mapped) space, in such a way to approximate the axis to an orthogonal spatial basis. In order to find the farthest objects from one another, it would be needed to compute the distances among every pair of objects, leading to an algorithm with complexity of O(n 2 ) for the number of distance calculations. However, the algorithm uses an heuristic to find a pair of objects whose distance is close to that of the farthest objects, as follows. At first, an object is chosen randomly, and then the algorithm searches the farthest object from it. This object is then used to search the object that is the farthest from it. This process is repeated a small number of times, until 36

51 objects close to the border of the dataset are found (empirically it was chosen that 5 is usually enough [8]). Using this heuristic, the complexity of the FastMap algorithm is O(n). 4. Vertical Data Splitting Visualization The technique proposed in this work aims at showing two or more overlapped visualizations, each one created from different subsets of features over the same objects. Humans can comprehend and capture visual information easily when it is embedded in 3-dimensional (3D) representations, so this work explore representing information in 3D visualizations. The execution of two multidimensional mappings from different sets of features over the same dataset generates two distinct 3D mapped spaces. However, there is no correlation between these spaces. This work is based on the following conjecture: if the two mappings differ only by the exchange of correlated features between the two sets of features, then the distribution of the objects on both mappings should be similar, differing only due to affine geometric transformations (translation, scale and rotation). Accordingly, we conjectured that if we adjust the two mappings as if they were the same, considering the pivots, then the other objects will be at positions very close in both mappings. Therefore, if one compares the displacement of each object among the two mappings, it is possible to see how much of the properties of one dataset is preserved in the second mapping. This is the conjecture upon which this work relies on. Intuitively we can imagine that this technique maps a relation considering two blocks of information, each block is composed of a subset of the features from the relation, as if they were two vertical blocks of data. This is the reason we call it the Vertical Data Splitting Visualization. As an example, the Figure 3 presents two visualizations of the same dataset, using two different features subsets for each visualization. This visualization uses the Education dataset, where each tuple corresponds to one of the 5508 Brazilian cities, and is composed of 10 features with continuous values related to education rates for the year 2000, as defined in Table 1. Figure 3(a) shows a two dimensional projection of the mapping obtained using the features {a 1, a 2, a 3, a 4, a 5, a 6, a 7, a 8, a 9, a 10 }, and Figure 3(b) shows the mapping obtained using the features {a 1, a 4, a 6, a 9, a 10 }. In this figure we showed the mappings space apart. However, in the FastMapDB tool using animation and color resources it is intuitive to distinguish them, and also to visualize them overlapped to highlight their differences. Feature Description a 1 % of 7 to 14 year-old kids attending school a 2 % of 7 to 14 year-old kids illiterate a 3 % of 7 to 14 year-old at least 1 school year late a 4 % of illiterate people older than 15 a 5 % of illiterate people older than 25 a 6 average number of school years for people older than 25 a 7 % of people older than 25 that attended up to 4 school years a 8 % of people older than 25 that attended up to 8 school years a 9 % of people older than 25 that attended at least 11 school years average gross income a 10 Table 1: Schema of the Education dataset. Source: IPEA Brazilian Institute of Applied Economic Research. Available at We can observe that the two mappings produce very similar 3D visualizations, and this 37

52 (a) (b) Figure 3: Education dataset mapped to 3 dimensions. The approximated fractal dimension of the dataset is (a) Three-dimensional mapping considering the features {a 1, a 2, a 3, a 4, a 5, a 6, a 7, a 8, a 9, a 10 }. (b) Three-dimensional mapping considering the features {a 1, a 4, a 6, a 9, a 10 }. tell us that the features {a 2, a 3, a 5, a 7, a 8 }, used in the mapping of the Figure 3(a) but not used in the mapping of the Figure 3(b), do not affect the general properties of the objects in the original space, indicating that they are correlated with at least part of the remaining features {a 1, a 4, a 6, a 9, a 10 }. Therefore, they can be obtained from the others, at least approximately. The feature set {a 1, a 4, a 6, a 9, a 10 } was chosen using the algorithm FDR [24] over the Education dataset. As it can be visually perceived in Figure 3, this selection is indeed meaningful. In order to help the comparisons between the mappings produced by the Vertical Data Splitting Visualization, it is necessary to minimize the effect of differences in scale, rotation, translation and skewing caused by the independent mappings between both features subsets, adjusting the relative placement of both datasets in only one space. This can be performed computing the composed transformation that changes the coordinates from one mapping to the coordinates of the other. These transformations can lead to adjustment errors, which are larger as the differences of the distances between each pair of objects chosen for the adjustment of the mappings increases. After computing the adjustments of the mappings regarding to scale, rotation and translation, each object is mapped and adjusted accordingly. Thereafter, a line segment can be drawn linking the representations of each object in each mapping. Objects represented in the same position in both mappings are plotted as a dot, while objects that change positions are represented by line segments that will be larger as the displacements between the positions of the objects representations increase. Moreover, different objects with equivalent displacements can be represented as line segments with similar direction, forming displacement fields. The Vertical Data Splitting Visualization needs to choose a few objects to be used as references to the adjustment. The pivots chosen by the FastMap algorithm are good candidates, once they are on the border of the dataset. We developed two methods to adjust pairs of mappings, 38

53 called Topological Fit and Best Fit. The methods uses the set of pivots to adjust both mappings, as follows Topological Fit The first method developed for the visualization adjustment is called Topological Fit. The idea is to adjust the objects in a pre-defined order, one object at a time, using geometric transformations that do not deform the mapping, so it is called a rigid body geometric transformation (that is, it is limited to translations, rotations and equal-factor scales). Figure 4 exemplifies the process. First select two sets p e q with 3 objects each, p selected from one mapping, and q from the other one. Using these datasets find the transformation matrix M T that translates the object q 0 to the position of the object p 0, the matrix M RS1 that rotates and scales the object q 1 to the position of the object p 1 (Figure 4(a)) and finally the matrix M R2 that rotates the object q 2 with regard to the axis formed by the objects p 0 and p 1 trying to approximate to the object p 2 (Figure 4(b)). The transformation matrix are those regular ones used in transforming points in 3D space, as shown in Figure 5 for the translation matrix M T. y p 2 y q 2 p 2 p 0 T q 0 R S p 1 x p 0 q 0 R p 1 q 1 x z q 1 (a) q 2 z (b) T Translation R Rotation S Scale Figure 4: Basic idea of the Topological Fit. (a) q 0 translation to the position of p 0, the rotation and scale to move q 1 to the position of p 1. (b) Axis formed by the objects p 0 to p 1 rotation that approximates q 2 to p 2. In order to find the transformation matrix M that computes the Topological Fit, these transformations need to be computed as M = M T M RS1 M R2. The composition is computed only once and then it is applied to all mapped objects of the second mapping, which maintains the method scalability. The transformations composition is computed using homogeneous coordinates. As the FastMapDB tool maps the data to a three-dimensional space, the number of pivot objects is equal to 6. The Topological Fit is a rigid body linear procedure, thus any set of 3 objects chosen among these 6 objects always leads to the same distortion. In this way, we used the pivots in the same order chosen by the FastMap algorithm. Figure 6 presents an adjustment example computed with real data, showing all six pivots of the Education dataset. 39

54 1 0 0 x p0 x q y p0 y q z p0 z q Figure 5: Transformation matrix M T that translates the object q 0 to the position p 0. (a) (b) Figure 6: Three-dimensional adjustment of the Education dataset pivots. The features selected are the same used in Figure 3. (a) Non-adjusted pivots. (b) Topological Fit adjustment of the pivots (pivot objects 0, 1 and 2) Best Fit The second method developed for the visualization adjustment tries to include all six pivots on both sets in the adjustment process using linear operations. However, the Best Fit adjustment does not impose a limitation of the transformation to just translations, rotations and scales. Figure 7 exemplify the problem, which consists in finding the transformation matrix that approximate the pairs of reference objects to be as close as possible. a 11 a 12 a 13 a 14 x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 a 21 a 22 a 23 a 24 y 1 y 2 y 3 y 4 y 5 y 6 = ȳ 1 ȳ 2 ȳ 3 ȳ 4 ȳ 5 ȳ 6 a 31 a 32 a 33 a 34 z 1 z 2 z 3 z 4 z 5 z 6 z 1 z 2 z 3 z 4 z 5 z 6 a 41 a 42 a 43 a transformation 1 st set of objects 2 nd set of objects Figure 7: Computing the transformation matrix that best fits the 1 st set of objects to the 2 nd set of objects. The sets of objects are composed of objects chosen to be reference on the adjustment on both mappings, and theirs coordinates (x, y, z, 1) are disposed in columns. The Best Fit method is based on the least squared error calculation [6], that can be used to compute the least squared error approximation, distributing the error among the objects. To expand the matrices from Figure 7 containing the objects chosen for the adjustment, we find a system of n 4 equations and 16 variables, where n is the number of objects used (in this case, n = 6). A linear approximation for this system produces the errors e(object 1 ), e(object 2 ),..., e(object n ) that individually can be positive, negative or zero. The minimum squared error E is the sum of the individual squared errors, as defined in the Equation 3. E = [e(object 1 )] 2 + [e(object 2 )] [e(object n )] 2 (3) In the next step the equation system is expanded, and it assumes the standard form A x = b, 40

55 where A is defined as the variables coefficient matrix, x is composed of the variables and b is composed of the system equation results. Thus, A x = b presents a solution for x if and only if every equation results from one straight line. Otherwise, the system does not have an exact solution, and then it searches the minimum squared solution. The solution is the vector x that satisfy the equations, as the matrix expressed in the form of Equation 4. A t Ax = A t b (4) Finally, a Gaussian elimination procedure is used by the least squared error method. This procedure is used to solve simultaneous linear equations, and it is based on the partial pivoting by exchanging equations. The exchange ensures that the value of a column in the active line must always be greater or equal in absolute value than any other element below it in the same matrix column. Figure 8 presents an example of this adjustment using the Education dataset. (a) (b) Figure 8: Three-dimensional adjustment of the Education dataset pivots. The features selected are the same used in Figure 3. (a) Non-adjusted pivots. (b) Best Fit adjustment of the pivots. Detailed algorithms of both Topological Fit and Best Fit methods can be found in [20]. 5. Experiments In this section we present the results of applying the presented techniques using the synthetic Synthsierp dataset (to show the overall behavior of the presented techniques) and the real world Education datasets (to evaluate the usefulness of the techniques). The results show two aspects: how the visualization resources help to understand synthetic and real datasets; and the technique scalability, considering the number of features and the number of objects in a dataset. This paper aims at presenting the technique Vertical Data Splitting Visualization and due to the space limitation, the extensive interaction tools and options implemented to support those concepts as extensions of the FastMapDB tool are not discussed. It is important to note that all visualizations produced by the tool are 3D and highly interactive. The tool is available for download at Synthsierp Dataset In this section we present visualizations of overlapped mappings of the Synthsierp dataset, which is presented in Table 2. The dataset contains tuples. The features {b 1, b 2 } correspond to the 2-dimensional coordinates of the points pertaining to a Sierpinsky triangle in the (x, y) plane. Feature b 3 corresponds to random values in [0, 1]. The features b 4 e b 5 are 41

56 computed from the previous ones by the expressions b 4 = (b 1 ) 2 + b 2 and b 5 = 5 b 3 /3, so we know they are correlated with the features b 1, b 2 e b 3. The approximated fractal dimension of this dataset is This means that 3 features should be enough to represent the fundamental characteristics of the dataset. Feature Description b 1 X Sierpinsky triangle coordinate in [0, 1] b 2 Y Sierpinsky triangle coordinate in [0, 1] b 3 random value in [0, 1] b 4 b 5 (b 1 ) 2 + b 2 5 b3 /3 Table 2: Schema of the Synthsierp dataset. Figure 9 presents the overlap of mappings considering the feature sets {b 1, b 2, b 3, b 4, b 5 } and {b 1, b 3, b 5 }. We can notice that the selection of the features {b 1, b 3, b 5 } is not a good choice to characterize the dataset. In fact, the visualization proves that lacking to include the features b 2 or a feature correlated to it in the subset of attributes does not adequately preserve the characteristics of the dataset. (a) (b) Figure 9: Three-dimensional visualizations of mappings of the Synthsierp dataset using not correlated subsets. (a) Mapping of the feature set {b 1, b 2, b 3, b 4, b 5 }. (b) Mapping of the feature set {b 1, b 3, b 5 }. In the FastMapDB tool, the algorithm FDR elects the n more significant features, where n is the first integer greater than the fractal dimension of the dataset. In this way, FDR chooses the features {b 1, b 2, b 3 }. Figure 10 shows the overlapped visualizations of the two mappings considering the features {b 1, b 2, b 3, b 4, b 5 } and the features {b 1, b 2, b 3 }. In Figure 10(a) no adjustments were performed. We can only notice that the distribution of the objects generated is similar. Figure 10(b) presents the results of applying the Topological Fit, where a rigid body transformation was applied. The result approximates the objects, adjusting them in relation to three reference objects (the third reference object sets the objects in the depth of this figure, which is not perceived in this two-dimensional view of the dataset). Figure 10(c) presents the overlapping of this visualization with the application of the Best Fit. The resulting visualization allows a better perception of the existent correlation among the features, once there is a smaller displacement among the objects. In Figures 9 and 10, only the dots that represent the objects were drawn to avoid the cluttering that the line segments between the pairs of objects would cause. In the tool, the user can employ this resource to perform trend analysis and identify cluster movements. 42

57 (a) (b) (c) Figure 10: Three-dimensional mapping of the Synthsierp dataset. Mapping overlap considering the feature sets {b 1, b 2, b 3, b 4, b 5 } and {b 1, b 2, b 3 } with: (a) No adjustment; (b) Topological Fit adjustment; (c) Best Fit adjustment. It should be noted that although the Best Fit adjustment presents a better pairing of the two mappings, it induces a skew in the visualization, that does not occurs in the Topological Fit adjustment. Therefore, both methods have situations where one is preferable over the other Education Dataset In this section we present visualizations of overlapping mappings of the Education dataset (Table 1). In a first experiment, we randomly chose five attributes to compare with the full set of attributes. Figure 11 presents the overlapped visualization of the mapped datasets considering the feature sets {a 1, a 2, a 3, a 4, a 5, a 6, a 7, a 8, a 9, a 10 } and {a 1, a 2, a 4, a 5, a 6 }. As it can be seen, the visual distribution of the objects in these figures does not allow one to identify any correlation. (a) (b) (c) Figure 11: Three-dimensional mapping of the Education dataset. The selection of correlated features does not allow the mappings to overlap. The mapping used the feature sets {a 1, a 2, a 3, a 4, a 5, a 6, a 7, a 8, a 9, a 10 } and {a 1, a 2, a 4, a 5, a 6 } with: (a) No adjustment; (b) Topological Fit adjustment; (c) Best Fit adjustment. The approximate fractal dimension of the Education dataset is 4.57 and the FDR algorithm selected the features {a 1, a 4, a 6, a 9, a 10 } as the five most significant ones. Figure 12(a) presents the overlapped visualizations of the mappings of the Education dataset considering the feature sets {a 1, a 2, a 3, a 4, a 5, a 6, a 7, a 8, a 9, a 10 } and {a 1, a 4, a 6, a 9, a 10 }. The result, in spite of the similar form, does not allow surely deciding about the identification of existing correla- 43

58 tions. Figure 12(b) presents the visualizations of overlapping the mappings after applying the Topological Fit technique proposed. The result are now closer, but not enough to allow a clear identification of the correlations. Figure 12(c) presents the overlapped visualizations of the mappings after applying the Best Fit method. Whereas in this figure it is yet possible to notice small differences between the mappings, it is possible to recognize that the essential behavior of the dataset is preserved even when using just those five attributes. (a) (b) (c) Figure 12: Three-dimensional mapping of the Education dataset. Visualizations of overlapping the mappings generated using the feature sets {a 1, a 2, a 3, a 4, a 5, a 6, a 7, a 8, a 9, a 10 } and {a 1, a 4, a 6, a 9, a 10 }, with: (a) No adjustment; (b) Topological Fit adjustment; (c) Best Fit adjustment. Figure 12(c) allows verifying that the IPEA (Education) research data is well detailed, but a reduced feature subset can be used to perform analysis with excellent approximation. Therefore, once the features % of 7 to 14 year-old kids attending school (a 1 ), % of illiterate people above 15 year-old (a 4 ), average number of years of study for people older than 25 (a 6 ), % of people higher than 25 year-old who attended at least 11 years of school (a 9 ) and average gross income (a 10 ) are selected, features such as % of 7 to 14 year-old that are at least 1 year late (a 3 ) and % of people above 25 year-old that attended to up to 8 school years (a 8 ) do not substantially affect the conclusions Scalability The following experiments were processed on a AMD Athlon XP computer, with 512 MB of main memory, 40 GB hard disk with an average seek time of 10 ms, and running Microsoft Windows 2000 Professional operating system. For the first scalability experiment we used two datasets, the first with a total of 50,000 and the second with 100,000 tuples, composed of 24 randomly generated uniformly distributed features, the worst case scenario to process attributes. The graph of Figure 13(a) presents the scalability of the tool considering the number of features selected to be mapped. For the 100,000 tuples dataset, the time spent to map the 12 features to 3 dimensions was approximately 2 seconds, while the time spent to map 24 features was lower than 4 seconds. Figure 13(b) presents the scalability of the tool regarding the number of objects mapped. In this experiment we have used a dataset with 1,200,000 objects with six randomly generated uniformly distributed features, stored in an Interbase version 5.6 DBMS. The reading curve presents the time spent for reading the dataset to main memory and the curve mapping presents the time spent with the FastMap process, in the described computer, with increasing number of objects. The time spent in the mapping process of the full dataset was approximately 15 seconds. The third scalability test considered the time to show the mapped results. The prototype 44