ISUM 2012 Conference Proceedings Editorial Board


 Byron Parks
 1 years ago
 Views:
Transcription
1
2
3
4 ISUM 2012 Conference Proceedings Editorial Board Editor: Dr. Moisés Torres Martínez Primary Reviewers Dr. Alonso Ramírez Manzanares (UG) Dr. Andrei Tchernykh (CICESE) Dr. Amilcar Meneses Viveros (CINVESTAV) Dr. Gabriel Merino Hernández (UG) Dr. Jesus Cruz Guzmán (UNAM) Dr. Juan Carlos Chimal Enguía (IPN) Dr. Juan Manuel Ramírez Alcaraz (UCOLIMA Dr. Luis Miguel de la Cruz Salas (UNAM) Dr. Manuel Aguilar Cornejo (UAMIztapalapa) Dr. Mauricio Carrillo Tripp (CINVESTAV) Dr. Miguel Angel Moreles Vázquez (CIMAT) Dr. Moisés Torres Martínez (UdG) Dr. Octavio Castillo Reyes (U. Veracruzana) Dr. Ramón Castañeda Priego (UG) Dr. René Luna García (IPN) Dr. Salvador Botello Rionda (CIMAT) Dr. Salvador Herrera Velarde (UG) Editorial Board Coordination Iliana Concepción Gómez Zúñiga Verónica Lizette Robles Dueñas Leticia Benitez Badillo Carlos Vázquez Cholico Formattting Gloria Elizabeth Hernández Esparza Cynthia Lynnette Lezama Canizales Pedro Gutiérrez García Angie Fernández Olimón ISUM 2012 Conference proceedings is published by the Coordinación General de Tecnologías de Información (CGTI), Universidad de Guadalajara, December 19, Authors are responsible for the contents of their papers. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior permission of Coordinación General de Tecnologías de Información at the Universidad de Guadalajara.
5
6 ISBN: Obra: ISUM Conference Proceedings: Where Supercomputing, Science and Technologies Meet Editor: Dr. Moises Torres Martinez Sello Editorial: Universidad de Guadalajara Fecha: 13 de Diciembre de 2012 Derechos Reservados 2012 Universidad de Guadalajara Ave. Juárez No. 976, Piso 2 Col. Centro, C.P Guadalajara, Jal., México Edited in México
7 DIRECTOR S MESSAGE Supercomputing in México is making significant leaps among the scientific community who is making greater uses of these powerful systems to conduct research. The 3rd International Supercomputing Conference in Mexico (ISUM 2013) was the gathering point for researchers nationally and internationally who shared their work using the power of supercomputing and research in the area. As it is customary, the conference invited some of the best researchers in the area who included high profile personalities in academia including Ian foster, William Thighpen, Rick Stevens, Richard Jorgensen, and Alexander Tkatchenko and shared their insight on the various topics of supercomputing. There were also various industry speakers who contributed with their insight on the emerging technologies. These conference speakers gave the attendees a new international foresight on the latest research being conducted and the new emerging technologies. Similarly, the conference presenters contributed to its success with the more than 100 presentations covering a wide spectrum of topics in the field. This third edition of the ISUM has certainly demonstrated its growth with the vast participation of presenters from and throughout the country as well as the participation of researchers from the United States, Europe and Latin America. The rich diversity of presenters lent itself for researchers to be able to create collaborations among them in areas where they found commonalities in their respective work. It is great to see that ISUM has created the space for these interactions to occur because it is through these collaborations that some of the most complex problems we face in our society today will be solved by the country s brightest minds. This book presents the works of some the brightest minds in the country working on the myriad of topics covering supercomputing and demonstrating that the country is serious about the growth in research and development through the uses of high performance computing. As this conference continues to evolve, we will continue to see greater advancement in the uses and research of supercomputing in the country. Thus, my deepest congratulations to the ISUM national committee, the University of Guanajuato, Centro de Investigación en Matemáticas (CIMAT), Laboratorio Nacional de Genómica para la Biodiversidad del Centro de Investigación de Estudios Avanzados (LANGEBIO CINVESTAV IRAPUATO) and Universidad de Guadalajara for their tireless work in organizing this important event to advance science and technology in México. Special congratulations to all the authors who contributed to this publication because without their work this publication wouldn t have been possible. On behalf of the University of Guadalajara and the Coordinating Office of Information Technologies kudos to all conference participants for making this event a memorable one, and invite you to browse through this publication and read the research work that most interest you. I look forward to see your participation in the next ISUM event and contributions to future publications. Ing. Léon Felipe Rodríguez Jacinto General Director Coordinating Office of Information Technologies University of Guadalajara
8
9 I II IV VI CONTENTS Foreword Miguel Ángel Navarro Navarro Vicerrector Ejecutivo de la Universidad de Guadalajara Preface Moisés Torres Martínez Acknowledgements Introduction The Role of Supercomputing in Building Digital Cities and Government: A Perspective from México Moisés Torres Martínez. Applications Matlab in Numerical Analysis Teaching Carlos Figueroa Navarro Lamberto Castro Arce Luis Manuel Lozano Cota Evolutionary Analysis of Genes and Genomes Javier Carpinteyro Ponce Zurisadai Miguel Muñoz González Mario Iván Alemán Duarte David José Martínez Cano Cecilio Valadez Cano Luis José Delaye Arredondo Cloud Computing, a Choice for the Information Management in the MSMEs Jorge Bernal Arias Nelly Beatriz Santoyo Rivera Daniel Jorge Montiel García
10 A Multiobjective Genetic Algorithm for the Biclustering of Gene Expression Data Jorge Enrique Luna Taylor Carlos Alberto Brizuela Rodríguez Sorter & Identifier of Oaks Through Mobile Devices Luis Enrique Jaime Meza Nelly Beatriz Santoyo Rivera Daniel Jorge Montiel García SCI Juan Carlos González Córdoba Nelly Beatriz Santoyo Rivera Daniel Jorge Montiel García An Interface for the Virtual Observatory of the University of Guanajuato René Alberto Ortega Minakata Juan Pablo Torres Papaqui HeinzJoachim Andernach Hermenegildo Fernández Santos Architectures Speech Coding Using Significant Impulse Modeling and Recycling of Redundant Waveform. Ismael Rosas Arriaga Juan Carlos García Infante Juan Carlos Sánchez García Performance Study of Cellular Genetic Algorithms Implemented on GPU Javier Arellano Verdejo Ricardo Barrón Fernández Salvador Godoy Calderón Edgar Alfonso García Martínez
11 Grids Online Scheduling of Multiple Workflows on a Homogeneous Computational Grid Adán Hirales Carbajal Alfredo Cristóbal Salas Energy Efficiency of Online Scheduling Strategies in Two Level Hierarchical Grids Alonso Mitza Aragón Ayala Andrei Tchernykh Ramin Yahyapour Raúl Valente Ramírez Velarde Genetic Algorithms for Job Scheduling in Two Level Hierarchical Grids: Crossover Operators Comparison Victor Hugo Yaurima Basaldúa Andrei Tchernykh Infrastructure Performance Evaluation of Infrastructure as a Service Clouds with SLA Constraints Anuar Lezama Barquet Uwe Schwiegelshohn Ramin Yahyapour Andrei Tchernykh Performance Comparison of Hadoop Running on FreeBSD on UFS Versus Hadoop Running on FreeBSD on ZFS Hugo Francisco González Robledo Víctor Manuel Fernández Mireles
12 Acceleration of Selective Cationic Antibacterial Peptides Computation: A Comparison of FPGA and GPU Approaches Dulce María Garcia Ordaz Miguel Octavio Arias Estrada Marco Aurelio Nuño Maganda Carlos Polanco González Gabriel Del Rio Guerra Computational Fluid Dynamics in Solid Earth Sciences a HPC Challenge Marina Manea Vlad Constantin Manea Mihai Pomeran Lucian Besutiu Luminita Daniela Zlagnean Parallel Computing A Parallel PSO for a Watermarking Application on a GPU Edgar Eduardo García Cano Castillo Katya Rodríguez Vázquez Analysis of Genetic Expression with Microarrays Using GPU Implemented Algorithms Isaac Villa Medina Eduardo Romero Vivas Fernando Daniel Von Borstel Luna Parallelization of Three Hybrid Schemes Based on Covariance Matrix SelfAdaptation Evolution Strategy (CMSAES) and Differential Evolution (DE) Francisco Ezequiel Jurado Monzón Victor Eduardo Cardoso Nungaray Arturo Hernández Aguirre
13 Simulation of Infectious Disease Outbreaks over Large Populations Through Stochastic Cellular Automata Using CUDA on GPUs Héctor Miguel Cuesta Arvizu José Sergio Ruiz Castilla Adrián Trueba Espinosa Parallelizing a New Algorithm for Determining the Matrix Pfaffian by Means of Mathematica Software Héctor Eduardo González José Juan Carmona Lemus Overthrow : A New Algorithm for MultiDimensional Optimization Composed by a Single Rule and Only One Parameter Herman Guillermo Dolder Color and MotionBased Particle Filter Target Tracking in a Network of Overlapping Cameras with MultiThreading and GP GPU Jorge Francisco Madrigal Díaz JeanBernard Hayet Mariano José Juan Rivera Meraz Load Balancing for Parallel Computations with the Finite Element Method José Luis González García Ramin Yahyapour Andrei Tchernykh Parallelization in the NavierStokes Equations for Visual Simulation Mario Arciga Alejandre José Miguel Vargas Felix Salvador Botello Rionda
14 233 Solution of Finite Element Problems Using Hybrid Parallelization with MPI and OpenMP José Miguel Vargas Felix Salvador Botello Rionda Design and Optimization of Tunnel Boring Machines by Simulating the Cutting Rock Process Using the Discrete Element Method Roberto Carlos Medel Morales Salvador Botello Rionda Pattern Recognition Library Implemented on CUDA Sonia Abigail Martínez Salas Federico Hernán Martínez López Angel Alberto Monreal González Amilcar Meneses Viveros Flavio Arturo Sánchez Garfias Parallel Processing Strategy to Solve the Coupled Thermal Mechanic Problem for a 4D System Using the Finite Element Method Victor Eduardo Cardoso Nungaray Roberto Carlos Medel Morales José Miguel Vargas Felix Salvador Botello Rionda A Parallelized Particle Tracing Code for CFD Simulations in Earth Sciences Vlad Constantin Manea Marina Manea Mihai Pomeran Lucian Besutiu Luminita Daniela Zlagnean
15 294 Scientific Visualization TridimensionalTemporalThematic Hydroclimate Modeling of Distributed Parameters for the San Miguel River Basin. María del Carmen Heras Sánchez José Alfredo Ochoa Granillo Christopher John Watts Thorp Juan Arcadio Saiz Hernández Raúl Gilberto Hazas Izquierdo Miguel Ángel Gómez Albores 306 Appendix I Conference Keynote Speakers Organizing committee Author Index
16
17 FOREWORD Where Supercomputing, Science and Technologies Meet couldn t have been a better slogan for the 3rd International Supercomputing Conference in México (ISUM) hosted in the historic city of Guanajuato, Guanajuato, México. It is representative of the tremendous shift the country is experiencing in the way research scientists are thinking about solutions to some of the most complex science problems we face in our society today. Supercomputing is recognized to be important to the advancement of science and technology in the country and as such the past five years we have seen a growth in the uses of these powerful systems in science research. Resulting in a growth of new Supercomputing centers around the country that provide researchers access to the most sophisticated systems available today to analyze data at incredible speeds and obtain faster results to their research questions. These supercomputing centers are making an impact on the country s advancement of science through the computing power that is capable to virtually recreate the physical world on the computing screen, with a stunning degree of precision and sophistication. Everything from weather systems to biochemical interactions to car crashes to air pollution to high speed subatomic particle collisions can now be simulated, manipulated, and observed at the scientists will. There is no doubt that these powerful systems are playing a significant role in transforming societies around the world and will continue to do so as it continues to evolve. In the same manner that supercomputing is transforming societies around the world, in México we are seeing a tremendous progress in its uses not only in academia but in industry and government as well. In the introduction of this book, Dr. Torres makes the solid argument that supercomputing is a critical element for projects like digital cities and digital government, for example these projects are expected to make life, business, travel, city planning and so on more convenient and more effective. The power of parallel computing is critical to the demands of digital cities due to its processing power. He also suggests that without high performance computing projects of this caliber could not be optimally implemented. It is clear that this introduction gives the overview of the importance of high performance computing to our ever changing society especially in México. Similarly, the book provides a series of studies conducted mostly in Mexico with collaboration from countries like United States, Russia, United Kingdom, Germany, Romania, Paraguay and Argentina. The studies cover a wide array of topics including architectures, parallel computing, scientific visualization, grids, applications and infrastructure. Each study contributes to the advancement of supercomputing, science and technology innovation. For example, the article a multiobjective genetic algorithm for the biclustering of gene expression data argues that this algorithm conducted on two sets of biological data widely used as test cases, have shown that it performs better than others currently reported in the literature. An important feature of the algorithm is that it does not require a local search, contrary to some current algorithms which require maintaining the MSR below the threshold by means of this technique. This work is a mere example of the works presented at the 2012 ISUM conference and shared in this publication which gathers the most outstanding work I
18 from the more than 300 attendees. This publication also represents a snapshot of the research work being conducted around the country in supercomputing and its uses to conduct highend research in the multiple disciplines of study. It is breath taking to see the high quality of work being published in this book because we know it comes from seasoned and young researchers from México and abroad who work together to resolve some of the most complex problems we are facing in supercomputing, science and technology today. It is encouraging to know that the country is moving forward in its quest for improving science and technology research and development with the talent of individuals who choose to contribute their time, effort and knowledge to the betterment of our society. With my utmost admiration and respect, I congratulate to the more than 81 authors who contributed to this book and the evaluation committee for their time and effort in selecting the best works to compile this book. On behalf of the University of Guadalajara, kudos to the University of Guanajuato, Centro de Investigación en Matemáticas (CIMAT), Laboratorio Nacional de Genómica para la Biodiversidad del Centro de Investigación de Estudios Avanzados (LANGEBIO CINVESTAVIRAPUATO) and the national ISUM Committee for their work in organizing this important event, which it has become the supercomputing event in México for the past three years. It is clear that it has made national and international impact in fostering the uses of supercomputing. This book is a true testament of its acceptance among the research community in México and abroad. The University of Guadalajara is most proud to have been a contributor to the success of this event through the vast participation of its researchers, graduate and undergraduate students by presenting their work and contributing to this third volume of the ISUM conference proceedings. Without any reservations, I invite you to read through the research articles presented in this book, because as a researcher or a graduate student you ll find excellent work that will help you advance in your own lines of study; And I encourage you to consider publishing your work in future editions of this publication. Dr. Miguel Angel Navarro Navarro University of Guadalajara Executive Vice Chancellor II
19 PREFACE qqqmexican scientists today are making significant strides in the way they approach the analysis of data to acquire faster results on their respective research works. High Performance Computing (HPC) is without any doubt a key element to the way researchers tackle some of the most contemporary problems of our society. It is no secret that a few years back researchers used antiquated computing power to attempt to solve complex problems whether in mathematics, econometrics, weather etc. (to name a few). However, this approach is changing rapidly with the growing uses of supercomputing among researchers; And the growth of supercomputing centers around the country, the latest being the ABACUS Center in the state of México that is expected to join the top 500 list of supercomputers in the world. Perhaps, being in this list is not the most important part but rather of most significance is the use of compute power to accelerate results to some of the most researched problems in our society today. qqqrelated to the advances of supercomputing in the country is the 3rd International Supercomputing Conference (ISUM 2012) that took place in the historic city of Guanajuato, Guanajuato México. The conference s aim was to gather the scientific and technological community to share their research works in the many areas of supercomputing. The ISUM 2012 drew over 300 attendees coming from throughout the country including the presence of 35 Universities and 12 research institutions. It also drew the international community from 8 universities representing Europe, United States, and Latin America. The Conference included internationally known speakers who presented their research work on the most contemporary themes in supercomputing. qqqa well renowned researcher from the U.S. Ian Foster from the Argone National Laboratory presented on the need to move research IT services out to socalled cloud providers to free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible. His compatriot William Thighpen from NASA Advanced Supercomputing Division presented on the High End Computing Capability Project: Enabling NASA Science and Engineering and emphasized the importance of High Performance Computing to NASA s projects. In addition, professor Rick Stevens from Argone National Laboratory presented his work on How largescale computing and knowledge bases will accelerate the development of biological theory and modeling. He discussed requirements of the emerging science of systems biology and how this new science may exploit technology under development for petascale computers and life science clouds. Professor Richard Jorgensen from the Genome National Laboratory for Biodiversity from CINVESTAV in México presented on the iplant Collaborative which is a cyberinfrastructurecentered community for enabling the solution of grand challenge problems in the plant sciences. The Collaborative is designed to enable multidisciplinary teams to address grand challenges and provide access to worldclass physical infrastructure for example, persistent storage and compute power via national and international resources such as the Teragrid as well as to provide analytical tools and services that promote interactions, communications and collaborations and advance the understanding and use of computational thinking in plant biology. And last but not least, professor Alexander III
20 Tkatchenko from the Organic Functional Materials and Intermolecular Interactions Theory Department of the Fritz Haber Institute in Germany presented the focus of his research group to construct a systematic hierarchy of efficient methods for the modeling of vdw interactions with high accuracy and capacity to predict new phenomena in advanced functional materials. Starting from quantummechanical first principles, it unifies concepts from quantum chemistry, densityfunctional theory, and statistical mechanics. qqqthese keynote speakers set the stage for the conference with their insight on their research work. As such the conference followed with 100 presentations from conference attendees, 7 industry keynote addresses, 23 poster presentations, and 7 workshops that made this event a memorable one for all those who had the opportunity to attend. It is rather satisfying to see this much interest from academia and industry on the integration of high performance computing in research. We are seeing a paradigm shift in the way researchers are tackling science and technology research problems in México, and this ISUM conference left clear that the use of HPC in México is no longer a luxury for researchers, it is a necessity to be able to conduct serious science and technology research. qqqin the same manner, we see in this volume the important work of 31 research articles of more than 81 authors covering six research topics including architecture, parallel computing, scientific visualization, grids, applications and infrastructure. These works represent the essence of the research work conducted in the more than fifty seven participating academic institutions from México, Europe, United States, and Latin America. The scope of these works varied for example the joint work from the Universidad Autonóma del Estado de México and Centro Universitario UAEM Texcoco presented the Simulation of Infectious Disease Outbreaks over Large Populations through Stochastic Cellular Automata Using CUDA on GPU. This work presents a performance analysis for an epidemiological event simulation with CA in large populations, using different methods of parallel processing. Works of this type and the various topics covered characterize this collection of studies, which is a representative sample of the work being conducted in High Performance Computing in the country today. qqqit is with much gratitude that I thank the many authors who contributed to this work, and the review committee who contributed their time to make this book a collection of quality work. On behalf of the National Committee, I invite you to read about this pioneering work and to participate in the upcoming International Supercomputing Conference in Mexico and share your research work with the scientific community by presenting and submitting your work for publication. Dr. Moisés Torres Martínez ISUM National Committee, Chair IV
21 ACKNOWLEDGEMENTS This publication could not be possible without the contributions of participants representing institutions from México, Latin America, United States, and European Union whom participated in this 3rd International Supercomputing Conference in México 2012 with presentations and paper submittals. It is a great honor to have had the participation of the many authors who contributed to this publication and conference attendees for their participation, presentations, questions, and interaction making this conference a success. In addition, this conference was also possible due to the many important contributions from the following people who made this event a success. We gratefully thank everyone for their individual contribution. Universidad de Guadalajara Dr. Marco Antonio Cortés Guardado, Dr. Miguel Ángel Navarro Navarro, Lic. José Alfredo Peña Ramos, Mtra. Carmen Enedina Rodríguez Armenta, Ing. León Felipe Rodríguez Jacinto, Mtra. Alejandra M. Velasco González, Rector General Vicerrector Ejecutivo Secretario General Coordinadora, General de Planeación y Desarrollo Institucional Coordinador General, Coordinación General de Tecnologías de Información Secretario, Coordinación General de Tecnologías de Información Executive Committee in Guanajuato Dr. Mauricio Carrillo Tripp, Dr. Salvador Botello, Dr. Alonso Ramírez Manzanares, Dr. Ramón Castañeda Priego, Dr. Jesús Gerardo Valdés Vázquez, Laboratorio Nacional de Genómica para la Biodiversidad CINVESTAV Centro de Investigación en Matemáticas,A.C. Departamento de Matemáticas, Universidad de Guanajuato División de Ciencias e Ingeniería, Campus León, Universidad de Guanajuato Rectoría de Campus Guanajuato,Universidad de Guanajuato V
22 Special recognition to the ISUM National Committee 2012 because without their participation and dedication in the organization of this event, the ISUM 2012 could not had been possible. Alonso Ramírez Manzanares (UG) Lizbeth Heras Lara (UNAM) Andrei Tchernykh (CICESE) Mauricio Carrillo Tripp (CINVESTAV) César Carlos Diaz Torrejón (IPICyTCNS) Manuel Aguilar Cornejo (UAM) Cynthia L. Lezama Canizales (IPICyTCNS) Ma. Del Carmen Heras Sanchez (UNISON) Fabiola Elizabeth Delgado Barragán (UdeG) Moisés Torres Martínez (UdeG) Jaime Klapp Escribano (ININ) René Luna García (IPN) José Lozano (CICESE) Salma Jalife (CUDI) Juan Carlos Chimal Enguía (IPN) Salvador Botello Rionda (CIMAT) Juan Carlos Rosas Cabrera (UAM) Salvador Castañeda (CICESE) Juan Manuel Ramirez Alcaraz (UCOLIMA) Verónica Lizette Dueñas (UdeG) VI
23 INTRODUCTION The Role of Supercomputing in Building Digital Cities and Government: A Perspective from México Dr. Moisés Torres Martínez University of Guadalajara General Coordinating Office of Information Technologies Abstract Digital City projects are a hot topic and are emerging and evolving throughout the globe. México is well positioned to transform into Digital México, but there is a lot of planning and investment in order for this to become a reality. In this paper we examined the state of Digital City in México and the role supercomputing can play as Digital City projects are implemented across the country. The city of Guadalajara, recently selected to be a digital city, is taken as an example to examine how can it be carry out successfully capitalizing on the existing infrastructure and the uses of supercomputing. The paper gives a perspective on the importance of integrating supercomputing to the digital project to ensure digital city is a truly smart city. Introduction The wave of digital cities and government around the globe has been picking up more attention as they are expected to make life, business, travel, city planning and so on more convenient, and effective. The current urban environments are not adapted to the massive migration of population towards cities. This means, new challenges in the fields of security, environmental issues, transportation systems, water distribution and more general resource management will rapidly occur. These cities are faced with challenges such as waste or misuse of resources, which could be changed by increasing intime information. In digital cities people will arrive just in time for their public transportation as exact information is provided to their device in due time. Even parking your car will be easier as free parking spots around you are shown on your portable device. Needless to mention government agencies operate more efficiently by sharing information and making better use of information computing technology resources. Real time information that is always available will optimize time, save energy and make everyday life easier. Digital cities and government around the world are making significant leaps in their implementation, for example Vienna is ranked #1 in the top smart cities on the planet scoring VII
24 handsomely as innovation city, regional green city, quality of life and digital governance. Vienna is establishing bold smartcity targets and tracking their progress to reach them, with programs like the Smart Energy Vision 2050, Roadmap 2020, and Action Plan Vienna s planners are incorporating stakeholder consultation processes into building and executing carbon reduction, transportation and landuse planning changes in the hopes of making the city a major European player in smart city technologies. Similarly cities like Toronto, Paris, New York, London, Tokyo, Berlin, Copenhagen, Hong Kong and Barcelona were ranked the top ten cities respectively (Top Smart Cities, 2012). There is no doubt that these cities are leading the pack and in Latin America the only one mentioned on the list was Sao Paulo, Brazil. This is not surprising since Brazil has demonstrated technological leadership in Latin America the last few years and is making significant leaps in this area. México is still absent on these top lists for digital cities since it s just beginning to evolve in this area. There is a strong momentum from the new government to create digital cities and government around the country. One of the new attempts is the newly office created by the new president named Digital Government and its beginning to have a slight pay off since according to the United Nations Egovernment Survey 2012 it moved its ranking from 39th to 28th. It also ranked recently in Egovernment eighth in the American continent and fifth in Latin America. It is clear that the country is making progress in the development of digital cities but there is a long journey to reach a top spot in the top ten digital cities in the globe. Especially, since México played a leading role in Latin America for more than thirty years and the last decade it has had a decline in their overall leadership especially in the technology sector, while countries like Brazil and Chile have been taking a lead the past decade. It is unquestionable that developing countries cannot afford to be left behind in the development of digital cities since the economic benefits are significant. Supercomputing in Mexico has grown a great deal the past five years with new centers being built throughout the nation and existing centers being updated moving research and development on the technology edge. As these centers grow throughout the country it only makes sense that as digital cities are planned to include these high performance computers as critical to the computing power and research and development needed to advance digital cities in the country. Background Digital cities The concept of digital city is derived from the concept of digital Earth brought forward firstly by Al Gore. The definition is as follow: Digital City is an open and complex application system based on internet technology and city information resource (Gore, 1998). The digital city must integrate with modern information technology and communication technology. Its aim is to promote sustainable development in the fields of environment, planning, health, transportation VIII
25 etc. Since the broader concept of digital earth shelters digital city, the term digital city will be used to also include digital government. The history of digital cities can be said to be approached from either sociotechnical or virtualphysical dimension in its early days. These approaches have gradually been interwoven. With a tradition of community based, privately managed, grass roots organizations in the United States, the Cleveland FreeNet was established in 1986, followed by the Blacksburg Electronic Village in 1991 and the Seattle Community Network in President Clinton s National Information Infrastructure initiative formalized the information network and ushered in a new era in It specified concrete target values for six items like universal access and scientific and technological research. Targeted research areas included economy, medicine, lifelong education, and administrative services. (Yasuoka et al., 2010) The European experiment with digital cities known as TeleCities began with an alliance of 100 cities in 20 countries in They focused on sharing best practices, project plans and success stories. Amsterdam s De Digitale Stad (DDS) in 1994 and the highspeed metropolitan network in Helsinki in 1995 offer other approaches. Asia began its digital city efforts with the Singapore IT2000 Master Plan in Singapore launched Singapore One: One Network for Everyone in 1996 to develop a broadband communications infrastructure. Korea and Malaysia followed up with their own versions. Koala Network was born in Japan with the assistance of a local prefecture. It set up an information center in 1985, connected to the Web in 1994 and promoted community networks in Kyoto digital city project was initiated in 1998 to develop a social information infrastructure for everyday urban life. This wave of new digital cities around the globe has influenced México to consider the implementation of digital cities. In 2012, the Massachusetts Institute for technology (MIT) conducted a study of eleven cities in México to create a digital city and the city of Guadalajara was chosen due to is versatility of more than 700 technology companies in its metropolitan area (Sistema de Gobierno Digital, 2012). Although the project is halted by the political issues the city is facing, there is no question that the conversations and negotiations are important to the city and with the new government it will come to fruition within the next couple of years. Prior to this initiative, the federal government in México pushed the concept of digital cities and government to improve the federal administration (Sistema de Gobierno Digital, 2008). Digital city in Mexico is moving in the right direction with the new government placing a high value in the country s digital evolution. Nevertheless, it is not moving fast enough to make a significant impact on the country s digital transformation. Architectures As the cities around Mexico begin to examine the architectures to implement digital city, it is clear their construction is an integrated application of modern technologies. Those technologies IX
26 are as suggested by Quianjun et al. (2008) internet, communication, grid computing, spatial information grid, Global Positioning System (GPS), Geographical Information System (GIS), LocationBased Service (LBS), data mining, virtual reality, etc. But the spatial data infrastructure is its foundation including a variety of databases, such as spatial database, image database, geocoding database, population dataset, resource database etc. In the middle tier, there are some functional servers, for example, spatial data server, spatial routing, and gateway server, geocoding server, and attribute data. This geoinformation architecture is what Quianjun et al. suggest as architecture but of course there are others for example Anthopoulous and Fitsilis (2011) suggest Enterprise Architectures, which is a strategic information asset base that identifies the mission, the information necessary to perform the mission, and the technologies necessary to perform the mission, and the transitional processes for implementing new technologies in response to the changing mission needs. These translate into five layers including stakeholder, service, business, infrastructure, and information layer. Others like Ishida (2000) suggested a three layer architecture that was implemented in the Kyoto Digital City, which included interaction, Interface, and Information layers. What Ishida suggests in his work is that the three layer architecture met the design for the Kyoto digital city and each city is unique in their requirements and the appropriate architecture depends on these requirements and vision of each digital city. Supercomputing Similarly, we have to examine how supercomputing can be a critical element to the digital city infrastructure, we found studies that suggest, digital cities can benefit in their use and integration of high performance computing. For example, Quianjun et al. (2008), in their work consider grid computing as technologies essential for digital cities. Others suggest that high performance computing is essential to the execution of applications and parallelism to be necessary to promote the scale and the speed of the applications of digital cities (Zhu & Fan, 2008). Zhu & Fan expand in their study suggesting that the power of parallel computing is suitable for process applications of digital cities in large scale or in real time. These studies suggest that supercomputing could be of great benefit to the digital urban management which could include public safety, traffic control, planning, construction, land resources and house administration, health, water, power and communications etc. All these aspects of urban management and more are potential users of the power of supercomputing, and digital cities attempting to convert urban management to digital urban management, without the power of high performance computing it could be said that they are not reaching the true digital status. Supercomputing is a powerful addition to the implementation of digital cities due to its power of processing (parallel) available to execute applications and conduct data analyses at high speeds. X
27 Supercomputing in México In México we have seen a trend in the growth of supercomputing centers throughout the nation. This growth includes new supercomputing centers like ABACUS, a world class national initiative for science and technology specialized in applied mathematics and High Performance Computing of the Center for Research and Advanced Studies of the Instituto Politecnico Nacional (Cinvestav). This center is expected to have one of the most powerful supercomputers in the country. Like, ABACUS, there s supercomputing centers like the National Supercomputing Center (CNS IPICyT), Kan Balam (Univesidad Nacional Autonoma de México), Aitzaloa (Universidad Autonoma Metropolitana) and many other established centers primarily dedicated to research. As these centers continue to grow and evolve around the country and telecommunications continues to improve, the country will be in a better position technologically to advance science and technology to higher levels. And these centers can serve as support to the digital city implementations around the country, since the power of processing and storage is certainly an asset to these projects. Digital México Mexico s Digital Agenda Digital cities around the globe are making significant strides in improving the quality of everyday life to its citizens. The digital city of the 21st century is defined by the transformation of the basic frameworks of human interaction. Those interactions social, economic, political, and cultural are informed by the interplay of history with the opportunities and challenges of the new digital urban reality. The digital city also represents a framework for sustainable growth and greater socioeconomic harmony. What does this mean to Mexican society? And how can it achieve the successful implementation of a digital city? México historically has been a country with rich cultural traditions transcending from its early indigenous and European roots, embedded in those traditions is the modern México of the 21st century which has the eleventh largest economy in the world, and emerging markets (United Nations, 2012). The modern México is moving in the direction to become a digital México (eméxico). In 2011, the Digital Mexico agenda was presented to the federal government by the commission composed of national Information Technologies (IT) associations (Amipici, AMITI, ANIEI, Canieti), legislators and consultants. The agenda was focused around public policy in the uses of Information and Communications Technology (ICT) to contribute to México s socioeconomic development. The document presents an analysis of connectivity indicators putting México into an international context and helps quantify the country s digital gap, i.e. access to ICT and the internet varies across individuals, homes, businesses and geographic zones. It also looks at how ICT has been used and appropriated for social and economic development ends. This document lays out the myriad of challenges the country has to address in communication, XI
28 accessibility, infrastructure, services and equity and access in urban and rural areas. The agenda is not clear in the inclusion of supercomputing as an important element to address research and development. This is concerning because it is clear the authors are not seeing the importance of supercomputing to address some of the most challenging problems the country is facing today and the benefits in integrating these systems in the new digital Mexico strategic plan (Agendadigital. mx, 2012). The digital México agenda was presented for a second time to the federal government in 2012 and it continues to be a conversation topic in the new federal government. The reality of this agenda is that it has not made it to the federal budget to initiate its implementation. Although funding for the implementation stages of the digital México agenda seems promising with the new federal government under President Enrique Peña Nieto, the reality is that there is no funding for it yet. The conversations on this topic remain active and there is much expectation that funding for the initial implementation stages to take effect during this presidential term. Certainly, it is of utmost importance in moving the digital agenda forward because the economic and social benefits to the country are significant. The effective implementation of this digital agenda will move the country forward in creating a more amenable and technological environment that will benefit the common citizen and attract foreign investors creating more jobs and a healthier economy. GuadalajaraDigital City It is noteworthy to mention the city of Guadalajara was chosen to be a digital city, which it has brought much attention in the state of Jalisco and is an important step nationally to creating a digital city in the country. This is an important step towards digital México, however similar to the digital agenda established in the country. This project is at a halt due to the political challenges the city is facing with the land that was allocated to this project and the uncertainty of funding to develop it as it was conceived. Since the digital México agenda and the digital city in Guadalajara seem to be coordinated and not moving fast enough, it is important to suggest that as these projects evolve it is key to consider: How can digital city be successfully implemented? The first opportunity in México to create a digital city provides the single most potent opportunity for imagining and building a common future. In the context of the global economy, in which goods and services transcend municipal and national boundaries with the click of mouse, it is vital that we transcend the parochial attitudes of individualism vs. collective that often dominate projects of this caliber in México. What is meant by this is that this project alone cannot be driven solely by the state or local governments. Therefore it is suggested that it be collaborative and innovative. What is meant to be collaborative is that the project must include government, private industry, and academia. This three prong approach has to be a real collaborative where each stakeholder has ownership of some aspect of the project, because it is common that the political influence takes over the project to one side of the three prong approach and not necessarily that side has XII
29 the resources, man power or capacity to takeon a project of this magnitude. Without the real collaboration of all the stakeholders, it would be difficult for a digital city to be implemented effectively with much innovation. Innovation is what is preached in the implementation of digital cities around the globe, new ideas don t always characterize innovation. Innovation is characterized by the novelty of capitalizing and buildingon the existing infrastructures to advance projects faster, effectively and economical. In this case, the digital city requires not only physical infrastructure but technological infrastructure that will support it all around. In the case of Guadalajara, the strongest communication infrastructures belong to the University of Guadalajara and this is an asset that ought to be exploited and be an integral part of the digital city network infrastructure. There is a strong sense that the academic institutions in the state hold the key to the digital city. This includes the major universities, institutes and research centers. It is not only the fiber optic infrastructure, which it is currently being updated and expanded; it is also the scientific inquiry that is so valuable to the evolution of the digital city. It would be erroneous for Guadalajara Digital City project to not capitalize on the infrastructures available in the academic institutions like the University of Guadalajara and others that have much to contribute. Role of supercomputing in a Digital City As discussed earlier, supercomputing often is underestimated by projects of this caliber because of the fundamental belief that these types of high performance computers are only used to solve mathematical problems that require powerful processing power. Partially it is true, however now days the power of high performance computing is not only for this purpose, we are seeing more and more the capability of these systems used to solve some of the most common problems faced in urban cities, for example in public safety, transportation, weather, and even construction (i.e. rendering) etc. These systems are no longer for the research sector of our society, as it is often perceived and as a result of this perception these systems are overseen to apply in projects like digital city. Having said this, to capitalize on existing infrastructure (computing or network), we do have to look to the local universities to examine what s available in terms of high performance computing to integrate in such projects. For example, if the local university has a state of the art supercomputing center it makes sense to examine how can its power be used in digital city. However, if the computing power needed does not exist, perhaps the planning should include the creation of a supercomputing center to meet the needs of the project. It is rather difficult to believe a true digital city project can do without high performance computing because the multiple tasks required to run a digital city require the power of parallel processing. And high performance computing is an ideal solution to these computing needs. The role of supercomputing in a digital city is fundamentally important and in order for a digital city to be implemented effectively, supercomputing has to be an integral part of its computing infrastructure. Cities like Tokyo, New York, Barcelona, and London etc. are using high performance XIII
30 Conclusion Digital city is an integrated place where we work, study and live. This place is characterized by high efficiency, convenience, security and comfort. Digital City can maximize the city s strengths, improve its social environment, sustain a rapid economic development, increase the cultural and living quality of people, and promote national sustainable development. Therefore in order to improve the sustainable development of the society, we must quicken the implementation of digital cities in the country. In this paper we intended to present the current state of digital city in México, its challenges and a perspective to create a digital city with a robust infrastructure including the use of supercomputing to accomplish a successful implementation. México is well positioned to implement a digital city with a high degree of success and it will do so with the effective implementation of architectures, computing power, and robust telecommunications. Digital City will create a rather effective digital government giving its citizens the most efficient services and making their daily life more convenient and effective. This book presents 31 articles that cover a wide array of topics including applications, architectures, grids, infrastructure, parallel computing and scientific visualization. These works represent the quality of work being produced throughout the country on these topics and although they are not directly tied to a digital city project, they do contribute to the advancement of supercomputing in México. It is an honor to have had the confidence of the 81 authors who contributed to this work to allow me to edit their work and compile this book to share with the scientific community. I invite you to browse through the articles of your choice and to consider publishing your work on future publications. References Agenda Digital.MX, handbook. (2012). Secretaria de Comunicaciones y Transporte. ISBN: Anthopoulos, L., Panos, F. (2011). From Digital to Ubiquitous Cities: Defining a Common Architecture for Urban Development. Anthopoulos, L., Panos, F. (2013) Digital Cities: Towards Connected Citizens and Governance. Digital Literacy: Concepts, Methodologies, Tools, and Applications. Asgarkhani, M. (2005). The Effectiveness of eservice in Local Government: A case Study. The Electronic Journal of egovernment. Volume 3. Issue 4, pp Gore, A. (1998). The Digital Earth: Understanding our planet in the 21st Century. isde5.org/al_gore_speech.htm XIV
31 Ishida, T. (2002). Digital City Kyoto: Social Information Infrastructure for Everyday Life. ACM. Quianjun, M., Deren, L., Yanli, T. (2010) Research on the Progress and Direction of Digital City. Heilongjiang Bureau of Surveying and Mapping, P.R. China.. XXXVII/congress/4_pdf/19.pdf Subsecretaria de la Funcion Publica, Sistema de Gobierno Digital. (2012). portal3/doc/gobierno_digital.pdf The Top 10 Smart Cities in the Planet. (2002). smartcitiesontheplanet Transforming EGovernment Knowledge through Public Management Research. (2009) Public Management Review, 11: 6, United Nations EGovernment Suurvey. (2012). Yasuoka, M., Ishida, T., Aurigi, A. (2010) The Advancement of World Digital Cities,Handbook of Ambient Intelligence and Smart Environments. ISBN SpringerVerlag,US,. Zhu, D., Fan, J. Application of Parallel Computing in Digital City, IEEE Society: ISBN /08 XV
32 Applications
33
34 Where Supercomputing Science and Technologies Meet Matlab in Numerical Analysis Teaching Carlos Figueroa N. 1, Lamberto Castro Arce 2, Luis M. Lozano Cota 3, 1 Departamento de Ingeniería Industrial, Universidad de Sonora; 2 Departamento de Ciencias e Ingeniería, Universidad de Sonora, Unidad Regional Sur; 3 Departamento de Ciencias e Ingeniería, Universidad de Sonora, Unidad Regional Sur; UNISON Abstract Many problems in physics and engineering are modeled by using mathematical functions. That situation always involves solving an equation, finding its roots; other cases require solving to find the solution of systems of linear equations. In the other hand, an interesting case is to solve an equation of eigenvalues of a physical model. In this work, graphs and computer calculation are presented and Matlab programs are the vehicle for exposing them with the goal underlying of numerical algorithms. Keywords: eigenvalues, linear systems, roots of an equation 1. Introduction The solutions of a scalar equation f(x)=0 are called zeros or roots. This paper shows the ability of approximated methods, such as graphical analysis, and it serves to illustrate the educational advantages offered by Matlab [1]. For the graphical method, it is developed the example of a black body radiation to determinate the wavelength in terms of its temperature. On the other hand, in a system of simultaneous equations many physical models can be represented by using vectors and matrices. An interesting example is solving a system of linear equations of an electrical circuit; the solution is determined easily by applying the commands of the program. Finally, the solution of a polynomial matrix equation leads to f(λ)=det(aλi). Obtaining the roots of the polynomial f(λ), means solving a problem of eigenvalues λ, where is a matrix and is the identity matrix. For this case it is showed the example of masses connected to springs. This paper s idea is to show the facilities and advantages obtained by using Matlab in teaching numerical analysis to engineering undergraduate students [2]. Working with this software allows more agile development of topics, and learning becomes easier. 1
35 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Graphic method In order to solve equations by determining the root or zero, it is necessary to find the x that satisfies f (x)=0. A priori or approximate solution is obtained by the graphical method [3], for example, trying to solve f (x)=1/2 e (x 3) sen(x)=0 x>0 Second, with roots closed to 1.94 x2=fzero( 0.5*exp(x/3)sin(x),1.94) x2 = When looking graphical solution in Figure 1, two points which match in f(x)=0 are highlighted, these show the two roots f(x), 0.67 and 1.94, also the roots in 1 x/3 2 e =sen (x) are observed in Figure 2 1/2 e (x 3) sen(x) Figure 1. This function has two roots It is possible to verify the result by using fzero Matlab command. The program runs as the format showed next; it is important, after a function, to write the approximated function. First, with the roots closed to 0.67: x1=fzero( 0.5*exp(x/3)sin(x),0.67) x1 = Figure 2. The root is on the curves intersection Below there is an example of quantum physics. It is a famous exercise since those ideas generated a revolution in physics. oooradiation of black body. The law of Planck for the socalled radiation of black body as a function of the wavelength λ and of the temperature T is written in this way [4] 8πhc/λ E(T)= 5 hc e λkt1 Where E(T) is the energy in function of the temperature, k is the constant of Boltzmann, h is the Planck constant and c is the speed of the light. To find the wavelength λ for which the energy is maximum and for a given temperature, it is necessary to find that. E λ =0 2
36 Where Supercomputing Science and Technologies Meet hc It is defined x= ktλmax and after deriving, the following equation is generated. e x + x 1=0 5 In figures 3 and 4, the solution can be found by visualizing. oojin other words, x 4.9. This is a famous result because it corresponds to the Wien displacement law. With the command fzero of Matlab, it is obtained x1=fzero( exp(x)+0.2*x1,4.9) x1 = e x Figure 4. Functions separated algebraically 3. Linear simultaneous equations To solve a system of linear equations this can be expressed in compact form as Ax=y Where A is a square matrix m=n Figure 3. For x>0, roots are near to 4.9. Here is an example: Figure 5 shows an electrical network connected to three terminals. The aim is to obtain the voltages in nodes a,b,c [3]. It is known that the electric current from the node to is determined by Ohm s law. 3
37 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Also using Kirchhoff s laws it could be Or terms are similarly grouped to reach When combining both the resulting function is Which now it is identified a system of linear equations. Figure 5. Circuit with 3 nodes similar to the star with the node a For the circuit from Figure 5, equations are expressed as The Matlab code that solves is described next. a(1,1)=1/3+1/4+1/4; a(1,2)=1/4; a(2,1)=1/4; a(2,1)=a(1,2); a(2,2)=1/4+1/4+1/3; a(2,3)=1/3; a(3,1)=a(1,3); a(3,2)=a(2,3); a(3,3)=1/3+1/4+1/3; y(1)=5; y(2)=0; y(3)=7/3; x=a\y When the program is running it is obtained: x = Then, 4
38 Where Supercomputing Science and Technologies Meet 4. Eigenvalues problems The eigenvalues of matrixes come from the function which is defined by f(λ)=det[aλi] The function f(λ) is a characteristic polynomial. This polynomial s roots are called eigenvalues from matrix A. In Matlab command eig(a) obtains the solution. rrrconsider a system consisting of masses and springs, as in figure 6, whose displacement is given by the equations [3] In a matrix form it leads to the eigenvalue equation According to the data in addition. When introducing Matlab it is obtained [5] A=[ ; ]; eig(a) While obtaining the A values, it happens that the eigenvalues are ans = , The frequency for the first eigenvalue is calculated by using (2πλ) 2 =γ, which result is λ = Hz. 5. Conclusions Figure 6. Masses and springs system Where y 1,y 2 are the displacements of m 1,m 2 If the next solution is proposed Where λ is the frequency j= 1 while performing the first and second derivative of Y k (t) t is obtained This work has exposed through applied exercises the educational advantages offered by using an appropriate software in engineering and physics lessons. The graphical method focuses on solving approximately the zeros or roots of an equation. It should be clarified that a more accurate solution requires using numerical methods, such as bisection or the Newton method. For systems of linear equations, it is possible in an expeditiously way, obtain a solution and a considerable saving of the time required in calculating the implied linear algebra. The same for obtaining 5
39 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 the matrix eigenvalues as in the case of the system mass. The foregoing can set up new ways to teach physics, by virtue of having more time to analyze and discuss the ideas underlying to each topic. Thus, concepts of quantum mechanics, dynamic systems and circuits can be treated more widely. 6. References [1] Desmond J. Higham, Nicholas J. Higham, Matlab Guide, Society for Industrial and Applied Mathematics, Vol. 43, No. 4, 2001, pp [2] John D. Carter, Matemática, Society for Industrial and Applied Mathematics, Vol. 46, No. 3, 2004, pp [3] Nakamura, S. Análisis Numérico y Visualización Gráfica con Matlab, Edit. Prentice Hall, EEUU, [4] Vazquez Luis et al. Métodos Numéricos para la Física y la Ingeniería, Edit McGraw Hill, España, [5] Amos, Gilat, Matlab una Introducción con Aplicaciones, Edit Reverte, España,
40 Where Supercomputing Science and Technologies Meet Evolutionary Analysis of Genes and Genomes Javier Carpinteyro, Zurisadai Muñoz, Mario Alemán, David Martínez, Cecilio Valadez, Luis Delaye, Departamento de Ingeniería Genética, Unidad Irapuato, CINVESTAV, Abstract In silico biology is now central for biological sciences. Here we describe four different areas of research developed in our group. i) Detection of natural selection on protein coding genes of symbiotic/parasitic bacteria from the genus Wolbachia and in Fungi from the Phylum Ascomycota; ii) Identification of secreted proteins on Fungi from the genus Trichoderma; iii) An attempt to rationally design a symbiotic bacteria from a parasitic one, Neorickettsia sennetsu; and iv) Study the metabolic evolution of a symbiotic cyanobacteria. Each one of these projects requires a set of specialized software to manipulate and make sense of complex biomolecular information. Keywords: Computational biology, symbiosis, natural selection, evolution. 1. Introduction Dobzhansky claimed in the 70 s that nothing in biology makes sense except in the light of evolution [1]. Nowadays, Dobzhansky s claim could be rephrased to nothing in biology makes sense except in the light of evolution and bioinformatics. This addition is due in part to the large amount of genetic and genomic information deposited in public databases. For instance, the last release of Genbank contains more than 380,000 species named at the genus level (or lower), and new taxa are being added at the rate of over 3,800 per month [2]. In parallel to this data flood, a plethora of bioinformatic tools have been developed to analyze and make biological sense of all this information. As an example, a popular web site devoted to make a compilation of evolutionary analysis software, lists a total of 374 different phylogeny packages [3]. Here we describe the application of some of these and other software to address specific biological questions. 7
41 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Natural selection analyses Jacques Monod in a now classic book pointed out that one of the main properties that characterize all living beings it is that of being objects endowed with a project [4]. Monod called this property teleonomy. According to Monod, the specific project is represented in the structures of living beings; which in turn, are used to accomplish their function (i.e., their performances). Of course, there is a direct link of Monod s observation to Darwinian evolutionary theory. Structures used by living beings are adaptations which evolved by means of natural selection [5]. Nowadays, the large number of DNA sequences, combined with the existence of software like PAML [6] or HYPHY [7] capable of detecting natural selection on protein coding genes, opens the possibility to study the process of evolutionary molecular adaptation in an unprecedented detail. The study of natural selection on protein coding DNA sequences rest on the assumption that adaptation acts mostly at the protein level. The algorithms implemented in the above software looks at homologous genes to compare the rate of replacement substitutions dn (those which change the coded amino acid) to the rate of silent substitutions ds (those that do not change the coded amino acid). The fraction of both quantities is often called omega: d N /d S = ω It is assumed that an omega < 1 is caused by purifying selection (mutations that led to an amino acid change are eliminated from the population); an omega > 1 is caused by positive (or diversifying) selection (mutations that change the amino acid are selected for in the population); and an omega 1 is caused by neutral evolution (natural selection does not distinguishes between both kind of substitutions). The parameter omega is estimated under a maximum likelihood framework. It is important to remark that it is possible to estimate omega on each branch of a phylogenetic tree, or on each codon position in a multiple alignment. If omega is estimated on a multiple alignment, then residues predicted to be under each kind of selection regime (purifying, diversifying or neutral) can be mapped in to a tertiary protein structure when available (Figure 1). Figure 1. Prediction of the selective pressure of residues in the enzyme Gal1p from Saccharomyces cerevisiae using codeml from the PAML package. Red: purifying selection; green: neutrality; magenta: positive selection. Residues participating in the active site are shown in spacefill representation. We are currently studying the process of molecular adaptation in two different biological systems. On the one hand, we are looking for the footprints of natural selection 8
42 Where Supercomputing Science and Technologies Meet in bacterial genes from the genus Wolbachia spp. And on the other hand, we are studying selection in genes from Ascomycota Fungi. 2.1 Natural selection on symbiotic or parasitic species. Bacteria from the genus Wolbachia are commonly found in association with eukaryotes. This association is often parasitic when the association is with arthropods, and symbiotic when the association is with nematodes [8]. To date, four Wolbachia genomes have been sequenced; two parasitic strains from distinct species of Drosophila (wmel and wri); one parasitic strain of the mosquito Culex quinquefasciatus (wpip); and one symbiotic strain of the nematode Brugia malayi (wbm). The disponibility of sequenced genomes from closely related Wolbachia species offers the opportunity to study the process of molecular adaptation to different lifestyles. In particular we are interested in identifying those genes under different natural selection regimens between parasitic and symbiotic species. 2.2 Natural selection on the evolution of new protein function. One central question in molecular evolution, is how new protein functions evolve. Genome sequences for Ascomycota fungi are an ideal model to study the roll of natural selection in the evolution of new protein functions. The phylum Ascomycota has among its members Saccharomyces cerevisiae, one of the best studied model organism [9]. S. cerevisiae genome is the result of a complete genome duplication [10]; it is known that some of the duplicated genes evolved new functions. We are interested in looking at omega indices for pairs of duplicated genes. The study can be extended to 23 different sequenced species thanks to a recent phylogenetic study that identified all groups of orthologous and paralogous genes [11]. 3. Identification of secreted proteins It is known that in some pathogens and symbionts there are small proteins that are secreted into the host in order to manipulate its cellular response and favor the invasion of the pathogen or the symbiotic relationship. These proteins are collectively known as effectors [12]. Fungi from the genus Trichoderma are able to colonize the roots of a plant, and mycoparasite different kinds of fungi [13]. In order to look for effector proteins involved in mycoparasitic or symbiotic relationships, we have analyzed the sequenced genomes of Trichoderma atroviride, Trichoderma virens and Trichoderma reesei using different bioinformatic tools. We have first made a compilation of scientific articles where the existence of effector proteins in fungi has been demonstrated experimentally. By this method, we have identified 80 effector proteins. We have then used the sequences of these proteins to look for homologous in the genomes of the three Trichoderma species by using BLAST [14]. We have also identified effector domains among Trichoderma spp. protein sequences by using Pfam definitions and HMMER software [15, 16]. Finally, we have identified typical effector motifs like RXL (where R stands for Arginine, L for Leucine and X for any amino acid) on Trichoderma spp. proteins by using 9
43 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Perl scripts. We have identified 316 proteins likely to function as effectors in the genomes of T. atroviride, T. virens and T. reesei. That includes, 8 families known as Hydrophobins Metalloproteases, Nep1 (Necrosis and Etilene Inducer Peptide), Serin proteases, Nuk7, LysM repeats, elicitors as SM1 and more than 200 sequences with an RXLR translocation motif. Some of these candidates are being evaluated for its ability to be translocated into Arabidopsis thaliana cells using GFP fusions. 4. Engineering a parasite into a symbiont Symbiosis is an ecological phenomenon widespread in nature. Symbiogenesis, the evolutionary outcome of symbiosis, is a central process in evolution [17]. Mitochondiron and chloroplasts are the outcome of a symbiotic relationship between alphaproteobacterias and cyanobacterias with the primitive eukaryotic nucleocytomplasm, respectively [18]. A scientific relevant question is whether it is possible to transform a parasite into a symbiont by genetic engineering and synthetic biology. We have decided to explore this hypothesis by studying genomic properties of the human pathogen Neorickettsia sennetsu. hhhn. sennetsu is an alphaproteobacterium codescendent of mitochondrias that causes a severe but not lethal infection in humans. Where to begin? In principle, one should eliminate from the genome of N. sennetsu genes causing damage to the host, and leave only genes important for the survival of the bacteria and for the desired symbiotic relationship. Therefore we have made a BLAST comparison of the gene content of N. sennetsu, with other alphaproteobacteria, including the free living specie Candidatus Pelagibacter ubique. We have identified by this comparison a set of ~250 genes that are very likely to be essential for cell survival in a wide range of environmental conditions. We have also identified a set of ~60 genes shared only by other Rickettsia that may be important for the interaction with the host. In order to have a better understanding of the biochemistry of N. sennetsu, we have made an in silico reconstruction of its metabolism (Figure 2), by using PathwayTools software [19]. We have been able to integrate 338 protein coding ORFs into 236 enzymatic reactions. Most central pathways are complete, with the exception of 8 missing enzymes. From this metabolic reconstruction we infer that N. sennetsu uses glutamine as its main carbon and energy source. We have finally made a BLAST comparison of N. sennetsu proteins with those proteins predicted to be present in the mitochondria (whether mitochondrial or nuclear coded). By this analysis we identified 239 proteins shared between the two systems. The next step will be to identify a minimal symbiotic system (which genes should we keep and which to eliminate from the genome of N. sennetsu) based on previous information. Figure 2. Representation of the metabolic pathways reconstructed in N. sennetsu 10
44 Where Supercomputing Science and Technologies Meet 5. Evolutionary analyses of the photosynthetic function in Paulinella chromatophora An important feature of the eukaryotic cell is the organelle presence; these are structures with specialized functions whose origins are attributed to endosymbiotic process between different freeliving prokaryotes [20]. The plastids (which are the organelles responsible of the photosynthetic function in eukaryotes) evolved about a billion years ago [21]. The details of the first steps in the evolution of plastids have been erased by the antiquity of its origin. However, less ancient cases of symbiotic associations between eukaryotic hosts and photosynthetic cyanobacteria can be used to study the origin of plastids by analogy and the early steps in the evolution of symbiosis. Paulinella chromatophora is a protist containing photosynthetic bodies (plastids) acquired by endosymbiosis with cyanobacteria of the genus Synechococcus sp. in relatively recent times (60 million years ago) [22]. This makes P. chromatophora an interesting model to study the evolution of the symbioticphotosynthetic function. With the use of PathwayTools [19], we are in silico reconstructing the metabolic pathways of the chromatophore (photosynthetic bodies) of P. chromatophora from its genomic sequence. We are at the same time reconstructing the metabolic pathways of a freeliving cyanobacteria (Synechococcus sp. WH 5701), which has been show to be the closest homolog to the cromatophore of P. chromatophora. Once both metabolic models are finished, we will perform flux balance analyses (FBA) on both metabolisms to look for a rationale of the metabolic changes from free living to the symbiotic lifestyle. 6. Conclusions Biology is evolving into an information driven science. Nowadays it is possible to contribute to biological research purely from an in silico approach. Here we have discussed four areas, namely, a) the study of how proteins evolve new functions and how similar genomes adapt to diverse lifestyles; b) pattern recognition on protein sequences to identify a set of proteins with a common general function; c) an attempt to rationally design a symbiotic cell from a parasitic one; and d) the study of the evolution of the photosynthetic function in a symbiotic system. However as difficult these projects may be, they are only possible under modern biology. 7. References [1] T. Dobzhansky, Nothing in Biology Makes Sense Except in the Light of Evolution, The American Biology Teacher, 1973, Vol. 35, pp [2] D. A. Benson, I. KarschMizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers Genbank, Nucleic Acids Res, 2011 January; 39(Database issue): D32 D37. [3] [4] Monod, J. Chance and Necessity: An Essay on the Natural Philosophy of Modern Biology, New York, Alfred A. Knopf, ISBN , [5] Darwin, C. The Origin of Species by Means of Natural Selection, or the Preservation of Favoured 11
45 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Races in the Struggle for Life (6th ed.). London: John Murray. ISBN , [6] html. [7]http://octamonkey.ucsd.edu/hyphywiki/index. php/main_page. [8] Werren, J. H., Baldo, L., and Clark, M. E. Wolbachia: master manipulators of invertebrate biology, Nature reviews. Microbiology, 2008, Vol. 6, pp [9] [10] Wolfe, K., and Shields, D.C. Molecularevidence for an ancient duplication of the entire yeast genome, Nature, 1997, Vol pp [11] Wapinski, I., Pfeffer, A., Friedman, N. and Regev A. Natural history and evolutionary principles of gene duplication in fungi, Nature, 2007, Vol. 449, pp [12] Stergiopoulos I, and de Wit P.J. Fungal effector proteins, Annu Rev Phytopathol.2009, Vol. 47, pp [13] Harman, G.E., Howell, C.R., Viterbo, A., Chet, I. and Lorito, M. Trichoderma species opportunistic, avirulent plant symbionts, Nature Reviews. Microbiology, 2004, Vol. 2, pp [16] R.D. Finn, J. Clements, and S.R. Eddy. HM MER web server: interactive sequence similarity searching, Nucleic Acids Research, 2011, Web Server Issue 39:W29W37. [17] Margulis, L. and Sagan, D. Acquiring Genomes: A Theory Of The Origins Of Species, Basic Books; 1st edition. ISBN10: , [18] Margulis, L. and Sagan, D. Microcosmos: Four Billion Years of Microbial Evolution, University of California Press. ISBN10: , [19] Karp, P.D., Paley, S., Romero, P. The Pathway Tools software, Bioinformatics, 2002, 18 Suppl 1:S [20] Bhattacharya, D., Archivald, J., Weber, A. How do endosymbionts become organelles? Understanding early events in plastid evolution, BioEssays, 2007, 29, pp [21] Yoon, H.S., Nakayama, T., ReyesPrieto, A., Andersen, R.A., Boo, S.M., Ishida, K. and Bhattacharya, D. A single origin of the photosynthetic organelle in different Paulinella lineages. BMC Evolutionary Biology, 2009, 9:98. [22] Bodył, A., Mackiewicz, P., and Stiller, J.W. The intracellular cyanobacteria of Paulinella chromatophora: endosymbionts or organelles? TRENDS in microbiology, 2007, 15, pp [14] Camacho, C., Coulouris, G., Avagyan, V., Ma N., Papadopoulos, J., Bealer, K., and Madden TL. BLAST+: architecture and applications, BMC Bioinformatics, 2009, Dec 15;10:421. [15] Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L., Eddy, S.R. and Bateman, A. The Pfam protein families database, Nucleic acids research, 2010, 38;Database issue;d
46 Where Supercomputing Science and Technologies Meet Cloud Computing, a Choice for the Information Management in the MSMEs Jorge Bernal Arias 1, Nelly Beatriz Santoyo Rivera 2, Daniel Jorge Montiel García 3 1 Master in Information Technology, Bachelor in Management, 2 Master of Information Systems, Computer Systems Engineer, 3 Master of Information Systems, Computer Systems Engineer, Instituto Tecnológico Superior de Irapuato, Carretera IrapuatoSilao Km. 12.5, Irapuato, Guanajuato, México, Abstract In the current conditions of globalized markets, having timely and complete information means a great contribution to decision making and, therefore, to guide all the administrative and logistical efforts in order to gain better position in those markets. This is clear to all companies, primarily for the large corporations that invest heavily in information technologies, worrying about being at the forefront of their markets. But, in the MSMEs, either the situation outlined above is not well understood, or they do not have the resources to establish an own information management strategy. Keywords: Cloud Computing, Databases, Applications in the Cloud, Mobile Devices, Information Management, Strategic Management, MSMEs, Economic Development. 1. Introduction Whichever situation that arises, the MSMEs (Micro, Small and Medium Enterprises) lack the ability to manage timely and valid information, so they are almost always out of competition. Considering that there are market sectors where small firms are more active than large corporations and that at any given time they can make the leap to better scenarios. It can provide a low cost information management system that meets the information industry standards. Analyzing the problems found in the MSMEs, it is currently found what we know as the absence of a functional structure in computing and information management. In this situation, it was made an analysis of what is had both in the internal and in the external field, concluding that the management of the cloud computing offers a solid alternative to resolve the problems identified, which will develop a database management system and its associated application which will be available to MSMEs at affordable prices. 13
47 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Development To demonstrate that emerging technologies, especially cloud computing, can redefine the concepts of information management of enterprises, especially MSMEs, allowing them to focus their efforts on the strategic and logistic aspects of the business processes, so they can set firmly and transcend within its market. hhhbased on the framework, it could be developed a study case that uses cloud computing services that allow solving general problems of MSMEs in two phases: Internal management information that facilitates the management of a MSME, and redirects its efforts toward meeting their business objectives. Management of a corporate image, i.e. generate a Web application that includes, among the above, links to mobile technologies that allow, at low cost, competing schemes in information management, creating value in their production processes. In the environment of MSMEs, particularly in the micro, the problems for their subsistence in their market are many and marked notables: competition with established businesses, brand new but unknown or notnecessary products; weak or undefined administrative structures, financial problems (especially cash), conflicts of planning and implementing the business plan, and mainly, what concerns us, an inadequate or notexisting scheme for managing information. Even though the aforementioned scheme should start from the fundamental objectives of the company, its business and its internal needs and/or external information, there is a factor that marks and makes this conflict grow, that is the general ignorance, and therefore the acceptance of computational tools as the main component for managing information. There is a general belief that the informatics solutions are expensive and are not considered costbenefit studies that would show the opposite, or that the use of spreadsheets or office suites is enough to manage their information, to finally, each person who is part of the company manage and store their information isolated and individually from the rest of all the teammates, creating conflicts of duplication of information or lack of validation in it, plus the knowledge generated is not shared and therefore it creates no value. Another aspect to consider is related to the equipment, as they have different characteristics, as in hardware: the old and obsolete office equipment and the heads or owners handling lastgeneration Laptops, Tablets and Smartphones; as in software: different versions of office suites, use of expired or unlicensed applications and wrong definition of internet resources. gwell, from the abovementioned: the solution would be to build an information system based on a database that can be accessed in both locally and remotely, to meet all information needs of staff and their managers, and also of potential customers or suppliers. This raises issues of accessibility, use of computing resources and their potential acquisition, which in theory would generate high costs, not always the entrepreneur is willing to cover. 14
48 Where Supercomputing Science and Technologies Meet An alternative to consider is the concept of cloud computing, which in a theoretical approach that offers that the computing resources and services (infrastructure, platform, applications) would be used as Internet services, without the users worrying for its internal use, so it would not be needed to have knowledge thereof [3]. jjjmoreover, the following benefits of cloud computing should be considered [4]: Costs. The user pays only the services used, i.e. consumption of resources (memory, processing, storage), leaving in the provider of the IT all the infrastructure problems. Competitiveness. The user can have access to new technologies at an affordable price, and so the ones who have the competitive advantage are who make a better use of their computing and information resources. Availability. In this aspect, the provider must guarantee constant access to the customer. Considering that this is done through Web application, so it must ensure accessibility from any computer equipment: PCs, Laptops, Tablets, Smartphones, etc. Scalability. Any updates, whether hardware or software, must be transparent to the user, so that any application must still be available to the user. All software updates should be immediate. All of these should allow the users to focus their efforts on strategic and logistic aspects of business processes, letting the MSME to establish solidly and transcend in its market. This relays on the provider the responsibility of the implementation, configuration and maintenance of the technological infrastructure that eases the execution of the application, and so the reliability of the information handled. From its origins, the concept of cloud computing has meant unlimited access to Internet services, being Google and Amazon the major companies that gave a strong impulse to this paradigm, as their infrastructure facilitate the use and ownership of the new environment As this concept was developing, new enterprises were adding to that impulse: the likes of IBM, HP, Oracle and Microsoft began offering communication services, dedicated Internet access, hosting services to Web applications, development and maintenance services to these, and document support and sharing services. In this context have emerged applications, such as Google Apps, Amazon EC2, among others, that ease sharing of documents on the network, and some of the basic services of the cloud. In addition, Oracle and Microsoft are developing platforms that offer integral services for the development of cloud applications such as Oracle On Demand, Google App Engine or Microsoft Azure. Although there are already some solutions available to businesses, large corporations are the only ones who can access them, due to the cost of the application development, and also as they have already the necessary platform, leaving out the small businesses. In Mexico, there are some independent companies dedicated to cloud services, like Serv.Net; however, they just start with hosting services to Web applications and their development and maintenance, but basically 15
49 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 they are hosting thirdparty products, which does not ensure the privacy and security required. In this project, the first thing to do is to raise the theoretical scheme of cloud computing as a framework, describing the three layers that form: Software as a Service (SaaS). Consistent llllldelivery of a complete application as a lllllservice. Platform as a Service (PaaS). The basic lidea is to provide a service platform to ldevelop software throughout the network. Infrastructure as a Service (IaaS). That establishes the offer of resources throughout the Internet instead of local servers and their needed infrastructure. Also, explain the concepts of virtualization (related with the platform), and clustering as resources for the improvement of the services of the cloud. 3. Conclusions Under the present conditions, both in nowadays globalized markets, especially for MSMEs, and in the context of information technologies, it is needed to search for solutions that are innovative, yet accessible in cost and scope, for those who will have timely and certain information that lets them to compete in more or less favorable conditions within their markets. A viable alternative is to build information management systems in the cloud computing, whose benefits would allow small entrepreneurs to channel their efforts towards the improvement of their production processes and, consequently, in a better position within their market. For example, they would be freed of costly equipment and network structures, being enough just a few Internet connections, which cost would be low and would have the extra expense related to the use of the tools that would be used. At this point, the creation of information management systems in the cloud computing is entirely feasible, either by products generated by large companies like Microsoft, Google, Oracle or Amazon, or by the developments made by companies dedicated to the creation of fitforpurpose IT solutions where the current proposal is embedded. 4. References [1] Ian Sommerville, Ingeniería de Software, México Pearson Educación, 2002, pp [2] Josep Curto Díaz; Jordi Conesa Caralt, Introducción al Business Intelligence. Barcelona: Editorial UOC, 2010 [3] Francisco Carlos Martínez Godínez, Beatriz Verónica Gutiérrez Galán, Cómputo en Nube: Ventajas y Desventajas. Available at seguridad.unam.mx/numero08/c%c3%b3mputoennubeventajasydesventajas [4] Google Apps for Business. Available at index.html#utm_ source=google&utm_ medium=ha&utm_campaign=latamappsarsk&utm_content=009&term=cloud%20computing [5] Amazon EC2. Available at com/es/ec2// / [6] Oracle On Demand. Available at 16
50 Where Supercomputing Science and Technologies Meet oracle.com/us/products/ondemand/index.html [7] Fernando Mollón, Oracle apuesta por virtualización y cómputo en la nube. Available at oracleapuestaporvirtualizacionycomputoenlanube [8] Microsoft Azure. Available at microsoft.com/esmx/cloud?ocid=bannmxjtc DPUTCCLOUDFY12 [9] Roger López Rodríguez, Punto de Vista. Available at cloud computing/2010/11/30/ [10] Serv.Net. Available in mx/ 17
51 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 A Multiobjective Genetic Algorithm for the Biclustering of Gene Expression Data Jorge E. LunaTaylor 1, Carlos A. Brizuela 2 1 Departamento de Sistemas y Computación, Instituto Tecnológico de La Paz, Boulevard Forjadores No. 4720, La Paz, B.C.S., México; web: 2 Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada, Carretera Ensenada Tijuana 3918, Ensenada, B.C.N., México, web: web: Abstract In recent years the development of new technologies for the design of DNA microarrays, has generated a large volume of biological data, which requires the development of parallel computational methods for their functional interpretation. On these sets of data, biclusters construction algorithms attempt to identify gene associations and experimental conditions, where genes exhibit a high correlation to each given condition. In this paper, we introduce a new multiobjective genetic algorithm, that unlike other evolutionary proposals, does not require a local search for the identification of optimum biclusters. The proposed algorithm is simpler and had better performance than the ones found in the current literature for two real gene expression data. Keywords: biclustering, gene expression, multiobjective genetic algorithm, microarray DNA. 1. Introduction The increased use of microarray technology has generated a large volume of biological data, which necessitates the development of parallel computational methods for their functional interpretation. To address this challenge it is necessary to apply data mining techniques. Among these techniques, in regards to database gene expression, clustering has become one of the most used approaches as a first step in the work of discovering new knowledge. However, the results of clustering methods applied to genes have been limited. This limitation causes difficulty in analyzing the expression of genes for a given set of experimental conditions. To overcome this limitation various algorithms have been proposed to cluster genes and conditions simultaneously. These algorithms are called bicluster algorithms and have the aim to identify groups of genes, given a set of experimental conditions, where the genes exhibit a high correlation across a set of given conditions. 18
52 Where Supercomputing Science and Technologies Meet The search for biclusters in gene expression arrays is a very attractive computational challenge. The work of Cheng and Church [1] is of significance since it introduced the concept of bicluster applied to the analysis of gene expression for the first time, and proposed an original algorithm for its construction. Despite some limitations presented in this algorithm (as discussed by Rodriguez et al. [2] and by Aguilar [3]), it has been used as a basis for evaluating and comparing the performance of a wide variety of more recent and elaborated algorithms. Madeira and Oliveira [4] presented a classification of biclustering methods mainly based on two aspects: i) the type of biclusters that the algorithms are able to find, and ii) the computational technique used. There are algorithms that seek biclusters with constant values, e.g. the mclustering [5], based on the divide and conquer approach, and the DCC [6] based on a combination of clustering of rows and columns. Other methods identify biclusters with columns or rows with constant values, such as the CTWC [7], the δpatterns [8] based on a greedy approach, and Gibbs [9]. Some methods like δ biclusters [1] and FLOC [10, 11], use greedy approaches, the pclusters [12] uses exhaustive search, Plaid Models [13] and PRM [14, 15], are based on the identification of the distribution parameters. There are also methods that seek biclusters with patterns of coherent evolution (OPSMs [16] and xmotifs [17]), using a greedy search, and SAMBA [18] and OPClusters [19], which perform exhaustive search. Rodriguez et al. [2] add to this classification methods that use stochastic search. In this branch algorithms such as the SEBI [20] and Simulated Annealing [21] are included. Despite the existence of a large number of biclustering algorithms, there are still many significant challenges to overcome: The scarce information available to llllllldefine the type of specific biclusters to lllllllsearch. The amount of noise in the data matrices. The computation time due to the complex llllllcalculations often required. The absence of data in the input matrices. The existence of user parameters that llllll NJJJstrongly influence the final results. The lack of assessment methods for the llllll generated results. The multiobjective nature of the llellproblem, since the MSR and the bicluster lllll size, must be optimized at the same time. In this paper, we introduce an evolutionary algorithm for the biclustering problem. The algorithm considers biclusters themselves as individuals in a population to evolve. The objective is to minimize the MSR value of biclusters, while simultaneously maximizing its size. Experiments are performed on two reference sets data (Yeast Saccharomyces cerevisiae and Human Lymphoma Bcells). The problem to solve is formally defined next. 2. Biclustering analysis of gene expresion Cheng and Church [1] introduced the concept of bicluster within the context of gene expression data analysis. A bicluster is a subset 19
53 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 of genes and a subset of conditions with a high level of similarity. The similarity is considered as a consistency measure between genes and conditions in the bicluster. Within this context, we can define biclustering as the process of grouping genes and conditions simultaneously, searching for biclusters of maximum size and maximum similarity within a data matrix of gene expression. Madeira and Oliveira [4] present a formal approach to the problem of biclustering. As input data having a matrix of n by m, where each element a ij is usually a real value. In the case of gene expression arrays, a ij represents the level of expression of gene i under condition j. More generally, one considers the data matrix A with a set X of rows and a set of columns Y, where the element aij corresponds to a value representing the relationship between the row i and column j. The matrix A with n rows and m columns is defined by its set of rows, X = {x 1,,x n } and its set of columns, Y = {y 1,,y m }. (X, Y) is used to denote the matrix A. If I X and J Y are subsets of rows and columns of A, respectively, then A IJ = (I, J) which denotes the submatrix A IJ of A containing only the elements a ij belonging to the submatrix with the set of rows I and the column set J. Given the matrix A, a cluster of rows is a subset of rows that have a similar behavior through the set of all columns. This means that a cluster of rows A IY = (I, Y) is a subset of rows defined by the set of all columns Y, where I = {i 1,,i k } is a subset of rows I X and k n. A cluster of rows (I, Y), can thus be defined as a submatrix k by m of the data matrix A. Similarly, a cluster of columns is a subset of columns which have a similar behavior across the set of all rows. A cluster A XJ = (X, J) is a subset of columns defined on the set of all rows of X, where J = {j 1,,j s } is a subset of columns (J Y and s m). A cluster of columns A XJ = (X, J) can be defined as a submatrix of n by s of the data matrix A. A bicluster is a subset of rows that have a similar behavior through a subset of columns, and vice versa. The bicluster A IJ = (I, J) is a subset of rows and a subset of columns of Y, where I = {i 1,,i k } is a subset of rows (I X and k n), and J = {j 1,,j s } is a subset of columns (J Y and s m). A bicluster (I, J) can be defined as a submatrix of k by s of the data matrix A. The specific problem addressed by the biclustering algorithms is defined as: given a data matrix A it is required to identify a set of biclusters B k = (I k, J k ) such that each bicluster B k satisfies some property of homogeneity. The exact features of homogeneity of biclusters vary according to the statement of the problem. Although the complexity of the biclustering problem depends on the exact formulation of the problem, and specifically the function used to evaluate the quality of a bicluster, most variants of this problem are NPhard. 3. Related work Recently there have been several algorithms based on a variety of techniques to find biclusters, for example, BBC [22], Reactive GRASP [23], RAP [24], GS Binary PSO [25] and TreeBic [26], among others. In general, it is difficult to evaluate and compare biclustering methods, since the results obtained strongly depend on the scenario under consideration. Prelic et al. [27] 20
54 Where Supercomputing Science and Technologies Meet present an evaluation and comparison of five outstanding methods. The evaluated methods were: CC [1], Samba [18], OPSM [16], ISA [28, 29] and xmotif [17]. To evaluate the methods both artificial and real data sets were used. Tested with artificial data biclusters with constant and additive values. They also were tested with systematic increase in noise, and with increasing overlap between the created biclusters. As for the real data, biological information used GO annotations [30, 31], maps of metabolic pathways [31], and information on proteinprotein interaction [32, 31]. In general, the methods ISA, Samba and OPSM perform well. While some methods perform better under certain scenarios, show lower performance in others. Mitra and Banka [33] introduced a multiobjective evolutionary algorithm (MOEA) with the addition of local search. The objective is to find large size biclusters, with MSR values below a predefined threshold. Their method was evaluated using two sets of gene expression data referenced in the literature: yeast and Human B Cell Lymphoma. The yeast data they used, is a collection of 2884 genes under 17 conditions, with 34 null entries with value 1, indicating loss of data. Expression data of Human B cells [37], containing 4026 genes under 96 conditions, with 12, 3% of values lost. The results of this method were compared with FLOC [11], DBF [35] and CC [1], using as a criterion of comparison the MSR, and the size of the biclusters obtained by each method. In addition, algorithms determine the biological significance of the biclusters in connection with information on the yeast cell cycle. The biological relevance is determined based on the statistical significance determined by using the GO annotation database [36]. As for the comparison based on the MSR and the size of the biclusters obtained, MOEA obtained results far superior to the obtained by other methods. Dharan and Nair [23] proposed Reactive GRASP method. Statistical significance of the found biclusters is assessed to see how well they correspond with the known gene annotation [33]. For this purpose the package SGD GO gene ontology term finder [36] were used. The performed tests show that the Reactive GRASP is able to find biclusters with higher statistical significance than the basic GRASP [23] and the CC [1]. Das and Idicula [25] proposed an algorithm based on greedy search combined with PSO. The tests were conducted on expression data of the cell cycle of the yeast Saccharomyces cerevisiae. The data used is based on [34], and consists of 2884 genes under 17 conditions. The results were compared with those of SEBI [20], CC [1], FLOC [11], DBF [35] and Modified Greedy [25]. The comparison was based on the MSR (also named as MSE) presented by [1], and the size of the biclusters. The GS Binary PSO outperformed the other methods, except to DBF, on the MSR, and showed competitive results in the size of the biclusters. Caldas and Kaski [26] proposed a method based on a hierarchical model (TreeBic). The model assumed that the samples or conditions in a microarray are grouped in a tree structure, where nodes correspond to subsets of the hierarchy. Each node is associated with a subset of genes, for which, samples are highly homogeneous. The tests were conducted on a collection of 199 mirnas profiled from 218 human tissues from healthy and tumor 21
55 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 cell lines. The results were compared with those obtained by Samba [18] Plaid [13], DC [1] and OPSM [16] methods. TreeBic performed better in general, both in terms of the proportion of biclusters enriched to at least one tissue or GO category, and in terms of the total number of tissues and GO categories enriched. Despite these results, the TreeBic method had the second lowest number of biclusters generated. conditions. Figure 1A shows an example of the binary representation of a bicluster. The bicluster corresponding to the binary string shown in Figure 1A is shown if Figure 1C, it is extracted from the expression matrix presented in Figure 1B. 4. Proposed algorithm We propose a multiobjective genetic algorithm, where each individual in the population is a bicluster. The objective is to minimize both the MSR and the bicluster size. Unlike the MOEA proposed in [33], this algorithm does not require of a local search to keep biclusters under the MSR threshold δ. Instead, consider the fact of being under or over the threshold, as a first condition when select the best biclusters. This represents two important advantages, first it avoids the use of the parameter α required in local search, which influences heavily the obtained results. The second advantage is that it reduces the execution time, allowing to use a larger number of individuals and generations. 4.1 Representation of biclusters A bicluster is represented as a binary string where the first bits correspond to genes and the second part of bits to the conditions. If the binary string has a value one at position j of the first part it indicates that the gene j, the same applies for the condition part of the string. A bicluster consists of the values of selected gene expression under the selected Figure 1. Representation of a bicluster. A) The binary string representing the bicluster. B) An array of gene expression data. C) Bicluster values comprising selected expression (shaded values) of the matrix in Figure B. 4.2 Multiobjective genetic algorithm The Algorithm 1 starts by creating a population of n biclusters. Each bicluster is created by selecting at random two genes and two conditions of the matrix expression, so that the MSR do not exceed the threshold δ. If it is exceeded the selected pair are discarded, and the process repeated until the MSR becomes under the threshold. The nondominated front of each bicluster is computed as it is done in [33]. The nondominated front is calculated based on the 22
56 Where Supercomputing Science and Technologies Meet concept of dominance. A bicluster i dominates bicluster j, if either of the following conditions hold: 1. The MSR of i (MSR i ) is less than or equal to MSR j, the size of i (size i ) is larger than size j. 2. size i is greater than or equal to size j, and MSR i is less than MSR j. For a bicluster to belong to a nodominated front, it should not be dominated by some other in the population. Once the biclusters are identified in the first front, they are discarded for the identification of the second front of biclusters. This process is repeated successively until there are no dominated biclusters. Step 3 computes the crowding distance of each individual as it is done by Mitra and Banka [33]. This distance is a measure of the degree of saturation of the search space (in terms of size and MSR). The more similar the MSR and size of an individual is to the rest of the population, the lower the crowing distance. This distance is used to maintain diversity in the population. Once the nondominated fronts and the crowding distance are computed, the selection of the best individuals is performed. The binary selection with crowding is applied. First randomly rearrange the individuals within the population, and two adjacent individuals to conduct the tournament. An inidividual i is chosen on an individual j if it meets any of the following conditions: 1. The MSR of i is below the threshold δ, and the MSR of j is not below the threshold. 2. Both MSR are on the same side of the threshold δ, and i is in a front with lower index than j. 3. Both MSR are on the same side of the threshold of δ, both belong to the same front, and the crowding distance of i is greater than the one corresponding to individual j. Crossover is applied on the selected individuals. For this process individuals are taken in pairs (parents), and creates two new biclusters (children) per each pair of parents. For each child, two random crossover points are selected in the binary strings corresponding to both parents. The first crossover point is set to a bit position corresponding to a gene, and the second crossover point is set to a position corresponding to a condition. The child takes from one of the parents the genes found to the left side of the first crossover point, and from the other parent genes to the right. The same procedure applied to the conditions. The parent bicluster whose genes are taken 23
57 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 from the left side of the first crossing point is chosen randomly. Figure 2 shows an example of the crossover of two biclusters. The child one is created by taking genes from the left side of the first crossing point in parent one, and genes on the right side of the first crossing point of the parent two. Take the conditions on the left side of the second crossing point of parent two and conditions from the right side of the second crossing point of parent one. The child two is created in reverse order of child one. Subsequently, the mutation process was applied to a percentage of biclusters in the children population. Mutation of a bicluster is done by selecting a random bit in the string, and changing its value. If the bit is a zero value is changed to one which represents a gene or condition that was not considered in the bicluster now is included. Figure 2. Example of a cross between two biclusters. Figure 3 shows an example of a mutation in a bicluster. In this example the tenth bit was randomly selected, and modified. The bit corresponds to the position of a gene. The value of this bit was changed from zero to one, which represents the values of expression of the gene number 10 in the matrix expression for the selected conditions (values shaded), will be included in the bicluster. Figure 3. Example of mutation of a bicluster After the mutation is performed a process which combines both populations (parent and children) is carried out. This process consists in considering only as a single population all biclusters from both populations. For this combined population nondominated fronts and crowding distances were recalculated. Subsequently, the biclusters are ordered for this combined population, according to the following criteria: 1. First fit the biclusters that are below the jjjjthreshold δ. 2. Within biclusters which are at the same jjjjside of the threshold, first fit those jjjjhaving a lower front. 3. Among the biclusters on the same side of jjjjthe threshold, and in the same front, fit jjjjfirst those with a larger crowding distance. Once the individuals are arranged in a combined population the first n are selected, which will be considered the next generation of biclusters. This process stops after a number of generations ng without change in the size of the largest bicluster with MSR below 24
58 Where Supercomputing Science and Technologies Meet threshold is reached. 5. Experimental results We apply the multiobjective genetic algorithm on two data sets used as test cases. The first experimental condition tested was the expression of 2884 genes under 17 conditions with the Yeast Saccharomyces Cerevisiae, containing 34 nulls. The second set corresponds to the expression of 4026 genes under 96 conditions of Human B cells Lymphoma, with 47,639 null values corresponding to 12.3% of the full set. Both sets of data were taken from the site med.harvard.edu/ [37]. The experiments were performed using an MSR threshold of δ = 300 for the yeast, and a threshold δ = 1200 for the Lymphoma. Although there is no justification from the point of view of biology, these values have been used extensively to evaluate and compare a variety of biclustering methods. In the case of the yeast assembly, null values were replaced by random values in the range 0 and 800. In the case of Lymphoma null values were replaced by random values in a range of 800 to 800. Both threshold values selected for the MSR, as well as the strategy and range to replace the zero values were established in the discussed manner in order to make a more direct comparison with the results reported in other studies. The experiments consisted of 30 runs with each data set, using populations of 50 individuals, and setting a value of 400 for the number of generations without improvement. A 90% of selection, while 100% and 50% for crossover and mutation, respectively. The method was coded and implemented in C Sharp, experiments were performed under the Windows using Visual Studio 7 Ultimate 2010 in a Laptop of 1.73 GHz speed and 1.00 GB of RAM. The algorithm receives as input a text file with the matrix of expression data to be processed. Returns as output another text file with the built biclusters, the values that were used to replace the null values in the array, and a descriptive information on the best biclusters built. The average MSR value, the number of genes, number of conditions, and the average and maximum size of the discovered biclusters were used as assessment criteria. Table 1 shows a comparison of the results obtained from the Yeast dataset. For this comparison the FLOC algorithms [10], DBF [35], MOEA [33], and the one presented by Cheng and Church [1] are considered here. This is a representative group of algorithms for biclustering, which have been analyzed frequently in literature. The results reported for these algorithms were taken from the work of Mitra and Banka [33]. The proposed algorithm (called MOGA) significantly outperforms other algorithms in the size of the biclusters discovered under the defined threshold. The MOGA obtains larger biclusters, even larger than those of MOEA, which already exceeds the performance of other algorithms. A very important advantage MOGA with respect to MOEA is that it does not require a local search to keep the biclusters below the threshold, which avoids the handling of the parameter α (used in various methods), whose proper choice largely influences on the results. 25
59 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Table 1. Comparative results of biclustering methods on data from the Yeast Saccharomyces Cerevisiae, using a threshold MSR δ = 300. The results of the algorithm using the Lymphoma data were compared with the results reported by Mitra and Banka [33], which were the best results in the literature (see Table 2). This table shows that MOGA outperforms the best MOEA result, both in terms of the size of biclusters found, as in CI value. The CI (Consistency Index) introduced by Mitra and Banka, represents the relationship between the MSR of a bicluster and its size. This ratio indicates how well the two requirements of biclusters are met: i) the expression levels of genes are similar over a range of conditions, i.e., must have a low MSR, and ii) the size is as large as possible. A bicluster is considered better as its CI value is smaller. Table 2. Best biclusters found on the data set of the Human BLymphoma cells, MSR using a threshold δ = Conclusions A new multiobjective genetic algorithm for the biclustering of gene expression data has been proposed. Experiments conducted on two sets of biological data, which have been used widely as test cases, have shown that the proposed algorithm performs better than others currently reported in the literature. An important feature of our algorithm is that it does not require a local search, contrary to some current algorithms which require maintaining the MSR below the threshold by means of this technique. Experiments were focused on the discovery of large biclusters with MSR below predefined thresholds for both sets of data, which are widely accepted by the scientific community. Future work will assess the biological significance of the generated biclusters, based on ontological annotations. The proposed evolutionary algorithm will be redesigned to work with parallel models of computation, which will allow us to deal with larger instances. 7. References [1] Y. Cheng and G. M. Church, Biclustering of expression data, Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB 00), 2000, pp [2] D. S. Rodriguez, J. C. Riquelme, and J. S. Aguilar, Analisis de datos de expresion genética mediante tecnicas de biclustering, tech. rep., Universidad de Sevilla, [3] J. Aguilar, Shifting and scaling patterns from gene expression data, Bioinformatics, vol. 21, 2005, pp
60 Where Supercomputing Science and Technologies Meet [4] S. C. Madeira and A. L. Oliveira, Biclustering algorithms for biological data analysis: a survey, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 1, no. 1, 2004, pp [5] J. A. Hartigan, Direct clustering of a data matrix, Journal of the American Statistical Association (JASA), vol. 67, no. 337, 1972, pp [6] S. Busygin, G. Jacobsen, and E. Kramer, Double conjugated clustering applied o leukemia microarray data, in Proceedings of the 2nd SIAM International Conference on Data Mining, Workshop on Clustering High Dimensional Data, [7] G. Getz, E. Levine, and E. Domany, Coupled twoway clustering analysis of gene microarray data, Proceedings of the Natural Academy of Sciences USA, 2000, pp [8] A. Califano, G. Stolovitzky, and Y. Tu, Analysis of gene expression microarays for phenotype classification, in Proceedings of the International Conference on Computacional Molecular Biology, 2000, pp [9] Q. Sheng, Y. Moreau, and B. D. Moor, Biclustering microarray data by gibbs sampling, Bioinformatics, vol. 19, no. 2, 2003, pp. ii196 ii205. [10] J. Yang, W. Wang, H. Wang, and P. Yu, δclusters: Capturing subspace correlation in a large data set Proceedings of the 18th IEEE International Conference on Data Engineering, 2002, pp [11] J. Yang, W. Wang, H. Wang, and P. Yu, Enhanced biclustering on expression data, Proceedings of the 3rd IEEE Conference on Bioinformatics and Bioengineering, 2003, pp [12] H. Wang, W. Wang, J. Yang, and P. S. Yu, Clustering by pattern similarity in large data sets, Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 2002, pp [13] L. Lazzeroni and A. Owen, Plaid models for gene expression data, Statistica Sinica, vol. 12, 2002, pp [14] E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller, Rich probabilistic models for gene expression, Bioinformatics, vol. 17, no. Suppl. 1, 2001, pp. S243 S252. [15] E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller, Decomposing gene expression into cellular processes, Proceedings of the Pacific Symposium on Biocomputing, vol. 8, 2003, pp [16] A. BenDor, B. Chor, R. Karp, and Z. Yakhini, Discovering local structure in gene expression data: The orderpreserving submatrix problem, in Proceedings of the 6th International Conference on Computacional Biology (RECOMB 02), 2002, pp [17] T. M. Murali and S. Kasif, Extracting conserved gene expression motifs from gene expression data, Proceedings of the Pacific Symposium on Biocomputing, vol. 8, 2003, pp [18] A. Tanay, oded Sharan, and R. Shamir, Discovering statistically significant biclusters in gene expression data, Bioinformatics, vol. 18, no. Suppl. 1, 2002, pp. S136 S144. [19] J. Liu and W. Wang, Opcluster: Clustering by tendency in high dimensional space, Proceedings of the 3rd IEEE International Conference on Data 27
61 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Mining, 2003, pp [20] F. Divina and J. S. Aguilar, Biclustering of expression data with evolutionary computation, IEEE Transactions on Knowledge and Data Engineering, vol. 18, 2006, pp [21] K. Bryan, P. Cunningham, and N. Bolshakova, Biclustering of expression data using simulated annealing, 18th IEEE Symposium on Computer Based Medical Systems (CBMS 05), 2005, pp [22] J. Gu and J. S. Liu, Bayesian biclustering of gene expression data, BMC Genomics, vol. 9, no. Suppl. 1, 2008, p. S4. [23] S. Dharan and A. S. Nair, Biclustering of gene expression data using reactive greedy randomized adaptive search procedure, BMC Bioinformatics, vol. 10, no. Suppl. 1, 2009, p. S27. [24] G. Pandey, G. Atluri, M. Steinbach, C. L. Myers, and V. Kumar, An association analysis approach to biclustering, ACM SIGKDD, ACM New York, NY, USA, 2009, pp [25] S. Das and S. M. Idicula, Greedy searchbinary pso hybrid for biclustering gene expression data, International Journal of Computer Applications, vol. 2, no. 3, 2010, pp [26]J. Caldas and S. Kaski, Hierarchical generative biclustering for microrna expression analysis, RECOMB, 2010, pp [27] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Buhlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler, A systematic comparison and evaluation of biclustering methods for gene expression data, Bioinformatics, vol. 22, 2006, pp , [28] J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, and N. Barkai, Revealing modular organization in the yeast transcriptional network, Nature Genetics, vol. 31,2002, pp [29] J. Ihmels, S. Bergmann, and N. Barkai, Defining transcription modules using largescale gene expression data, Bioinformatics, vol. 20, 2004, pp [30] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. IsselTarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, Gene ontology: tool for the unification of biology, Nature Genetics, vol. 25, no. 1, 2000, pp [31] A. P. Gasch, P. T. Spellman, C. M. Kao, O. CarmelHarel, M. B. Eisen, G. Storz, D. Botstein, and P. O. Brown, Genomic expression programs in the response of yeast cells to environmental changes, Molecular Biology of the Cell, vol. 11, 2000, pp [32] A. Wille, P. Zimmermann, E. Vranova, A. Furholz, O. Laule, S. Bleuler, L. Hennig, A. Prelic, P. von Rohr, L. Thiele, E. Zitzler, W. Gruissem, and P. Buhlmann, Sparse graphical gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana, Genome Biology, vol. 5, 2004, p. R92. [33] S. Mitra and H. Banka, Multiobjective evolutionary biclustering of gene expression data, Journal of the Pattern Recognition Society, vol. 39, 2006, pp [34] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G.M.Church, Systematic determination of genetic network architecture, Nat. Genet., vol. 22, no. 3, 1999, pp
62 Where Supercomputing Science and Technologies Meet [35] Z. Zhang, A. Teo, B. Ooi, and K. Tan, Mining deterministic biclusters in gene expression data, Proceedings of the fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE 04), 2004, pp [36] SGD GO Termfinder (http://db.yeastgenome. org/cgibin/go/gotermfinder) [37] Harvard Molecular Technology Group and Lipper Center for Computational Genetics (http:// arep.med.harvard.edu) 29
63 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Sorter & Identifier of Oaks Through Mobile Devices Luis Enrique Jaime Meza 1, Nelly Beatriz Santoyo Rivera 2 y Daniel Jorge Montiel Garcia 2 1 Master in information technologies, 2 Master in information systems, computational systems; Instituto Tecnológico Superior de Irapuato, Abstract The identification of images by computer has given us now a very large percentage progress. But we but we have not fully utilized, a neglected area is our biodiversity. Through the development of this research, is found if possible to realize the identification of a species of plant (tree, plant, fungus) through image processing on mobile devices. 1. Introduction The identification of images by computer has given us now a very large percentage progress. But we but we have not fully utilized, a neglected area is our biodiversity. Guanajuato is currently the only state that does not have scheduled the families of oak trees in their territory. Our purpose is to identify those families that live in our state. Biology department of the Technological Institute of Irapuato has worked on identifying the families of these trees, but methods of identification were inaccurate, an experiment was conducted which compared petals flor. The method consisted of the identification based on shape and color of each of these; the successful outcome of the comparison process has been the proposal to realize the same procedure with photographic images. Taking into account currently excessive deforestation is carried out for purposes of urbanization or agriculture, are cut down millions of hectares of trees that do not even bother to find out which class belongs [1]. certainly many areas of trees and plants that are not even endangered but they are about to be are deforested unduly for lack of an analysis and determination that plants are living in those places. Early identification of varieties of trees and plants on a mobile device, will know the species, and to assess, determine or decide whether an area is ideal for exploitation. Because there is an extensive biodiversity of oaks in Guanajuato that have a large set of differences and similarities turn biologists have decided to classify them, taking into account typical characteristics of each of them through their leaves, color, size, location, etc.., which make it different from others, however, has proved very difficult to make this classification, because it requires extensive experience in the recognition of oak and thus discover that species belongs, since it is necessary to analyze very thoroughly the parts of the tree to avoid confusion with other species by the similarity of their characteristics [2]. The system is based on C language, since this allows you to add libraries and methods that help to fast programming of modules requires some algorithm for detection and identification of images, in this case we use the detection of canny1 of edges in the image. 30
64 Where Supercomputing Science and Technologies Meet 2. Development The development of our system is based on the detection of the shape of oak leaves, which are quite different particularity each other. The need for a deep identification leads us to perform several tests of color, size, shape before our expert system makes a deduction. It is for this reason that various methods are passed containing pattern recognition algorithms, include the Canny algorithm [3], which focuses on the detection of edges (this step helps to identify how our leaf) involves us in three key steps: Noise reduction.  The detection algorithm uses a canny edge filter based on the first derivative of a Gaussian. Since it is susceptible to noise present in unprocessed image data, the original image is transformed with a Gaussian filter. The result is a slightly blurred image over the original version. This new image is not affected by a single pixel of noise in a significant degree. the edge gradient and direction: Figure 2. Obtaining direction and the edge Pattern recognition within the group of methods and sequences of identification is the last step in image processing is what allows us to make the best choice among the alternatives, here a flowchart of the recognition [4]: Figure 3. Flowchart Figure 1. Here is an example of a 5x5 Gaussian filter Finding the intensity gradient of the image.  The edge of an image may point in different directions, so that the Canny algorithm uses four filters to detect horizontal, vertical and diagonal edges blur. The edge detection operator (Roberts, Prewitt, Sobel, for example) returns to the first derivative in the horizontal direction (Gy) and the vertical direction (Gx). From this, one can determine The following describes the steps to be considered when dealing with images: The first step is the acquisition of the himage, to do this requires a camera and hthe ability to digitize the signal produced hby it. Continue with the preprocessing of hthe image. The objective of this step his to improve the image to increase the hlikelihood of success in consequential hsteps. Some methods used in this stage 31
65 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 his to increase the contrast, remove noise, hetc. It is consistent segmentation, in this step his partitioned image into its constituent hparts or objects. The data obtained in the segmentation hstage must be converted to an appropriate hstructure for further processing. The hdescription, also called feature selection, hextract quantitative information relevant hfeatures to distinguish one object from hothers. The last step is the recognition and hinterpretation. Recognition is the hprocess that assigns an identifier to an hobject based on information provided by hits descriptors. The interpretation is to assign a meaning to the object recognition. In general, processing systems, including recognition and interpretation are associated with image analysis applications aimed at extracting information [5]. Figure 4. Stages of image processing acatenangensis Trel. The system reduces 90% the time of classification developed manually in the Biology Department Technological Institute of Irapuato, however there remains a need to collect a good sample of leaves, since the system leaves showed errors abuse and inconsistencies by abuse, whether caused by external factors or due at time of collection. Extensive research was conducted by consulting different sources of information as was the dichotomous key of oaks of Mexico for the collection and analysis of where they made a distinction between these sources of information to select the main features that are similar in most species such as pubescence, shape and leaf margin. The results shown by the system are many, among the highlights are listed below. The system is tangible, stable and 80% reliable. It reached this conclusion because the evidence on differences existed between classifiers and the system, there is a relatively large database to house most of the families of oaks that exist in our Mexican Republic, also there is no classification and identification of living in our own state. It aims to improve the current system version to test different recognition algorithms that allow us to reduce the margin of error, and enlarge our database of already classified images to increase the speed and accuracy in our system. Based on the research that is working, has been developing software that provides the taxonomic identification of the leaves of oaks for proper classification of species in the state of Guanajuato. Some species have been identified and classified stand Oxydenia J. T. Howell, Q. acutifolia Née, Q. affinis Scheidweiler, Q. agrifolia ssp, Q 32
66 Where Supercomputing Science and Technologies Meet 3. Conclusions The present work has been just the basics of computer system or software for the Identification of oaks in the state of Guanajuato, quick and easy recognition of oak leaves for the correct classification of most species in the state of Guanajuato. The system decreases by 80% the time of classification was done manually in the Department of Biology Technological Institute of Irapuato, but not decrease the time it takes place at the leaf collection since this involves several factors other system development. Extensive research was conducted by consulting different sources of information as was the dichotomous key of oaks of Mexico for the collection and analysis of where they made a distinction between these sources of information to select the main features that are similar in most species such as pubescence, shape and leaf margin. [5] Learning OpenCV: Computer Vision With The OpenCV Library by Gary Bradski, Adrian Kaehler, O Reilly, page References [1] J. MARTÍNEZ Juan Cruz, Oswaldo Téllez Valdés and Guillermo Ibarra Manriquez. Structure of the oaks of the Sierra de Santa Rosa de Guanajuato, Mexico. Journal of Biodiversity, 80: [2] Silvia Gonzalez Morales Muñiz Chain Mary. Software for the taxonomic identification of oaks in Guanajuato, December [3] MARTÍNEZ Juan Cruz, Oswaldo Téllez Valdés and Guillermo Ibarra Manriquez. Structure of the oaks of the Sierra de Santa Rosa de Guanajuato, Mexico. Journal of Biodiversity, 80: [4] Canny, J., A Computational Approach To Edge Detection, IEEE Trans. Pattern Analysis and Machine Intelligence, 8 (6) : ,
67 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 SCI Juan Carlos González Córdoba 1, Nelly Beatriz Santoyo Rivera 2 y Daniel Jorge Montiel Garcia 2 1 Master in information technologies, 2 Master in information systems, computational systems, Instituto Tecnológico Superior de Irapuato, Abstract Since the beginnings of the civilization writing has always been the greatest legacy of humanity, allowing us to transmit our knowledge through the generations. This quality of human beings have tried to play programmatically inside computers, but the way in which the brain performs these actions, has been one of the most complex topics and have unresolved today. With this in mind Alan Turing (1950) proposed the Turing Test which begins with a formal challenge, the possibility of trying to trying to understand a computer the same way as we do with our neighbors, this is proved in the article Computing Machinery and Intelligence [1]. This gives us the pattern of trying to use artificial intelligence techniques [2] [3] [4] [5] to try to start a possible conversation, in this investigation were used expert systems where the need arises to have a factual basis that serves as a priori knowledge, the formulation of rules of inference for later integration into an inference engine which search method applied to reach the solution. Among the results obtained is the ability to conduct a simple conversation with the user within the margins of a context defined by its knowledge base. Keywords: Turing Test, knowledge base, inference motor. 1. Introduction Alan Turing proposed the question, Do the machines can think?, In his article Computing machinery and intelligence in that article makes the Turing test, which consists of making a machine capable of holding a conversation with a person but without the person to see the machine, the conversation should be fluid and consistency so that the person you think you are talking with another person when in reality is a machine which he is responding. Turing proposed that if a machine can behave in all aspects of smart, then you should be smart [6]. The Loebner Prize is a competition in which chatbots are to this day the firstplace prize is not delivered as no robot has been sufficiently compelling human being. This competition, which has been held annually since 1990, is sponsored by Hugh Loebnerand several universities. This year the winner was Bruce Wilcox with its program Suzette, who was credited with the $ 3,000 prize for the third place and bronze medal. Two prizes will be awarded only once and so far nobody has. One awarded a silver medal and $ 25,000, which corresponds to the first chatbot that is indistinguishable 34
68 Where Supercomputing Science and Technologies Meet from a human and a $ 100,000 gold medal included, to be awarded to the first computer program capable of emulating a human intelligence, in addition to text using voice and other elements of visual communication. Once you get the «gold medal» of the chatbots, this competition will be held, as the Turing test will have been completely overcome and then we can say that the machines are as smart as we [7] [8]. It is in the area of chatbots in which the focus of this project, which is to develop an Intelligent Conversation Software (SCI by its Spanish acronym) that is capable of holding a conversation with a person through a graphical user interface, and while the user is able to contribute to the knowledge base of software to integrate their own question sand possible answers to it by the program. 2. Development Because the Turing test is to create a program that is to behave intelligently, we proceeded to investigate how they are programmed some of the chatbots, yielding the result that the vast majority of these bots have three parts: A knowledge base. An inference motor. A user interface. To develop the project proposes a methodology that consists of three steps: Planning: process focuse on research the ccccctopic and the development of theoretical cccccframework. cexecution: this is where the ccccctools are defined to use (in this case the cccc programming language C# and XML files) and develops software Intelligent Conversation with the following tools. Analysis: This phase is oriented program performance analysis and detection of execution errors. This is where the SCI was presented to different people to do use it and so have the feedback and the opinion that users give on the SCI. SCI architecture is divided into three parts as mentioned above mind, the functions of those parts are detailed below. Knowledge Base Represents the knowledge of the system and the problem in the form of descriptive facts and rules of logical inference. For the representation of the knowledge base system decided to use a file in XML (extensible Markup Language). This XML file has the following structure: <base> <categoria> <patron> </patron> <answers></answers> </categoria> </base> The node <base> represents the root node that stores all the knowledge base used by the system and all that is inside this node (which 35
69 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 is between the tags <base> and </ base>) will be considered by the form of SCI as validate and understand the questions asked by the user and all possible answers to by the system. Specifically, within the node where it is stored <categoria> each of the elements forming part of the knowledge base; exist within this node and nodes <answer> <pattern> which store questions accepted or recognized by the system and the list of possible answers to by the system respectively. The following is an example of how information is stored within the system s knowledge base: <base> <categoria> <patron>hola</patron> <answers>hola</answers> <answers>hola</answers> </categoria> <categoria> <patron>como te llamas</patron> <patron>cual es tu nombre</patron> <answers>me llamo GENIALO</answers> <answers>soy GENIALO</answers> <answers>mi nombre es GENIALO</ answers> <answers>me llamo GENIALO</answers> <answers>genialo</answers> <answers>yo soy GENIALO</answers> </categoria> </base> The example shows that within <base> there are two nodes types <categoria> which contain the information requested by the user and the system recognizes or accepts (in this case contain a greeting: hola and a question: as you flames), which are limited or marked within a node <patron> and as might be expected, the nodes contain the answer<answers> to take for the system to their respective questions, both (both the question and answer) are grouped within their respective <categoria> node. So, continuing the example, if the user asks, como te llamas or cual es tu nombre the system would respond mi nombre es AWESOMEO or me llamo AWESOMEO or any of the possible answers related to the question randomly selected by the system. Inference motor It is the way in which the system attempts to reach a valid conclusion looking for rules that feel they be met. Stated another way, as is the way the system accepts the information requested by the user to subsequently contact the knowledge base and check which of the events that are part of the knowledge base is coincident with the requested and therefore is accepted and proceed to give an answer. Each time the system does not understand the question asked by the user (because it is not part of its knowledge base) the user can add to your knowledge base if you wish (this is discussed in more detail in concerning the user interface). The manner of operation of the inference engine SCI system is based on SQL type queries, which are responsible for searching the XML file that forms the basis of knowledge of the system. The following is an example of how the inference engine: var query = from p in docbase. Elements( base ).Elements( categoria ) where (string)p.element( patron ) == mensajeusuario select p; 36
70 Where Supercomputing Science and Technologies Meet In this section of code performs the query to the XML file of the knowledge base, where mensajeusuario is the user s information request. Subsequently is selected from the search result obtained from one of the suggested responses that have been entered into the knowledge base. This selection of a response to return is done through the following code: foreach (var registro in query) { int i = registro.elements( answers ). Count(); int index = rnd.next(i); string s = registro.elements( answers ). ElementAt(index).Value; mensajeanterior = mensajeusuario; } return s; This section of code is responsible for randomly selecting one of the possible answers to it by the SCI system. Using conditions, if the pattern is not within the knowledge base of the machine answers that do not understand what the user has entered at this point the user can choose to add this sentence in the knowledge base by clicking the button configuration. A phrase that the user enters will remove all exclamation points, question and points that might contain, so everything that the user can type in lower case will be proceeded to become that way because the use of these symbols can be generated that can be written phrases vary in a larger number, for example HOLA!!!! User s interface The program interface consists of two panels (Conversation and message) and a setup button. The panel consists of a chat box is not editable, which shows the conversation that leads to the program while the panel consists of a message box to write what you want to talk to the system, a which submit button to send the message to the system and a clean button which clears the text box that displays the conversation with the system. The button configuration is one that allows the user to contribute to the knowledge base of the system, giving you the opportunity to attach their own questions to the knowledge base system and the list of possible answers to it by the SCI system. This was designed this way to help append information to the knowledge base of the system, because the more the system has interaction with users and in turn they contribute to the system, the system will draw ever closer to its goal, which is carry a fluid conversation with someone, but this will create a problem, the large amount of data and information contained in the XML file that represents the knowledge base SCI system. In the figure 1 shows the system interface SCI. 37
71 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Conclusions Figure 1. System Interface SCI The interface that allows append information to the knowledge base consists of two text boxes and a list. The first box is the data elements that will contain <patron> node while the second box is the one that will generate the content of the list, which each of its elements correspond to a node <answers> type. In The figure 2 shows the interface to append data to the knowledge base. Figure 2. Interface to append data to the knowledge base Today the question of whether a machine will be able to pass the Turing test has been replaced by how much it take for a machine pass the Turing test? This is due to the advances that have in the field of artificial intelligence every day, just give a brief overview of robotic technology in the Far East to realize that the machines may already have matched the man s intelligence and even some have passed, but then, why have not passed the Turing test? Well, I dare say that is because the machines do not have any human imperfections. So, the challenge is to make machines more intelligent than man as this has been achieved, but to make machines more human every time, and when that is done, finally I m sure a machine pass the Turing Test. If you think about why the Loebner Prize judges still distinguish those machines? I think the answer is simple, just compare ourselves with some technological device now in common use, say, a scientific calculator will not only perform the basic operations, how long it takes for a person to perform the operation * by only using of mental ability? Now, how long it takes a calculator to perform the same operation? The difference is huge! So if we use that logic, I could simply ask the result of an arithmetic operation to realize if we are dealing with a person or a machine. So, we conclude that the challenge is not building a smart chatbot capable of passing the Turing Test, but to build a smart chatbot that is human enough layers to have a fluent conversation with a person, and in that conversation the chatbot react as a person, think like a person, has the mental agility of 38
72 Where Supercomputing Science and Technologies Meet a person and above all you have and make the same mistakes a person because let s face it, there is no perfect human being. 4. References [1] Alan Turing, Computing Machinery and Intelligence, Mind, [2] John McCarthy and Patrick J. Hayes, Some philosophical problems from the standpoint of artificial intelligence, Computer Science Department, Stanford University, [3] Michael L. Mauldin, Chatterbots, tinymuds, and the Turing test entering the Loebner prize competition, Carnegie Mellon University Center for Machine Translation,1994 [4] Harnad S., Machines and Searle. Journal of Theoretical and Experimental Artificial Intelligence, Minds, 1989 [5] Amir Karniel, Ilana Nisky, Guy Avraham, Bat Chen Peles, and Shelly LevyTzedek, A Turinglike Handshake Test for Motor Intelligence, EuroHaptics, 2010 [6] html [7]http://www.pensamientoscomputables.com/ entrada/premio/loebner/2010/chatbot/alanturing [8] html 39
73 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 An Interface for the Virtual Observatory of the University of Guanajuato René A. OrtegaMinakata 1, Juan P. TorresPapaqui 1, Heinz Andernach 1 and Hermenegildo FernándezSantos 2 1 Departamento de Astronomía, Universidad de Guanajuato; 2 Maestría en Medios Interactivos, Universidad Tecnológica de la Mixteca. {rene, papaqui, Abstract We present the first attempts to build a userfriendly interface for the Virtual Observatory of the University of Guanajuato. The data tables will be accessible to the public through PHP scripts and SQL database managers, such as MySQL and PostgreSQL, all administrated through phpmyadmin and pgmyadmin. Although it is not made public yet, this interface will be the basis upon which the final front end for our VO will be built. Furthermore, we present a preliminary version of a web front end to the publicly available stellar population synthesis code STARLIGHT (starlight.ufsc.br) which will be made available with our VO. This front end aims to provide an easy and flexible access to the code itself, letting users fit their own observed spectra with their preferred combination of physical and technical parameters, rather than making available only the results of fitting a specific sample of spectra with predefined parameters. Keywords: High Performance Computing: applications; Scientific Visualization; Databases: Public Access 1. Introduction Astronomy is nowadays a highly computational science, with astrophysical models being tested numerically with high performance computing, but also with large amounts of new observational and theoretical data produced, analyzed, archived and made available to the public constantly. To cope with this highly complex task, a global network of both astronomers and computer scientists has been formed as the socalled International Virtual Observatory Alliance (IVOA; www. ivoa.net). In [1] we described the rationale behind the project of the Virtual Observatory of the University of Guanajuato (VOUG), its implementation strategies and data model. In this paper, we present the first attempts to build a userfriendly interface for our VO, using PHP scripts and SQL database managers, as well as the phpmyadmin and pgmyadmin administrators. We also present a preliminary version of a web front end to the publicly available stellar population synthesis code STARLIGHT (starlight.ufsc.br) which will be made available with our VO. This astrophysical code fits the observed optical spectrum of a galaxy or stellar system with a set of spectra of simple 40
74 Where Supercomputing Science and Technologies Meet stellar populations (those with a single age and metallicity), assembled from both stellar evolutionary models and libraries of observed stellar spectra. As a result, STARLIGHT infers relevant stellar parameters from the galaxy s spectrum, such as the mass in stars, mean stellar age and mean stellar metallicity, as well as the star formation history of the galaxy. For a detailed description of the technique used by STARLIGHT, see [2]. 2. A userfriendly interface for the Virtual Observatory of the University of Guanajuato In its first implementation, our VO service will make public several databases containing diverse astrophysical information for a large sample of galaxies. The user will be able to query the collection of tables in which this information is stored with an SQL search. The tables will be organized in different databases according to their content, and columnbycolumn descriptions will be made available. A PHP script will allow the user to select a database to query, and direct him to the query page. Figure 1 shows an example of the database selection page and the description page. Figure 1. Example of the database selection page (top) and the columnbycolumn description page (bottom) of the Virtual Observatory service to be made available at the University of Guanajuato All the tables and databases will be administered by either phpmyadmin or pgmyadmin. The user will write his/her SQL search in a textbox, which in turn will be run by one of the above administrators within the selected database. Through a PHP script, the results of the query will be dumped to a screen on the service page, and the page will reload. A results file will also be made available to the user through a download button for a limited period of time (15 days), after which the file will be eliminated automatically, using the crontab daemon. Figure 2 shows an example of how phpmyadmin will help administer databases and tables. Table 1 lists the PHP scritps for the process described in 41
75 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 this section. Table 1. PHP scripts for the VO database query service. PHP script default.php Description Home page for the VO database query service. Lets the user select the database to be queried. search.php Controls the type of query to be made: a general query (gives the user limited freedom, but no prior knowledge of SQL is required), or a specific query (the user introduces his own SQL query statement in a text box). query.php Manages the environment to show the query results. specific.php Executes the query, displays the results, and manages the download button that allows the user to get a results file. download.php Executes the download of the results file. delete.php Deletes old usergenerated files after 15 days. This script is run automatically using the crontab daemon. Figure 2. Example of administration of the different databases (top) and the administration of the collection of tables within a single database (bottom) using phpmyadmin 3. A web front end to the STARLIGHT code In a second phase, our VO will offer a front end to the astrophysical code STARLIGHT. As explained before, STARLIGHT is used to obtain meaningful information from a galaxy or any stellar system by fitting its observed optical spectrum with a set of spectra of simple stellar populations (those with a single age and metallicity), assembled from both stellar evolutionary models and libraries of observed stellar spectra. The information obtained as a result of this fit includes relevant stellar parameters such as the mass in stars, 42
76 Where Supercomputing Science and Technologies Meet mean stellar age and mean stellar metallicity, as well as the star formation history of the galaxy. So far, STARLIGHT is available as source code from the STARLIGHT group (starlight. ufsc.br), and the user needs to learn how to use the code and how it works in detail in order to fit their own set of spectra, one by one. The results of applying STARLIGHT to a large sample of galaxies are also available on the same web page (also as a VO service), but these are the results of fits with a predefined set of astrophysical and technical parameters for the whole sample. Our front end aims to provide easy and flexible access to the code itself, letting users fit their own observed spectra with their preferred combination of physical and technical parameters, without the need of learning how the source code works in detail and run the code on each single spectrum. Through a series of PHP scripts, the user will decide the level of control of the input parameters he desires, depending on his/ her knowledge of the involved technicalities (beginner or expert), and upload the files containing the spectra to be fitted. Figure 3 shows the welcome page to the STARLIGHT front end service and the description page for the different modes. On the welcome page, the user informs the service of the number of files (spectra) to be processed, and also chooses the mode (beginner or expert) in which the code will be run. Figure 3. Welcome (top) and description page (bottom) of our STARLIGHT front end service Once all spectra are uploaded and configurations specified (depending on the running mode), a PHP script will create a task 43
77 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 and add it to a job queue. The task list will be run automatically and periodically using the crontab daemon, and each element on the job queue will be deleted once it has been run. Upon completion of the proper STARLIGHT run, the task will send a confirmation of completion (or detection of errors) to the user. The user must then return to the web interface and retrieve the result files through a link created by another PHP script. Table 2 lists the PHP scritps for the operation of the STARLIGHT front end described in this section. Table 2. PHP scripts for the STARLIGHT front end that will be available at our VO service 4. Conclusions The availability of astrophysical data for large samples of objects makes it imperative to find solutions to handle this large amount of information. The standard established by the IVOA marks the direction in which all astronomical data services should move. The Virtual Observatory of the University of Guanajuato will contribute to this task by using a set of standard information technology tools, such as database managers (MySQL, PostreSQL), administrators (phpmyadmin, pgmyadmin) and languages (PHP, SQL) to construct a VO service, according to the IVOA standard, to allow the creation of our own astrophysical databases (see, for an example, [3] and [4]). Furthermore, a userfriendly front end to the STARLIGHT code will help users employ this useful astrophysical code with their own set of technical and physical parameters without the need to learn the specifics of the source code. This flexible and easy access to an otherwise complex code will increase the potential information that can be obtained from using such a code, eventually allowing for an extrapolation of this implementation to similar astrophysical codes. 5. References [1] J.P. TorresPapaqui, R.A. OrtegaMinakata, J.M. IslasIslas, I. PlauchuFrayn, D.M. Neri Larios, and R. Coziol, The Virtual Observatory at the University of Guanajuato: Identifying and Understanding the Physical Processes behind the Evolution and Environment of the Galaxies, More than Research, Volume 2, ISUM 2011 Conference Proceedings, 2011, Editor: Moisés Torres Martínez. 44
78 Where Supercomputing Science and Technologies Meet [2] R. Cid Fernandes, A. Mateus, L. Sodré, Jr., G. Stasińska, and J.M. Gomes, Semiempirical analysis of Sloan Digital Sky Survey galaxies  I. Spectral synthesis method, 2005, Monthly Notices of the Royal Astronomical Society, 358, 363. [3] H. Andernach, E. Tago, M. Einasto, J. Einasto, and J. Jaaniste, Redshifts and Distribution of ACO Clusters of Galaxies, 2005, Astronomical Society of the Pacific Conference Series, 329, 283 (see arxiv.org/abs/astroph/ ). [4] H. Andernach Safeguarding old and new journal tables for the VO: Status for extragalactic and radio data, 2009, Data Science Journal, 8, 41; (see 41/ article). 45
79
80 Architectures
81
82 Where Supercomputing Science and Technologies Meet Speech Coding Using Significant Impulse Modelingand Recycling of Redundant Waveform Ismael Rosas A., Juan C. García I., Juan C. Sánchez G. IPN ESIME Culhuacán, Abstract This paper proposes two waveform coders algorithms in the time domain. The first one is performed by a Significant Impulse Modeling (MIS) which aims it s work as an endpoint detector and a downsampling, but only using the detection and selection of significant peaks and valleys. The second algorithm is the Recycler Redundant Waveform (RFOR) which uses a fuzzy algorithm and a cumulative memory. The fuzzy algorithm has the function of determining the degree of similarity between redundant waveform blocks. This pattern is accumulated in a knowledge base and if it reiterates its appearance in the incoming signal it will be used to replace the block by a code which indicates the pattern in memory where it belongs. Keywords: Signal processing, speech coding, fuzzy logic, vector quantization. 1. Introduction The digitalization of voice signals, sample by sample has given rise to PCM conventional coding which is based on the scalar quantization [1]. However, when a set of values of a waveform is quantized jointly as a single vector or entity, the process is known as vector quantization (VQ) [2]. This entity is encoded by a binary word, which is an approximation of the original vector. Each vector is encoded in comparison with a set of stored reference vectors, which are known as patterns [3]. Each pattern is used to represent the input vectors that are somehow identified as similar to this pattern [1]. The best set of patterns in the codebook, i.e. the set of reference patterns stored in memory is selected by the encoding process in accordance with an adequate fidelity measure, and a binary word is used to identify this pattern in the codebook of patterns [2]. The size and definition of the population of the codebook or training (for those that are updated during the measurement) are two critical parameters that determine the efficiency of a VQ [15]. There are several models that reduce both storage and computational load, but this do not always match with the vector patterns of the incoming signal due to a phase shift [18]. This paper proposes two algorithms that change the perspective of VQ to adjust the patterns to the input vectors [34] for recycling input vectors that are used for redundancy. These algorithms are the MIS which prevents the phase shift and reduces the number of samples for modeling voice and RFOR that 49
83 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 achieves nontrained patterns recognition through the recycling of the same signal. Figure 1 shows the general architecture that is basically an adapted vector quantizer which replaces the training patterns for recycling by adding components that allow an adequate fidelity measure comparison and an efficient recycling. 2. General architecture In the block diagram of figure 1the voice signal is first fitted for RFOR by the algorithm MIS. The RFOR quantifies the signal as a vector, which is compared with a pattern in memory. The differences between them are evaluated by a fuzzy system, and if the differences are great enough to be considered similar the RFOR calls another pattern in memory. This process is repeated until any of the following two cases occurs: find a pattern in memory similar to the input vector or can t find any pattern which satisfies the conditions of similarity. For the first case the input vector is encoded by an index that identifies this pattern in the memory, plus an adjustment in proportion. For the second case, the coding is performed by the MIS. case of the RFOR (this process is illustrated in figure 2). Figure 2. Block diagram of speech decoder 3. Significant Impulse Modeling In order to use linear interpolation for the decoding, the signal is modeled based on their direction and strength properties, to allow omitting pulses having the same direction and close strength. This will reduce the number of samples needed to reproduce a signal. Although the linear correlation has good results when using these features to find the relationship between two signals, modeling is not always equally benefited, particularly at high frequencies, since it adds components to the signal, and therefore becoming noise. However for a voice signal it is sufficient, as shown in figure 3. Figure 1. Block diagram of speech coder A binary word is needed to identify the coding algorithm performed, so that the decoding choose a linear interpolation for the MIS or an iadjusted vector magnitude σ in the 50
84 Where Supercomputing Science and Technologies Meet needed. The C adjustment will be in terms of the signal / noise ratio. So, in conditions of low noise (about 16 db) a reduction of approximately 50% of the samples needed and a correlation coefficient greater than 0.99 can be obtained. In this way a benefits of a downsampling (in terms of reduced samples) is obtained without cons in terms of voice quality (see figure 3). Background modeling using reduced excitation sequences can be found in the RPE algorithm [9]. Figure 3. Comparison of the spectrum between linear modeling (below) and the original signal (above). To achieve this modeling IFTHEN rules are used in terms of the direction and strength or magnitude. The equations 1 and 2 show this rules respectively. (1) (2) Where x i is the ith sample of the signal, S(x) is the Where signal x direction and M is the direction i is the ith sample of signal, S(x) is the signal direction an model. Likewise, I is the vector formed by interpolation of samples skipped by M and V x is the vector formed by the omitted samples M. 4. Simulation results of the MIS The constant C determines the degree to which the MIS reduces the amount of samples Figure 4. Comparison between the original signal and its modeled. Once the signal is modeled by the MIS the number of significant changes in the signal are count for determine if a signal frame is voiced or unvoiced with similar results than a zero crossing detector. 5. Recycling of Redundant Waveform To describe in detail the RFOR let s assume that the size of the frame is ms since this is approximately the average 51
85 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 greatest elongation that a pattern can have [3, 10]. The search the RFOR pattern is based on the premise of comparing the patterns with a frame of the incoming signal that is shifted one sample in each iteration. However, this assumption implies a large computational load by the number of iterations. One solution is to increase the number of shifted samples at each iteration but this would involve pattern recognition problems. The proposed solution is to shift the frame of the signal in each significant permutation, causing the number of displaced samples to switch according to the changes in the signal, where such changes are described by the MIS. Functions (MF) are sigmoidal (as shown in figure 6 and 7), which determine the degree to which each part of the antecedent satisfies every rule, since there are two rules (direction and difference) the AND fuzzy operator is use to unify. Once the antecedent by a single number is defined, the consequent is defined by aggregation of two rules: pattern similar or different, each with their respective FM. Finally for the defuzzification, the centroid method is used [1113]. k=1 (4) σ i=1 k = n Figure 5. RFOR block diagram. Having defined the above conditions, the patterns are compared in terms of its direction and strength. Where the effect has previously been defined by the MIS and the difference in magnitude is defined by equation (3). However, before any comparison it is necessary to standardize their scales according to equation (4). n x e = k y k (3) max(x) min(x) n i=1 (x i y i ) 2 y ki (4) igure 6. MF for Direction. Figure 6. MF for Direction Figure 7. MF for difference in magnitude. Both comparisons were evaluated by a diffuse Mandani algorithm. The Membership 52
86 Where Supercomputing Science and Technologies Meet As mentioned above the output of the fuzzy system requires a condition of likeness (given by equation 5) to allow a more subjective comparison of the waveform. (5) Where SC is the coded signal, SD is the output of the fuzzy system, P is the size of the displacement and L is the length of the pattern. 6. Simulation results of RFOR. In a voice signal of 30 seconds that contains words considered in Spanish phonetics as voiced consonants [15] the RFOR has achieved the recycling of many patterns in speech frames, between 20% and 70%, depending on whether the frame is mostly voiced or mostly unvoiced, as well as the size of the knowledge base. Figure 8 shows the last syllable of the Spanish word segment alrededor with an elongation of about ms, of which 90 ms were recycled in 8 patterns of ms each (shown in figure 8 by the minimum segments of the black line). This saves about 50% of the voice frame. In addition, the MIS reduction only needs 22.6 ms of the original frame for modeling ms of the speech. This implies a total compression rate of Figure 8. Comparison between the original speech and the reconstructed signal by RFOR and MIS. As seen in figure 9, the correlation between recycled patterns and the original signal is not perfect, but something more important than a perfect correlation is that significant changes are in phase. This means that it is possible reproduce the majority of the frequency components. The fuzzy comparator achieves this purpose in most the cases. Figure 9. Comparison between the original signal and the 8 patterns recycled. 53
87 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Conclusions The algorithms mentioned above are simulated with an incoming speech signal to the architecture with a sampling frequency of 8 KHz and 8 bits of resolution. Under these conditions compression rates ranging from 0.2 to for the MIS and between for the RFOR are achieved. When there are joint a compression ratio up to 0.04is achieved. The decoded voice quality can be improved with a 3 KHz low pass filter applied to the reconstructed signal. An acceptable quality is achieved at compression rate of 0.15 according to a test MOS (Mean Opinion Score) on a population of 20 people. In this way this architecture suggests that the process for defining an input vector through a window which is displaced in relation to changes in the voice signal greatly avoids that similar waveforms are discarded as patterns due to a dephasing in its window, and therefore involving a more complex detection or an algorithm that forces detection. 8. References [1] A. Gersho, V. Cuperman, Vector Quantization: A patternmatching Technique for Speech Coding, IEEE Communications Magazine Vol. 21, IEEE Communications Society, 1983, pp [2] J. Makhoul, S. Roucos, H. Gish, Vector quantization in speech coding, Proceedings of the IEEE Vol. 73, IEEE, 1985, pp [3] A. M. Kondoz, Digital Speech Coding for Low Bit Rate Communication Systems, John Wiley & Sons, UK, [4] A. Vasuki, P. Vanathi, A review of vector quantization techniques, IEEE Potentials Vol. 25, IEEE, 2006, pp [5] N.B. Karayiannis, A methodology for constructing fuzzy algorithms for learning vector quantization, IEEE Transactions on Neural Networks Vol. 8, IEEE Computational Intelligence Society, 1977, pp [6] C.E. Pedreira, Learning vector quantization with training data selections, IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 28, IEEE Computer Society, 2006, pp [7] C. W. Tsai, C.Y. Lee, M.C. Chiang, and C.S. Yang, A fast VQ codebook generation algorithm via pattern reduction, Pattern Recognition Letters Vol. 30, Elsevier B.V., 2009, pp [8] G.E. Tsekouras, D. Darzentas, I. Drakoulaki, A.D. Niros, Fast fuzzy vector quantization, IEEE International Conference on Fuzzy Systems, IEEE, Barcelona, 2010, pp [9] P. Kroon, E. Deprettere, R. Sluyter, AT&T Bell Laboratories, Murray Hill, NJ, Regular Pulse ExcitationA novel approach to effective and efficient multipulse coding of speech, IEEE Transactions On Acoustics, Speech and Signal Processing Vol. 34, IEEE Signal Processing Society, 1986, pp [10] I. McLoughlin, Applied Speech and Audio Processing with Matlab examples, Cambringe University Press, UK, [11] E. Mamdani, Applications of Fuzzy Algorithms for Control of Simple Dynamic Plant, Proc. IEEE Vol. 121, 1974, pp [12] L. Zadeh, Fuzzy Sets, Information and control, Vol. 8, 1965, pp [13] T. Takagi, M. Sugeno, Fuzzy Identification 54
88 Where Supercomputing Science and Technologies Meet of Systems and its Applications to Modelling and control, IEEE Transactions and Systems, man, and cybernetics Vol. 15, 1986, pp [14] K.M. Passino, Fuzzy Control, Addison Wesley, USA, [15] I. J. Hualde, The sounds of Spanish, Cambrindge University Press, UK,
89 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Performance Study of Cellular Genetic Algorithms Implemented on GPU Javier ArellanoVerdejo, Ricardo Barrón Fernández, Salvador GodoyCalderón, Edgar A. GarcíaMartínez Laboratorio de Inteligencia Artificial Centro de Investigación en Computación (CIC) Instituto Politécnico Nacional (IPN), México Abstract 1 Authors wish Abstract to express their gratitude to IPN, SIPIPN, 1. Introduction CONACyT and SNI, from Mexico, for their financial support of this research, particularly through grants SIP and SIP Cellular Genetic Algorithms (cgas) are a special kind of Evolutionary Algorithm (EA) that perform heuristic search on the style of Genetic Algorithms but over highly structured populations. Their distinctive feature is to organize the search space with a neighborhood structure that limits and governs the evolutive relations among individuals. Each evolving individual can only interact with its designated neighbors, thus forming a connected graph that twists the development of the reproductive and mutation processes. The intrinsic parallel structure of cga can be exploited and forward studied using Graphical Processor Units (GPU) and this paper shows the implementation techniques as well as the emerging conclusions of that performance study. Keywords: Cellular Genetic Algorithms, Evolutionary Algorithms, GPU, CUDA. The current need to solve increasingly complex optimization problems has promoted intense research on a wide variety of solution methods including direct, heuristic and evolutionary ones. Among those methods Cellular Genetic Algorithms (cgas) constitute a special case which deserves greater attention [1][2]. cgas are a special kind of Evolutionary Algorithm (EA) that perform heuristic search on the style of Genetic Algorithms (GA) but over highly structured populations. Their distinctive feature is to organize the search space with a neighborhood structure that limits and governs the evolutive relations among individuals. Each evolving individual can only interact with its designated neighbors, thus forming a connected graph that twists the development of the reproductive and mutation processes (see Figure 1). 1 Authors wish to express their gratitude to IPN, SIPIPN, CONACyT and SNI, from Mexico, for their financial support of this research, particularly through grants SIP and SIP
90 Where Supercomputing Science and Technologies Meet The partial isolation of each individual in the population generates a slow diffusion rate for possible solutions, which in turn leads to an increased exploration capability of the algorithm [1]. To study the effect of the neighborhood topology over the overall performance of the algorithm several different types of neighborhoods are used, being Linear5, Compact9 (also known as a Moore neighborhood) and Compact13 the most popular ones (see Figure 2) [5]. Aside from the type of neighborhood selected, all cgas deploy the evolving individuals over a standard grid, usually with a toroidal interconnection topology. Each individual is eventually replaced by the best fitted offspring produced by its designated neighbors. Figure 3 shows a highlevel pseudocode for a cga. With the aid of any selection method (binary tournament, roulette, etc) two parents are selected from the neighborhood of each individual in the grid. Recombination of those parents might be done using a standard crossover technique followed by mutation [2] and the final replacement process is generally done with the best fitted offspring [5]. (a) (b) Figure 1. A cga grid highlighting a central individual and its neighbors. (a) Linear 5 (b) Compact 9 (c) Compact 13 Figure 2. Different types of neighborhood used in cgas. Figure 2. Different types of neighborhood used in cgas 57
91 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 The intrinsic parallel structure of cgas can be forward studied. This paper presents a performance study of three different models of cga implementation on GPU, tested with four numerical optimization problems. When evolving synchronously, offspring is generated in parallel for all cells in the grid, and most important, the replacement of all individuals is done at the same time. Initialization: parallel for each cell i in the grid do Generate a random individual i end parallel for each Evolution: while not termination condition do parallel for each cell i do Evaluate individual i Select parents from the neighborhood Produce offspring of selected parents Evaluate offspring and select best Replace individual i with best offspring end parallel for each Figure end 3. while General pseudocode for a cellulargrid. Figure 3. General pseudocode 2. Parallel implementation for a cellulargrid. on GPU As with any other parallel program the underlying architecture of the supporting platform turns out to be of the outmost importance [3][4]. GPU represent a lowcost and highperformance option for the solution of scientific problems with high computational cost. GPUs, although widely used nowadays for scientific computing, lack standardization, so the specific model for implementing cgas on GPU must be carefully selected [6]. Three different models for GPU implementation were tested: a sequential nonparallel version, a masterslave version and a completely parallel synchronous version [8][9]. Figure 4. Serial model for implementing a cga on GPU. Figure 5. MasterSlave model for implementing a cga on GPU. 58
92 Where Supercomputing Science and Technologies Meet Figure 6. Fully parallel synchronous model for implementing Figures 4, 5 and 6 below show the pseudocode of each implementation model. All programs were aimed for an Nvidia GeForce GTX 480 GPU which has a 2.0 index of computing capability, and max 1024 threads per block. All programs were coded using version 4.1 of CUDA. 3. Experimental results Three metrics were selected for evaluating the performance of each parallel program. The traditional Computational throughput index (V) is calculated as the reciprocal of the total processing time in seconds (T) of 1 the parallel program V=. The Speedup T index (S) measures how many times the tested program runs faster than the equivalent sequential implementation; this is calculated as a ratio between both indexes, S = V V. reference Finally, the Efficiency index (E) for a parallel program is calculated as the quotient between the speedup of a program and the total number of cores used E = S n. All three indexes were measured for each run [10]. Since the sequential implementation sets the reference yield its global speedup, as well as its efficiency are always set to 1. Tables 3, 4 and 5 below show the average performance indexes for all implementations. The graphics showing total execution time, computational throughput, speedup and computational efficiency can be seen in figure 7. Each implementation model was tested 30 times, with 15,000 generations and 4 test functions. All experimental parameters Table 1. Experimental parameters for the implemented cgas. 59
93 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Table 2. Test functions used. Table 3. Average results for the Sequential Implementation 60
94 Where Supercomputing Science and Technologies Meet Table 4. Average results for the MasterSlave Table 5. Average results for the Fully Parallel Figure 7. Time, throughput, speedup and efficiency graphics for all models 61
95 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Conclusions Cellular Genetic Algorithms (cgas) combine the resultant advantages of segmenting the search space and of structuring the neighborhood of every individual. As a result, cgas promote allow an ordered and homogeneous search. Also the intrinsically parallel nature of cgas show enormous potential for efficient parallel implementations. In this paper, three implementation models of a cga were created over an Nvidia GeForce GTX 480 GPU handling a maximum of 1024 threads per block. All three models were run 30 times over a set of four test functions. For performance evaluation three indexes were used: Computational throughput, Speedup, and Parallel efficiency. Experimental results show the overwhelming superiority of the fully parallel implementation when compared to the Master Slave model. In turn, the MasterSlave model does not differ much from the sequential implementation. Presumably this effect is caused by the communication overhead induced by the constant data transfer between the host (CPU) and the device (GPU). 5. References [1] Enrique Alba and B. Dorronsoro. The exploration/explotation tradeoff in dynamic cellular evolutionary algorithms. IEEE Transactions on Evolutionary Computation. 9(2): , April [2] Enrique Alba. Parallel evolutionary algorithms can achieve superlinear performance. Information Processing Letters, 82(1):713, [3] Gerardo A. Laguna Sanchez,Mauricio Olguin Carbajal, Ricardo Barrón Fernández. Introducción a la programación de códigod paralelos con CUDA y su ejecución en un ngpu multihilo. Journal of Applied Research and Technology, vol.7, No.3, [4] Fernando Randima. GPU gems : real time graphics programing. NVIDIA [5] Marco Tomassini. Spatially Structured Evolutionary Algorithms. Artificial Evolution in Space and Time. Natural Computing Series. SpringerVerlag, [6] E. Alba and J. M. Troya. Cellular evolutionary algorithms: Evaluating the influence of ratio. In M. Schoenauer, K. Deb, G. Rudolph, X. Yao, El Lutton, J.J. Merelo, and H. P. Schwefel, editors. Parallel Problem Solving from Nature PPSN 6, volume1917 of Lecture Notes in Computer Science, pp Springer, [7] D. Andre and J. R. Koza. A parallel implementation of genetic programing that achieves superlinear performance. Journal of Information Sciences, 106(34): , [8] T. Bäck. Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. Oxford University Press, [9] H. Beeersini and V. Detour. Asynchrony induces stability in cellulllar automata based models. In R. A. Brooks and P. Maes editors, Artificial Life IV, pp MIT Press, Cambridge, [10] E. CantúPaz and D. E. Goldberg. Predicting speedups of idealized bounding cases of parallel genetic algorithms. In T. Bäck editor, Proceedings of the Seventh International Conference on Genetic Algorithms, pp Morgan Kaufmann,
96 Grids
97
98 Where Supercomputing Science and Technologies Meet Online Scheduling of Multiple Workflows on a Homogeneous Computational Grid Adán Hirales Carbajal 1, Andrei Tchernykh 1, Alfredo Cristóbal Salas 2 1 Computer Science Department, CICESE Research Center, Ensenada, BC, México, 2 Universidad Veracruzana, Poza Rica de Hidalgo, Veracruz, México {ahirales, Abstract In this paper, we present an experimental study of online nonpreemptive multiple parallel workflow scheduling strategies on a homogeneous computational Grid. We analyze scheduling strategies that consist of two stages: adaptive job allocation and parallel machine scheduling. We apply these strategies in the context of executing the Cybershake, Epigenomics, Genome, Inspiral, LIGO, Montage, and SIPHT workflow applications. Their performance is evaluated based on a joint analysis of three metrics: competitive factor, mean waiting time, and critical path slowdown. Keywords: Grid Computing, Workflow Scheduling, Online Scheduling 1. Introduction The problem of scheduling jobs with precedence constraints is an important problem in scheduling theory and has been shown to be NPhard [1]. It arises in many industrial and scientific applications. In this work, we focus on their scheduling in Grid computing environments. A number of workflow scheduling algorithms for single parallel computers has been studied in literature [1, 2, 3, 4, 5, 6]. Most of them are designed to schedule only a single Directed Acyclic Graph (DAG) at a time, and assume precise knowledge of machine and job characteristics. These assumptions make algorithms far from practical application for real production systems. Only few studies have addressed scheduling of multiple DAGs. In [7], authors discussed clustering DAG tasks into chains and allocating them to single machine resources. In [8], a fairness based approach for scheduling of multiple DAGs on heterogeneous multiprocessor is proposed. In [9], authors focus on developing strategies for scheduling Parallel Task Graphs (PTG), providing fairness and makespan minimization. Similarly, authors assume a deterministic scheduling model. Many Grid scheduling problems are inherently dynamic and online, since job characteristics and arrival information are unknown. Very few online DAG scheduling studies have been conducted. In [10], online scheduling of multiple DAGs is addressed. Authors proposed two strategies based on aggregating DAGs into a single DAG. In [11], an Online Workflow Management (OWM) strategy for scheduling multiple mixparallel workflows is proposed. OWM includes three stages: workflow scheduling, task scheduling, 65
99 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 and resource allocation. In their approach, DAGs are labelled independently, but scheduling is conducted on a single priority queue at Grid layer. Authors model job arrival time as an Homogeneous Poisson Process (HPP) with constant rate parameter λ (job arrival rate). However, NonHomogeneous Poisson Processes (NHPP) have been shown to describe numerous arrival time phenomena better than the HPP counterpart, as it models time dependent arrival rates λ(t) present in many real production environments. There are three main drawbacks of these approaches: the premise of knowledge of the exact job execution times, use of a single optimization criterion, and HPP. In this paper, we propose multiple DAG scheduling strategies that do not use exact runtime information, make a joint analysis of three metrics, and use NHPP release time model. Since the Grid context is multiobjective in its nature, we use multicriteria decision support that addresses both user and system related goals. We extend results presented in [12] by evaluating workflow scheduling in the online context. As there are yet no established online workflow traces, we consider the Cybershake, Epigenomics, Genome, Inspiral, LIGO, Montage, and SIPHT workflow applications and model their release times by NHPP. We consider a nonclairvoyant execution, where the scheduler has no knowledge of the real execution length of the tasks in the workflow. User time estimates are used. We evaluate the workflow scheduling strategy named MWGS2online (Online Multiple Workflow Grid Scheduling 2 stages). The stages of the strategy consist of an adaptive allocation (WGS_Alloc) and parallel machine scheduling (PS): MWGS2online = WGS_Alloc +PS. The rest of this paper is divided into five sections. Section 2 formally presents our Grid scheduling model. Section 3 discusses related work. We introduce workflow scheduling algorithms and classify them in Section 4. Experimental setup, performance analysis methodology, and experimental results are presented in Section 5. Conclusions are presented in section The scheduling model First, we address an online nonpreemptive, nonclairvoyant multiple parallel workflow scheduling problem on a homogeneous computational grid, where n workflow jobs J 1, J 2, J n, must be scheduled on parallel identical machines (sites), N 1, N 2,, N m. Let m 1 be the size of machine N i (number of identical processors), and m 1,m be the number of processors in the Grid. Assume without loss of generality that the machines are arranged in nondescending order of their sizes m 1 m 2... m m. Jobs are scheduled on a jobbyjob basis, no rescheduling is allowed. A workflow is a composition of tasks subject to precedence constraints. Workflows are modeled by a directed acyclic graph G j =(V j,e j ), where V j is the set of tasks, and E j ={(T u,t v ) T u,t v Є V j,u v }, with no cycles T u T v st u, is the set of edges between tasks in V j. Each edge (T u,t v ) Є E j represents the precedence constraint between tasks T u and T v, such that T u must be completed prior execution of T v is initiated. Each workflow job J j is described by the tuple (G j, size j,p j,p j G,p ĵ, ĵ G, cpn j,r j ):with G j = (V j E j ); its size j that is referred to as workflow 66
100 Where Supercomputing Science and Technologies Meet processor requirements or maximum degree of parallelism; critical path execution time p j ; total workflow execution time p G, user run j time estimate, workflow run time estimate p ĵ p G, number of tasks in the critical path cpn, j j and the job release time r j. The graph size (width) is the cardinality of the largest set of nodes in G j that are disjoint (there is no path connecting any pair of them). Such a set represents nodes that might be simultaneously executed. The job arrival times is based on a nonhomogeneous Poisson model. The Poisson distribution is commonly used to model the number of arrivals. Each workflow task T k Є J j is a sequential application and is described by the tuple i: with release date,execution time, and user run time estimate. Due to the online scheduling model the release date of intasks is set to r j, however, the release date of a task is not available before r k the task is released. Tasks are released over time according to the precedence constraints. A task can start its execution only after all its dependencies have been satisfied. Note that intasks have no precedence constraints. At its release date, a task must be immediately and irrevocably allocated to a single machine. However, we do not demand that the specific processor is immediately assigned to a task at its release date; that is, the processor allocation of a task can be delayed. Tasks are scheduled on a taskbytask basis. We use g(t k )= N i to denote that task T k is allocated to machine N i, and n i to denote the number of tasks allocated to the site i. A machine must execute a task by allocating a processor for an uninterrupted period of time to it. Total workflow processing G time p j and critical path execution cost p j are unknown until the job has completed its execution. They represent time requirements of all tasks of the job, and time requirements of tasks that belong to the critical path. We allow multisite workflow execution; hence, tasks of a job J j can be run on different sites. We reduce ourselves to jobs that tolerate to latency since sites could not be at the same geographical location. We also assume that the resources are stable and dedicated to the Grid. Workflow execution is evaluated by the following criteria: approximation factor, critical path waiting time, and critical path slowdown. Let c j and be the completion cʹk time of job J j and task T k, respectively. The approximation factor of the strategy is defined as is the makespan of a schedule, and C* max is the optimal makespan. Waiting time of a task bb bbbbbbb is the difference between the completion time of the task, its execution time, and its release date. Waiting time of a critical path is the difference between the completion time of the job and the length of its critical path. It takes into account waiting times of all tasks in the critical path. Critical path slowdown cps j =1+ (cpw j )/p i is the relative critical path waiting time and evaluates the quality of the critical path execution. A slowdown of one indicates zero waiting times for critical path tasks, while a value greater than one indicates that the critical path completion is increased by increasing waiting time of critical path tasks. The approximation factor is used to qualify the efficiency of the scheduling algorithms. We use the approximation factor over makespan criterion. Both metrics differ in constants 67
101 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 regardless of the scheduler being used. In addition to estimating the quality of each workflow scheduling execution two workflow metrics are used. They are commonly used to express the objectives of different stakeholders of Grid scheduling (endusers, local resource providers, and Grid administrators). 3. Related work In this section, we briefly review single and multiple workflow scheduling problems. Strategies are categorized in terms of the scheduling problem that is addressed and characteristics of the proposed scheduling strategy (Table I). Two classification criteria are used for reviewed work. First, scheduling problems are characterized using the three field notation (α,β,γ) introduced by Graham et al. [13]. This notation describes the machine environment (α), job characteristics (β), and objective function (γ). Subsequently, scheduling strategies are categorized as list scheduling based (list), clustering based (cluster), task duplication based (duplication), or as a guided method search strategy (meta) as proposed by Topcuouglu et al. in [14]. Most works evaluate the performance of heterogeneous parallel machines P m or heterogeneous Q m computational Grids. Heterogeneous processor machine speeds differ, but the speed of each processor is usually assumed constant and does not depend on the task being executed. The number of processors in a parallel machine is often considered constant. Three types of Grids are distinguished in literature: computational G c Q m, data Grids G d Q m, and service G s Q m Grids. Computational Grids are composed by parallel machines, that differ in size and processing speeds. Data Grids include characteristics of computational Grids, but machines are bounded by memory and storage constraints. Whereas service Grids provide processing/storage services accessed thought well defined interfaces. Service providers may offer some QoS guaranties agreed between stakeholders, defined in contracts often referred to as Service Level Agreements (SLA). Most studies focus on developing heuristics that find a solution with regard to a single optimization criterion, i.e. C_max. Multiobjective problems are more complex since it involves the arbitration of one or more conflicting objectives, for instance, minimization of the schedule length C_max and budget w_j constraints. Metaheuristics are used to find an optimal solution in a discrete solution space. Popular scheduling metaheuristics include: genetic algorithms, ant colony optimization, and simulated annealing. New studies are beginning to use Pareto optimality to address multiobjective scheduling problems. 68
102 Where Supercomputing Science and Technologies Meet Table I. Scheduling problem characteristics Machine Admissibility (Adm.) is used to identify a candidate set of resources that can satisfy a given job resource requirements. Strategies that include admissibility tend to reduce the amount of work needed to perform scheduling or allocation decisions, by selecting a subset of admissible resources and confining subsequent decision making to such subset. The concept of admissibility has been addressed from different perspectives (see, for instance, [27, 28]). In computational Grids and multiprocessor settings, resources are often evaluated in terms of a single constraint i.e. requested machine size, requested memory, etc. However, jobs in real production systems entail multiple resource constraints, thus the problem of matching job requirements to available resources becomes intrinsically more complex. Contemporary studies in the field of service Grids have proposed SLA specifications and management strategies to address the previous problem. We do not elaborate on the subject, since the field of study is broad and out of the scope of this work. Table II list several workflow scheduling strategies proposed during the last decade. DAG scheduling strategies may include task labelling (Label), clustering (Cluster), and allocation (Alloc) of independent or clustered tasks at Grid layer. At machine level, local queuing ordering (Queue) and scheduling policies (Parallel Scheduling, PS) may be applied. Workflow Grid Scheduling (WGS) stages can be regarded as WGS = Label + Cluster + Alloc + Queue + PS. The previous scheduling model is not inflexible as some stages can be reexecuted, enabling adaptability under conditions of uncertainty that may arise in the execution context. At the first stage, workflow tasks are labelled using user provided information or structural characteristics obtained from analysing the DAG representation of the workflow. Often, labels of one DAG are independent from those of other DAGs. They may be computed once and remain 69
103 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 static trough the execution of the workflow or updated upon the occurrence of an event. Labelling information has been used to describe properties of DAGs, such as: task level and critical path length. Labels have also been used to cluster tasks. The Upper Rank (UR) and Downward Rank (DR) are amongst the most common labelling strategies. At the second stage, a list of tasks that are ready to be submitted is maintained at Grid layer. Independent tasks, with no predecessors or whose predecessors completed execution, are entered into the list. Upon completion of a task, its immediate successors may become available therefore inserted in to list. If clustering is performed, clustered tasks are inserted in the list. An allocation strategy is used to select a suitable resource for each task in the list. Task allocation may be immediate or delayed until a set of tasks is buffered in the list. Well known allocation strategies include: Earliest Finishing Time (EFT), Absolute Earliest Finishing Time (AEFT) Earliest Start Time (EST), and Earliest Completion Time (ECT). An allocation strategy based on any of the previous criterions aims to select a resource that minimizes the criterion. Often, knowledge of local machine schedules is required. One a resource is selected, the task is stored in its local job queue. Table II. Workflow scheduling strategies characteristics At the third stage, task labels may be used to arrange local queues in increasing or decreasing order of task labels. Labels are therefore interpreted as priorities. This feature is seldom used, as machines administrative domains are independent and prioritization policies are often set locally. At the final stage, parallel machine schedulers assign tasks from its local queue to a candidate processor via reservation mechanism, an insertion based criteria, or a backfilling scheduling policy. As tasks complete execution, the parallel machine scheduler signals the Grid broker so that the 70
104 Where Supercomputing Science and Technologies Meet scheduling process continues. Many heuristics have been design under the assumption that information concerning the state of machines, of their local schedules, and of the jobs they held is known. Unfortunately, due to security issues such a scenario is very unlikely since Grid brokers often see local machines as black box resource providers. Information monitoring systems provide a way to acquire machine information without breaching security constraints. Several specifications are underway, among the two most well know are GLUE and CIM. 4. The workflow scheduling strategies In this section, we present details of the MWGS2online= WGS_Alloc+ PS workflow scheduling strategy. The allocation strategies WGS_Alloc and site scheduling algorithms PS are discussed. At the WGS_Alloc stage, the list of tasks that are ready to be started is maintained. Tasks from the list are allocated to suitable resources using a given optimization criteria. At the PS stage, a PS algorithm is applied to jobs that were allocated during WGS_Alloc stage for each parallel machine independently. Note that we study workflow scheduling strategies that are based only on allocation policies designed for scheduling independent tasks. We distinguish strategies depending on the type and amount of information they require. In order to provide performance comparison, we use workloads from a parametric workload generator that produces workflows that resemble those of real workflow applications such as Cybershake, Epigenomics, Gnome, Inspiral, Ligo, Montage, and Sipht [29] Task Allocation (WGS_Alloc) The list of tasks that are ready to be started is maintained. Independent tasks with no predecessors and with predecessors that completed their execution are entered into the list. Upon completion of a workflow task its immediate successors may become available and are entered into the list. Allocation policies are responsible for selecting a suitable site for task allocation. We use adaptive task allocation strategies presented in [27, 30] as independent task allocation policies (Table III). We distinguish allocation strategies depending on the amount of information they require. Four levels of available information are considered. Level 1: Information about the status of available resources is available. Level 2: Once a task has been submitted, the site with least load per processor is known. Level 3: All information of the level 2 and the job run time estimates are available. Level 4: Information of the level 3, all local schedules, and site status are available. Level 14 information may be provided via Grid information service. The number and the sizes of the parallel machines are known. In [12], it was found that schedulers with minimal information requirements in their job allocation phase, provide good performance results, in the case when user run time estimates are used instead of exact execution time. The strategies with minimum observed degradation were MaxAR, MLB, and MPL. Consequently, allocation of jobs based on user run time estimates and information on local schedules produced results with high percentages of degradation. In this work, we evaluate level 13 allocation policies 71
105 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 proposed in [12], and MST as it also produced small degradation results. In addition to the strategies listed in Table I, we evaluate the Round Robin (RRobin) strategy as it is found in real Grid systems [31] Job arrival rate In order to comprehend job arrival rates present in real production environments, we analyzed the RISCC and SHARCNET rigid job workloads available in [32]. Table III. Task allocation strategies The logs correspond to heavily utilized systems. An average job arrival rate for weekdays (T=86400 sec.) of λ=3300 was found. Only jobs with completed status were considered. We use the rate function λ(t)={35,50,40, 50,80,100,80,60,50,40,30} simulating low rate and high rate job arrivals during morning and afternoon correspondingly. Jobs are released over a period of 11 hours, initiating at 8:00 am. The maximum rate λ(6) 137.5, which is the average rate per hour estimated for rigid jobs. The thinning method is used to generate the NHPP [33]. 5. Experimental validation 5.1. Experimental setup All experiments are performed using the Grid scheduling simulator tgsf (Teikoku Grid Scheduling Framework). tgsf is a standard trace based simulator that is used to study Grid resource management problems. Design details of the simulator are described in [34]. The Cybershake, Epigenomics, Genome, Inspiral, LIGO, Montage, and SIPHT workflows [35] are used. They are publicly available via the Pegasus project portal [29], and include V j, E j, size j,p j,p j G and cpn j. 72
106 Where Supercomputing Science and Technologies Meet Examples of workflow types are shown in [12]. These workflows were executed using the Pegasus Workflow Management System on the TeraGrid. The workload containing 1009 workflows ( tasks) released over a period of 11 hours is used in experiments for comprehensive analysis. In this paper, we also use TeraGrid sites information for experimental analysis. We have chosen three computational sites with a total of 5376 processors, which compromise among the three smallest machines in the TeraGrid. We chose the TeraGrid as a testbed since it has been used in workflow scheduling studies [21]. Background workload (locally generated jobs) that is an important issue in nondedicated Grid environment is not addressed. Communication cost is not considered. Users of large parallel computers are typically required to provide run time estimates for submitted jobs. It is used by production system schedulers to bind job execution time, and to avoid wasting resource consumption. Thus, when a job reaches such a limit, it is typically aborted. To our knowledge, run time estimates for workflows are not provided. Thus, for the simulation purposes, we set, where c is randomly generated value uniformly distributed between 0 and 5, and, for each T v Є V j, modeling inaccuracy of the user run time estimates as proposed in [36]. Note that only level 3 and 4 allocation strategies use run time estimates. Table IV and V summarizes the Grid properties, site properties, and workflow scheduling Secondorder headings A good scheduling algorithm should schedule jobs to achieve high Grid performance while satisfying various user demands in an equitable fashion. Often, resource providers and users have different, conflicting, performance goals: from minimizing response time to optimizing resource utilization. Grid resource management involves multiple objectives and may use multicriteria decision support, for instance, based on the Pareto optimality. However, it is very difficult to achieve fast solutions needed for Grid resource management by using the Pareto dominance. The problem is very often simplified to a single objective problem or to different methods of objectives combining. There are various ways to model preferences, for instance, they can be given explicitly by stakeholders to specify an importance of every criterion or a relative importance between criteria. Due to the different nature of criteria, the actual difference may have a different meaning. In order to provide effective guidance in choosing the best strategy, we performed a joint analysis of several metrics according to methodology used in RamírezAlcaraz et al. [30]. Authors use an approach to multicriteria analyze assuming equal importance of each metric. The goal is to find a well performing strategy under all test cases, with the expectation that it will also perform well under other conditions. First, we evaluate the degradation in performance of each strategy under each of the c three metrics Approximation factor ρ= max c*max Mean critical path waiting time Mean critical path slowdown 73
107 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 They are well known performance metrics commonly used to express the objectives of different stakeholders of Grid scheduling (endusers, local resource providers, and Grid administrators). The objective of the mean critical path waiting time and slowdown performance metrics is to evaluate the quality of the critical path execution in the context of scheduling several workflows. In the experimental analysis the lower bound of the optimal completion timee is used instead of the optimal makespan.. The degradation in performance is computed relatively to the best performing strategy for the metric, as follows: vvv. Thus, each strategy is now characterized by 3 numbers, reflecting its relative performance degradation under the test cases. In [30], these 3 values (assuming equal importance of each metric) are averaged and ranked. The best strategy, with the lowest average performance degradation, has rank 1. However, some metrics might have little variation by nature or by a given scenario, and the averaging would yield to lesser impact to the overall score. In this paper, a different approach is taken. We average the individual ranks instead of the metric degradations their itself. Note that we try to identify strategies which perform reliably well in different scenarios; that is we try to find a compromise that considers all of our test cases. For example, the rank of the strategy in the average performance degradation could not be the same for any of the metrics individually Experimental results In this section, the 2stage scheduling system is simulated by various types of input processes and constraints. The following simulation results compare the performance of the 6 strategies (5 MWGS2 strategies designed for rigid parallel jobs, and Rand). We conduct their comprehensive performance evaluation for one scenario considering three metrics. Analysis of 6 twostage strategies is performed based on the workload A. We select the 3 best performed strategies of the previous experiment and analyze them Scheduling strategies In this section, we evaluate scheduling strategies MWGS2online=WGS_Alloc+PS using workload A [12]. Firstly, we compare allocation strategies considering the approximation factor, mean waiting time, and mean critical path slowdown, independently. Then, we perform a joint analysis of these metrics according to methodology described in Sec. 4.2, and present their ranking when considering all metrics average. A small percentage of degradation indicates that the performance of a strategy for a given metric is close to the performance of the best performing strategy for the same metric. Therefore, small degradations represent better results. Fig show the performance degradation of all strategies for ρ, (cpw), and (cps). Fig. 4.4 shows the mean degradations of the strategies when considering all metrics average. Fig. 4.5 shows ranking for all test 74
108 Where Supercomputing Science and Technologies Meet cases. A small percentage of degradation indicates that the performance of a strategy for a given metric is close to the performance of the best performing strategy for the same metric. Therefore, small degradations represent better results. From Fig. 4.1.a we observe that MST, RRobin, and Rand are strategies with worst approximation factors with 100, 44, and 23 percent of degradation in performance correspondingly. MLB, MLp, and MaxAR strategies have small percents of degradation in all metrics. The best approximation factor was achieved by MLp in both test cases. Results also show that small mean critical path waiting time degradation significantly increases mean critical path slowdown (Fig. 4.2, 4.3). Degradation occurs when during the scheduling period of a job, newly arrived jobs contend for resources, thus increasing task waiting time and critical path length. Performance degradation of Rand is smaller than of MST and RRobin for all metrics, this is occurs since the range of processors corresponding to site 1 is nearly two times greater than the range of processors of the other two sites. As machine sizes become similar and the number of sites increases, we expect an increase in Rand performance degradation. Strategies with minimum mean degradation of all strategies under workload A are MaxAR and MLp (Fig. 4.4). We assumed equal importance of each metric. Ranking results also show that the best 3 strategies are MLp, MaxAR, and MLB. Nevertheless, the optimal completion time is unknown, as results are based on the best performing strategy. 6. Conclusions Effective workflow management requires an efficient allocation of tasks to limited resources over time and currently is subject of many research projects. Multiple workflows and the online scheduling model add a new complexity to the problem. The manner in which allocation of a workflow task can be done depends not only on the workflow properties and constraints, but also on an unpredictable workload generated by other workflows in the distributed context. This work extends the earlier results presented in [12] by considering the online scheduling of workflows. Contrary to other studies, job arrival times are modeled as a Non Homogeneous Poisson Process. We determine rate λ by analyzing workload traces from highly utilized systems. We analyze strategies that consist of two stages: adaptive allocation and parallel machine scheduling.we conduct a comprehensive performance evaluation study of 6 workflow scheduling strategies in a homogeneous Grid via simulation. In order to provide effective guidance in choosing the best strategy, we performed a joint analysis of three metrics (approximation factor, mean critical path waiting time, and critical path slowdown) according to a degradation methodology that considers multicriteria analysis assuming equal importance of each metric. 75
109 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO PERCENT PERCENT MLB MLp MST MaxAR RRobin Rand 0 MLB MLp MST MaxAR RRobin Rand (a) All strategies Figure 4.1. Approximation factor performance degradation All strategies Figure 4.3. Mean critical path slowdown performance degradation PERCENT MLB MLp MST MaxAR RRobin Rand (b) All strategies (c)figure 4.2. Mean critical path waiting time performance degradation All strategies Figure 4.4. Mean degradation of all strategies 76
110 Where Supercomputing Science and Technologies Meet When we examine the overall Grid performance based on real data, we find that an appropriate distribution of jobs over the Grid by MaxAR and MLp. This result is consistent with the findings in [12]. However the length of the critical path corresponding to jobs whose tasks have not been entirely scheduled tend to increase, as arrival and scheduling jobs occurs. In future work, we will evaluate our scheduling strategies under more realistic conditions, such as background workload and heterogeneous resources. 7. References [1] M. L. Pinedo, Scheduling: Theory, Algorithms, and Systems, 3rd Edition, Springer, (2008). [2] C. Mccreary, A. A. Khan, J. J. Thompson, M. E. Mcardle, A comparison of heuristics for scheduling DAGs on multiprocessors, in: International Parallel and Distributed Processing Symposium/ International Parallel Processing Symposium, pp , (1994). [3] Y. kwong Kwok, I. Ahmad, Dynamic criticalpath scheduling: An effective technique for allocating task graphs to multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 7, pp , (1996). [4] Y.K. Kwok, I. Ahmad, Static scheduling algorithms for allocating directed task graphs to multiprocessors, ACM Comput. Surv., 31 (4), pp , (1999). [5] J. Leung, L. Kelly, J. H. Anderson, Handbook of Scheduling: Algorithms, Models, and Performance Analysis, CRC Press, Inc., Boca Raton, FL, USA, (2004). [6] S. Rajakumar, V. P. Arunachalam, V. Selladurai, Workflow balancing strategies in parallel machine scheduling, International Journal of Advanced Manufacturing Technology, 23, pp , (2004). [7] L. F. Bittencourt, E. R. M. Madeira, Towards the scheduling of multiple workflows on computational grids, Journal of Grid Computing, 8, pp , (2010). [8] H. Zhao, R. Sakellariou, Scheduling multiple DAGs onto heterogeneous systems, in: Parallel and Distributed Processing Symposium, 20th International, IPDPS 06, IEEE Computer Society, Washington, DC, USA, pp. 14, (2006). [9] T. N takpé, F. Suter, Concurrent scheduling of parallel task graphs on multiclusters using constrained resource allocations, in: International Parallel and Distributed Processing Symposium/ International Parallel Processing Symposium, pp. 1 8, (2009). [10] L. Zhu, Z. Sun, W. Guo, Y. Jin, W. Sun, W. Hu, Dynamic multi DAG scheduling algorithm for optical grid environment, Network Architectures, Management, and Applications V (Proceedings Volume), 6784 (1), p. 1122, (2007). [11] C.C. Hsu, K.C. Huang, F.J. Wang, Online scheduling of workflow applications in grid environments, Future Gener. Comput. Syst., 27, pp , (2011). [12] A. HiralesCarbajal, A. Tchernykh, R. Yahyapour, J. L. GonzálezGarcía, T. Röblitz, J. M. RamírezAlcaraz. Multiple workflow scheduling strategies with user run time estimates on a grid, Journal of Grid Computing, Springer Verlag, Netherlands, DOI: /s [13] M. R. Garey, R. L. Graham, Bounds for multiprocessor scheduling with resource 77
111 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 constraints, SIAM Journal on Computing, 4, pp , (1975). [14] H. Topcuouglu, S. Hariri, M.y. Wu, Performanceeffective and lowcomplexity task scheduling for heterogeneous computing, IEEE Trans. Parallel Distrib. Syst., 13 (3), pp , (2002). [15] Z. Shi, J. Dongarra, Scheduling workflow applications on processors with different capabilities, Future Generation Computer Systems, 22, pp , (2006). [16] A. Ramakrishnan, G. Singh, H. Zhao, E. Deelman, R. Sakellariou, K. Vahi, K. Blackburn, D. Meyers, M. Samidi, Scheduling dataintensive workflows onto storageconstrained distributed resources, in: CCGRID 07: Proceedings of the 7th IEEE Symposium on Cluster Computing and The Grid, CCGRID 07, pp , (2007). [17] F. R. L. Cicerre, E. R. M. Madeira, L. E. Buzato, A hierarchical process execution support for grid computing, in: Proceedings of the 2nd workshop on Middleware for grid computing, MGC 04, ACM, New York, NY, USA, pp , (2004). [18] L. F. Bittencourt, E. R. M. Madeira, A dynamic approach for scheduling dependent tasks on the Xavantes Grid middleware, in: MCG 06: Proceedings of the 4th international workshop on Middleware for grid computing, MCG 06, ACM, New York, NY, USA, pp , (2006). [19] L. F. Bittencourt, E. R. M. Madeira, On the distribution of dependent tasks over nondedicated grids with high bandwidth, in: III Workshop TIDIA FAPESP, Proceedings III Workshop TIDIA, Sao Paulo, Brazil, (2006). [20] Y. Gong, M. E. Pierce, G. C. Fox, Dynamic resourcecritical workflow scheduling in heterogeneous environments, in: Job Scheduling Strategies for Parallel Processing, pp. 1 15, (2009). [21] G. Singh, M.H. Su, K. Vahi, E. Deelman, B. Berriman, J. Good, D. S. Katz, G. Mehta, Workflow task clustering for best effort systems with pegasus, in: MG 08: Proceedings of the 15th ACM Mardi Gras conference, ACM, New York, NY, USA, pp. 1 8, (2008). [22] T. A. Henzinger, V. Singh, T. Wies, D. Zufferey, Scheduling large jobs by abstraction refinement, in: Proceedings of the sixth conference on Computer systems, EuroSys 11, ACM, New York, NY, USA, pp , (2011). [23] S. C. Kim, S. Lee, J. Hahm, Pushpull: Deterministic searchbased dag scheduling for heterogeneous cluster systems, IEEE Transactions on Parallel and Distributed Systems, 18 (11), pp , (2007). [24] E. Saule, D. Trystram, Analyzing scheduling with transient failures, Inf. Process. Lett., 109 (11), pp , (2009). [25] Y. Jia, B. Rajkumar, Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms, Scientific Programming, 14 (3), pp , (2006). [26] S. Ranaweera, D. Agrawal, A scalable task duplication based scheduling algorithm for heterogeneous systems, in: Proceedings International Conference on Parallel Processing, pp , (2000). [27] A. Tchernykh, U. Schwiegelshohn, R. Yahyapour, N. Kuzjurin, Online hierarchical job scheduling on grids with admissible allocation, J. of Scheduling 13 (2010) [28] A. QuezadaPina, A. Tchernykh, J. 78
112 Where Supercomputing Science and Technologies Meet L. GonzálezGarcía, A. HiralesCarbajal, V. MirandaLópez, J. M. RamírezAlcaraz, U. Schwiegelshohn and R. Yahyapour. Adaptive job scheduling on hierarchical Grids. Future Generation Computer Systems, Elsevier Science. (2011, accepted for publication). [29] Workflow generator (August 2010). https://confluence.pegasus.isi.edu/display/pegasus/ WorkflowGenerator. of scientific workflows, in: Third Workshop on Workflows in Support of LargeScale Science, WORKS 2008., pp. 1 10, (2008). [36] C. B. Lee, Y. Schwartzman, J. Hardy, A. Snavely, Are user runtime estimates inherently inaccurate?, in: Job Scheduling Strategies for Parallel Processing, pp , (2004). [30] J. M. RamirezAlcaraz, A. Tchernykh, R. Yahyapour, U. Schwiegelshohn, A. Quezada Pina, J. L. GonzalezGarcía, A. HiralesCarbajal, Job allocation strategies with user run time estimates for online scheduling in hierarchical grids, Journal of Grid Computing, 9, pp , (2011). [31] K. Lee, N. W. Paton, R. Sakellariou, E. Deelman, A. A. A. Fernandes, G. Mehta, Adaptive workflow processing and execution in pegasus, Concurrency and Computation: Practice and Experience, 21, pp , (2009). [32]Standard workload archive (August 2010). index.html. [33] X.T. Wang, S.Y. Zhang, S. Fan, Nonhomogeneous fractional poisson processes, Chaos Solitons Fractals, 31 (1), pp , (2007). [34] A. HiralesCarbajal, A. Tchernykh, T. Roblitz, R. Yahyapour, A grid simulation framework to study advance scheduling strategies for complex workflow applications, in: IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1 8, (2010). [35] S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.H. Su, K. Vahi, Characterization 79
113 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Energy Efficiency of Online Scheduling Strategies in Two level Hierarchical Grids Alonso Aragón Ayala 1, Andrei Tchernykh 1, Ramin Yahyapour 2, Raúl Ramírez Velarde 3 1 Computer Science Department, CICESE Research Center, Carretera EnsenadaTijuana No. 3918, Zona Playitas. CP 22860, Ensenada,BC, México; 2 GWDG University of Göttingen, Göttingen, Germany; 3 Computer Science Department, ITESM, Av. Eugenio Garza Sada 2501 Sur Col. Tecnológico. CP 64849, Monterrey, NL, México; Web: ~chernykh} Abstract In this paper, we focus on energy efficient nonpreemptive parallel job scheduling in Grid computing. Our strategies turn resources off when they are not needed and turn them back on to provide desired QoS. We analyze online scheduling algorithms on two level hierarchical Grids. Scheduling is split into a site allocation part and a local machine scheduling part. At the first stage, we allocate a suitable machine for each job using a given selection criterion. At the second stage, a local scheduling algorithm is applied to each machine for the jobs allocated during the previous stage. We consider 4 metrics for scheduling evaluation: waiting time, response time, bounded slowdown and energy consumption. To show practical applicability of our methods, we perform a comprehensive study of the performance of the proposed strategies and their derivatives using simulation. We take into account several issues that are critical for practical adoption of the scheduling algorithms: we use Grid workloads based on real production traces, consider two Grid scenarios based on heterogeneous HPC systems, and consider problems of scheduling jobs with unspecified execution time requirements as only job user run time estimates are available in real execution environment. Keywords: Grid, scheduling, energyaware 1. Introduction In recent years, Green IT has emerged as a new research domain [1]. There are several efforts to reduce energy consumption on largescale systems (datacenters, Grids, Clouds) and improve their energy efficiency. Diversified aspects of the problem are discussed in the literature to cope with the new challenges of multi domain distributed systems, including the software and hardware aspects. In this paper, we focus on energy efficient scheduling on a largescale distributed system, namely the Grid. The Grid is a platform that offers integrated architecture and aggregation of geographically distributed resources. 80
114 Where Supercomputing Science and Technologies Meet Grids are designed to cope with the overall system s peak load, but in lowerload periods excess capacity can be turned off. Studies of the usage of Grid resources show that the usage of a Grid site may vary from less than 20% to over 90% on a daytoday basis [2]. This means that there is opportunity for energy saving mechanisms to switch on and off unused resources to adapt the available capacity to actual demands. Computational Grids have different constraints to those of traditional highperformance computing systems, such as heterogeneous resources. Energy efficiency in Grids is further complicated by the fact that several sites that comprise the Grid probably have several ownerships and priorities. Therefore, a poweraware scheduling algorithm designed for the Grid should consider nonequal site policies. Scheduling algorithms can be classified as static or dynamic. Static scheduling algorithms [3] assume that all information required for scheduling decisions (characteristics of jobs and sites) are known in advance. Scheduling decisions are made at job release time and remain constant during runtime. These assumptions may not be applied to Grid environments, due to the difficulty to achieve and maintain 100% information accuracy. In contrast, dynamic scheduling algorithms [4, 5] use runtime information to make scheduling decisions. One of the major drawbacks of dynamic algorithms is the possible inaccuracy of performance prediction information that the algorithm uses for scheduling purposes. Due to the size dynamicity of Grids, the process to allocate jobs to available resources is more complex than in traditional highperformance systems. Many studies propose either a distributed resource management system [6] or a centralized one [7, 8], while real implementations have a combination of centralized and decentralized architectures [9]. These hierarchical multilayer systems have been considered in [1017]. The highest layer, or Gridlayer scheduler, has a general view of job requests but is unaware of specific details on the state of the resources. Local resource management systems, the lower level layer, know resource state and the jobs that are forwarded to them. In very large systems, additional layers may exist in between. Therefore, an efficient resource management system for Grids requires a suitable combination of scheduling algorithms that support such multilayer structures. In this paper, we conduct a performance evaluation study of an energyaware two layer online Grid scheduling strategies. The first layer allocates a job to a suitable machine using a given criterion, while the second applies machine dependent scheduling algorithms to the allocated jobs. Given that Grid resources do not typically share the same management system and are connected via wide area networks, migration between resources requires a significant overhead and can be very challenging. Therefore, we do not consider multisite execution nor job migration: once a job has been allocated to a machine, it must be executed on this machine. We assume rigid parallel jobs; that is, the jobs have a specific degree of parallelism and must be assigned to the specified number of processors during their execution. We restrict to machines with different number of the same processors as architectures of individual cores and their clock frequency tend to be rather 81
115 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 similar. After we formally present our Grid scheduling model in Section 3, we introduce the energy model in Section 4. Next, we discuss the metrics we will use in Section 5. Section 6 presents the algorithms we will use, followed by the experimental setup in Section 6 and results in Section 7. Finally, we conclude with a summary in Section Grid model We address an online scheduling problem: n parallel jobs J 1, J 2,,J n must be scheduled on m parallel machines (sites) N 1, N 2,,N m. Let m i be the number of identical processors or cores of machine N i. We denote the total number of processors belonging to machines m from N 1 to N m by m 1,m = i=1 m i. We assume, without loss of generality, that the machines are arranged in nondescending order of their number of processors, that is, m 1 m 2... m m holds. Each job J j is described by a tuple (r j, size j,p j,p j ): its release time rj 0, its size 1 size j m m, its execution time p j and its user runtime estimate p j. A machine must execute a job J j by allocating size j processors for an uninterrupted period of time p j. Multisite execution is not allowed, so in order to execute the job Jj on machine N i, size j m i must hold. We use g j = i to denote that job J j is allocated to machine N i, while n i is the number of jobs allocated to machine N i.c j is the completion time of job J j, while the makespan is C max =max Jj {c_j}. We use C* max to denote the optimal makespan. Also, w j = p j. size j and w j = p j size j are the work and estimated work of job J j. At its release date, each job must be immediately allocated to a single machine. However, processor allocation of a job can be delayed until the required number of processors becomes available. A job must be executed by allocating exactly size j processors for an uninterrupted period of time p j to it. Since we do not allow multisite execution or coallocation of jobs in different machines, a job J j can only run on machine N i if size j < m i holds. We assume all resources are dedicated to the Grid. We use g j = I to denote that job J j is allocated to machine N i, while n i is the total number of jobs allocated to machine N i. The completion time of job J j of instance I in a schedule S is denoted by c j (S,I) and the makespan of a schedule S and instance I is C max (S,I)=max Jj {c j (S,I)}. We use C* max (I) to specify the optimal makespan of instance I. If and when possible and without causing ambiguity, we will omit instance I and schedule S. 3. Energy model Site N i with m i cores consists of CH i chassis, BO i boards in the chassis, and CR i cores in the board. For simplicity and without loss of generality, we will assume that all cores are identical. Each component of the Grid consumes a predefined amount of energy when it s on, plus an extra amount if it is working at a given moment. For each of the n starsite times a site is turned on, we must wait an amount of time equal to T starsite before that site can process a job; during that start period, the site consumes P starsite power. Total consumption of turn on is startsite startsite startsite startsite denoted by E i = n i T i P i. Energy consumption of the Grid can be 82
116 Where Supercomputing Science and Technologies Meet represented as a sum of consumption in operational end start modes E grid = E opgrid + E startgrid, where E opgrid Cmax = (t=1) P opgrid (t) denotes the energy the Grid consumes while it is operational and E startgrid = m E startsite i=1 i denotes the energy it consumes when starting m up. We use P opgrid (t) = i=1 q i (t) (P idlesite opsite +P i (t)) to represent power consumption of the Grid at a given time t. q i (t) is equal to 1 if site i is on at time t, and 0 otherwise. P idlesite is a constant value that represents power opsite consumption of a site when it is on. P i is used to denote extra power consumption of a site i when it is operating. CHi opsite P i (t) = r ch (t) (P idlechassis ch +P opboard (t)), ch where r ch (t) is equal to 1 if chassis ch=1,...,ch i is on at time t, and 0 otherwise. P idlechassis is a constant value that represents power consumption of a chassis when it is on. P opboard (t) denotes extra power consumption of chassis ch when it is operating. opboard BOi P CH (t)= s bo (t) (P idleboard +P opcore 1 bo=1 bo (t)). s bo (t) is equal to 1, if board bo=1,...,bo i is on at time t, and 0 otherwise. P idleboard represents power consumption of a board bo when it is on, and P opcore bo (t) denotes extra power consumption of board when it is operating. Finally, P opcore CR (t)= i BOi cr = 1 w cr (t) (P idlecore +v cr (t) P workcore ) represents power consumption of all the cores in board BO i, where w cr (t) is equal to 1 if core cr=1,...,cr i is on at a time t, and 0 otherwise. P idlecore is a constant value that represents power consumption of an idle core. v cr (t) is equal to 1, if core cr is working at a given time t, and 0 otherwise. P workcore is a constant value representing extra power consumption of a working core. 4. Metrics For our simulation experiments, we n consider four metrics: Mean waiting n time (t w =1/n j=1 (c j p j r j ); response time (TA)=1/n j=1 (c j  r j ); mean bounded slowdown and energy consumption of the Grid E grid = P grid (t). We denote the Grid machine model by GP m, and characterize the problem as GP m r j,size j,p j,p j {TA,t w,ρ,e grid }, using the three field notation (α β γ) introduced by Graham et al. [18]. This notation describes the fields machine environment (α), job characteristics (β), and objective function (γ). 5. Algorithms We consider algorithms that have only knowledge about the number of unfinished jobs in the system, their processor requirements and user runtime estimates. 5.1 Allocation Strategies As already discussed, our Grid scheduling algorithms can be split into a global allocation part and a local scheduling part. Hence, we regard them as a two stage scheduling strategy: MPS=MPS_Alloc+PS. All algorithms proceed on a jobbyjob basis. We consider three levels of information available for our allocation strategies [see Table 1]. Level 1: There is no information on the processing time of jobs, only processor requirement is known. But our algorithm may use information of previously performed 83
117 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 allocations. Level 2: We know the job runtime estimate p j along with information of Level 1. Level 3: We have access to the information of Level 2 and to all local schedules as well. 5.2 Parallel Machine Scheduling Once a job has been allocated to a machine, the local resource management system (LRMS) generates a schedule for that machine. Different scheduling algorithms can be used Table 1. Allocation strategies 84
118 Where Supercomputing Science and Technologies Meet by the LRMS; many systems apply the First Come First Served (FCFS) algorithm which schedules jobs in the order of their arrival. In the experimental parts of this paper, we assume that the LRMS uses an online parallel scheduling algorithm: FirstCome FirstServe policy with EASY backfilling [8], where the scheduler may use later jobs to fill holes in the schedule. In order to apply EASY backfilling, user estimated runtimes are used. 6. Experimental setup Two fundamental issues have to be addressed for how to set up a simulation environment for performance evaluation. On one hand, representative workload traces are needed. On the other hand, an appropriate testing environment should be set up to obtain reproducible and comparable results. In this section, we present most common Grid tracebased simulation setups with an emphasis on the mentioned two issues. 6.1 Grid Configuration We consider two Grid scenarios for evaluation: Grid1 and Grid2. In Grid1, we consider seven HPC machines with a total 4,442 processors as presented in the Table 3. In Grid2, we considered nine sites with a total of 2,194 processors (see Table 4). Respective logs are used to create the Grid log. Details of the log characteristics can be found in the Parallel Workloads Archive (PWA) [20] and the Grid Workloads Archive (GWA) [21]. 6.2 Power configuration Since we consider homogeneous processors with identical power requirements. We consider a 5second timelapse. If it is idle for this period, then it will turn off to save power. For each board this time is instant; when all processors in a board are turned off, it will turn off immediately. Finally, for a Site we also consider that if the site is off when a job arrives to it, we must wait 20 seconds for it to start processing. Table 2 defines the energy consumption for each component in the Grid. Table 2. Power configuration 6.3 Workload To test job execution performance under a dedicated Grid environment, we use Grid workload based on real production traces. Carefully reconstructed traces from real supercomputers provide a very realistic job stream for simulationbased performance evaluation of Grid job scheduling algorithms. Background workload (locally generated jobs), that is an important issue in nondedicated Grid environment, is not addressed. We consider logs from PWA and GWA, and apply filter to remove invalid jobs and normalize timezones. We remove jobs with the following parameters: job number 0; submit time < 0; runtime 0; number of allocated processors 0; requested time 0; user ID 0; status = 0, 4, 5 (0 = job failed; 4 = partial execution, job failed; 5 = job was 85
119 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 cancelled, either before starting or during run). It is known that job demand varies with the time of the day and the day of the week. Moreover, sites are located in different time zones. This is a reason of the mentioned normalization of the used workloads as shifting the workloads by a certain time interval to represent more realistic setup. We transformed the workloads so that all traces begin at the same weekday and at the same time of day. To this end, we removed all jobs until the first Monday at midnight. Note that the alignment is related to the local time; hence the time differences corresponding to the original time zones are maintained. Table 4. Grid2 characteristics Table 3. Grid1 characteristics 7. Simulation results In this section, we analyze 13 allocation strategies (Lbal_S, Lbal_T, Lbal_W, MCT, MLB, MLp, MPL, MST, MWT, MWWT_S, MWWT_T, MWWT_W, Rand) together with the EASY backfilling local scheduling algorithm. 7.1 Methodology Users and resource providers usually have different and often conflicting goals. Hence, Grid resource management should involve multicriteria decision support. The general methodology applies the Pareto optimality; 86
120 Where Supercomputing Science and Technologies Meet however, it is very difficult to achieve fast solutions needed for Grid resource management. The problem may be simplified to a single objective problem by combining all the objectives into one. We performed a joint analysis of several metrics, according to the methodology proposed in [22]. They present an approach to multicriteria analysis that considers equal importance of each metric. The goal is to find a well performing strategy under all test cases, expecting it to perform well under other Grid conditions. Table 5. Performance degradation for strategies in Grid1 First, we calculate the performance degradation of each strategy under each of four metrics. This is done relative to the best performance strategy for the metric, as follows: 100*strategy_metric/best_metric100. Each strategy is characterized by four numbers, reflecting its relative performance degradation under our test cases. In the second step, we get the average from these four values (assuming equal importance for each metric) and rank the strategies. The best strategy, with the lowest average performance degradation, has rank 1; the worst has rank 13. Then we calculate the average performance degradation and ranking. The rank may not be the same for any of the individual metrics.and flush right. Please do not place any additional blank lines between paragraphs. 7.2 Experimental analysis Results of all strategies with our metrics are presented in three tables. Table 5 gathers results from the Grid1 scenario and Table 6 shows the results for the Grid2 scenario. Table 7 represents results of each strategy considering both Grid scenarios. The C_ AVG column represents an average value considering only the performance metrics, that is, excluding E grid, and T_AVG is an average value of all four metrics. The numbers in these tables represent performance degradation that are normalized with the best found value for each metric. The value 1.00 is the best for that metric. We also consider different conditions and test cases. Grid1 is characterized by seven sites with an average of 634 processors and 87
121 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO jobs per site, while Grid2 is characterized by nine smaller sites with 243 processors and 4280 jobs per site, in average. As seen from Table 5, the strategy MWWT_W has the best performance for the Grid1 scenario, if we consider only the energy consumption of the grid, but is one of the worst in the other three metrics. On the other side, Lbal_S has a good performance on the performance metrics, and, though, its energy consumption is higher than others, it is the best strategy for this scenario considering average of all metrics, as shown on the last column. Table 6 shows that, for the Grid2 scenario, the best strategy for energy saving is still MWWT_W, but similar to the previous scenario, its performance considering the other metrics is poor. MST proves to be the best strategy for the other metrics and the best for this scenario, with an energy consumption 3.49% higher than MWWT_W. Table 6. Performance degradation for strategies in Grid2 We represent the strategy ranking in Table 8 showing the performance metrics, the energy consumption and the overall ranking. MWWT_W is the best strategy considering energy consumption for both scenarios; MST is the best strategy considering all metrics and scenarios. Table 7. Performance degradation for strategies in both scenarios 88
122 Where Supercomputing Science and Technologies Meet Table 8. Strategies ranking 8. Conclusions and future work As computational Grids become larger and more accessible, minimizing energy consumption takes up relevance together with traditional performance objectives. In this work, we have studied the performance of several Grid allocation strategies taking into account their energy consumption to find a compromise between energy consumption and quality of service. Simulation results show that in terms of minimizing energy consumption, bounded slowdown, response time and mean waiting time, MST is a robust and effective strategy, reducing the energy consumption of the Grid without losing much performance quality. 9. References [1] AnneCecile Orgerie, Laurent Lerevre and JeanPatrick Gelas, Demystifying Energy Consumption in Grids and Clouds. [2] A.C. Orgerie, L. Lefèvre, and J.P. Gelas, How an experimental grid is used: The Grid5000 case and its impact on energy usage, in Proc. 8th IEEE Int. Symp. On Cluster Computing and the Grid (CCGrid2008), May [3] C. Kim, H. Kameda, An algorithm for optimal static load balancing in distributed computer systems, IEEE Transactions on Computers 41 (1992) [4] K. Lu, R. Subrata, A.Y. Zomaya, Towards decentralized load balancing in a computational grid environment, in: Proceedings 1st International Conference on Grid and Pervasive Computing (published in Springer Verlag Lecture Notes in Computer Science), 2006, pp , Taichung, Taiwan. [5] R. Subrata, A.Y. Zomaya, B. Landfeldt, Artificial life techniques for load balancing in computational grids, Journal of Computer and System Sciences 73 (2007) [6] Ernemann, C., Yahyapour, R.: Applying economic scheduling methods to Grid environments, in: Grid Resource Management: State of the Art and Future Trends, pp Kluwer, Dordrecht (2004). [7] Ernemann, C., Hamscher, V., Schwiegelshohn, U., Yahyapour, R., Streit, A.: On advantages of Grid computing for parallel job scheduling, in: 2nd IEEE/ACM International Symposium on Cluster 89
123 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Computing and the Grid, pp. 39. IEEE Computer Society (2002). [8] Ernemann, C., Hamscher, V., Yahyapour, R.: Benefits of global Grid computing for job scheduling, in: Fifth IEEE/ACMInternationalWorkshop ongrid Comput ing (Grid 04), in Conjunction with SuperComputing 2004, pp IEEE Computer Society, Pittsburgh (2004). [9] VázquezPoletti, J.L., Huedo, E., Montero, R.S., Llorente, I.M.: A comparison between two Grid scheduling philosophies: EGEE WMS and Grid way. Multiagent and Grid System. Grid Computing, High Performance and Distributed Applications 3, (2007). [10] Schwiegelshohn, U., Yahyapour, R.: Attributes for communication between Grid scheduling instances, in: Nabrzyski, J., Schopf, J.M., Weglarz, J. (eds.) Grid Resource Management: State of the Art and Future Trends, pp Kluwer, Norwell (2004). [11] Kurowski, K., Nabrzyski, J., Oleksiak, A., Weglarz, J.: A multicriteria approach to twolevel hierarchy scheduling in Grids. J. Sched. 11, (2008). [12] Zikos, S., Karatza, H.D.: Resource allocation strategies in a 2level hierarchical Grid system, in: Simulation Symposium, ANSS st Annual, pp Ottawa, Ont. (2008). [13] Chunlin, L., Layuan, L.: Multilevel scheduling for global optimization in Grid computing. Comput. Electr. Eng. 34, (2008). [14] Wäldrich, O., Wieder, P., Ziegler, W.: A metascheduling service for coallocating arbitrary types of resources, in: Wyrzykowski, R., Dongarra, J., Meyer, N., Wasniewski, J. (eds.) Parallel Processing and Applied Mathematics, vol. 3911, pp Springer, Heidelberg (2006). [15] Tchernykh, A., Ramírez, J., Avetisyan, A., Kuzjurin, N., Grushin, D., Zhuk, S.: Two level jobscheduling strategies for a computational Grid, in: Wyrzykowski, R., Dongarra, J., Meyer, N., Wasniewski, J. (eds.) 6th International Conference on Parallel Processing and Applied Mathematics PPAM 2005, LNCS, vol. 3911, pp Springer, Heidelberg (2006). [16] Zhuk, S., Chernykh, A., Avetisyan, A., Gaissaryan, S., Grushin, D., Kuzjurin, N., Pospelov, A., Shokurov, A.: Comparison of scheduling heuristics for Grid resource broker, in: Third International IEEE Conference on Parallel Computing Systems (PCS 2004), pp IEEE, Colima, Colima, México (2004). [17] Pugliese, A., Talia, D., Yahyapour, R.: Modeling and supporting Grid scheduling. Journal of Grid Computing 6, (2008). [18] Graham, R.L., Lawler, E.L., Lenstra, J.K., Rinnooy Kan, A.H.G.: Optimization and approximation in deterministic sequencing and scheduling: a survey, in: Hammer, P.L., Johnson, E.L., Korte, B.H. (eds.) Annals of Discrete Mathematics 5. Discrete Optimization II, pp NorthHolland, Amsterdam (1979). [19] Naroska, E., Schwiegelshohn, U.: On an online scheduling problem for parallel jobs. Inf. Process. Lett. 81, (2002). [20] Parallel Workloads Archive. cs.huji.ac.il/labs/parallel/workload/ [21] Grid Workloads Archive, TU Delft. ewi.tudelft.nl [22] Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using systemgenerated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18, (2007). [23] A. Tchernykh, U. Schwiegelsohn, R. Yahyapour, N. Kuzjurin Online Hierarchical Job Scheduling on Grids with Admissible Allocation,. Journal of Scheduling, Volume 13, Issue 5, pp Springer Verlag, Netherlands,
124 Where Supercomputing Science and Technologies Meet [24] Ariel QuezadaPina, Andrei Tchernykh, José Luis GonzálezGarcía, Adán Hirales Carbajal, Vanessa MirandaLópez, Juan Manuel RamírezAlcaraz, Uwe Schwiegelshohn, Ramin Yahyapour. Adaptive job scheduling on hierarchical Grids. Future Generation Computer Systems, Elsevier Science,
125 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Genetic Algorithms for Job Scheduling in Two Level Hierarchical Grids: Crossover Operators Comparison Víctor Hugo Yaurima Basaldúa 1, Andrei Tchernykh 2 1 CESUES Superior Studies Center, Carretera a Sonoyta km San Luis Río Colorado, Sonora, 83400, México; 2 CICESE Research Center, Carretera a TijuanaEnsenada km. 107, Ensenada, B.C.N., 22860, México; Abstract This paper considers the application of genetic algorithms to offline scheduling of parallel jobs in a two stages hierarchical Grid. In the first stage, jobs are assigned to a site responsible for the distribution of jobs to data centers. In the second stage, the jobs in data centers are locally scheduled. While genetic algorithms have the disadvantage of excessive processing time to find a timetable, they can obtain quality results. In addition, they can easily incorporate multiple optimization objectives. In this paper, we use a genetic algorithm in the first stage, compare six crossover operators, three mutation operators, and conduct multivariate statistical analysis of variance. Keywords: Genetic algorithm, crossover operator, mutation operator, offline scheduling. 1. Introduction Due to the size and dynamic nature of Grids, the process of allocating computational jobs to available Grid resources must be done in an automatic and efficient fashion. Various scheduling systems have been proposed and implemented in different types of Grids. However, there are still many open issues in this field, including the consideration of multiobjective optimization, and multiple layers of scheduling. In this paper, we present experimental work in the field of multiobjective scheduling in a two layer Grid computing. At the first layer, we select the best suitable machine for each job using a given criterion. At the second layer, local scheduling algorithm is applied to each machine independently. In such an environment, one of the big challenges is to provide scheduling that allows more efficient use of resources, and satisfies other demands. The optimization criteria are often in conflict. For instance, resource providers and users have different objectives: providers strive for high utilization of resources, while users are interested in a fast response. We provide solutions that consider both goals. The aggregation method of criteria and a scalar function to normalize 92
126 Where Supercomputing Science and Technologies Meet them are used. We examine the overall Grid performance based on real data and present a comprehensive comparative analysis of six crossover operators, three operators of mutation, five values of crossover probability, and five values of mutation probability. To genetic algorithm tune up the multifactorial analysis of variance is applied. After formally presenting our Grid scheduling model in Section 2, we discuss related work in Section 3. We introduce genetic scheduling algorithms and discuss their application for Grid scheduling in Section 4. Genetic algorithm calibration is presented in section 5. Finally, we conclude with the summary in Section Model We address an offline scheduling problem: parallel jobs J 1,J 2,,J n must be scheduled on parallel machines (sites) N 1,N 2,,N m. Let m 1 be the number of identical processors or cores of machine N 1. Assume without loss of generality that machines are arranged in nondescending order of their numbers of processors, that is m 1 m 2... m m holds. Each job J j is described by a tuple (size j, p j, p j ): its size 1 size j m m also called processor requirement or degree of parallelism, execution time p j and user runtime estimate p j. The release date of a job is zero; all the jobs are available before scheduling process start. Job processing time is unknown until the job has completed its execution (nonclairvoyant case). User runtime estimate p j is provided by a user. A machine must execute a job by exclusively allocating exactly size j processors for an uninterrupted period of time p j to it. As we do not allow multisite execution and coallocation of processors from different machines J j, a job can only run on machine N i if size j m j holds. Two criteria are considered: makespan: C max =max(c i ) i=1,2,3,,n, where C is the m i maximum completion time on N i machine, and mean turnaround time: 1, where c j ion time of job J j. TA= n n j=1 c j We denote our Grid model by GP m.. In the three field notation (α β γ) introduced in [6], our scheduling problem is characterized as GP m size j,p,p j OWA, j where is the value of the multicriteria aggregation operator (OWA=w 1 C max +w 2 TA), and is the linear combination weights. The problem of scheduling on the second stage is denoted as P m size j,p j,p j C max. 3. Related Work 3.1 Hierarchical scheduling Scheduling algorithms for two layer Grid models can be split into a global allocation part and a local scheduling part. Hence, we regard MPS (Multiple machine Parallel Scheduling) as a two stage scheduling strategy: MPS=MPS_Alloc+PS [20]. At the first stage, we allocate a suitable machine for each job using a genetic algorithm. At the second stage, PS(single machine Parallel Scheduling) algorithm is applied to each machine independently for jobs allocated during the previous stage. 93
127 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Multicriteria scheduling Several studies deal with scheduling in Grids considering only one criterion (e.g. EGEE Workload Management System [1], NorduGrid broker [4], enanos [16], Gridway [10]). Various Grid resource managements involve multiple objectives and may use multicriteria decision support. Dutot et al. [3] considered task scheduling on the resources taking into account maximum task completion time and the average completion time. Siddiqui, et al. [18] presented task scheduling based on the resource negotiation, advance reservation, and user preferences. The user preferences are modeled by utility functions, in which users must enter the values and levels of negotiation. General multicriteria decision methodology based on the Pareto optimality can be applied. However, it is very difficult to achieve fast solutions needed for Grid resource management by using the Pareto dominance. The problem is very often simplified to a single objective problem or objectives combining. There are various ways to model preferences, for instance, they can be given explicitly by stakeholders to specify an importance of every criterion or a relative importance between criteria. This can be done by a definition of criteria weights or criteria ranking by their importance. In order to provide effective guidance in choosing the best strategy, Ramírez et al. [15] performed a joint analysis of several metrics according to methodology proposed in [19]. They introduce an approach to multicriteria analysis assuming equal importance of each metric. The goal is to find a robust and well performing strategy under all test cases, with the expectation that it will also perform well under other conditions, e.g., with different Grid configurations and workloads. Kurowski et al. [11] used aggregation criteria method for modeling the preferences of participants (owners, system managers and users). Authors considered two stage hierarchical grids, taking into account the stakeholders preferences, assuming unknown processing times of the tasks, and studied the impact of the size of the batch of tasks on the efficiency of schedules. YaurimaBasaldua et al. [23] apply genetic algorithms with two objectives using the aggregation method, and compare five crossover operators. Lorpunmanee et al. [12] presented task allocation strategies to the different sites of a Grid and propose a model for task scheduling considering multiple criteria. They concluded that such scheduling can be performed efficiently using GAs. In this paper, we consider twocriteria scheduling problem. We propose a genetic algorithm as a strategy for allocating jobs to resources. It uses an aggregation criteria method and the weight generating function representing the relative importance of each criterion. We present an experimental analysis of such a problem and compare obtained results with strategies aimed at optimizing a single criterion. In this paper, the Ordered Weighted Averaging (OWA) [11] is applied: OWA(x 1,x 2,,x n )= c=1 w c s(x)σ(c), where w c is the weight, c=1,,k, x c is a value associated with the satisfaction of the c criterion. Permutation ordering values: s(x) σ(1) s(x) σ(2) )...s(x) σ(k) is performed. Weights (w c ) k are nonnegative and c=1 w c =1. 94
128 Where Supercomputing Science and Technologies Meet If all the weights are set to the same value, OWA behaves as the arithmetic mean. In this case, high values of some criterion compensate low values of the other ones. In OWA approach, a compromise solution is provided. The highest weight is and the subsequent ones are decreasing, but never reaching 0. That means that both the worst criterion and the mean of criteria are taken into account. In this way stakeholders are able to evaluate schedules using multiple criteria. The goal is to find a weighting scheme that provides the best possible mean value for all stakeholders evaluations and the highest possible value for the worst case evaluation. To achieve this, weight w 1 must be relatively large, while weight w k should be small, where k denotes the number of criteria. The remaining weights decrease in value from w 1 to w k according to: 4. Genetic Algorithm Scheduling algorithms for two layer Grid models can be split into a global allocation part and a local scheduling part. Hence, we regard MPS as a two stage scheduling strategy: MPS=MPS_Alloc+PS. At the first stage, we use a genetic algorithm (GA) to allocate a suitable machine for each job (MPS_ Alloc=GA). At the second stage, the parallel job scheduling algorithm PS is applied to each machine independently for jobs allocated during the previous stage. As PS we use the wellknown strategy BackfillingEASY [21, 22]. GA is a wellknown search technique used to find solutions to optimization problems [9]. Candidate solutions are encoded by chromosomes (also called genomes or individuals). The set of initial individuals forms the population. Fitness values are defined over the individuals and measures the quality of the represented solution. The genomes are evolved through the genetic operators generation by generation to find optimal or nearoptimal solutions. Three genetic operators are repeatedly applied: selection, crossover, and mutation. The selection picks chromosomes to mate and produce offspring. The crossover combines two selected chromosomes to generate next generation individuals. The mutation reorganizes the structure of genes in a chromosome randomly so that a new combination of genes may appear in the next generation. The individuals are evolved until some stopping criterion is met. OWA operator as a fitness function is applied (Section 2). kkkkkkkkkkkkkkkkkkkkkkkkkkkkkmatrix. Each solution is encoded by n m Where m is a number of machines in a Grid, and n is a number of jobs. The i = 0,m1 row represents local queue of machine N i. Note, that machines are arranged in nondescending order of their number of processors m 0 m 2... m m1. A job J j can only run on machine N i if size j m i holds. The available set of machines for a job J j is defined to be the machines with indexes {f j,,m}, where f j is the smallest index i such that mi size j. The selection picks chromosomes to produce offspring. The binary tournament selection, known as an effective variant of the parents selection is considered. Two individuals are drawn randomly from the population, and one with highest fitness value wins the tournament. This process is repeated 95
129 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 twice in order to select two parents. 4.1 Crossover operators Crossover operator produces new solutions by combining existing ones. It takes parts of solution encodings from two existing solutions (parents) and combines them into single solution (child). The crossover operator is applied under a certain probability (P c ). In this paper, five operators are considered. One Segment Crossover for Matrix (OSXM). It is based on the crossover operator OSX  One Segment Crossover [7]. In this crossover, two random points S1 and S2 from 0 to vmax (maximum index used) are selected. The child inherits columns from the 0 to S1 position from parent 1. It also inherits columns from S1 to S2 from parent 2, but only elements that have not been copied from parent 1. Finally, child inherits elements from parent 1 that have not yet been copied. Two Point Crossover for Matrix (TPM). It is based on the crossover operator Two Point Crossover [13]. In this crossover, two random points S1 and S2 from 0 to vmax are selected. Columns from position 0 to S1 and from S2 to vmax are copied from parent 1. The remaining elements are copied from the parent 2 only if they have not been copied. Order Based Crossover for Matrix (OBXM). It is based on the crossover operator OBX  Order Based Crossover [5]. A binary mask is used. The mask values equal to one indicate that the corresponding columns are copied from parent 1 to the child. The rest of elements are copied from parent 2, only if they have not been copied. The mask values are generated randomly and uniformly. Precedence Preservative Crossover for Matrix (PPXM). It is based on the crossover operator PPX  Precedence Preservative Crossover [2]. Random binary mask values equal to one indicates that corresponding columns are copied from parent 1 to the child, and the values equal to zero indicate that columns are copied from parent 2, this is done by iterating the columns of parents who have not been copied in order from left to right. Order Segment Crossover for Matrix with Setup (OSXMS). It is based on the crossover operator OSX Order Segment Crossover [7]. It chooses two points randomly. Columns from position 1 to the first point are copied from parent 1. The elements are ordered by the number of processors required for subsequent insertion into the child in the position according to the number of required processors. Columns from first point to second point are copied from parent 2, only if the elements have not been copied. Finally, the remaining elements are copied from the parent 1, considering not copied elements. Order Based Crossover for Matrix with Setup (OBXMS). It is based on the crossover operator OBX  Order Based Crossover [5]. A binary mask is used. The mask values equal to one indicate that the corresponding columns are copied from parent 1 to the child. The rest of elements are copied from parent 2, only if they have not been copied. The mask values are generated randomly and uniformly. 4.2 Mutation The mutation operator produces small changes in an offspring with probability Pm. It prevents falling of all solutions into local optimum and extends search space of the algorithm. Three operators Insert, Swap and 96
130 Where Supercomputing Science and Technologies Meet Switch adapted for twodimensional encoding are considered: 1) Queue_Insert. Two points are randomly selected. The element of the second point is inserted to the first point, shifting the rest. Note that this mutation preserves most of the order and the adjacency information. 2) Queue_Swap. This mutation randomly selects two points and swaps their elements. 3) Queue_Switch. This mutation selects a random column and swaps their elements with the next column. 5. GA calibration 5.1 Workload The accuracy of the evaluation highly relies upon workloads applied. For testing the job execution performance under a dedicated Grid environment, we use Grid workload based on real production traces. Carefully reconstructed traces from real supercomputers provide a very realistic job stream for simulationbased performance evaluation of Grid job scheduling algorithms. Background workload (locally generated jobs) that is an important issue in nondedicated Grid environment is not addressed. Four logs from PWA (Parallel Workloads Archive) [14] (Cornell Theory Center, High Performance Computing Center North, Swedish Royal Institute of Technology and Los Alamos National Lab) and one from GWA (Grid Workloads Archive) [8] (Advanced School for Computing and Imaging) have been used: The archives contain real workload traces from different machine installations. They provide information of individual jobs including submission time, and resource requirements. For creating suitable grid scenarios, we integrate several logs by merging users and their jobs. The premise for the integration of several logs of machines in production use into a Grid log is based on the following. Grid logs contain jobs submitted by users of different sites; a Grid execution context could be composed by these sites. Unification of these sites into a Grid will trigger to merger users and their jobs. It should be mentioned that merging several independent logs to simulate a computational Grid workload does not guarantee representation of the real Grid with the same machines and users. Nevertheless, it is a good starting point to evaluate Grid scheduling strategies based on real logs in the case of the lack of publicly available Grid workloads. Time zone normalization, profiled time intervals normalization, and invalid jobs filtering are considered. 5.2 Calibration parameters A method of experimental design is adapted from Ruiz and Maroto [17], where the following steps are defined: (a) test all instances produced with possible combinations of parameters; (b) obtain the best solution for each instance; (c) apply the Multifactor Variance Analysis (ANOVA) with 95% confidence level to find the most influential parameters; (d) set algorithm parameters based on selected parameters values; (e) calculate relative difference of the calibrated algorithm and other adapted algorithms over the best solutions. Table 1 shows parameters that were set for the calibration. Hence, 6 x 3 x 5 x 5 = 450 different algorithms alternatives were considered. 30 executions of the workload were realized, in total 450 x 30 = 13,500 experiments. 97
131 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 The performance of the proposed algorithm is calculated as the percentage of the relative distance of the obtained solution from the best one (Relative Increment over the Best Solution  RIBS). The RIBS is calculated using the following formula: RIBS=, where Heu sol is the value of the objective function obtained by considered algorithm, and Best sol is the best value obtained during the testing all possible parameter combinations. Parameters Crossover operators Mutation operators Crossover probability Mutation probability Population Table 1. Calibration parameters Number of jobs in individual Selection Stop criterion Levels OSXM, TPM, OBXM, PPXM, OSXMS, OBXMS Queue_Insert, Queue_Swap, Queue_Switch 0.5, 0.6, 0.7, 0.8, , 0.01, 0.05, 0.1, individuals 200 Binary tournament 5.3 Analysis of variance If the fitness value of the best chromosome is not improved 4 times, the algorithm is stopped. a significant effect, and which are the most important factors. Parameters of the Grid scheduling problem are considered as factors, and their values as levels. We assume that there is no interaction between the factors. Table 2. Analysis of variance for RIBS (Type III Sums of Squares) Source Mean Square FRatio PValue MAIN EFFECTS A:crossover B:mutation C:crossover probability D:mutation probability RESIDUAL TOTAL The FRatio is the ratio between mean square of the factor and the mean square of residues. A high FRatio means that this factor affects the response variable (see Table 2). The value of PValue shows the statistical significance of the factors. The factors, whose PValue is less than 0.05, have statistically significant effect on the response variable (RIBS) with 95% level of confidence. According to the FRatio, the most important factor is the crossover operator. To assess the statistical difference among the experimental results, and observe effect of different parameters on the result quality, the ANOVA is applied. The analysis of variance is used to determine factors that have 98
132 Where Supercomputing Science and Technologies Meet 6. Conclusions Figure 1. Crossover operators Figure 1 shows means and 95% LSD intervals of the crossover operator factor. Six operators are presented in the following order: 1 OSXM, 2 TPM, 3 OBXM, 4 PPXM, 5 OSXMS and 6 OBXMS. The vertical axis is the values of RIBS. We can see that crossover OSXMS is the best crossover among the six ones tested, followed by OBXM. Figure 2. Execution time of crossover operators Figure 2 shows that OBXMS has the better time followed by OSXMS. The vertical axis represents the time in seconds and the horizontal axis represents the execution number. We addressed a twoobjective genetic algorithm calibration for scheduling jobs in a two stage computational Grid. We conduct comprehensive comparison of six known crossover operators, three mutation operators adapted for two dimension encoding, five values for the crossover probability, and five values for the mutation probability. 450 different genetic algorithms are analysed with 30 instances for each one. 13,500 experiments were evaluated. We use aggregation criteria method for modeling the preferences of two objectives: Cmax and TA. To assess the statistical difference among the experimental results, and observe impact of different parameters on the result quality, the ANOVA technique was applied. The crossover operator is shown to have a statistically significant effect. It plays the most important role in genetic algorithms applied to a scheduling in computational Grid. Of six compared operators, the OSXMS obtained the best result, followed by the OBXM. Obtained results may serve as a starting point for future heuristic Grid scheduling algorithms that can be implemented in real computational Grids. 7. References [1] Avellino, G., Beco, S., Cantalupo, B., Maraschini, A., Pacini, F., Terracina, A., Barale, S., Guarise, A., Werbrouck, A., Sezione Di Torino, Colling, D., Giacomini, F., Ronchieri, E., Gianelle, A., Peluso, R., Sgaravatto, M., Mezzadri, M., Prelz, F., Salconi, L., The EU datagrid workload management system: towards the second major release. In: 2003 Conference for Computing in High Energy and Nuclear Physics. University of 99
133 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 California, La Jolla, California, USA (2003). [2] Bierwirth, C., Mattfeld, D. and Kopfer, H.: On permutation representations for scheduling problems. In: Voigt, H.M., W. Ebeling, I. Rechenberg, y H.P. Schwefel, editores, Parallel Problem Solving from Nature  PPSN IV, Berlin, Germany, LNCS. Springer. Vol.1141 (1996) [3] Dutot, P., Eyraud, L., Mounie, G., and Trystram, D., Models for scheduling on large scale platforms: which policy for which application? In: Parallel and Distributed Processing Symposium, Proceedings. 18th International. p [4] Elmroth, E., Tordsson, J.: An interoperable, standards based Grid resource broker and job submission service. In: First International Conference on escience and Grid Computing, 2005, pp IEEE Computer Society, Melbourne, Vic. (2005). [5] Gen, M. and Cheng, R.: Genetic algorithms & engineering optimization. John Wiley & Sons, New York, 1997, 512 pp. [6] Graham, R.L., Lawler, E.L., Lenstra, J.K., Rinnooy Kan, A.H.G.: Optimization and approximation in deterministic sequencing and scheduling: a survey. In: Hammer, P.L., Johnson, E.L., Korte, B.H. (eds.) Annals of Discrete Mathematics 5. Discrete Optimization II, pp NorthHolland, Amsterdam (1979). [7] Guinet, A. and Solomon, M.: Scheduling Hybrid Flowshops to Minimize Maximum Tardiness or Maximum Completion Time, Int. J. Production Research, Vol.34, No.6 (1996) [8] GWA Grid Workloads Archive, tudelft.nl [9] Holland, J., Adaptation in natural and artificial systems, University of Michigan Press. [10] Huedo E., R.S. Montero, I.M. Llorente. Evaluating the reliability of computational grids from the end user s point of view. Journal of Systems Architecture, 52(12): , [11] Kurowski, K., Nabrzyski, J., Oleksiak, A. and Węglarz, J., A multicriteria approach to twolevel hierarchy scheduling in Grids. Journal of Scheduling, 11(5), pp [12] Lorpunmanee, S., M. Noor, A. Hanan and S. Srinoy Genetic algorithm in Grid scheduling with multiple objectives. Proceedings of the 5th WSEAS int. Conf. on Artificial Intelligence. Knowledge Engineering and Data Bases. Madrid, Spain p. [13] Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolutions Programs. 3rd edn. SpringerVerlag, Berlin Heidelberg New York (1996). [14] PWA Parallel Workloads Archive, cs.huji. ac.il/labs/parallel/workload/ [15] RamírezAlcaraz J.M, Tchernykh A., Yahyapour R., Schwiegelshohn U, QuezadaPina A., GonzálezGarcíaJ.L., HiralesCarbajal A. Job Allocation Strategies with User Run Time Estimates for Online Scheduling in Hierarchical Grids. J Grid Computing (2011) 9:95 116, Springer Verlag, Netherlands. [16] Rodero, I., Corbalan, J., Badía, R.M., Labarta, J.: enanos Grid resource broker. In: Advances in Grid Computing. European Grid Conference (EGC 2005), pp Springer, Amsterdam (2005). [17] Ruiz, R., and Maroto, C. (2006). A genetic algorithm for hybrid flowshops with sequence dependent setup times and machine eligibility. European Journal of Operational Research, 169,
134 Where Supercomputing Science and Technologies Meet [18] Siddiqui, M., Villazon, A., Fahringer, T.: Grid Capacity Planning with Negotiationbased Advance Reservation for Optimized QoS. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing (SC 06). ACM, New York, NY, USA, Article org/ / (2006). [19] Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using systemgenerated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18, (2007). [20] Tchernykh A., Schwiegelsohn U., Yahyapour R., Kuzjurin N. Online Hierarchical Job Scheduling on Grids with Admissible Allocation. Journal of Scheduling, Volume 13, Issue 5, pp SpringerVerlag, Netherlands, [21] Lifka, D. A The ANL/IBM SP Scheduling System. In: IPPS 95: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, London, UK: SpringerVerlag. [22] Skovira, J., Chan, W., Zhou, H., Lifka, D.: The EASY  LoadLeveler API Project. In: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing(IPPS 96), Feitelson, D.G., Rudolph, L. (eds.). SpringerVerlag, London, UK, (1996). [23] YaurimaBasaldúa, V., Tchernykh, A., Castro García, Y., VillagómezRamos, V., Burtseva, L. Genetic algorithm calibration for two objective scheduling parallel jobs on hierarchical Grids. In: Parallel Processing and Applied Mathematics, Wyrzykowski et al. (Eds.). 9TH International Conference on Parallel Processing and Applied Mathematics  PPAM September 1114, 2011, Torun, Poland, LNCS, SpringerVerlag,
135
136 Infrastructure
137
138 Where Supercomputing Science and Technologies Meet Performance Evaluation of Infrastructure as a Service Clouds with SLA Constraints Anuar Lezama Barquet 1, Andrei Tchernykh 1, Uwe Schwiegelshohn 2, Ramin Yahyapour 3 1 Computer Science Department, CICESE Research Center, Ensenada, BC, México; 2 Robotics Research Institute, Technische Universität Dortmund, Dortmund, Germany; 3 GWDG University of Göttingen, Göttingen, Germany; Abstract In this paper, we present an experimental study of job scheduling algorithms in infrastructure as a service type of Clouds. We analyse different system service levels that are distinguished by the amount of computing power a customer is guaranteed to receive within a time frame, and a price for a processing time unit. We analyze different scenarios for this model. These scenarios combine single service level with single and parallel machines. We apply our algorithms in the context of executing the real workload traces available for HPC community. In order to provide performance comparison, we perform a joint analysis of several metrics. A case study is given. Keywords: Cloud Computing, Infrastructure as a Service, Quality of Service, Scheduling. 1. Introduction Infrastructure as a Service type clouds allow users to use computational power ondemand basics. The focus of this kind of clouds is managing virtual machines (VMs) created by the users to execute their jobs on the cloud resources. However, in this new paradigm, there are issues that prevent the widespread adoption. The main concern is that it has to provide Quality of Service (QoS) guarantees [1]. The use of Service Level Agreements (SLAs) is a fundamentally new approach for job scheduling. With this approach, schedulers are based on satisfying QoS constraints. The main idea is to provide different levels of service, each addressing different set of customers for the same services, in the same SLA, and establish bilateral agreements between a service provider and a service consumer to guarantee job delivery time depending on the level of service selected. Basically, SLAs contain information such as latest job finish time, reserved time for job execution, number of CPUs required, and price per time unit. The shifting emphasis of the Grid and Clouds towards a serviceoriented paradigm led to the adoption of SLAs as a very important concept, but at the same time led to the problem of finding the stringent SLAs. There has been significant amount of research on various topics related to SLAs: admission control techniques [2]; incorporation of the SLA into the Grid/Cloud architecture [3]; specifications of the SLAs [4] 105
139 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 [5]; usage of SLAs for resource management; SLAbased scheduling [6], SLA profits [7]; automatic negotiation protocols [8]; economic aspects associated with the usage of SLAs for service provision [9], etc. Little is known about the worst case efficiency of SLA scheduling solutions. There are only very few theoretical results on SLA scheduling, and most of them address real time scheduling with given deadlines. Baruah and Haritsa [10] discuss the online scheduling of sequential independent jobs on real time systems. They presented algorithm ROBUST (Resistance to Overload By Using Slack Time) that guarantees a minimum slack factor for every task. The slack factor f of a task is defined as a ratio of its relative deadline over its execution time requirement. It is a quantitative indicator of the tightness of the task deadline. The algorithm provides an effective processor utilization (EPU) of (f1)/f during the overload interval. He shows that given enough processors, online scheduling algorithms can be designed with performance guarantees arbitrarily close to that of optimal uniprocessor scheduling algorithms. A more complete study is presented in [11] by Schwiegelshohn, et al. Authors theoretically analyze the single (SM) and the parallel machine (PM) models subject to jobs with single (SSL) and multiple service levels (MSL). Their analysis is based on the competitive factor, which is measured as the ratio between the income of the infrastructure provider obtained by the scheduling algorithm and the optimal income. They provide worst case performance bounds of four greedy acceptance algorithms named SSLSM, SSL PM, MSLSM, MSLPM, and two restricted acceptance algorithms MSLSMR, and MSLPMR. All of them are based on the adaptation of the preemptive EDD (Earliest Due Date) algorithm for scheduling the jobs with deadlines. In this paper, we adopt the models of IaaS cloud model proposed in [11]. To show the practicability and competitiveness of the algorithms, we conduct a comprehensive study of their performance and derivatives using simulation. We take into account an important issue that is critical for practical adoption of the scheduling algorithms: we use workloads based on real production traces of heterogeneous HPC systems. We study two greedy algorithms SSLSM and SSLPM. SSLSM accepts every new job for a single machine if this job and all previously accepted jobs can be completed in time. SSLPM accepts jobs considering all available processors in parallel machines. Key properties of SLA should be observed to provide benefits for real installations. Since SLAs are often considered as successors of service oriented real time paradigm with deadlines, we start with a simple model with a single service level on a single computer, and extend it to a single SLA service level on multiple computers. One of the most basic models of SLA is to provide relative deadline as a function of the job execution time with constant usage parameter of a service level. This model does not match every real SLA, but the assumptions are nonetheless reasonable. It is still a valid basic abstraction of SLAs that can be formalized and treated automatically. We address an online scheduling problem. The jobs arrive one by one and after the arrival of a new job the decision maker must decide whether he rejects a job or schedules it on 106
140 Where Supercomputing Science and Technologies Meet one of the machines. The problem is online because the decision maker has to make its decision without any information of the following jobs. For the problem, we measure the performance of the algorithms by a set of measures including the competitive factor, and the number of accepted jobs. 2. Scheduling model 2.1 Formal definition In this work, we consider the following model. User submits jobs to the service provider, which has to guarantee some level of service (SL). Let S = [S 1,S 2,...,S i,...,s k ] be a set of service levels offered by the provider. For a given service level S i the user is charged by a cost u i per execution time unit depending on the urgency of the submitted job. u max = max{u i i } denotes the maximum cost. This urgency is denoted by the slack factor of the job f i 1. The total number of jobs submitted to the system is n r. Each job J j from the released job set J r = [J i, J 2,, J nr ] is described by a tuple (r, p,s,d ) j j i j : its release date r j 0, its execution time p j, and the SL S i. The deadline of each job d j is calculated at the release of the job as: d j = r j + f i p j. The maximum deadline is denoted by d max = max{d j j }. Therefore p j become known at time r j. Once the job is released, the provider has to decide, before any other job arrives, whether the job is accepted or no. In order to accept the job J j the provider should ensure that some machine in the system is capable to complete it before its deadline. In the case of acceptance, further jobs should prevent that the job J j misses its deadline. Once a job is accepted, the scheduler uses some heuristic to schedule the job. Finally, the set of accepted jobs J = [J 1, J 2,, J n ] is a subset of J r and n is the number of jobs successfully accepted and executed. 2.2 Metrics We used several metrics to evaluate the performance of our scheduling algorithms and SLAs. In contrast to traditional scheduling problems the classic scheduling metrics such as C max become irrelevant for evaluating the system performance for evaluating systems being scheduled through SLA. One of the objective functions represents the goal of the infrastructure provider who wants to maximize his total income. Job J j with service level S i generates income in the case of acceptance and zero otherwise. The n competitive factor u p i j j=1 cv = ( ) * 1 V A is defined as a ratio of total income generated by an algorithm and optimal income V( A) *. Due to the maximization of the income, a larger competitive factor is better than a smaller one. Note that in our evaluation of experiments, we use the upper bound of the optimal income ˆV * ( A ) instead of the optimal income as we are, in general, not able to determine the optimal income. * * ( ) ˆV ( A) = min(umax V A The first bound is the sum of the processing times of all released jobs multiplied by the maximum price per unit execution of all available SLAs. The second bound is the maximum deadline of all released jobs multiplied by the maximum price per unit n r p j, u max d max m) j=1 107
141 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 execution value and the number of machines in the system. Due to our admission control policy the system do not execute jobs which deadline cannot be reached, therefore this is also an upper bound of maximum processing time that the system can execute work. In our experiments we analyze SSLSM and SSLPM algorithms, hence only one SL is used, and we do not take into u max account to calculate the competitive factor. We also calculate the number of rejected jobs and use it as a measure of the capacity of the system to response to the incoming flow of jobs. Finally, we calculate the mean waiting time of the jobs within the system as: MWT = 1 n (c p ), j j j=1 where c j is the completion time of the job j. 3. Experimental Setup 3.1 Algorithms In our experiments, we use algorithms SSLSM and SSLPM based on the EDD (Earliest Deadline Deadline) algorithm, which gives priority to jobs according to its deadline. The jobs that have been admitted but have not yet completed are introduced in a queue. The jobs are ordered in nondecreasing deadlines. For their execution jobs are taken from the head of the queue. When a new job is released, it is place on the queue according with its deadline. EDD is an optimal algorithm for minimizing the lateness in a single machine system. In our case, it is corresponds to minimizing the number of rejected jobs. Gupta and Palin [12] showed that the problem of allocating jobs for the hard realtime scheduling model n in which a job must be completed if it was admitted for execution there cannot exist an algorithm with competitive ratio greater than 1 (1/ f i ) + Ú with m 1 machines and Ú > 0 is arbitrary small. They proposed an algorithm that achieves a competitive ratio of at least 1 (1/ f i ) and demonstrated that this is an optimal scheduler for the hard realtime scheduling with m machines. The admittance test also proposed by Gulpa consists in verify that all the already accepted jobs which their deadline is greater than the incoming job will get completed before their deadline is met Workload In order to evaluate the performance of the SLA scheduling, we perform a series of experiments using traces of HPC jobs obtained from the Parallel Workloads Archive (PWA) [13], and the Grid Workloads Archive (GWA) [14]. Details of the log characteristics can again be found in the Parallel Workloads Archive (PWA) [13]. These traces are logs from real parallel computer systems and give us a good insight in how our proposed schemes will perform with real users. It is well known predominance of low parallel jobs in real logs. However, some jobs require multiple processors. In later case, we consider that machines in our model have enough capacity to process, them so we can abstract of their parallelism. We assume that IaaS clouds are promising alternative to computational centers, therefore, we can expect that workload submitted to the cloud will have similar characteristics to the workloads submitted to actual parallel and grid systems. In our log, we considered nine traces from: DAS2 (University of Amsterdam), 108
142 Where Supercomputing Science and Technologies Meet DAS2 (Delft University), DAS2 (Utrecht University), DAS2 (Leiden University), KTH, DAS2 (Vrije University), HPC2N, CTC, and LANL. Details of the log characteristics can be found in the PWA [13] and GWA [14]. To obtain valid statistical values, 30 experiments of one week interval are simulated for each SLA. We calculate job deadlines based on the real processing time of the jobs. 4. Experimental Results 4.1 Single machine model For the first set of experiments with a single machine system scheme, we perform experiments for 12 different values of the slack factor: 1, 2, 5, 10, 15, 20, 25, 50, 100, 200, 500 and We do not expect that a real SLA provides slack factors more than 50 but large values are important in order to study expected system performance when slack factors tend to infinity. Figures 15 show simulation results of SSLSM algorithm. They present percentage of rejected jobs, total processing time of accepted jobs, mean waiting time, mean number of interruptions per job, and mean competitive factor. Figure 1 shows percentage of rejected jobs for the SSLSM algorithm. We see that the number of rejected jobs decreases with increasing the slack factor. %RJ seconds Percentage of Rejected Jobs (detail) SLA f Figure 1. Percentage of rejected jobs for the SSLSM algorithm x 10 6 Total Processing Time (detail) SLA f Figure 2. Total processing time for the SSL SM algorithm 109
143 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Large values increase flexibility to accept new ones by delaying the execution of already accepted jobs. In the case when a slack factor equals to one, the system cannot accept new jobs until the completion of already accepted job. We observe that the percentage of rejected jobs with a slack factor of one is a bit lower than with values of slack factor from 2 to 25. However, it does not mean that this slack factor allows the system to execute more computational work as we can see in Figure 2. Figure 2 shows the total processing time of accepted jobs for the given slack factors. We can see that it increases as the slack factor increases, meaning that the scheduler is able to exploit the increase flexibility of the jobs. Figure 3 shows mean waiting time versus the slack factor. It demonstrates that increasing total processing time causes increasing waiting time. seconds x 10 5 Mean W aiting Time (detail) We also evaluate the number of interruptions per job. Figure 4 shows mean number of interruptions per job. We see that for small slack factors the number of interruptions is greater than for larger slack factors. Mean values are below the one interruption per job. Moreover, if a slack factor is major than ten, the number of interruptions per job is stable between 0.2 and 0.3. This fact is important; keeping low the number of interruptions prevents the overhead of the system. Figure 5 show mean competitive factors. It represents the infrastructure provider objective to maximize his total income. Note that a larger competitive factor is better than a smaller one. When the slack factor is one, the competitive factor is Once the slack factor is increased until 5, we obtain better competitive factors. When slack factor equals to 5, the mean competitive factor has the maximum value Passing this point, the competitive factor decreases until the slack factor equals to 200. We consider that in this point deadlines of the jobs are much larger than their processing time. If the slack factor is between 200 and 500, the competitive factor is increased again because of the maximum deadline gets close to the sum of processing time SLA f Figure 3. Mean waiting time of the jobs for the SSLSM algorithm 110
144 Where Supercomputing Science and Technologies Meet Interruptions/Job Mean Number of Interruptions per Job (detail) expected. In a real Cloud scenario, the slack factor can be dynamically adjusted in response to the changes in the configuration and/or the workload. To this end, the past workload within a given time interval can be analyzed to determine an appropriate slack factor. The time interval for this adaptation should be set according to the dynamics in the workload characteristics and in the IaaS configuration. 4.2 Multiple machines model SLA f Figure 4. Mean number of interruptions per job for the SSLSM algorithm 1 Competitive Factor In this section, we present the results of the simulations of SSLPM algorithm with two and three machines. We present results and plotted them SSLSM results to analyze the change of the system performance varying the number of machines c V SLA f Figure 5. Mean competitive factor of the SSLSM algorithm When the deadline of all jobs trends to infinity the competive factor is optimal, as Figure 6. Percentage of rejected jobs for the SSLPM algorithm Figures 611 show percentage of rejected jobs, total processing time of accepted 111
145 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 jobs, mean waiting time, mean number of interruptions per job, efficiency and mean competitive factor. Figure 6 presents percentage of rejected jobs. We see that increasing the number of machines has a limited effect on the acceptability of jobs when the slack factor is small. However, larger values of slack factor have more impact on number of accepted jobs. Figure 7 shows the total processing time of accepted jobs. The processing time is increased as more machines are added to the system. However, doubling and tripling the processing capacity do not cause the same increase in the processing time. This effect can be clearly seen when the slack factor is large. We can conclude that an increase in the processing capacity will be more effective with smaller slack factors. Figure 8 shows mean waiting time varying slack factor. We see that an increase of the total processing time, as a result of larger slack factors, also causes the increase of the waiting time. We can also see than adding more machines to the system makes the increase of the mean waiting time less significant. Figure 8. Mean waiting time for the SSL PM algorithm Figure 9 shows mean number of interruptions per job. We can see that increasing the number of machines increases the number of interruptions. The increase is not considerable, and is stabilized as the slack factor is increased. The number of interruptions is maximal with a slack factor of two for all three models. Figure 10 shows the execution efficiency. This metric indicates the relative amount of useful work the system executes during the interval between the release time of the first job and the completion of the last one. Figure 7. Total processing time for the SSL PM algorithm 112
146 Where Supercomputing Science and Technologies Meet Figure 9. Mean number of interruption for the SSLPM algorithm We can see that the decrease of the efficiency, at least with moderate slack factors, mainly depends on the number of machines. Figure 11 presents the competitive factor varying slack factor. We can see that for the two and three system configuration the maximum competitive factor is obtained with a slack factor of two. As we already said, in the case of the single machine configuration the best competitive factors are obtained with a slack factor of five and two. We can also observe that when the slack factor is increased the competitive factor is decreased. This happens until the slack factor becomes large enough to create a significant difference between job deadlines and their processing times. This can be clearly seen when slack factor is 200 for a machine configuration, and 100 for two and three machines. In the case of two and three machines configuration, for slack factor more than 500 the competitive factor is almost reached the optimal. Figure 10. Execution efficiency for the SSL PM algorithm Figure 11. Competitive factor for the SSL PM algorithm 113
147 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Execution costs In the IaaS scenario, cloud providers offer computer resources to customers on a pay as you go base. The price per time unit depends on the services selected by the customer. This charge depends not only on the cost that user can accept to invert, but also on the cost of infrastructure maintenance. In order to estimate a tariff function for users that depends on the slack factor, first, we have to consider the total cost a provider has to get returned from the execution of jobs. Let us consider that the provider pays a flat rate for the use/maintenance of the resources. The total maintenance cost of job processing (co t ) can be calculated by the n r expression: p j j=1. And the cost per time unit co u can be calculated as n co r co u = t, where is the sum of n p j j=1 m p j j=1 processing time of all released jobs u u, is the price per unit of maintenance, m is the number n of machines, and p is the sum of j j=1 u u m processing time reached by the algorithm. We consider that u u is equal to 8.5 cents per hour, which is the price that Amazon EC2 charges for a small processing unit [15]. Figure 12. Execution cost per hour Figure 12 shows execution cost per hour varying slack factor. As we can see the cost to the system administration for processing jobs with a small slack factor is greater than the execution of jobs with a looser slack factor. Moreover, the costs are grater if less machines are used. The reason is that the system with less machines and small slack factor rejects most of the jobs in the interval, so the execution is costly. Whereas, configurations that execute more jobs have lower costs per execution time unit. The clear profit is generated if presented cost per time unit is incremented. 5. Conclusions and future work The use of Service Level Agreements (SLAs) is a fundamentally new approach for job scheduling. With this approach, schedulers are based on satisfying QoS constraints. The main idea is to provide different levels of service, each addressing different set of 114
148 Where Supercomputing Science and Technologies Meet customers. While a large number of service levels leads to high flexibility for the customers, it also produces a significant management overhead. Hence, a suitable tradeoff must be found and adjusted dynamically, if necessary. While theoretical worst case IaaS scheduling models are beginning to emerge, fast statistical techniques applied to real data have empirically been shown to be effective. In this paper, we provide an experimental study of two greedy acceptance algorithms SSLSM and SSLPM with known worst case performance bounds. They are based on the adaptation of the preemptive EDD algorithm for job scheduling with different service levels, different number of machines. Our study results in several contributions: Firstly, we identify several service levels to make scheduling decisions with respect to job acceptance; We consider and analyze two test cases with single machine and parallel machine; We estimate a cost function for different service levels; We show that the slack factor can be dynamically adjusted in response to the changes in the configuration and/or the workload. To this end, the past workload within a given time interval can be analyzed to determine an appropriate slack factor. The time interval for this adaptation depends on the dynamics of the workload characteristics and IaaS configuration. Though our model of IaaS is simplified, it is still a valid basic abstraction of SLAs that can be formalized and treated automatically. In this paper, we explored only a few scenarios of using SLAs. The IaaS clouds are usually large scale and vary significantly. It is not possible to fulfill all QoS constraints from the service provider perspective, if a single service level is used. Hence, a balance between number of service levels, and number of resources needs to be found and adjusted dynamically. System can have several specific service levels (e.g. Bronze, Silver, Gold) and algorithms to keep the system with QoS specified in SLA. However, further study of algorithms for multiple service classes and the resource allocation algorithms is required to assess their actual efficiency and effectiveness. This will be subject of future work for better understanding of service levels in IaaS clouds. Moreover, other scenarios of the problem with different types of SLAs and workloads with a combination of jobs with and without SLA still need to be addressed. Also, in the future work, we will address the elasticity of the slack factors to obtain increasing the profit while giving better QoS to the users. 6. References [1] S. Garg and S. Gopalaiyengar, SLAbased Resource Provisioning for Heterogeneous Workloads in a Virtualized Cloud Datacenter, Algorithms and Architectures for, pp , [2] L. Wu, S. Kumar Garg, and R. Buyya, SLAbased admission control for a SoftwareasaService provider in Cloud computing environments, in Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, 2011, pp [3] P. Patel, A. Ranabahu, and A. Sheth, Service Level Agreement in Cloud Computing, in Cloud Workshops at OOPSLA, [4] A. Andrieux, K. Czajkowski, A. Dan, and K. Keahey, Web services agreement specification (WSAgreement), Global Grid,
149 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 [5] F. Cloud, C. Use, and C. Whitepaper, Review and summary of cloud service level agreements, dw/cloud/library/ clrev2slapdf.pdf, pp. 110, [6] L. Wu, S. K. Garg, and R. Buyya, SLABased Resource Allocation for Software as a Service Provider (SaaS) in Cloud Computing Environments, th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp , May [7] A. L. Freitas, N. Parlavantzas, and J. Pazat, Cost Reduction Through SLAdriven Self Management, in Web Services (ECOWS), 2011 Ninth IEEE European Conference on, 2011, pp Symposium [12] B. D. Gupta and M. a. Palis, Online realtime preemptive scheduling of jobs with deadlines on multiple machines, Journal of Scheduling, vol. 4, no. 6, pp , Nov [13] D. Feitelson, Parallel Workloads Archive [14] A. Iosup et al., The Grid Workloads Archive, Future Generation Computer Systems, vol. 24, no. 7, pp , Jul [15] Amazon, Elastic Cloud, Pricing, [Online]. Available: pricing/. [8] G. Cosmin Silaghi, L. Dan Şerban, and C. Marius Litan, A Framework for Building Intelligent SLA Negotiation Strategies under Time Constraints, in Economics of Grids, Clouds, Systems, and Services, vol. 6296, J. Altmann and O. Rana, Eds. Springer Berlin / Heidelberg, 2010, pp [9] M. Macías, G. Smith, O. Rana, J. Guitart, and J. Torres, Enforcing Service Level Agreements Using an Economically Enhanced Resource Manager, in Economic Models and Algorithms for Distributed Systems, D. Neumann, M. Baker, J. Altmann, and O. Rana, Eds. Birkhäuser Basel, 2010, pp [10] S. K. Baruah and J. R. Haritsa, Scheduling for overload in realtime systems, IEEE Transactions on Computers, vol. 46, no. 9, pp , [11] U. Schwiegelshohn and A. Tchernykh, Online Scheduling for Cloud Computing and Different Service Levels, in 26th IEEE International Parallel & Distributed Processing 116
150 Where Supercomputing Science and Technologies Meet Performance Comparison of Hadoop Running on FreeBSD on UFS versus Hadoop Running on FreeBSD on ZFS Hugo González, Victor Fernández Academia de Tecnologías de la Información y Telemática, Universidad Politécnica de San Luis Potosí, Urbano Villalón, 500. La Ladrillera. web: Abstract Currently Hadoop software [1] is widely used to handle huge amounts of data, through the MapReduce framework [2]. Yahoo! has the best brands to sort 1 terabyte and 1 petabyte of information [3], using large clusters of Hadoop. One of the advantages of this system is that it can run on clusters commodity, from a few to thousands of networked computers. FreeBSD [4] is an open source Unix distribution system, that is very robust and powerful, in addition to the fact that it does not need as many resources as other Unix systems, such as Solaris system, which has support for the Zetabyte File System (ZFS) [5], which according to its developers Sun / Oracle, it is the most advanced file system in the world. Within the literature and the official tutorials, there are no reports of running Hadoop on FreeBSD and the performance on ZFS. We think that using the combination of FreeBSD and ZFS using Hadoop is promising because of its achieved performance. In order to demonstrate this, we established a Hadoop cluster of 20 nodes connected in a Gigabyte Ethernet network, which compares the performance of the UFS file systems against ZFS through a series of standard tests through TeraSort [6], where a lot of information is generated and time is measured to sort the information using MapReduce. Another is to find the frequency of words in dozens of classical books. Hadoop and MapReduce can be used to process huge amounts of data using clusters or grids and to compare the efficiency of FreeBSD with ZFS. We can recommend or not the use of ZFS as the file system and FreeBSD as an option to run Hadoop. Keywords: Big Data, Hadoop, ZFS, FreeBSD 1. Introduction The Big Data systems are designed for extremely large data sets, they provide high scalability and often handle terabytes or petabytes of data which if managed by traditional systems would be too expensive and difficult. Those systems consist of datasets that grow so large that they become awkward to work with using traditional database management tools. Difficulties include capture, storage, search, sharing, analytics and visualizing. [7] FreeBSD was used to compare both file systems. FreeBSD is an advanced 117
151 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 operating system for x86 compatible, amd64 compatible, ARM, IA64, PowerPC, PC98 and UltraSPARC architectures. It is derived from BSD, the version of UNIX developed at the University of California, Berkeley. Moreover, it is developed and maintained by a large team of individuals. Additional platforms are in various stages of development [4]. 2. ZFS description The ZFS file system is a revolutionary new file system that fundamentally changes the way file systems are administered, with features and benefits not found in any other file system available today. ZFS is robust, scalable, and easy to administer. ZFS setups scale from standalone servers up to larger hybrid server farms with several operating systems [8]. ZFS was first publicly introduced in the OpenSolaris operating system in late 2005, followed by a first public release in Solaris Express. The port to FreeBSD was written by Pawel Jakub Dawidek in April Since then ZFS has undergone many substantial improvements and enhancements including level 2 cache, deduplication and encryption. In August 2010 the OpenSolaris project was discontinued and the ZFS development continued in closed code. Therefore the latest available public ZFS pool version is 28 without encryption, which was introduced in closed source and made available for public testing in Oracle Solaris 11 express in late The basic ZFS features are pooled storage, transactional semantics, checksums and selfhealing, scalability, instant snapshots and clones, dataset compression and simplified delegable administration [9]. 3. Hadoop The Apache Hadoop project was developed in opensource software as a collection of related subprojects for scalable and distributed computing. Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage, there libraries are designed to detect and handle failures at the application layer, so delivering a highlyavailable service on top of a cluster of computers[1]. The computational paradigm Map Reduce is implemented by Hadoop, where each application is divided into many small fragments of work; the fragments are executed or reexecuted on any node in the cluster. In addition, it provides another important subproject for a distributed file system (HDFS) that stores data on the compute nodes. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework [7]. 4. MapReduce MapReduce is a software framework for easily writing applications which process large data sets inparallel clusters (thousands of nodes) of commodity hardware in a reliable, faulttolerant manner. A MapReduce job is a unit of work splinted in the input data set into independent chunks which are processed by the map tasks in a completely parallel manner, sorting the outputs maps, which are 118
152 Where Supercomputing Science and Technologies Meet then input to the reduce tasks. Both the input and the output of the job are stored in a filesystem. MapReduce take care of scheduling tasks, monitoring them and reexecutes the failed tasks. The Distributed File System and MapReduce are running in the same node for a effectively schedule tasks where data is already present [10]. In general, MapReduce framework consists in two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker [7].. 5. TeraSort TeraSort is a group of benchmark and testing tools included in the Hadoop distribution. There are others tools, but this is the one that produce the results that we are looking for. These are very popular tools to do benchmark and stress test on Hadoop clusters. The measures of the performance then could be shared or compared. These tools are used to compare large commodity clusters and even to compete in order to break some records. The Hadoop distribution includes a file of programming examples, which contains the three programs used in TeraSort. The list of examples test is showing next. $ bin/hadoop jar hadoopexamples jar An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. dbcount: An example job that count the pageview counts from a database. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using montecarlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sleep: A job that sleeps at each map and reduce task. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. Figure 1. Hadoop samples From this list we will use Teragen, TeraSort, Teravalidate and Wordcount for our benchmarks. The TestDFSIO test was also used in order to measure the Input/Output in the cluster. This test is in the test file as is showed next. 119
153 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 $ bin/hadoop jar hadooptest jar An example program must be given as the first argument. Valid program names are: DFSCIOTest: Distributed i/o benchmark of libhdfs. DistributedFSCheck: Distributed checkup of the file system consistency. MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures TestDFSIO: Distributed i/o benchmark. dfsthroughput: measure hdfs throughput filebench: Benchmark SequenceFile(Input Output) Format (block,record compressed and uncompressed), Text(Input Output)Format (compressed and uncompressed) loadgen: Generic map/reduce load generator mapredtest: A map/reduce test check. mrbench: A map/reduce benchmark that can create many small jobs nnbench: A benchmark that stresses the namenode. testarrayfile: A test for flat files of binary key/value pairs. testbigmapoutput: A map/reduce program that works on a very big nonsplittable file and does identity map/ reduce testfilesystem: A test for FileSystem read/write. testipc: A test for ipc. testmapredsort: A map/reduce program that validates the MapReduce framework s sort. testrpc: A test for rpc. testsequencefile: A test for flat files of binary key value pairs. testsequencefileinputformat: A test for sequence file input format. testsetfile: A test for flat files of binary key/value pairs. testtextinputformat: A test for text input format. threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill Figure 2. Hadoop tests 6. Problem statement To achieve this goal we compared the performance of Hadoop on top of ZFS running on FreeBSD commodity cluster versus the same configuration but using UFS. We use the wellknown and used benchmark and stress tests TeraSort and word count. 7. Methodology of the experiment 7.1 The laboratory We used commodity hardware available on computer classroom, 20 personal computers with an Intel Dual Core, 3.2 GHz. Two Gigs of RAM and 160 Gigs in Hard drive. All of this connected with a Gigabyte Ethernet switch. 7.2 The installation process for each scenario, one with UFS and other with ZFS For the first scenario with UFS: 1. Install FreeBSD 9.0 with UFS in one computer. 2. Install official javajre 3. Install last stable distribution of Hadoop. 4. Clone the machine in the other 18 s. 5. Configure the network with the 19 computers. 6. Customize and adjust each 19 configurations. 7. Configure the cluster with 1 server and 18 job slaves. 8. Set up the HDFS in the cluster 9. Running the tests. We want to know if using ZFS will help on the performance on a Hadoop cluster. 120
154 Where Supercomputing Science and Technologies Meet For the second scenario with ZFS. 1. Install FreeBSD 9.0 with ZFS in one computer. 2. Install official javajre 3. Install last stable distribution of Hadoop. 4. Clone the machine in the other 18 s. 5. Configure the network with the 19 computers. 6. Customize and adjust each 19 configurations. 7. Configure the cluster with 1 server and 18 job slaves. 8. Set up the HDFS in the cluster 9. Running the tests. 7.3 The Benchmark Wordcount test: Got the 100 most read books from project Gutenberg web page in txt format. Move it to the HDFS. Run the Wordcount test in the directory where the books are. Measure the time. Input/output test: Run TestDFSIO write with X files and Y filesize Measure the time. Run TestDFSIO read with same X and Y parameters. Repeat for incremental different X and Y numbers. TeraSort test: Run Teragen with X number of data to generate it. Measure the time. Run TeraSort to order the data previous generated. Measure the time. Run Teravalidate to validate the sorted data. Measure the time. Repeat the TeraSort tests with different values for X. We run in some troubles because the data disk space available weren t enough for some tests. 8. Results For the results we present the more representative tables between the UFS and ZFS, and explain briefly each one. Wordcount test: We use the top 100 books from the Gutenberg web site and download it on txt format; they are like 50 Mb of data. Then put it on the HDFS and finally run the Wordcount test with the default parameters. The results are showed in table 1. In this test the results where identical and we have almost the same time so the performance are so similar. The time is very short and this is why we could not see a lot of differences. 121
155 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Table 1. Word Count test UFS ZFS Time Om s 0m s Input/output tests: This selected sample of the test worked with 30 files of 5000 Mb each one ( Gbytes of data). First the write test and then the read test. This data were generated by commands showed below. In this test, we could see that ZFS is better for the generation of data, but it s slower to read it. The Average IO rate is better in write and worse in read for the ZFS. TeraSort tests: The more representative tests with the default parameters are showed in table 4. Table 4. Tests with default parameters time bin/hadoop jar hadooptest jar TestDFSIO write nrfiles 30 filesize 5000 time bin/hadoop jar hadooptest jar TestDFSIO read nrfiles 30 filesize 5000 The results are showed in table 2 for write and table 3 for read. Table 2. Writting data time Table 3. Reading data time There are not so much data yet, but we can see the same behavior, write data is faster in ZFS, but read data is slower. When we have more data, the differences are bigger. We run some tests with modified parameters, we set the number of task maps to 100 and the reduce tasks to 100 too. Table 5. Modified parameters to
156 Where Supercomputing Science and Technologies Meet The behavior is almost the same, but the data generated is very short. It is even longer than the time spent during the generation of rows. We generated 1 million of rows and compare different parameters in the TeraSort task. Here the differences are more notorious because the time and size of the data generated. As a general note, we can t work with more than 1 million rows. The programs broke with error because disk space problem. Table 6. Diferent parameters in terasort test EXT4 file system, with two different operating systems, in the same hardware. A report from Google says that having more hard disks help to improve the read performance. So we should try these tests again with more than one disk per computer and maybe the results for ZFS could change. 11. References [1] Hadoop web site, [2] Jeffrey Dean and Sanjay Ghemawat MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), DOI= / [3] O Malley, O. and Murthy, A.C. Winning a 60 second dash with a yellow elephant. Benchmark. org. //http://sortbenchmark.org/yahoo2009.pdf [4] FreeBSD web site, 9. Conclusions With these tests, we can discard that ZFS is a better file system than UFS for using it on a Hadoop cluster. ZFS is better for other tasks and have a lot of improvements, but for this specific task, the read operation is worst than in UFS file system, or in some cases are equal with a short amount of data. Since Hadoop is used for big data, we cannot recommend using ZFS for a Hadoop cluster because our own experience, running it on hardware with only one hard drive per computer. 10. Future work Next summer we will compare UFS versus [5] Dawide, P.J. Porting the ZFS file system to the FreeBSD operating system. AsiaBSDCon 2007.T. White, Hadoop the definitive guide, O Reilly Media, Inc., (2009). [6] O Malley, O. Terabyte sort on apache hadoop. Sortbenchmark.org. YahooHadoop.pdf [7] T. White, Hadoop the definitive guide, O Reilly Media, Inc., (2009). [8] ORACLE, Oracle Solaris ZFS Administration, (2010) [9] M. Matuska, ZFS and FreeBSD, Magazine BSD, February [10] MapReduce web site, org/mapreduce/ 123
157 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Acceleration of Selective Cationic Antibacterial Peptides Computation: A Comparison of FPGA and GPU Approaches GarciaOrdaz D. 1, AriasEstrada M. 2, NuñoMaganda M. 3, Polanco C. 4, and Del Rio G. 5 1,2 National Institute for Astrophysics, Optics and Elecronics, 3 Universidad Politécnica de Victoria (TECNOTAM), 4,5 Universidad Nacional Autónoma de México (UNAM). Abstract Prediction of physicochemical properties of peptide sequences can be used for the identification of Selective Cationic Amphipatic Antibacterial Peptides (SCAAP), with possible applications in different diseases treatment. The exhaustive computation of physicochemical properties of peptide sequences can lead to reduce the search space of SCAAP, but the combinatorial complexity of these calculations is a highperformance computing problem. There are several alternatives to deal with the high computational demand to accelerate computation, in particular the use of supercomputer resources has been the traditional approach but other computational alternatives are costeffective to bring the supercomputing power to the desktop. In this work we compare acceleration and performance of SCAAP (with 9 amino acids length), computation among three different platforms: traditional PC computation, FPGA acceleration by custom architecture implementation, and GPU acceleration with CUDAC programming. The comparison is carried out with four physicochemical properties codes used to identify peptide sequences with potential selective antibacterial activity. The FPGA acceleration reaches 100x speedup compared to a PC computer and the CUDAC implementation shows a performance 1000x speedup at desktop level with the potential of increasing the number of peptides being explored. A discussion of performance and tradeoffs for the implementation on each platform is given. Keywords: Physicochemical properties, selective cationic amphipatic antibacterial peptides, SCAAP, GPU, CUDAC, FPGA 1. Introduction Nowadays antibacterial peptides have a strategic position for pharmaceutical drug applications and are subject of intense research activity since they are used by the immune system of all living organisms to protect them against pathogenic bacteria. Bioinformatics research has recently been oriented to find fast and efficient techniques to predict the impact of antibacterial peptide action, due to the increasing resistance that pathogen agents have developed against multiple drugs. These techniques can help to enhance the costly and laborious chemical 124
158 Where Supercomputing Science and Technologies Meet synthetic approach as well as the subsequent trial and error experiments to identify peptide with antibacterial activity in complex organisms. One feature of the antibacterial peptides is the presence of an amphipathic alphahelix. This feature corresponds to the predominance of hydrophobic (Leucine, Alanine) and cationic (Lysine) amino acids in the linear sequence of these peptides. It is important to note that such feature does not determine any influence on the toxicity or selectivity of the peptide once in contact with the target membrane [1, 2]. To improve natural antibacterial peptides, researchers have been replacing and/or removing nativelyoccurring amino acids known for its antibacterial action [4] aiming to reduce the size while keeping or increasing its toxicity [5]. Another technique consists of joining two peptides that individually do not exhibit antibacterial properties but combined turn out to be highly toxic [6]. From these studies, it is possible to derive rules for assigning an antibacterial activity to peptide sequences. This has led to a shift in the search of antibacterial peptides: although a source of peptides with antibacterial properties is natural diversity [3], the research efforts are now aimed at synthetic strategies. In fact, contemporary efforts to construct antibacterial peptides are the result of joint computational and/or mathematical methods to simulate peptide variations and then evaluate and qualify these variations to ultimately determine if the peptide complies with the required purposes. To get efficient antibacterial peptides by measuring the potential action of each altered peptide with the previous methods would result in a combination of possibilities that exceed by far the capacity of the verification methods known in the laboratory. For instance, the number of possible peptides to be formed from one peptide with 9 amino acids in length would be 20 9 = 512,000,000,000 peptide sequences (note that there are 20 amino acids constituting eny peptide sequence); assuming a time of 0.01 seconds to compute the physicochemical properties for each sequence, this become a computational bottleneck for current computers (computation time is 5.12x10 9 seconds=162 years using 1 CPU or almost 2 months using 1,000 CPU). Thus, implementations aimed to reduce the computation time of physicochemical properties of peptides may assist researchers to expand their ability to screen for antibacterial peptides. In this paper, we compare the performance (three platforms: F77Linux Fedora14, HandelCFPGA ADMXPL and CUDAC GPU) of a proposed method [7] to identify antibacterial peptide subgroups for its highly selective toxicity to bacteria, hereinafter referred to as Selective Cationic Amphipatic Antibacterial Peptides (SCAAP). A SCAAP is characterized by being less than 60 amino acids in length, not adopting an alphahelicoidal structure in neutral aqueous solution and showing a therapeutic index higher than 75 [8]. The therapeutic index of a peptide is defined as the ratio between the minimum inhibitory concentration observed against mammalian and bacterial cells [9, 10, 11], i.e. the higher the value, the more specific the peptide for bacteriallike membranes. Hence SCAAP display strong lytic activity against bacteria but exhibit no toxicity against normal eukaryotic cells such as erythrocytes 125
159 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 [10]. 2. Method 2.1 SCAAP method Peptides are synthesized linearly as an amino acid sequence [12]. Such representation gives the peptide a unique blueprint. From this sequence, mathematical and computational algorithms have been designed with different complexity levels that measure a variety of physicochemical properties [12]. Among the properties of this linear peptide representation, there are four that define if a peptide falls within the category of SCAAP [7] as described below: 1. Mean hydrophobicity [12] (MH). This is the mean of the hydrophobicities of the amino acids normalized to 1 over all amino acids; SCAAP range from 0.35 to Mean net charge [12] (MC) is determined by: MC = [(Arg+Lys) (Asp * Glu)]* 1/n, where the variables Arg, Lys, Asp and Glu represent the number of times the amino acids Arginine (Arg), Lysine (Lys), Aspartic acid (Asp) and Glutamic acid (Glu) appeared in the peptide sequences. The value of n is the length of the peptide, and MC > * MH Isoelectric point [12] (IP) represents the ph at which the peptide has neutral charge; SCAAP range from to Hydrophobic moment [12] (HM) represents the peptide amphipathicity; SCAAP range from 0.40 to 0.60 The method has been reported previously in the literature and the first software implementations were carried out in Fortran and C [7, 8]. 2.2 FPGA Acceleration Approach A first attempt to accelerate the computation of the SCAAP method has been developed around FPGA technology [7] with encouraging results. FPGAs are devices where the user can design a custom architecture adapted to the algorithm, and replicated in the device to achieve parallelism. The SCAAP algorithm is well suited for high parallelization since the algorithm tests the 4 conditions explained in the previous section for each peptide sequence, without depending on previous computations of other peptide combinations. Therefore a high level parallelism could be implemented for the SCAAP algorithm, limited only by the FPGA capacity as well as an external arbitration mechanism to coordinate and read out results from the device(s) into a host computer. The general FPGA based architecture is as follows. The prediction of each peptide is divided in 4 basic blocks, where each block performs each one of the 4 physicochemical properties calculations. The hardware architecture modules include: 126
160 Where Supercomputing Science and Technologies Meet 1. Hydrophobicity Hardware Module (HHM), implements the MH algorithm in four storage modules (Amino acids Memory (AM), hydrophobicity Memory (HM), Peptide Charge Register (PCR) and Hydrophobicity Register (HR)) and one Length Adjusting Module (LAM). 2. Mean Net Charge Hardware Module (MNCHM), implements the MC algorithm using a set of registers required to accelerate this calculation (Net Charge Register, NCR); the final values were registered in the FNCR memory of this module. 3. Isoelectric Point Charge Hardware Module (IPCHM), implements the pi algorithm; for this, several multilevel memory access in parallel are used to accelerate the computation (temporal registers R0, R1, RM) that is stored ultimately at the Isoelectric Point Charge Register (IPCR). 4. Helical Hydrophobic Moment Hardware Module (HHMHM) implements the μh algorithm. This module actually computes the MC and MH values that are used by the MNCHM; it stores also the Sine (SM) and Cosine (CM) values that accelerate the μh calculation; however, this acceleration is counterbalanced by the large number of cycles required to execute this module. The values in the HM, SM and CM are stored in three temporal registers (TR0, TR1 and TR2). The final result is stored in the Mean Value Register (MVR). 5. Global Control Unit (GCU) defines the execution order of each of the hardware modules; this is currently achieved by activating one module at a time changing the control signal of the Multiplexer1 (MUX1) in each hardware module. Every peptide sequence was represented by a fixedpoint number and generated by the FPGA board, thus eliminating the overhead due to the CPUFPGA data communication. Figure 1. Hardware architecture for the SCAAP algorithm. The architecture can be replicated at least 4 times inside the FPGA (depending on the FPGA device used for the implementation) device to increase parallelism. We validated our peptide representation in the FPGA comparing the results obtained using the FPGA with the software version (see Design and Implementation section) and observed 100% match in all the tested 127
161 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 sequences: from the peptide sequences analyzed, the same 4,984 peptide sequences were predicted as SCAAPs by our FPGA board and the software version. 2.3 GPU Approach GPU for general purpose computing has gained interest since the last 5 years, first by adapting algorithms into the OpenGL code (which is compiled to the parallel structures of GPUs), but recently Nvidia has proposed a generic GPU architecture, named Fermi, that can be used as a General Purpose computing coprocessor, programmed under the CUDA C language. CUDAC is an extension of the standard ANSI C language that support sequential and parallel code in the same source code, under an abstraction of the parallel multithreading structure organized as blocks of threads in a single grid. The software abstraction is convenient since is independent of the architecture of the computing device, and it combines CPUGPU processing from a single source code. For our experiments, the SCAAP algorithm was implemented in CUDA C. The kernel evaluates subsets 20 4 peptides, of the complete 20 9 possible peptides sequences for 9 amino acid sequence length. Each one of the subsets has 20 4 =160,000 peptide sequences computed with individual thread for each sequence. The sequences of the subset are computed in two phases. The first peptide sequences subset is done outside the kernel, the positions 0 to 4 of the sequences are calculated by getting all the possible combination of the amino acids. The second phase is performed inside the kernel. Since each thread analyzes one peptide sequence, the kernel receives the first fragment of the sequences, which is obtained in the first phase; then, the remaining four amino acids are calculated in the kernel. This last four characters are chosen accordingly whit the index of the blocks and threads. The threads identifiers move the seventh and eighth characters, and the block identifies the fifth and sixth character. Amino={ A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y }; The kernel is launched with a grid of 20x20 blocks, and blocks of 20x20 threads. In this way all the possible combination for the 20 9 sequences can be created in multiple calls from the CPU to the GPU. Figure 2 shows the general data processing from the CPU to the Device in kernel batches. In the kernel, each thread computes the four characteristics: a) Mean hydrophobicity b) Mean net charge c) gisoelectric Point d) Helical hydrophobic moment 128
162 Where Supercomputing Science and Technologies Meet The program evaluate the first two properties, the Mean hydrophobicity (hydro) and the Mean net charge(charge). Then, the next condition is evaluated: charge <= * hydro * hydro * hydro f * hydro Figure 2. Host (CPU) calls to the device in batches of 202 blocks of 202 threads each. In order to complete the 209 sequences, the CPU launches 205 times the kernel. If the condition is passed, the Isoelectric Point is calculated. Some calculations are previously made outside the device, the result of this is stored in the constant memory. Now, a second condition is evaluated: 10.8 < Isoelectric Point < 11.8 If true, the Helical hydrophobic moment is calculated, If the condition: 0.40 < Hydrophobic Moment < 0.60 If passed, then the evaluated sequence is flagged and at the end of the kernel execution, the flagged threads are stored and communicated to the CPU. Figure 3. Grid and Blocks Organization. The device grid is organized as 20x20 blocks and each block contains 20x20 threads. Each individual thread computes a unique peptide sequence composed of the 5 most significant values set by the host, and the 4 least significant values set by the position of the thread / block. 3. Experiments and Results From a previous experiment [7], the FPGA implementation and evaluation was carried out in a VirtexIIpro FPGA device with 1.5 million equivalent gates. Results are summarized and compared to the CPU and GPU implementation in table 2. On the other hand, the GPU implementation was evaluated in three different Nvidia boards: GeForce GT 230M, Tesla C2070 and GeForce GTX 480. The GeForce GT 230M card has 48 CUDA cores, runs up to 158 gigaflops and has 129
163 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 a Processor Clock of 1100MHz. The Tesla C2070 card has 448 CUDA cores, 1.15 GHz of processor core clock, 515 gigaflops of double precision floatingpoint performance and 1030 gigaflops of single precision floatingpoint performance. Finally, the GeForce GTX480 card has 480 CUDA cores and a processor Clock of 1401 MHz. Since the program was implemented to work with a grid of 20x20 blocks, and blocks of 20x20 threads, the minimum group of sequences to be evaluated has a size of 20 4 peptides. The program was evaluated with several sets of peptides. The first group is formed by all the sequences from AAAAAAAAA to AAAAAYYYY, where the first five amino acids remain the same and the last four amino acids change, is a total of 20 4 sequences. The second set goes from AAAAAAAAA to AAAAYYYYY changing only the last five amino acids, is a total of 20 5 sequences. The remaining sets are shown in the table 1. Two kinds of sets were used, one starting with the amino acid Alanine(A), and the other with Arginine (R), it is done in this way because in former experiments [7] it was found that there exist a large amount of Antibacterial Peptides in the range of sequences that start with R. This also means that the algorithm is going to take more time analyzing these sequences since is more likely that the sequences pass the two first conditions, get to the third one, which take more time than the others, and the global execution time will be incremented. For these reasons we decided to include this set to compare execution times. Table 1. Sets of Analyzed Sequences. When the peptide start with R, the execution time increases. In former experiments was found that a large number of antibacterial peptides start with R, so when the algorithm is applied, the sequences on this range are more likely to be evaluated for the second and fourth condition. Figure 4. Execution times for three Nvidia GPU boards, and different sequence length. As expected, when the program was tested with peptides starting with R, the time taken to analyze the sequences increase. It was found that in the range of peptides starting with R there are a large number of antibacterial 130
164 Where Supercomputing Science and Technologies Meet peptides, when the SCAAP algorithm analyze the peptide it stops when the peptide do not pass a condition but, since there are a lot of antibacterial peptides in this range, the probabilities that a sequence get to the third and fourth condition increases. hhh The last condition takes more time than the others, as shown in figure 5. Since the CUDA architecture arrange the threads in warps, and warps can not finalize until all its threads have finished, if there is at least one antibacterial peptide in one thread in a warp, the rest of the threads must wait until the antibacterial peptide have passed the fourth conditions. Despite of this drawback, the penalty in performance is not large as shown in figure 5. On the other hand, compared to the software and FPGA implementation, the performance increase is large. Table 2 shows a comparison between the computing time in an FPGA, Fortran77 CPU code, and the NVIDIA boards used in this work. The execution time per peptide shown in Table 2 for the NVIDIA card is obtained by dividing all the execution times from the graphics above by the number of sequences evaluated, and getting the average of them. The time per Grid is the time required by the GPU board to analyze a set of 204 sequences that are computing in the Grid during a kernel launch (including 20x20 blocks of 20x20 threads, hence 204 peptides tested). kkk In Table 2, it can be observed that the time required by the FPGA to analyze a single peptide is almost the same time that the GTX 480 board needs to evaluate sequences. Table 2. Execution time comparison among all the approaches. As shown in Table 2, the GeForce480 and the Tesla C2070 boards give the best performance. The Tesla device is the highend Nvidia GPU parallel processor, and its only difference with the GeForce480 is the amount of global memory and internal cache. kkksince the CUDA implementation of the peptide search algorithm does not use this memory, the overall performance is comparable and it does not benefit much from the Tesla board resources. 4. Conclusion In previous work we concluded about the convenience of using FPGA devices for high parallel computation for bioinformatics, with a perfect match for peptide computation. But, after exploring the GPU approach using GPU devices, we found a larger parallelism increase with easier coding. In our experiments, the GPU approach outperformed the FPGA approach by several orders of magnitude. 131
165 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 On the other hand, last generation FPGA would increase performance reducing the GPU/FPGA gap, but still, giving more than 1000 performance lead to GPU approach. Furthermore, the CUDA abstraction of parallelism is well suited to express the solution as a high level parallel structure arranged as blocks of threads that are launched as kernels from the host computer, that are not restricted to the actual hardware size of the GPU board, allowing easy scaling with higher performance boards as shown in our experiments. We consider the GPU coprocessor approach using CUDAC language as a powerful tool for bioinformatics that allowed us to increase the computational performance and that could help explore faster or longer peptide chains in the near future. Finally, the FPGA approach continues to be interesting but if the base architecture cannot be replicated 100x, or 1000x times in a single device, it is more convenient to go for the GPU/CUDA option where off the shelf boards contain in the order of 500 core processors, outperforming FPGA, CPU and multicore CPU approaches by orders of magnitude. 5. References [1] Unger T, Oren Z, Shai Y., The effect of cyclization of magainin 2 and melittin analogues on structure, function and model membrane interactions: implication to their mode of action, Biochemistry, 2001, 40: [2] Yeaman M.R., Yount N.Y., Mechanisms of antimicrobial peptide action and resistance. Pharmacol Rev. 2003, Mar;55(1): [3] Boman, H. G. Peptide antibiotics and their role in innate immunity, 1995, Annu. Rev. Immunol. 13: [4] Saugar J.M, Rodriguezhernandez MJ., de la Torre B.G., Pachonibanez M.E., FernandezReyes M., Andreu D., Pachon J., Rivas L. Activity of cecropin Amelittin hybrid peptides against colistinresistant clinical strains of Acinetobacter baumannii: molecular basis for the differential mechanisms of action, Antimicrob Agents Chemother. 2006, 50: [5] Cao Y., Yu R.Q., Lio Y., Zhou H.X., Song L.L., Cao Y., Qiao D.R., Design, recombinant expression, and antibacterial activity of the cecropinsmelittin hybrid antimicrobial peptides, Curr Microbiol. 2010, Sep ;61: [6] Zhu W.L., Park Y., Park I.S., Park Y.S., Kim Y., Hahm K.S., Shin S.Y., Improvement of bacterial cell selectivity of melittin by a single Trp mutation with a peptoid residue, Protein Pept Lett. 2006,13: [7] Polanco Gonzalez C, Nuno Maganda MA, AriasEstrada M, del Rio G., An FPGA Implementation to Detect Selective Cationic Antibacterial Peptides, PLoS ONE 2011, 6(6): e doi: /journal.pone [8] Del Rio G, CastroObregon S, Rao R, Ellerby MH, Bredesen DE., APAP, a sequencepattern recognition approach identifies substance P as a potential apoptotic peptide, FEBS Lett. 1999, 494: [9] Ellerby HM, Arap W, Ellerby LM, Kain R, Andrusiak R, Rio GD, Krajewski S, Lombardo CR, Rao R, Ruoslahti E, Bredesen DE, Pasqualini R., Anticancer activity of targeted proapoptotic peptides, Nat Med. 1999, 5: [10] Shin SY, Kang JH, Janq SY, Kim Y, Kim KL, Kahm KS., Effects of the hinge region of cecropin 132
166 Where Supercomputing Science and Technologies Meet A(18) magainin 2(112), a synthetic antimicrobial peptide, on liposomes, bacterial and tumor cells, Biochim Biophys Acta 2000, 1463: [11] Hausman, Robert E.; Cooper, Geoffrey M., The cell: a molecular approach, 2004, Washington, D.C: ASM Press. 51. [12] Polanco C, Samaniego JL, Detection of selective cationic amphipatic antibacterial peptides by Hidden Markov models, ABP, 2009, Vol. 56 No. 1/2009,
167 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Computational Fluid Dynamics in Solid Earth Sciences: a HPC Challenge Vlad Constantin Manea 1,2, Marina Manea 1,2, Mihai Pomeran 2, Lucian Besutiu 2, Luminita Zlagnean 2 1 Computational Geodynamics Laboratory, Centro de Geociencias, Universidad Nacional Autónoma de México, México; 2 Solid Earth Dynamics Department, Institute of Geodynamics of the Romanian Academy, Romania; Abstract Presently, the solid earth sciences started to move towards implementing High Performance Computational (HPC) research facilities. One of the key tenants of HPC is performance, which strongly depends on the softwarehardware interaction. In this paper we present benchmark results from two HPC systems. Testing a Computational Fluid Dynamics (CFD) code specific for earth sciences (CitcomS), the HPC system Horus, based on Gigabit Ethernet, performed reasonably well compared with its counterpart CyberDyn, based on Infiniband QDR fabric. However, the HPCC CyberDyn based on lowlatency highspeed QDR network dedicated to MPI traffic outperformed the HPCC Horus. Due to the highresolution simulations needed for geodynamic, HPC facilities used in earth sciences should benefit from larger upfront investment in future systems that are based on highspeed interconnects. Keywords: High performance computing cluster, numerical modeling, computational fluid dynamics 1. Introduction In recent years, modeling and computation have come to play a central key role in modern earth sciences [1, 2, 3], and one of the reasons is due to their dependence on fine spatial grids and small time steps for integration used for solving numerically systems of equations that express mathematically a physical process [4, 5, 6]. Presently, the solid earth sciences started to move towards implementing high performance computational research facilities, as can be seen in many universities and research centers: Argonne National Laboratory, California Institute of Technology, Johns Hopkins University, Purdue University, Los Alamos National Laboratory, University of California Berkeley, University of California San Diego, Woods hole Oceanographic Institution, Australian National University, Cardiff University, Geological Survey of Norway or Monash University, just to name some of them. When optimized for the unique needs of the solid earth community, and coupled with the existing open source community software, such a high performancecomputing infrastructure certainly provides a key tool enabling rapid major advances in this exciting area of re 134
168 Where Supercomputing Science and Technologies Meet search (CIG  Computational Infrastructure for Geodynamics  org). Numerical methods have now progressed to the point that numeric simulations have become a central part of modern earth sciences, in particular for geodynamics [7, 8, 9, 10, 11]. Such computational systems are structured specifically for the solid earth community s simulation needs, which include large number of computing cores, fast and reliable storage capacity, considerable amount of memory, everything configured in a system designed for long runtimes. One of the key tenants of HPC is performance [12, 13, 14, 15], and designing a HPC solution tailored to a specific research field as solid earth that represents an optimum price/performance ratio is often a challenge. The HPC system performance strongly depends on the softwarehardware interaction, and therefore prior knowledge on how well specific parallelized software performs on different HPC architectures can weight significantly on choosing the final configuration [15, 16]. 2. HPCC System Configurations In this paper we present and evaluate benchmark results from two different HPC systems: one lowend HPCC (Horus) with 300 cores (Table 1) and 1.6 TFlops theoretical peak performance, and one highend HPCC (CyberDyn) with 1344 cores and 11.7 TFlops theoretical peak performance (Table 1). Horus uses CentOS 5 and as operating environment the open source Rocks Cluster Distribution5.1 (http://www.rocksclusters.org). CyberDyn employs Bright Cluster Management (http://www.brightcomputing. com) and Scientific Linux (www. scientificlinux.org ). Table 1. Comparison between HPCC Horus and HPCC Cyberdyn s configurations HPCC Horus HPCC CyberDyn Master node Server model: 1 x Dell PE x Dell R715 (failover configured) Processor type: AMD Opteron 2 x Dual core 2.8 AMD Opteron 2 x Twelve core 2.1 GHz GHz Memory: 8 GB RAM 68 GB RAM Cards: PERC 5i Integrated Raid PERC H200 Integrated Raid Controller Controller Broadcom NetXtreme Dual port SFP + Direct Attach 10 GbE NIC Computing nodes Server model: 38 x DELL Sc1435/PE x DELL R815 Processor type: AMD 2 x Quad core 2.6 GHz AMD 4 x Twelve core 2.1 GHz Intel 2 x Quad core 2.6 GHz Memory: GB/node GB/node Network fabric: 1000T Ethernet (1 Gbit/sec) InfiniBand QDR (40 GBit/sec) Network MPI trafic HP Procurve 48G T 48 port unmanaged high performance switch, 10 GbE and stacking capable Management/IPMI Storage We use the same switch for MPI traffic and cluster management x Direct Attached Storage arrays connected to the master node. Each one has a 15 TB RAID 5 volume. Qlogic BS01 36 port InfiniBand Quad Data Rate 2 x Dell PowerConnect Port managed Layer 3, 10 GbE and stacking capable 1 x Panasas 8 Series with 40 TB storage.
169 Server model: 38 x DELL Sc1435/PE x DELL R815 Processor type: AMD 2 x Quad core 2.6 GHz AMD 4 x Twelve core 2.1 GHz Intel 2 x Quad core 2.6 GHz Memory: GB/node GB/node 3rd INTERNATIONAL Network fabric: SUPERCOMPUTING CONFERENCE 1000T Ethernet IN MÉXICO (1 Gbit/sec) 2012 InfiniBand QDR (40 GBit/sec) Network MPI trafic HP Procurve 48G T 48 port unmanaged high performance switch, 10 GbE and stacking capable Management/IPMI We use the same switch for MPI traffic and cluster management. Storage 2 x Direct Attached Storage arrays connected to the master node. Each one has a 15 TB RAID 5 volume. Precision cooling (temperature 21 ±2 C, humidity 50%±5%). Liebert 8 ton Inside view Qlogic BS01 36 port InfiniBand Quad Data Rate 2 x Dell PowerConnect Port managed Layer 3, 10 GbE and stacking capable 1 x Panasas 8 Series with 40 TB storage. Liebert 2 x 7 ton Apart from the number of computing cores, the main difference between the two HPC systems is the interconnect. The HPC Horus system uses a centralized Gigabit Ethernet network for administrative traffic, data sharing (NFS or other protocols) and MPI or applications processing traffic. The second and larger HPC system CyberDyn, uses two internal networks. The first, a Gigabit Ethernet network is used for scheduling, node maintenance, basic logins, while the 2nd internal network is QDR InfiniBand and is dedicated exclusively to computational parallelmpi traffic. In order to study the earth s mantle flow in detail large HPC facilities and specific tools are required. The software benchmark used in this paper is the open source package CitcomS, which is widely used in the solid earth community (www.geodynamics.org). CitcomS is a computational fluid dynamics code based on finite element code, and is designed to solve thermal convection problems relevant to earth s mantle [17, 18]. Written in C, the code runs on a variety of parallel processing computers, including shared and distributed memory platforms and is based on domain decomposition. As mentioned above, the FEM software used for this comparative benchmark is CitcomS. This parallelized numeric code requires a library, which implements the MPI standard, and on both HPCC systems CitcomS is compiled using OpenMPI version On both HPC systems we use Sun Grid Engine (SGE) as job scheduler. For these specific benchmarks we select a series of different earth s mantle convection problems, from the simple purely thermal convection to the more complex thermochemical convection problem. All numeric simulations were performed within full spherical and regional shell domains. We used different mesh resolutions and computing cores with the finer the mesh the higher the number of cores used. Each test was performed several times to ensure that we obtained consistent, repeatable and accurate results. 136
170 Where Supercomputing Science and Technologies Meet 3. Benchmark results Below we present in details the benchmark results obtained on both HPCC systems. Due to the limitation at 300 cores for HPCC Horus, the results are comparable to only 192 computing cores for fullspherical models and 256 computing cores for regional models. On both HPC systems we performed a series of benchmarks using three different FEM simulations, two simple thermal convection problems one as regional model and the other one as fullspherical model, and a more complex thermochemical simulation as fullspherical model. In the case of purely thermal convection simulations, on both HPC systems, we obtained similar performance for grid sizes limited to 129x129x65 nodes (Figure 1 and 2). Increasing the mesh size and the number of computing cores the HPCC CyberDyn starts outperforming the HPCC Horus because of the lowlatency highspeed QDR network dedicated to MPI traffic. Figure 1. Comparison between benchmark results on HPCC Horus and HPCC CyberDyn for a purely thermal convection FEM simulation in a regional model, which shows the influence of mesh size on wall time as a function of number of processors. To the right is shown, as temperature isosurfaces, three different evolutionary stages of the thermal convection simulation (visualization performed with open source software OpenDX). Warm colors correspond to high temperature, and cold colors represent low temperature inside the Earth s mantle. Orange sphere at the initial stage represents the Earth s iron core. 137
171 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Figure 2. Comparison between benchmark results on HPCC Horus and HPCC CyberDyn for a purely thermal convection FEM simulation in a fullspherical model, which shows the influence of mesh size on wall time as a function of number of processors. To the right is shown, as temperature isosurfaces, three different evolutionary stages of the thermal convection simulation (visualization performed with OpenDX). Warm colors correspond to high temperature, and cold colors represent low temperature inside the Earth s mantle. Orange sphere at the initial stage represents the Earth s iron core. A thermochemical simulation performed in a fullspherical shell represents the third comparative test between HPCC Horus and HPCC CyberDyn. These simulations involve particles (or tracers) in order to track the thermochemical changes inside the model. The tracers are generated pseudorandomly, with a total number equal to tracers per element total number of finite elements. In our simulations we used a number of 20 tracers per element, therefore we varied the total number of tracers from 1 million to 500 millions, depending on the model resolution. The benchmark results show that for complex numeric simulations HPCC CyberDyn performs better that Horus, because of the fast lowlatency InfiniBand QDR network fabric and the highperformance Panasas storage. We also found that for very highresolution models the maximum number of computing cores that offers the minimum wall time is around 384 (see Figure 3). Although Horus is slower than CyberDyn for these highend FEM simulations, we can 138
172 Where Supercomputing Science and Technologies Meet observe a continuous decreasing in wall time for almost all model resolutions and number of computing cores. This result demonstrates that HPCC Horus still has a real potential to expand to probably over 500 computing cores in the future. Figure 3. Comparison between benchmark results on HPCC Horus and HPCC CyberDyn for a thermochemical convection FEM simulation in a fullspherical model, which shows the influence of mesh size on wall time as a function of number of processors. To the right is shown, as temperature isosurfaces, three different evolutionary stages of the thermochemical convection simulation (visualization performed with OpenDX). Warm colors correspond to high temperature, and cold colors represent low temperature inside the Earth s mantle. Orange sphere at the initial stage represents the Earth s iron core. 4. Discussion and conclusions The highspeed Infiniband interconnect offers the possibility to exploit the full potential of large clusters, and represents a key component that positively influence both, scalability and performance on large HPC systems. Testing a CFD code specific for earth sciences, the HPC system Horus based on Gigabit Ethernet performed remarkably well compared with its counterpart CyberDyn which is based on InfiniBand QDR fabric. However, HPC systems based on Gigabit Ethernet is still a quite popular costeffective choice, but suitable for small, eventually mediumsize highperformance clusters running CFD codes specific for earth sciences. On the other hand, for medium and large HPC systems running earth sciences CFD codes, lowlatency highbandwidth as Infiniband fabric is 139
173 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 highly recommended. Since presently we are moving towards highresolution simulations for geodynamic predictions that require the same scale as observations (from several to thousands of kilometers), HPC facilities used in earth sciences should benefit from larger upfront investment in future systems that are based on highspeed interconnects. 5. Acknowledgments We gratefully acknowledge the use of the HPCC HORUS at the Computational Geodynamics Laboratory of the Centro de Geociencias, UNAM, Mexico and the HPCC CyberDyn at the Institute of Geodynamic of the Romanian Academy for all numeric simulations. This research has been conducted through the CYBERDYN project (POS CCE O212_ID 593). 6. References [1] Gerya, T., Introduction to Numerical Geodynamic Modelling, Cambridge University Press, Cambridge UK, [2] IsmailZadeh, A., and P. Tackley, Computational Methods for Geodynamics, Cambridge University Press, Cambridge UK, [3] B.J.P. Kaus, D.W. Schmid, and S.M. Schmalholz, Numerical modelling in structural geology, Journal of Structural Geology, submitted, 2011, Available online: kausb/kss10_ Review_MS_JSG.pdf. [4] Press, W.H., S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes in C: The Art of Scientific Computing, 2nd ed., Cambridge University Press, Cambridge UK, [5] Hughes, T.J.R., The finite element method, Dover Publications, [6] Spiegelman, M., Myths and Methods in Modeling, Columbia University Course Lecture Notes, 2004, Available online:http://www.ldeo. columbia.edu/~mspieg/mmm/course.pdf [7] L.N, Moresi, F. Dufour, and H.B. Muhlhaus, A Lagrangian integration point finite element method for large deformation modeling of viscoelastic geomaterials, Journal of Computational Physics, 184, 2003, pp [8] Choi, E., M. Gurnis, S. Kientz, and C. Stark, SNAC, User Manual, Version , Available online: [9] E. Choi, L. Lavier, and M. Gurnis, Thermomechanics of midocean ridge segmentation, Physics of the Earth and Planetary Interiors, 171, 2008, pp [10] Landry, W., L. Hodkinson, and S. Kientz, S. GALE, User Manual, Version , 2010, Available online: software/gale/gale.pdf [11] Tan, E., M. Gurnis, L. Armendariz, L. Strand, and S. Kientz, CitcomS, User Manual, Version , 2012, Available online: citcoms.pdf [12] P. Terry, A. Shan, and P. Huttunen, Improving Application Performance on HPC Systems with Process Synchronization, Linux Journal, 127, 2004, pp. 68,70, [13] J. Borill, L. Oliker, J. Shalf, H. Shan, A. 140
174 Where Supercomputing Science and Technologies Meet Uselton, HPC global file system performance analysis using a scientific application derived benchmark, Parallel Computing, 35(6), 2009, pp [14] H. Yi, Performance evaluation of GAIA supercomputer using NPB multizone benchmark, Computer Physics Communications, 182(1), 2011, pp [15] H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, and B. Chapman, High performance computing using MPI and OpenMP on multicore parallel systems, Parallel Computing, 37(9), 2011, pp [16] A. Gray, I. Bethune, R. Kenway, L. Smith, M. Guest, C. Kitchen, P. Calleja, A. Korzynski, S. Rankin, M. Ashworth, A. Porter, I. Todorov, M. Plummer, E. Jones, L. SteenmanClark, B. Ralston, and C. Laughton, Mapping application performance to HPC architecture, Computer Physics Communications, 183(3), 2012, pp [17] S. Zhong, M.T. Zuber, L.N. Moresi, and M. Gurnis, The role of temperaturedependent viscosity and surface plates in spherical shell models of mantle convection, Journal of Geophysical Research, 105, 2000, pp. 11,06311,082. [18] E. Tan, E. Choi, P. Thoutireddy, M. Gurnis, and M. Aivazis, GeoFramework: Coupling multiple models of mantle convection within a computational framework Geochemistry, Geophysics, Geosystems. 7, Q06001, 2006, doi: /2005gc
175
176 Parallel Computing
177 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 A Parallel PSO for a Watermarking Application on a GPU Edgar García Cano 1, Katya Rodríguez 2 1 Posgrado en Ciencia e Ingeniería de la Computación; 2 Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas; Universidad Nacional Autónoma de México, Ciudad Universitaria, Coyoacán, México, D.F., México; Abstract In this paper, a research about the usability, advantages and disadvantages of using Compute Unified Device Architecture (CUDA) is presented in order to implement algorithms based on populations, specifically Particle Swarm Optimization (PSO) [5]. In order to test the performance of the proposed algorithm, a hide watermark image application is implemented, and the PSO is used to optimize the positions where the watermark has to be inserted. This application uses the insertion/extraction algorithm proposed by Shieh et al [1]. The whole algorithm was implemented for both sequential and CUDA architectures. The fitness function used in the optimization algorithm is the union of two objectives: fidelity and robustness. The measurement of fidelity and robustness is computed by using Mean Squared Error (MSE) and Normalized Correlation (NC) respectively; these functions are evaluated using a Pareto dominance scheme. Keywords: Parallel PSO, watermarking, CUDA, GPU. 1. Introduction Digital age brought a new way to share information (files, audio, video, image, etc.), and there is any guarantee if someone else uses it without the authorization, that is why watermarking came to innovate as a new way to protect the information. Digital watermarking is presented when a pattern is inserted in an image, video or audio file and it helps to copyright the information in the files. In the case of image watermarking, it is divided in two groups: visible and invisible watermarks. A visible watermark is a visible semitransparent text or image overlaid on the original image. It allows the original image to be viewed, but it still provides copyright protection by marking the image as its property. Visible watermarks are more robust against image transformation (especially if you use a semitransparent watermark placed over the whole image). Thus they are preferable for strong copyright protection of intellectual property in digital format [2]. An invisible watermark is an embedded image that cannot be perceived with human eyes. 144
178 Where Supercomputing Science and Technologies Meet Only electronic devices (or specialized software) can extract the hidden information to identify the copyright owner. Invisible watermarks are used to mark a specialized digital content (text, images or even audio content) to prove its authenticity [2]. Evolutionary computation, a subfield of Artificial Intelligence, uses models based on populations to solve optimizations problems. These models have as main feature that are inspired in the mechanisms of natural evolution. Another set, based in biological models, and classified as bioinspired algorithms are: Ant Colony and Swarmbased algorithms. These are a different way to solve problems based on the behavior of animals or systems that take centuries to evolution. In recent years, new and cheaper technologies such as CUDA architecture have emerged with the concept of massive parallelism for generalpurpose problems. The advantage of this technology is that every person with personal computer has the possibility of taking advantage of the massive parallelism to accelerate procedures. The process of watermarking could be applied to copyright any sort of digital information. In some fields, like financial banking, it is necessary to make the process as soon as possible for a big quantity of information. On one hand, this is a reason to look for a new and cheaper technology as CUDA to accelerate the process. On the other hand, the need to improve the watermarking process against modifications such as cropping, rotation, flipping, scaling, change in the colors, etc., was the reason to use an optimization process. The idea of using PSO as the optimization algorithm comes owing to the fact that it has few parameters to adjust. The paper is then organized as follows: section 2 offers an explanation about how the watermarking algorithm works and the metrics used to evaluate the watermarked image quality. In section 3, an overview about the PSO algorithm is presented. Section 4 shows the optimization algorithm used in this work. In section 5, tests and results are presented and analyzed. Finally, a discussion about the advantages and disadvantages of using GPUs as a technology to implement algorithms based in populations is presented. 2. Methods With the vast volume of information flowing on the Internet, watermarking is widely used to protect this information authenticity. The need for copyright a huge quantity of digital files, spending the less possible amount of time and avoiding the loss information were the reasons to propose the use of an algorithm for watermarking Shieh algorithm, Particle Swarm Optimization as an optimizer, and finally a GPU based in CUDA architecture to accelerate the process. 2.1 The watermarking algorithm Shieh et al [1] have proposed an algorithm to insert and extract watermark based on Discrete Cosine Transformation (DCT). This transformation is used due to it is not necessary to have the original cover to extract the watermark; dealing with a huge number of images, it would be very expensive to store all cover images for the watermark extraction. Figure 1 shows an adjustment of the Shieh algorithm used in this work. 145
179 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Once the original image is loaded in memory and after the DCT, the ratio values are calculated using DC and AC coefficients. Next step calculates the relation between image content and the embedding frequency bands (polarities). Then, the watermark image is inserted in the selected bands of each 8x8 block. Quantization is used as an attack to the watermarked image, and it is necessary for the optimization process. Finally IDCT is computed and the watermarked image is obtained (for more details about the CUDA implementation see [3]). Figure 1. Watermarking algorithm equation 1) was utilized to measure fidelity. It ought to be close to zero to have a good approach between the nonwatermarked and the watermarked image. Watermarkrobustness: The robustness represents the resistance of the watermark against attacks (compression, rotation, scaling amongst others) done on the watermarked image. The normalized correlation NC (see equation 2) is used to measure robustness. It applies the logical operation exclusive disjunction, also called exclusive or. The bitwise operations are faster reducing the runtime. The NC value must be close to zero between the original watermark (W) and the extracted watermark (W ), to prevent the watermark image information loss Watermarking metrics In order to evaluate the performance of a watermark algorithm to hide the information, some metrics have been proposed. The watermark algorithm has to be capable to hide the mark data, but preventing the distortions of the image. In order to propose a simpler way to measure the fitness and the robustness spending the shortest time possible, the MSE and the NC were used. The exclusive or calculation is shown in table 1. Watermark fidelity: The fidelity represents the similarity of the watermarked image with the original image. Thus, mean squared error (see 146
180 Where Supercomputing Science and Technologies Meet The NC and the MSE are computed for each 8x8 block as it is shown in Figure 2. This was carried out with the purpose of divide, as much as possible, the data in GPU. When measuring the MSE in each block, just 64 comparisons are needed and they are executed at the same time in the other blocks. In the sequential process there are needed 512x512 evaluations one after another for a 512x512 image size. The same case was applied for the NC, instead of being calculated for the whole image (as in the sequential form); it was just computed for each block. with a population of random solutions and searches for optima by updating iterations. However, unlike GA, PSO has no evolution operators such as crossover and mutation. In PSO, the potential solutions, called particles, fly through the problem space by following the current optimum particles. It has been successfully applied to many problems in several fields such as Biomedicine (Karssemeijer 2006 [7] and Energy Conversion (J. Heo 2006 [4]), image analysis being one of the most frequent applications like Biomedical images (Mark P. Wachowiak 2004 [9]), Microwave imaging (M. Donelli 2005 [6] and T. Huang 2007 [8]) amongst others Basic PSO algorithm Figure 2. Blocks organization to calculate MSE and NC. 2.2 Particle Swarm Optimization Particle Swarm Optimization (PSO) is a population based stochastic optimization technique developed by Dr. Eberhart and Dr. Kennedy in 1995, inspired by social behavior of bird flocking or fish schooling [10]. PSO shares many similarities with evolutionary computation techniques such as Genetic Algorithms (GA). The system is initialized Each particle keeps track of its coordinates in the problem space which is associated with the best solution (fitness) achieved so far (this fitness value is stored). This value is called pbest. Another best value that is tracked by the particle swarm optimizer is the best value, obtained so far by any particle in the neighbors of the particle. This location is called lbest. When a particle takes all the population as its topological neighbors, the best value is a global best and is called gbest. The PSO concept consists of, at each time step, changing the velocity of (accelerating) each particle toward its lbest and gbest locations. Acceleration is weighted by a random term, with separate random numbers being generated for acceleration toward lbest and gbest locations. After finding the two best values (lbest and gbest), the particle i updates its velocity and position with equations 3 and 4, where i = 1, 2, 3 NS. 147
181 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 ϕ1 and ϕ2 are positive constants called acceleration coefficients, NS is the total number of particles in the swarm, r 1 and r 2 are random vectors, each component is generated between [0,1], and g represents the index of the best particle in the neighborhood. The other vectors Xi=[x1, x2,, xid] position of the ith particle; Vi = [v 1, v 2,,vi D ] velocity of the ith particle; B i best historical value of the ith particle found, g B i best value found of the ith particle in the neighborhood [5] (for more details about the CUDA implementation see [3]). 2.3 The optimization algorithm The objective of the optimization is to find the best frequency bands set to insert the watermark within the image. Different frequency bands are tested through the iterations of the algorithm finding out the best solution. At the end of the execution the application has as results the watermarked image and a matrix with the whole best positions (frequency bands) to insert the complete watermark. The algorithm in charge of doing the watermark optimization is the PSO, and at the same time it uses Pareto dominance to evaluate the fitness function through the MSE (fidelity) and NC (robustness). This process is detailed as follows 1. Using the DCT idea to split the image in 8x8 blocks, each block is used as a swarm. An image of 512x512 has 4096 blocks; hence each block will be a swarm. The number of particles by swarm is specified as a configuration parameter of the algorithm. 2. Each particle has a position vector. The vector size depends of the number of watermark bits used to insert in each block of the image. If the watermark size is 128x128 and if it is divided uniformly in the 4096 blocks of the image, then 4 bits are inserted in each block. Each position corresponds to a band in the 8x8 block, where the watermark bits are inserted. At the beginning, all the swarms are initialized randomly (each swarm must have the same particles number). If 4 bits will be inserted, 4 bands are required, then 4 random numbers must be created between 1 and 63. This means that each particle will consist of 4 bands (positions vector). If each swarm has 5 particles, every particle has a set of 4 bands used to originate 5 different solutions. To generate solution 1, all the particles with index 1 are taken and joined from every swarm; to generate solution 2, all the particles with index 2 are taken and joined from every swarm and so 148
182 Where Supercomputing Science and Technologies Meet on. This procedure is shown in figure After the insertion and extraction operations, the MSE (see equation 1) and the NC (see equation 2) are calculated. Based on MSE and NC, the fitness value is estimated. 4. One of the particles must be selected as the best global. Among the best options generated, one of them is chosen to be the best global. To choose the local best particle is considered to add up the MSE and the NC. If the new value is closer to zero than the old one, the new particle replaces the old one; otherwise the old one continues in the process. 5. In the last step, the velocity and the new position of the particles are calculated, according to equations 3 and 4. This generates the new bands and new iteration begins. Figure 4 shows the whole algorithm. Figure 3. This figure shows how the solutions are generated taking from particles P1 and P2 from the different swarms, bands B1, B2, B3 and B4 generate the corresponding solution. Figure 4. The optimization algorithm 149
183 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Test and results All tests were executed on two different servers with the following features that are showed in Tables 2 and 3. In order to test the implementations, Figure 5 shows the original image used in the algorithm and Figure 6 the watermark image. For the experiments, the size of the images is of 515x512 and they are in gray scale. 3.1 Proposed algorithm results Figures 7 and 8 show the runtime of the watermarking optimization. First and second tables show five experiments with different number of iterations using the sequential and CUDA versions. These experiments were made to compare the amount of time used for the algorithm and the quality of the results; based on the idea that the operations executed in the GPU must be faster than the ones computed in the CPU. Third table shows the runtime for the sequential implementation with 10 and 30 iterations, but without random number generation. This was made to compare CUDA runtime and see if CUDA implementation is faster than the sequential implementation without random numbers generation. Using the random numbers in the sequential version it is remarkable the difference in time. The use of those numbers consumes a big quantity of time due to its necessity to spend time in the CPU to generate different numbers. For the sequential version, the random numbers are generated using the C function drand48 that returns a pseudorandom number in the range [0.0,1.0). On the GPU, the random numbers are generated using a library called curand. Figure 5. Original image (Barbara) Figure 6. Watermark image ( 2012 BancTec, Inc., All rights reserved 150
184 Where Supercomputing Science and Technologies Meet Figure 7. Runtime for PSO on Geogpus In the case of the GPU, the random numbers are generated directly in the constant s GPU memory; there is no need to transfer them from the host to the device. This is why the difference, in terms of runtime, between the use of random numbers or not in CUDA is minimal. Reviewing the values (Figures 7 and 8) of the initial fitness and the final fitness it is noteworthy that the sequential version gives better results than the ones obtained from the GPU. For all the cases, the runtimes indicate that GPU is faster than CPU, even when all data have been loaded or when using static numbers in the CPU version. Thus, it is possible to set up that, at least for this version of the application, if the user wants a good optimization for the watermarking, the sequential version must be used. By contrast, if the user needs a quick approximation, the GPU version ought to be applied (for other results see [3]). 4. Conclusions Figure 8. Runtime for PSO on Uxdea Since there is not a standard configuration for the blocks, threads or the memory treatment in the GPU, it is necessary to make analysis and design of the procedures involved in the application to take advantage of the parallelism. In order to use parallel programing in a GPU, it is necessary to shift from a sequential to a parallel thinking, strictly to learn how to divide a huge problem into small ones divide and conquer, attempting to have the best performance. For example, in the calculation of the NC only 4 threads to do the comparisons were required, but in the case of the MSE 64 threads working at the same time were used. Therefore, the 151
185 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 configuration of the blocks and threads for an application on a GPU must be carefully analyzed. Different options to implement the PSO were analyzed, but a version that use as much swarms as number of blocks used to divide the image in the DCT was implemented. This was in profit of dividing a big problem in small ones, which suited with this parallel paradigm. As it was established, there is not a standard configuration in CUDA architecture, so the configuration accordance with the need of the function was made. The PSO needs to evaluate two vectors: velocity and position. Position depends on the velocity that is why velocity needs to be computed first. If there are 4096 swarms 4096 blocks and each swarm has five particles, then each of them need to update the velocity vector. The number of operation to be calculated in a CPU is: 4096 (swarms) * 5 (particles) * 1 (operation) = operations one after another. In the case of the same operation on the GPU, they are executed the same operations, but the difference is that there are 4096 swarms with 5 threads working in parallel computing one operation, hence there are threads working at the same time. If one thread in the CPU spends 1 second by operation the runtime will be s, but in the case of the GPU there are threads working at the same time, and they spend 1 second to finish the calculus. In the last example the speed of the processor is not considered, neither CPU nor GPU, nor the upload/download of the data to/from the GPU. The velocity vector needs random numbers to be calculated (see equation 1). In order to generate random numbers a library called curand was used. This library is useful because it is easy to generate a lot of numbers in a short time; the problem comes with the memory. If there is a big quantity of these numbers generated and held in global memory, there might be a shortage of space to store other data. In one iteration of the PSO, two random numbers to calculate the velocity value are used. If there are 4096 blocks with 5 particles each, random numbers for iteration are needed. There is another type of memory on the GPU, the constant memory. This memory is loaded in the GPU but it cannot be changed. This memory was considered to store the random numbers because they do not modify its value on the execution of the calculation of the velocity value. Another feature that needs to be considered (from GPU to GPU) is the velocity of the processor. This is evident in the experiments because the Geogpus server is faster than the Uxdea server. After this analysis of the present work, it can say that the use of CUDA helps to improve the performance of the application and that an algorithm based in population could be implemented on it, as long as the developer is aware of the features of this technology. 5. Acknowledgements The first author student at the Posgrado en Ciencia e Ingeniería de la Computación at Universidad Nacional Autónoma de México wants to express his gratitude to the support received from CONACYT (scholarship number 37617). We also want to express our gratitude to the support received from PAPIIT (project number ). 152
186 Where Supercomputing Science and Technologies Meet 6. References [1] ChinShiuh Shieh (2004). Genetic watermarking based on transformdomain techniques. Pattern Recognition. [2] ByteScout (2011). Digital watermark types. image registration utilizing particle swarm optimization. IEEE Transactions on Evolutionary Computation vol. 8 no. 3. [10] R. C. Kennedy, J. & Eberhart (1995). Particle swarm optimization. Proceedings of IEEE International Conference on Neural Networks, pages [3] Edgar GarcíaCano (2012). A parallel bioinspired watermarking algorithm on a GPU. Posgrado en Ciencia e Ingeniería de la Computación, UNAM (graduate thesis:m.sc). [4] J. Heo, K. Lee, & R. GardunoRamirez J. Heo, K. Lee (2006). Multiobjective control of power plants using particle swarm optimization techniques. IEEE Transactions on Energy Conversion, vol. 21, no. 2. [5] Mark Johnston & Mengjie Zhang Ammar Mohemmed (2009). Particle swarm optimization based multiprototype ensembles. GECCO, pages [6] M. Donelli & A. Massa (2005). Computational approach based on a particle swarm optimizer for microwave imaging of two dimensional dielectric scatterers. IEEE Transactions on Microwave Theory and Techniques, vol. 53, no. 5. [7] N. Karssemeijer J. Sequeira R. Cherian & B. Dhala S. Selvan, C. Xavier (2006). Parameter estimation in stochastic mammogram model by heuristic optimization technique. IEEE Transactions on Information Technology in Biomedicine, vol. 10, no. 4. [8] T. Huang & A. S. Mohan (2007). A microparticle swarm optimizer for the reconstruction of microwave images. IEEE Transactions on Antennas and Propagation, vol. 55, no. 3. [9] Mark P. Wachowiak & Renata Smolikova (2004). An approach to multimodal biomedical 153
187 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Analysis of Genetic Expression with Microarrays Using GPU Implemented Algorithms Isaac VillaMedina 1,2, Eduardo RomeroVivas 1, Fernando D. Von Borstel 1 1 Centro de Investigaciones Biológicas del Noroeste, S.C., Mar Bermejo 195, Col. Playa Palo de Santa Rita, C.P , La Paz, B.C.S., México; 2 Instituto Tecnológico de La Paz, Boulevard Forjadores de Baja California Sur No.4720, C.P , La Paz, B.C.S., México; Abstract DNA microarrays are used to analyze simultaneously the expression level of thousands of genes under multiple conditions; however, massive amount of data is generated making its analysis a challenge and an ideal candidate for massive parallel processing. Among the available technologies, the use of General Purpose computation on Graphics Processing Units (GPGPU) is an efficient costeffective alternative, compared to a Central Processing Unit (CPU). This paper presents the implementation of algorithms using Compute Unified Device Architecture (CUDA) to determine statistical significance in the evaluation of gene expression levels for a microarray hybridization experiment designed and carried out at the Centro de Investigaciones Biológicas del Noroeste S.C (CIBNOR). The results with respect to traditional implementations are compared. Keywords: GPU, Microarray, CUDA. 1. Introduction Recent technological advances in molecular biology and genomics have triggered an explosion in the amount of information generated, prominent examples of this growth can be easily observed in the public databases of DNA sequences such as GenBank or UniProt where information is doubling approximately every 6 months. Technologies such as nextgeneration sequencing or the use of microarray to analyze gene expression allow for large scale analyses covering a large proportion of the genome of an organism, in contrast to only a few years ago, where techniques allowed genes only to be analyzed separately. DNA microarrays, for example, to analyze simultaneously the expression level of thousands of genes with multiple conditions, has revolutionized molecular biology impacting academia and fields in medicine and pharmaceuticals, biotech, agrochemical and food industries. Today the costs of analyses of this information, in terms of economics, time, and resources tend to be costlier than its generation [1]. This growth in the amount of information generated in each experiment 154
188 Where Supercomputing Science and Technologies Meet requires the use of new analysis technologies that go hand in hand with the dimension of the data. Bioinformatics, understood as the application of mathematics, statistics and information technologies for the analysis of genomic and proteomic signaling has become the accept solution to this challenge. One of the main features of the microarray is the large volume of data generated; therefore, one of the greatest challenges in this area involves the handling and interpretation of these data. The size of the generated information and its analysis of microarrays make them ideal candidates for parallel processing architectures taking advantage of many cores and multicores that are revolutionizing the highperformance computing. However, the use of clusters and supercomputers has remained exclusive of laboratories and universities with large resources. Meanwhile, the development of manycore architectures such as Graphics Processing Units (GPU), and specifically the Architecture of Unified Computing Devices (CUDA), proposed by NVIDIA in 2006 [24], allows the development of bioinformatic analysis algorithms with highperformance lowcost devices but with high computing power. There are only a few studies using GPU s for microarray analyses. For example, an algorithm based on GPU s for the classification of genes expressed in a microarray has been developed recently [5]. This paper reports the implementation of algorithms in CUDA to determine statistical significance in the evaluation of gene expression levels for a microarray hybridization experiment designed at CIBNOR and compares the results with respect to traditional implementations. 2. Materials and Methods 2.1 Microarrays DNA microarrays are devices that can measure the expression levels of thousands of genes in parallel. A microarray is a crystalline solid surface, usually a microscopic slide, which adheres specific DNA molecules for the purpose of detecting the presence and abundance of complementary molecules (nucleic acids) marked in a biological sample (via WatsonCrick hybridization duplex formation). In most experiments microarray labeled nucleic acids derived from messenger RNA (mrna) of a tissue sample of an organism are involved in the generation (coding) process of a protein microarray and therefore the degree of expression of a gene can be measured by quantifying the relative abundance of molecules attached [6]. Figure 1 shows the most commonly used experimental design for microarrays. First step in the process is to extract genetic material from tissues from two different biological conditions, such as an abnormal condition and a normal control. Then the samples are labeled with different fluorophores; red for the sample tissue (with Cy5) and green for the control tissue (with Cy3), and hybridized on the microarray slide. These markers are used to identify the DNAs complementary to nucleic acids of interest in the sample by emitting light when illuminated by a laser in red and green, respectively. Both images are combined to obtain a color image, where the overexpressed genes acquire shades of red, inhibited genes are in green shades, and genes that have remained in the same condition in both samples are shown in yellow. Afterwards, 155
189 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 an estimate of signal intensity in each case was carried out whereby corrections were made to normalize and adjust the signal to the dark background. The overexpression or under expression of a given gene was represented as a fraction as defined in Equation 1. In this formula genes that are overexpressed by a factor of 2 will give a ratio of 2, whereas genes underexpressed give values of 0.5. Hence, it is preferable to use a logarithmic transformation with base 2, so that a doubly overexpressed gene will generate a value of 1, whereas a gene that is underexpressed value will be 1, making interpretation of results more intuitive to the natural symmetry of biological phenomena [6]. (1) 2.2 Statistical analysis of differential expression Each gene spot gives a measure of expression that compares two samples for a given experiment. However, in order to represent the variability among a population of organisms, it is required to have repetitions of the experiment for different individuals to identify genes that are differentially expressed consistently. Setting a threshold of expression and averaging the readings for the total number of organisms is not appropriate as it does not reflect the extent to which the expression levels vary for each individual, or takes into account the size of the sample, i.e. the number of agencies involved in the study. Therefore, a hypothesis test shall be used to determine whether a gene is differentially expressed. The null hypothesis for this experiment is that there is no difference in expression for both tissues. If this hypothesis were true, variability in the data would only represent the variability between individuals or, measurements error. The selection of differentially expressed genes should not be based on their proportion defined in Equation 1 but on a predefined value p (p = 0.001), i.e. the probability of observing a degree of change randomly. For the purpose of this study, the tpaired test was selected, and calculated as shown in Equation 2 Figure 1. Experimental design and use of microarrays (2) 156
190 Where Supercomputing Science and Technologies Meet Where X is the average of the ratio defined in Equation 1. S is the standard deviation calculated with Equation 3, and n is the number of biological replicates of the experiment. (3) The p value is calculated from the statistical comparison with a tdistribution with an appropriate number of degrees of freedom, in this case the number of replicates minus one. image displayed is the result of combining the images of the slide of the challenge condition and the control slide, containing genes arranged in 160 rows and 384 columns divided into two blocks. Each point represents a unique sequence of 70 bases, representative for the gene of interest. In the zoom view of a section, it is shown that not all gene spot in the microarray have the same intensity. 2.3 Design of the microarray As part of SAGARPACONACYT 2009II project entitled FUNCTIONAL GENOMICS APPLICATION AS A STRATEGY FOR IMPROVEMENT OF THE SHRIMP INDUSTRY a microarray was designed specifically for shrimp from unique sequences from public databases (GenBank) and subtractive libraries generated in the Biological Research Center of the Northwest, S. C. (CIBNOR). The selection of sequences, preprocessing, assembly and design of probes was carried out in CIBNOR, while the physical impression of the microarray was done by the company Biodiscovery, LLC (dba MYcroarray). Experimental challenges to various biological conditions were carried out at the CIBNOR, while the process of microarray hybridization and scanning was performed on DNA Microarrays Unit at the Institute of Cellular Physiology, UNAM. Figure 2 shows an example of the microarray image generated for a given experiment and a zoom view. The Microarray Figure 2. Microarray image and zoom view The SPOTFINDER program was used for image analysis of microarrays to determine the level of expression of each gene from the set of points that form the image, which generates a table of maximum, minimum, and average intensity and background. 2.4 Experimental design In order to evaluate the use of parallel processing algorithms for the analysis of microarray gene expression, we developed the tpaired parametric analysis on the GPU cards with graphics processing routines developed in CUDA. 157
191 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 From hybridization data of a microarray of genes, several subsets of data were generated by varying the number of genes selected for analysis and the number of replicates of the experiment. The computer equipment, in which the project was developed, has the following features: Processor Intel Core2Duo E8400 at 3.00 GHz with 2.0 GB RAM, 100 GB hard drive, operating system Fedora 12, with a GeForce 9800 GT (112 CUDA cores, CUDA computing capability 1.1 with 1024 MB dedicated memory and 256bit memory interface). 2.5 Implementing CUDA Figure 3 shows a flowchart of operations and functions in CUDA to perform the tpaired test, as defined in Equation 2. The threedimensional matrix consisting of the microarray data from global memory was mapped into a two dimensional array, as illustrated in Figure 4. Figure 4. Mapping of the data matrix to global memory The functions required for the computation are described below. Step 1. Get the average of the data. To perform this operation, it is required to apply an algorithm that allows adding the elements of a row, and subsequently dividing by the number of columns, thus obtaining the average. The required functions are: sumatoria(). Sum by Row. Algorithm that receives as input parameters: an array of data to be processed, and a vector. The algorithm calculates the sum of all columns in each row, and stores the result in the vector that is passed as a parameter. Each thread of each block is responsible for loading an element of the array in the shared buffer of each block, and the operations are performed using this buffer. Figure 3. Flow diagram of the tpaired test calculation in CUDA 158
192 Where Supercomputing Science and Technologies Meet Step 2. Obtaining the standard deviation It is required to obtain the sum of the square of the differences of the samples and the average. To do this we used the following algorithms: divisionesc(). Division by a Scalar. Algorithm that divides a vector by a scalar value, both are received as input parameters. Each thread loads a block of each vector element in the buffer, and subsequently performs the operation. restapow2(). Square Difference. Algorithm that receives as input parameters an array and two vectors. The algorithm subtracts to the elements of each row the corresponding value in the vector. The results are squared and stored. promedio(). Average. Algorithm aided by the previous algorithms to obtain the average of each row. raiz(). Square Root. Algorithm that computes the square root of each vector element. Step 3. Calculate the value of t To complete the calculation the following algorithm divides two vectors element by element. 159
193 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 divisionmat(). Vector Division. The algorithm performs element by element division of two vectors and stores the result in a third vector. Figure 6. Computation time using the GPU 3. Results We compared the results in computation time for GPU implementation against time obtained in a serial implementation using CPU, varying the number of genes involved in the analysis, and the number of replicates in each experiment. Figures 5 and 6 show the processing time for the algorithm using a CPU and GPU, with different numbers of genes analyzed, and number of replicas. Figure 5. Computation time using the CPU Figure 7. Computation time ratio CPU / GPU 4. Discussion Figure 5 shows how the runtime of the ttest varies with respect to the number of genes involved in each replica and to the number of replicas n. For a given number of replicas there is a linear increase with increasing number of genes involved. For a greater number of replicates the slope becomes larger as the number of values used for each calculation. This behavior corresponds to what one would expect from the serial implementation developed using a CPU. Figure 6 shows the corresponding times for the same calculation but now implemented 160
194 Where Supercomputing Science and Technologies Meet on GPU. One can observe that the processing times remain approximately equal, in the order of one hundred thousandths of a second, regardless of the increase in the number of genes or the number of replicas. Figure 7 shows the advantage of using the calculation in parallel on the serial implementation. The process can be performed from 5 up to 30 times faster depending on the number of genes involved in comparison with the implementation using CPU. 5. Conclusion Despite the GPU computing time being from 5 up to 30 times faster, the order of time spent on CPU and GPU, at first glance, does not justify the use of a parallel implementation, since both are made in fractions second. However, we must take into account that only the most basic statistical parametric has been implemented: a ttest in a single study with paired data. Microarray technology however, is used in more complex experiments, where there may be multiple groups in which more than one condition is analyzed. Such experiments require more sophisticated analysis known as ANOVA and generalized linear models. Both techniques are similar to the ttest in that they require that the variability in the data follows a normal distribution. Bootstrap analysis can be applied to both techniques to generate the data distributions under no Gaussian assumptions. For that, new sets of data of the same dimensions are generated from the original data, being common to produce millions of these sets to generate the distributions [6]. In these cases it is fully justified the advantage in speed of analysis presented in GPU implementations. 6. Acknowledgements The authors acknowledge the support of the project SAGARPA CONACYT 2009II entitled Functional Genomics application as a strategy for improvement of the shrimp industry. 7. References [1] A. Sboner, X. J. Mu, D. Greenbaum, R. K. Auerbach and M. B. Gerstein, The real cost of sequencing: higher than you think!, Genome Biology, Vol. 12, pp com/2011/12/8/125, [2] NVIDIA, cuda.html [3] J. Sanders and E. Kandrot, CUDA by example: An introduction to GeneralPurpose GPU Programming, (1st Ed.). AddisonWesley Professional, [4] D. B. Kirk and W. W. Hwu, Programming massively parallel processors: A handson approach, (1st Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, [5] A. Benso, S. Di Carlo, G. Poltano and A. Sevino, GPU acceleration for statistical gene classification, IEEE Intl Conf. On Automation Quality and Testing Robotics (AQTR 2010), Vol. 2, pp.16, may, ClujNapoca Rumania [6] S. Dov, Microarray bioinformatics, Cambridge: Cambridge University Press,
195 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Parallelization of Three Hybrid Schemes Based on Covariance Matrix SelfAdaptation Evolution Strategy (CMSAES) and Differential Evolution (DE) Víctor E. Cardoso, Francisco E. Jurado, Arturo Hernández Departamento de Computación y Matemáticas Industriales, Centro de Investigación en Matemáticas, A.C., Jalisco S/N, Col. Valenciana, C.P , Guanajuato, Gto., México; Web: Abstract In this document we propose three hybridization schemes between the algorithm ( )CMA SAES described by [2] and the differential evolution (DE) techniques explained in [1]. The goal is to balance the disadvantages of the CMSAES with the advantages of DE and in the other way around, in order to maximize the probability of success to find the global optimum in the test functions. The first proposal was called CMSAES/CDE and it consisted of the adaptation of the covariance matrix C of the CMSAES using the differential operator directly over the matrix. To accomplish this, three subpopulations of size µ should be selected from the λ offspring individuals. The second proposal is identified as CMSA ES/xDE, because the differential operator creates an individual candidate (from the λ population) to compete in a tournament with the recombinant individual of the CMSAES algorithm, the winner will be used to compute the covariance matrix of the next generation. The third proposal is known as CMSAES/ x:<μ(de)>, because it produces μ individuals with the differential operator from the λ offspring created at each iteration of the CMSA ES/x:<μ(DE)>, because it produces μ individuals with the differential operator from the λ offspring created at each iteration of the CMSAES algorithm, later, they are used to generate the recombinant individual x. The algorithms were written in parallel using the programming language C++ with OpenMP; it will be shown that with the parallelization we reduce the computing time from hours to minutes. Finally, we show the results for each algorithm with a benchmark of five known functions, varying the number of dimensions and reporting the exit rates of each one of them. The proposed algorithms have a better performance or equivalent to its basic versions. Keywords: Parallel Algorithms, Evolutionary 1. Introduction Before exposing how to perform the hybridization between CMSAES and DE, it will be explained in brief how these methods individually work. Some of their advantages 162
196 Where Supercomputing Science and Technologies Meet and disadvantages will be discussed. Each algorithm will be explained step by step to serve as a reference to the following hybrid versions. Regarding notation, N(0,1) will be used to denote a random number generated with a normal distribution with mean equal to zero and variance equal to one, in the same way N(0,I) will be used to represent a vector of random numbers generated with a multivariate normal distribution with zero for the mean of all dimensions and with a covariance matrix equal to the identity matrix, the same conventions will be used for U(0,1) to denote a random number generated with a uniform distribution in the (0, 1) range. 2.Covariance Matrix Self Adaptation Evolution Strategy (CMSAES) There are two main components to efficiently build an evolution strategy that works with any continuous domain: 1. A method of covariance matrix selfadaptation capable of capturing the shape of the search space in the shortest time. 2. A routine able to adjust the step size. The CMSAES algorithm uses only two parameters: τ and τ c, which according to the convergence analysis in [2] are given by Where N is the dimension size and μ is the count of selected individuals from the new λ offspring. The algorithm 1 shows a simple version of the CMSAES, simplifying the learning rule of the covariance matrix and using selfadaptation for σ. In algorithm 1, the population with μ individuals are denoted by P μ, in the like manner P λ denotes a population of μ individuals. The recombinant individual is represented as x (r), while the recombination of the P pop population is defined as, which in its simplest case is the average of the in the P pop population, where the subscript pop refers to some population, for example, P μ and P λ. Furthermore, each individual is related to a step size, with a search direction and a velocity. It is assumed that before doing the recombination again, the Pμ population has been selected from the μ best 163
197 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 individuals of the λ population, where. The covariance matrix C is initialized as the identity matrix, so in the first iteration we have. The computation of using a spectral decomposition requires to solve the eigenvalue problem, although it is possible to obtain additional information about the search space adaptation in the neighborhood of the population state, the computational cost is high and not always needed, so that we could exploit the symmetry of the covariance matrix and approximate with the lower triangular matrix L of the Cholesky decomposition of C, as it is explained in [2]. It is important to note that could be computed before starting the iterations over P λ. 3. Differential Evolution This method is based on the differential operator. This operator generates a new improved candidate solution calculating the difference between a number of predecessor solutions randomly chosen from the population. Depending on the optimization function, the author in [1] proposes five different schemata of the differential operator, which suits the solution in different ways. Some of the schemas are: DE/rand/1 DE/rand/2 DE/randtobest The differential parameters and are helpers in the adaptation of the schemas in the actual optimization problem. On the other hand, if the s 1, s 2, s 3, s 4 and s 5 are randomly chosen then the following assertion must be true: Once the s indexes are chosen (just need to extract the number of indexes required by the selected schema. i.e. DE/best/1 only requires s 1 and s 2 ), a new candidate is created then it is used to compete in a tournament operation against the individual in the ith position. This process repeats iteratively until either the system converges or it reaches a top number of iterations. Algorithm 2, shows the implementation of an optimizer using differential evolution, P n represents the population of size n, and the different s are the indexes required by the operator. It s recommended to choose n to be at least 10 times greater than the problem dimension size N. DE is effective finding global optimal solutions for soft functions, only if the right search schema is chosen. As an example, a good choice for a schema in the sphere function is the one that involves the best candidate solution. Otherwise, if the function has many local optimal solutions, following the same approach may not give any good results. DE/best/1 DE/best/2 164
198 Where Supercomputing Science and Technologies Meet Where is a differential parameter, P α and P β are subpopulations with the size of λ extracted from P_λ. P_μ represents the μ best individuals from P_λ. Algorithm 3 shows how the CMSAES/CDE is implemented. 4. Hybrid Proposal: The CMSA ES/CDE The objective in this strategy is to decrease the systematic resulting errors produced during the covariance matrix approximation, hence, for each generation in the CMSA ES, the C matrix will be adjusted using the DE/best/1 operator schema. This algorithm is based on algorithm 1, the difference is that once the is created, μ candidates must be randomly chosen, then a subpopulation P α is extracted, the same idea is applied to extract. The covariance matrix C is then calculated using DE: Notice that it s possible to control the weight of the differential operator on the algorithm using the γ parameter. If, then algorithm 3 becomes the same as algorithm 1. If the previous equations are substituted in the calculation of C, we get: 165
199 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Hybrid Proposal: The CMSA ES/xDE As stated in the previous sections, CMSA ES uses the covariance matrix information to guide the population in the direction that minimizes the function. This matrix is constructed using the μ best individuals from the P λ population; hence advantages can be taken from the DE operator. In other words, if the covariance matrix is constructed using the μ best candidates, it s likely that some of the exploration data will be disregarded, in the case that the candidates are randomly extracted from the population, the right direction may be lost. In order to obtain a balance between exploration and exploitation, the DE/rand/1 scheme is used. The candidate which is created with this operator is put into a tournament with the recombinant. This tournament is carried out with a probability of p t for iteration iteration, this will assure that information from the original covariance matrix will not be lost, note that if p t =1 then the tournament will always occur. 6. Hybrid Proposal: CMSAES/x: <μ(de)> This hybrid strategy is significantly different than the previous proposals, since it promotes the exploration and direction path from DE as new populations are created; it exploits the information provided by the CMSAES covariance matrix regarding global orientation as well. The algorithm reaches these goals by using the CMSAES as the base framework, then the DE is embedded in the base algorithm in order to generate the μ individuals that will give birth to the new recombinant; this means that the new strategy will follow the CMSAES workflow to generate the P λ population, then the DE/best/1 operator schema in order to create the μ individuals for the P μ population. This hybrid shows that the CMSAES and DE genetic strategies can both work together, without the need of stripping off any features in either of the two algorithms. Just like in the previous proposal, if the control parameter γ=0, then the hybrid will behave just as the standard CMSAES. Algorithm 5 shows the implementation of the hybrid strategy described in this section. 166
200 Where Supercomputing Science and Technologies Meet random numbers from any other distribution. In this particular case numbers from a normal distribution are required. The BoxMuller transform [4], allows the mapping of uniform distribution numbers to a normal distribution and. These mapping is made by the following equations. Where Z 1 and Z 2 are independent variables from a normal distribution whose parameters are and, U 1 and U 2 are random numbers from the uniform distribution that take values in (0,1). 8. Test Functions 7. Random Number Generator With Normal Distribution The different strategies were tested using the following functions. The problem of generating random numbers in parallel has to be considered with care, because most of the pseudorandom generators including the standard C function rand(), use a system s internal clock based seed, this means that the numbers can t be accessed simultaneously [3], hence these functions aren t thread safe. However, there exists another C function named rand_r() which generates random numbers from a uniform distribution, which is not C89 standard, but it is available in many systems which still allow the portability of the code. Once we re able to generate random numbers from a uniform distribution simultaneously, it s possible to build a routine to generate 167
201 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 The following images show the visualization of the test functions in two dimensions. The following table includes the parameters that were used for testing: The argument that minimizes the Rosenbrock function is a vector with1 s in all of its entries. 168
202 Where Supercomputing Science and Technologies Meet 9. Parallelization The parallel implementation of the proposed algorithms reduced the execution time up to 90%. The implementation was made using C++ and OpenMP ([5] has a detailed description of its use). Before making the implementation efficient, the bottleneck points had to be detected. Once the bottleneck information is available, the routine could be designed accordingly. Two different approaches were used for the parallel implementation. The first approach tries to process the candidates in P λ in parallel, taking advantage of the independence in the data for each individual. The second approach consists on processing the objective function in parallel, this is particularly efficient when the dimension of the objective function is large or when the function evaluation depends on the dimension (i.e. Sphere, Rosenbrock, Rastrigin, Griewank and Ackley), also in general it s a good idea to use this approach when the function evaluations have a latency greater than 80%. Since the creation and destruction of execution threads are very time consuming, the first strategy works only for large values of λ, while the other strategy works much better for large dimensions, in any other case it might be a good idea to parallelize the serial version of the program which implies running many instances in parallel. following parameters set to: λ=4 λ=50 and γ=0.1, each execution is considered successful if the resulting error is lower than the one shown in the error column. Notice that the hybrid strategies work better in the functions that contain many local optimum values; this is because the information from the best adapted candidate is not as heavily important as the nonhybrid versions. It s also important to note that the CMSAES/xDE converges faster than CMSAES, since it is influenced by the DE operator, this means that the local optimal stagnation problem has lower chances to occur. 10. Results In this section, the results of the algorithm execution are shown (algorithms 1, 3, 4 and 5). Each algorithm was run 30 times with the 169
203 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO References [1] Feoktistov V. Differential Evolution, in search of solutions. Springer (2006). [2] Beyer H. Covariance Matrix Adaptation Revisted CMSA Evolution Strategy.Vorarlberg University of Applied Sciences (2008). [3] Marsaglia G. Random number generators. Journal of Modern Applied Statistical Methods (2003). [4] Box G. and Muller M., A Note on the Generation of Random Normal Deviates. The Annals of Mathematical Statistics (1958). [5] Chapman B., Jost G. and Van Der Pas A. Using OpenMP. Portable Shared Memory Parallel Programming. Massachusetts Institute of Technology (2008). 11. Conclusions The 3 proposed hybrids were made based on the use of the selfadaptive mutations, instead of only using the selfadaptive step length. In order to calibrate the hybrids, empiric knowledge is required since the parameters are very sensitive. The proposed algorithms have a better performance when working with big populations. Strong evolution strategies, use big populations and are highly parallelizable, this is why the proposed strategies have a great potential in different applications. 170
204 Where Supercomputing Science and Technologies Meet Simulation of Infectious Disease Outbreaks over Large Populations through Stochastic Cellular Automata Using CUDA on GPUs Héctor CuestaArvizu, José RuizCastilla, Adrián TruebaEspinosa Universidad Autónoma del Estado de México, Centro Universitario UAEM Texcoco, Av. Jardín Zumpango s/n Fraccionamiento el Tejocote, C.P.56259, Texcoco, Estado de México, México, Abstract In Science a large number of areas are being benefited by the reduction of computational time with the use of Graphics Processing Units (GPU). In the case of Epidemiology, through the speeding of the simulation of scenarios with large populations where the processing time is very significant. This article introduces an epidemiological event simulation with a model based on Stochastic Cellular Automata (SCA). This provides an implementation of the main features of a largescale infectious disease: Contact, Neighborhood, Trajectories and Transmissibility. A case study is simulated on an implementation of the SCA algorithm, for an infectious disease type SEIRS (Susceptible, Exposed, Infected, Recovered and Susceptible). Over a population of 1,000,000 individuals. This is parallelized through a process balancing algorithm implemented in CCUDA. The result given by the GPU parallelized software is compared against a parallelized model analysis made by multithreaded CPU. The results show that the computation time can be significantly reduced through the use of CCUDA. Keywords: Epidemiology, Cellular Automata, GPU, Stochastic Model, CUDA, Simulation. 1. Introduction Epidemiology is defined as the study of the distribution and determinants of health related states or events in human population and its application in the prevention and control of health problems [8]. Epidemiological events caused by a new virus like H1N1 [4] and the new varieties of bioterrorism can cause severe human and economic losses. A priority is the study and simulation of the spread of pathogens in a population. Figure 1. The triangle of epidemiology. 171
205 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 In Figure 1 we can appreciate the factors favoring the distribution of an Agent of infectious disease in a population over time and the environment. Where the interaction between the environment and the host (individuals of the population), are caused by a behavior (demographic, geographic or social). The modeling of epidemiological events can be treated from the mathematical point of view as variants of the SIR model (Susceptible, Infected and Recovered) developed in 1927 by WO Kermack and A. G. McKendrick [13]. This is through categorizing individuals into different groups in the population according to its state during an outbreak. Such models can be simulated in mathematical systems of differential equations or hidden markov chains. From a computational point of view the Cellular Automata (CA) are being widely used [3] because of its ease to represent natural rule systems in discrete steps and the ability to represent the interaction between individuals through their neighborhood. This makes the CA an option for the simulations of the Monte Carlo method where the use of a stochastic environment can simulate the behavior of nature. Also could experiment with different simulation strategies. Moreover, the development of these simulations represents a high cost in processing time. Due to the fact that they have to generate hundreds of thousands of random numbers and you have to make millions of interactions among individuals in the population studied. The emergence of computer equipment that has graphics processing units (GPU) has allowed researchers researchers use the existing computing power. To have a large number of cores that can work in parallel simple operations [1]. 2. SEIRS Model SEIRS epidemiological model (Susceptible, Exposed, Infected, Recovered and Susceptible) is a model that describes the course of an infectious disease as shown in Figure 2. Starting with a susceptible population (S) which comes into contact with an infected population (I), taking into account a period of incubation or exposure to the pathogen (E), once the infection period marks the individual enters a recovered state (R) preserving that state for a period of immunity to the pathogen, once lost this resistance the individual returns to a susceptible state. Those diseases are called Endemic because the remaining individuals returning to the susceptible population, that gradually stabilizes the curve of the different populations. Figure 2. SEIRS state diagram. 172
206 Where Supercomputing Science and Technologies Meet 3. Cellular Automata In the early 50 s, Von Neumann and Stanislaw Ulam conceived the concept of Cellular Automata (CA) as a model to represent physical events through discrete rules. The cells are most often arranged to constitute a regular spatial lattice [2]. The cell changes its state based on the status of the surrounding cells whose range may vary depending on their Neighborhood (See Figure 3). Each time the rules are applied to all cells is considered a new generation. on Euclidean lattices of one or two dimensions forming a homogeneous graph whose integral points on the cell units are placed [3]. Figure 4. Cellular Automata represented in a Cayley graph. A Cellular Automata is defined by where is a Cayley graph (See Figure 4)[7] when are the vertices and arcs [3]. It is a copy of a finite state machine for each vertex, with an input alphabet (d times) and a local transition function defined as follows: 3.2 Contact Model Figure 3. Von Neumann (a), Moore (b), Extended Moore (c) and Global (d) neighborhood. 3.1 Formalism A cellular automaton is a network of automata defined on a regular graph with a finite state machine is identical for each unit cell. Generally cellular automata are defined The way in which each cell has interaction with other is conditioned by evolution rules of the CA. In the case of the SEIRS model the contact distribution is defined as follows: a) The number of contacts through the whole population is determined by Where represent the total population of the CA and CR is the average number of contacts 173
207 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 per individual. b) The number of contacts through a Population Infected is determined by.where represent the infected population of the CA and CR is the average number of contacts per individual. In both distributions the number of contacts is divided by two that s because every interaction has 2 contacts. 4. Basic concepts of Cuda and GPU The GPUs are multicore processors that enable high performance. Originally used for entertainment and graphics processing. The capacity to perform multiple operations simultaneously provides a parallel with many possibilities for computing in general and many other areas. There are problems such as very large dataset analysis, text processing, multimedia and data processing, in general any of the socalled Big Data Problems. Currently you can use different ways to do this, such as OpenCL, CUDA, etc. This article focuses on the GPU processing with CUDA (Compute Unified Device Architecture) for which we use a NVIDIA Geforce GTX260 with 30 MP (Multi Processor) 8 SP (Scalar Processor) for each processor (MP) for MP and 1 GB global memory. CUDA is an API provided by NVIDIA Corporation. Natively is implemented in C language, although there are implementations of the API for other languages (see Figure 5) or specific implementations such as Java (through JCuda). Allows a great performance in problems with high parallelism through the possibility of speeding up the software. Although not everything is parallelizable and we need identify which parts of our algorithm may optimize and organize so that memory accesses are optimized parallel and thereby avoid bottlenecks bottle. Parallelize programs allows primarily to solve problems in less time, and run the program in less time. Work with problems that require a lot of computing power in a reasonable run time. In Figure 6 shows how is organized internally the GPU where we see concepts such as thread, block and grid. The grid has two dimensions, which in turn blocks each block having three dimensions of threads (threads) where each thread has a direction. Figure 5. Development environment for CUDA Architecture [11]. Figure 6. Organization of Threads, Blocks and Grids over the GPU [11]. 174
208 Where Supercomputing Science and Technologies Meet 5. Implementation This paper shows a comparative performance analysis provided by the CPU parallelism through a ThreadPool programmed in C # language and on the other hand a parallel GPU in CCUDA. The implementation for CPU is developed in C # through a ThreadPool Manager that handles a structure of 4 threads and a balancer processes asynchronous solves the queue of messages between processes. For this case and considering the high level language has no control at lowlevel details of message passing. The code parallelized considered a source of data in shared memory. In order to exploit the underlying processor topology. The GPU implementation is given by a decomposition of the matrix of CA process. Shared memory model is implemented. A toroidal hyperspace for periodic boundary conditions is implemented in order to conserve a uniform neighborhood in any region of the matrix. The distribution of the processes is obtained through a matrix decomposition given by: Where each subblock of the CA is equal to a matrix, these blocks are mapped to processes assigned to the balancer processes. Each one of the matrices generated has the same structure as the original but with reduced dimensions. That it can distribute in GPU cores and having a multistart. Finally we can note that parallelization by decomposition shown refers to the Cholesky decomposition [12] which is an example of divide and conquer method. 6. Experimental Results For this case we used data from the influenza virus. That is an RNA viruslike family of Orthomyxoviridae. Considering a closed population in an idealized environment where demographic or geographic variables are not considered (see Table II). Table II. Simulation parameters 175
209 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Rules for the spread of the disease (see Figure 2): An individual (S) changes its state to (E) after its interaction with (I) and evolves according to its incubation and infectious periods. After staying in (R) for their resistance period the individual returns to a state (S). It is considered an extended Moore type neighborhood to 4 levels. It is considered a multistart for event by the random distribution of the index cases. It is considered a periodic boundary condition, where all individuals have the same rules of neighborliness. that only takes 3 seconds (0.05 minutes), which gives a factor of about 6 times faster. Knowing that the distribution is uniform and using the same base (Seed) for generating random numbers. It determines a considerable reduction in computation time for the case of GPU processing. Figure 7. SEIRS Curve Figure 6 shows the curve of disease progression as a result of simulation which is consistent with the curve of an endemic nature of the SEIR model [8]. 7. Performance Analysis Due to the nature of the CA generating random numbers is essential in the performance of the application. As shown in Figure 8. The speed with which generated random numbers in CPU is 18 seconds (0.3 minutes) in contrast to the GPU version Figure 8.CPU vs GPU performance contrast. On the side of the contacts between individuals of the CA, with a population of one million cells in a matrix of A 1000 x1000 by iteration is noted that the CPU takes 260 seconds and for side of the GPU is solved in 35 seconds that gives a factor of 7 times faster. Table III shows the performance of the simulated CA. Considering each one of iterations of the CA as a Day which is observed to be consistent with the values given in Figure 9. The simulation was run 10 times providing similar results with a deviation of 0.03 seconds for the GPU version and 0.18 seconds for the CPU version. Response time was taken by a log record at the start and end of the simulation. 176
210 Where Supercomputing Science and Technologies Meet 8. Conclusions In this paper, we present performance analysis for an epidemiological event simulation with CA in large populations, using different methods for the parallel processing. There is a difference of 7 times the performance reported by the GPU implementation over time reported by the CPU implementation. This decrease in computation time allows the simulation with high population cities and enables the analysis of epidemiological events. As well as support the decisions of public health systems. Additionally, the work in progress is focused in the use of different contact distributions to run simulations. To contrast the different implementations to corroborate the data presented here. 9. Acknowledgments Figure 9. CPU vs GPU contact rate time contrast. Table III. Results of a case study The authors would like to thank Mexico State Autonomous University (UAEM) and the Mexico State Council for Science and Technology (COMECYT) for supporting young people in academic activities. 10. References [1] Chang Xu, Steven R. Kirk and Samantha Jenkins, (2009), Tiling for Performance Tuning on Different Models of GPUs, Second International Symposium on Information Science and Engineering 2009, pp [2] Armin R Mikler, Angel BravoSalgado and Courtney D Corley, (2009), Global Stochastic Contact Modeling of Infectious 177
211 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Diseases, International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing 2009, pp [3] Héctor CuestaArvizu, Ángel Bravo Salgado, Armin R. Mikler y Adrián Trueba Espinosa, (2011), Modelado para estudio de brotes epidémicos usando un Autómata Celular Estocástico Global, IEEE ROC&C 2011, CP12. [4] Jorge X VelascoHernandez y Maria Conceicao A Leite, (2011), A model for A(H1N1) epidemic in México including social isolation, Salud Publica de México Vol. 53, No 1, pp [5] Stephen Wolfram, (2002), A new kind of Science, Wolfram Media, pp [6] Linda J. S. Allen, (2005), An Introduction to Stochastic Epidemic Models, Department of Mathematics and Statistics Texas Tech University, pp [7] Douglas B. West, (1996), Introduction to Graph Theory, PrenticeHall, pp [10] NVIDIA, NVIDA CUDA Programming Guide. compute/cuda/2_1/toolkit/docs/nvidia_ CUDA_Programming_Guide_2.1.pdf. [11] NVIDIA, Technical Brief: NVIDIA GeForce GTX 200 GPU Architectural Overview. GeForce_GTX_200_GPU_Technical_Brief. pdf. [12] David Kirk and Wenmei W. Hwu, (2008), Programming Massively Parallel Processors: the CUDA experience. Presented at the Taiwan 2008 CUDA Course. [13] Kermack, W. O., McKendrick, A.G., (1927), Contribution to the Mathematical Theory of Epidemics, Proceedings of the Royal Society, 1154, pp [14] Philipp K. Janert, (2010), Data Analysis with Open Source Tools, O reilly, pp [8] Ray M. Merrill, (2010), Introduction to Epidemiology, Fifth Edition, Jones and Bartlett Publishers, pp [9] Qihang Huang, Zhiyi Huang, Paul Werstein, Martin Purvis, (2008), GPU as a General Purpose Computing Resource, Proceedings of the 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, pp
212 Where Supercomputing Science and Technologies Meet Parallelizing a New Algorithm for Determining the Matrix Pfaffian by Means of Mathematica Software González, H.E., Carmona L., J.J. Instituto Nacional de Investigaciones Nucleares, Gerencia de Sistemas, AP , México, D.F. Abstract The only known method for the calculation of the Pfaffian of a skewsymmetric matrix, requires the calculation of its determinant and then of the square root of the value obtained. The new method proposed in this paper, consists in the successive application of the GaussBareiss linear transformation to a skewsymmetric matrix, and finally making a numerical simplification in order to arrive at the same result as that of the known method, only without using the value of the determinant. The new method has been tested by means a code, developed using a Linux platform, on a cluster of AMD Barcelona. Keywords: Pfaffian, skewsymmetric matrix, perfect matchings, dominoes tiling 1. Introduction In linear algebra, every square matrix has associated a polynomial, its characteristic polynomial, obtained from its socalled secular equation or characteristic. This polynomial encodes several important properties of the matrix, especially its eigenvalues, its determinant and trace or spur. eigenvalues, its determinant and trace or spur. Every square matrix in Linear Algebra is associated to a characteristic polynomial which can be obtained from its so called secular equation. This polynomial condensates several fundamental properties of the matrix, especially its eigenvalues, determinant and trace. The characteristic polynomial is an algebraic expression derived from the left hand side of the characteristic equation: det (A  xi) = 0 (1) where det represents the determinant of the square matrix A, and I indicates the identity matrix of the same dimension, whereas x is one variable. The characteristic polynomial of a square matrix A can be described by the following expression: (1) n (x n  c n1 x n1 +c n2 x n c 0 ) = 0 179
213 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 where c n 1 cn 2,...,, c 0 are the coefficients whose calculation has given rise to a wide variety of methods. After the earliest one by Le Verrier [1], there followed improvements by D.K. Faddeev and I.S. Sominsky [2], enabling the obtention of the inverse matrix and the eigenvalues. Working on the design of efficient parallel algorithms for determinant calculation, Valiant [3] developed a way to obtain the characteristic coefficients according to the method by Paul Samuelson but through a purely combinatorial formalism. Thus, Valiant presented his closed walks theorem, within the frame of Graph Theory which is a branch of the Operations Research and Applied Mathematics, whose demonstration required the Samuelson algorithm. This has been the first attempt at explaining the determinant calculation by means of combinatorial manipulation alone. This achievement, along with the purely combinatorial proof of the CayleyHamilton by Rutherford [4] and Straubin [5], inspired Mahajan and Vinay [6] to discover a combinatorial algorithm in order to calculate the characteristic coefficients, whose proof is independent from linear transformations. Then, the same authors [7] developed a combinatorial arsenal intended to analyze and interpret algorithms to calculate characteristic polynomials and determinants. Yet, all these papers constitute a collection of new interpretations and proofs of already known results, and, often, they have paved the way for a group of combinatorialists to proof matrix identities using Graph Theory. Two main schools of thought can be identified in solving the problem of finding the characteristic coefficients: the algebraic and the combinatorial approaches. The keen reader is referred to [89]. It is, finally, worth mentioning that the problem of finding the eigenvalues of a square matrix is ever present in a vast range of subjects such as Nuclear Science and Technology [10], Genetics [11], as well as Mechanic, Chemical, Electric and Electronic Engineerings, Astrophysics, Cosmology, etc. 2. The pfaffian and a new method 2.1 Pfaffians The determinant of an antisymmetric matrix, commonly called skewsymmetric matrix by mathematicians ( ) turns out to be the square of an expression called the Pfaffian [12], for instance, 0 a det A det a a a a a a a a 2 (3) a a a a a a 34 a14 0 a 24 a a 34 a 0 a The determinant of an skewsymmetric matrix of odd order is always null. This property has been proven by Cayley in 1849 [13]. More recently, Knuth [14] pointed out in a brief history of the Pfaffian that it is but a determinant, closely related to that of the skewsymmetric matrix. Johann Friedrich Pfaff introduced the function named a 12 0 a a a a a 34 a a a
214 Where Supercomputing Science and Technologies Meet after him in 1815 [12, pp ] while developing a general method to solve first order partial differential equations. Formally speaking, the Pfaffian of a skewsymmetric 2n A R matrix can be defined as [15]: Pf ( A) sign M weight( M ) ( 4) Here the perfect match edges is given by: with Assuming that i k jk for each k. The sign of M corresponds to that of the permutation: 1 π = i1 = perfect matchings M M = 2 j 1 3 i 2 4 j M n 1 n i k j k considered as the permutation of 1,2,..., n 1, n The weight of M is: n k = 2 ( i j ), ( i, j ),...,( i k, j ) ( 5) 1, k { } ( i j ) ( i, j ),..., (, ) 1, i k j k (6) The Pfaffian deserves particular interest given its close relationship to matchings, as seen in [16,17]. For instance, in some graphs such as the planar ones, it is possible to identify the number of perfect matches in a polynomial time by means of Pfaffians. The relationship shows that determinants can be seen as a particular case of Pfaffians corresponding to bipartite graphs. In this sense, the Pfaffian plays a more fundamental role than expected, which deserves a correct understanding, despite the prevalence of its associated determinant in most applications. The present study aims at contributing to that understanding by proposing a method of calculating the Pfaffian independently from the use of the determinant. 2.2 The new partitiontransformation method A new algorithm to calculate the characteristic polynomial of a square matrix has been proposed by González [18] which constitutes an antecedent of the presently proposed method. The algorithm has been seen as inspired by one of the most powerful paradigms of algorithmic design: divide and conquer. Thus, the algorithm is based upon the recursive solution of a problem by splitting it into two or more subproblems of similar nature. The process is continued to the point in which the problems are sufficiently simple to be solved directly. At the end, a combination of the solutions to the subproblems provides a solution to the original problem. This principle can be adapted to the calculation of a characteristic polynomial on the basis of the iterative decomposition of submatrices obtained from the original one by means of two processes (partition S and transformation M). The polynomial coefficients become the sum of the traces of all the submatrices obtained from the application of either the linear transformation called GaussBareiss or the free transformation of division. 181
215 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Partition Process The partition process can be defined as the elimination of the first column and the first row of a matrix. This step requires neither any numerical operation nor the assignment of any additional memory. The process can be expressed as: a1 a21 a n1 a a a 12 2 n2 a1 n a2 a2n S a an 2.4 Modification Process n2 The GaussBareiss transformation is defined as: G r a21 I an1 Here I r = 1 is the matrix pivot and n is the unitary matrix. Every time that the transformation is applied, the pivot is redefined as the first element of the input matrix ( r = a 1 ). At the end of the transformations, the matrix is reduced to the one single number limit. The process can be illustrated by figura 1 [19]: a2n (8) a n [ 1 0 0] (9) = 1 n n a 1 n Figure 1. Tree structure of the algorithm in the case of a matrix with n=5. The original matrix is found at the tree top and, starting at the left hand side, successive partitions are carried out until a submatrix of size n=2 is produced. At the limit of possible partitions, one single number is obtained by applying the GaussBareiss modification. Such a number will be added to the traces of all the matrices modified once, eg C 3 in figure 1. On the right hand side the original matrix is modifies until reaching the size n1. Thus, eventually, the limit is attained where the matrix is reduced to a single number by applying the GaussBareiss modification. This number will be added to the traces of the ntime modified matrices, in the example above, C. 0 It should be noticed that the single number corresponds to the value of the original matrix determinant, whereby the trace is unique. Clearly, the partitionmodification process is applied to every obtained submatrix so that all the submatrix traces with coefficients C and are added, leading to 1 C2 those of the characteristic polynomial. 182
216 Where Supercomputing Science and Technologies Meet 2.5 Numerical Example 1 It should be observed the way in which the Pfaffian is obtained by dividing the last pivot (in brackets) by the successive ones, sufficing two transformations instead of three. Obviously, should a third transformation were performed then the determinant would be, whereby 42=2 GaussBareiss transformations are enough to arrive at the expression: In this case, the characteristic polynomial of the matrix becomes: P f P f ( ( ) ( ) ( ) ) (10 ) 4 ( x 417 x 3 3 ) + (8 x 2 2 = x 17 x + 8) x (37 x) )=0 x + (10) = 0 ( ( ) ( ) ( ) ) = 0 (10 ) 4 ( x 417x 3 38x 2 2 = x x x 8x + 61) 37 = x0 (11) Calculating the Pfaffian with the New Method 3.2 Numerical examples 3 Now, the Pfaffian of a size 6 skewsymmetric matrix will be calculated as: It can be seen that, following the right hand side trajectory of the tree, the determinant is calculated. This will be exemplified in the case of a skewsymmetric matrix as far as the transformation n2, keeping the matrix size a pair number so as to prevent it from going to zero. 3.1 Numerical example 2 As in the size 4 example, the calculation requires only the pivots after the exchange, and only 62=4 GaussBareiss transformations in order to obtain: The calculation of the Pfaffian of a size 4 skewsymmetric matrix can be illustrated as follows: 183
217 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 This iteration reduced calculation is crucial for its own mathematical interpretation and highly relevant for symbolic calculus, as seen in the analytical expressions of the Pfaffian of matrices sizes 4 and 6, widely different in between. The conventional calculation using the determinant leads to an analytical expression only after a strenuous process, as in the case of a matrix size 6. Thus, in addition to the determinant calculation, a long factorization follows in order to simplify the expression, all on top of the intrinsically slow symbolic mathematical computer processing. One should realize the amount of terms to be analyzed, compared and, finally, verified, previous calculation, as to constitute a perfect square trinomial, which ensures their being a Pfaffian. The high consumption of computer memory is not difficult to imagine. On the contrary, the new method renders directly the analytical result without any additional complications resulting from the obligatory use of the associated determinant. Where it is the permanent, similar to the determinant, except in that every product of the sum is written with a positive sign The traditional method Let us go back to the above example in order to calculate the determinant of the size 4 matrix by means of minors and cofactors, beginning with first row: 3.3 Numerical examples New proposed method As an additional example, the Pfaffian of a size 4 matrix is obtained in analytic form, according to the new proposed method: 184
218 Where Supercomputing Science and Technologies Meet Observe the way in which the expanding calculation of algebraic terms which have to be later on simplified by the identification of a perfect square trinomial. The term to be squared is precisely the Pfaffian, which could have been obtained without calculating the determinant. These instances reveal how powerful the proposed method can be, particularly in the case of large size matrices where the number of calculated algebraic terms is enormously increased. 4. Computational implementation The following part of the study has been carried out on a Mathematica platform as it allows both a symbolic and a numerical calculation. From the latter point of view, the possibility of parallelizing the determinant calculation was explored on the basis of some mathematical theorems. These indicated that parallelization is possible with matrices within the sizes 4,000 to 10,000, randomly generated with an entire input in the [20,20] range, provided that the method proposed here is applied, as seen on the following table: In here, Time refers to the duration in seconds of the calculation whereas Sp stands for speed up, namely, the degree of haste relatively quantified as the number of times that the incumbent core processors need to be accelerated. Notice that, with one same matrix size, the shortest time is not necessarily achieved with the maximal number of processors but with the higher Sp values, which indicates that the workload is distributed more uniformly among the processors, improving the efficiency. However, as the matrix size is expanded, the time shortens by using more processors, suggesting a massively parallel computation. It has to be admitted that a greater number of core processers, actually available in the cluster, should have been considered in Table I. However, the Mathematica software only allows up to 30, otherwise its cost would increase out of budget. That is why this first approach has been rather focused on exploring the parallelization of the process. 4.1 Numerical example 5 In order to clarify the transformation and workload distribution mechanics, the following example on parallel calculation distribution under the proposed linear transformation is presented: 185
219 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 By means of the following command B = ParallelTable[ A[[{ i, k}, { i, j}]], { k, i + 1, Length[ A]},{ j, i + 1, Length[ A]}] / MatrixForm matrix A above is reduced to one dimension and reexpressed in the form of 2x2 elementary determinants to be solved in parallel among all the processors. 4.2 Analytical (symbolic) example 6 The Pfaffian of the below skewsymmetrical matrix of size 16 can be calculated. This matrix has been obtained from a planar chesslike 4x4 array where the white and black squares represent the nodes of a bipartite graph with Pfaffian orientation. This incidence matrix allows to obtain the number of perfect matchings in polynomial time by calculating the Pfaffian algebraically [21]. Then the following modified matrix is obtained once the partial pivot has been changed so as to avoid that the first matrix element be null: The traditional method With the Mathematica command one obtains the result in seconds. In this manner, the results for the numerical matrices shown in Table I were obtained. It can be concluded that only with a massive number of core processors would it be possible to notice any advantage with respect to the proposed sequential process Proposed method By programming in a sequential way 186
220 Where Supercomputing Science and Technologies Meet To this purpose the algorithm has to be programmed in the C language which can be compiled on any GPU. The latter are the type of graphic processors which allow the massive incorporation of core processors for the calculation of the Pfaffian by means of a fine grain parallelization. In this manner something never done before (not even with the Mathematica software) would be accomplished: working the routine of the determinant in a parallel way. 6. References [1] Le Verrier, U.J.J Sur les variations séculaires des elements elliptiques des sept planets principales: Mercure, Venus, La Terre, Mars, Jupiter, Saturn et Uranus, J. Math. Pures Appl., 4, p (1840). [2] Faddeev, D.K., and Sominskii, I.S., Sbornik zadach po vysshei algebry ( Collection of problems on higher algebra ) 2nd ed., Gostekhizdat, Moscow, Rusia (1949). the same result is found in seconds. The advantage of the new method is clear. 5. Conclusions Regarding the previous results, it can be clearly seen the advantages in processing time that symbolic matrices in serial form yield. However, further testing of GPU s programming in C language is required. All in all, the conveniently modified Gauss Bareiss linear transformation applied to a skewsymmetrical matrix, either analytically or numerically, can enable the parallelization of the algorithm provided that a massive number of core processors is available. [3] Valiant, L.G., Why is Boolean complexity theory difficult?, S. Paerson, ed., London Math. Soc. Lecture Notes Ser., 169, p (1991). [4] Rutherford, D.E., The CayleyHamilton theorem for semirings, Proc. Roy. Soc. Edinburgh Sect. A, 66, pp (1964). [5] Straubing, H., A combinatorial proof of the CayleyHamilton theorem, Discrete Math., 43, pp (1983). [6] Mahajan, M., Vinay, V., Determinant: combinatorics, algorithms, complexity, Chicago J. Theoret. Comput. Sci., 1, pp. 5 (1997). [7] Mahajan, M., Vinay, V., Determinant: 187
221 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 old algorithms, new insights, SIAM J. Discrete Math., 12 No. 4, pp (1999). [8] Davenport,J.H.,Siret, Y., Tournier, E., Computer Algebra. Systems and Algorithms for Algebraic Computations, Academic Press, N.Y. USA (1988). [9] Faddeev, D.K., Faddeeva, V.N., Computational Methods of Linear Algebra, W. H. Freeman and Company, San Francisco, USA (1963). [10] Duderstadt,J.J., Hamilton, L.J., Nuclear Reactor Analysis, Jhon Wiley and Sons, New York, USA (1976). [11] Barash, D., Spectral Decomposition of the Laplacian Matrix Applied to RNA Folding Prediction, Computer Society IEEE, Proceeding of the Computational Systems Bioinformatics, (2003). [12] Muir, T., A treatise on the Theory of Determinants MacMillan and Co., London 1882, Repr. Dover, New York [13] Cayley, A.On the theory of permutants, Cambridge and Dublin Mathematical Journal 7 (1852), Reprinted in his Collected Mathematical Papers 2, [17] Stembridge, J.R.Nonintersecting paths, pfaffians, and plane partitions. Adv. Math. 83 [18] H.E. González, Un método nuevo para los coeficientes del polinomio característico, XVI Congreso Técnico Científico ININSUTIN, México, [19] D. Ortega Pacheco, V. OrtegaGonzález, H.E. González, J. FigueroaNazuno, A new computational method for finding the characteristic polynomial, Décimo sexta reunión de otoño de comunicaciones, computación, electrónica y exposición industrial, ROC&C 2006, Memorias. Acapulco, Gro., México, [20] D. Ortega Pacheco, V. OrtegaGonzález, H.E. González, J. FigueroaNazuno, Análisis e Implementación de un nuevo Algoritmo para la Determinación de los Coeficientes de la Ecuación Secular, Informe Técnico, Centro de Investigación en Computación, IPN, México, [21] Kastelyn,P.W., Graph theory and crystal physics, in: F. Harary, ed., Graph Theory and Theoretical Physics (Academic Press, New York, 1967) [14] Knuth, D.E. Overlapping Pfaffians Electron. J. Comb. 3 (1996), No.2, article R5, 13, pp., [15] Rote,G.DivisionFree Algorithms for the Determinant and the Pfaffian: Algebraic and Combinatorial ApproachesLectures of the Graduate ProgramComputational Discrete Mathematics, Berlin,
222 Where Supercomputing Science and Technologies Meet Overthrow : A New Algorithm for MultiDimensional Optimization Composed by a Single Rule and Only One Parameter Herman Guillermo Dolder Head of Engineering and Infrastructure Department Instituto Provincial de Vivienda, Tierra del Fuego, Argentina Francisco Gonzales 651, Ushuaia, Tierra del Fuego, Argentina Abstract Optimization techniques using mathematical algorithms tend to suffer from the same limitations as their subject of analysis, that is, the stagnation in local maxima. In this case, the failure to obtain significant new advances in the algorithm s performance is due to the asymptotical approaching of the so called wall of complexity. The purpose of this paper is to introduce a new algorithm developed from scratch, that is, without relying on, modifying, or expanding any existing algorithms, using as its fundamental premise the search for extreme simplicity. The algorithm has been named OVERTHROW, and consists of a single rule and a single parameter. Despite its extreme conceptual simplicity, this algorithm achieves a promising balance between exploration of the general environment and the optimization of the local (or global) maxima it finds. In this paper some striking features of this algorithm are explored, such as how its overall behavior appears to be complex, organic, and at times almost intelligent, notwithstanding its basic component is extremely straightforward. We conclude that the behavior of the algorithm is strongly molded by the shape of the optimization function. This would also yield, as a byproduct of the analysis, a sort of pattern recognition or fingerprint of the optimization function tested, which would allow adjusting the analysis tool to address the problem more specifically. We finally present some of the variations that can be implemented from this basic algorithm, explaining the scope, advantages and disadvantages observed in each. Keywords: Optimization Algorithm, Multidimensional 1. Introduction Since its beginnings, the research on optimization techniques for multidimensional problems has regularly produced revolutionary results, which quickly branched out into many areas of research. Most of these techniques were inspired by natural phenomena, whether physical, chemical, 189
223 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 biological or even sociological. As an example we will name some of these techniques as emblematic of each decade: 60 s: Random Search 70 s: Genetic Algorithms 80 s: Tabu Search 90 s: Particle Swarm Optmization 00 s: Cuckoo Search While these techniques are usually very simple in their inception, in the new lines of research that arise from them simplicity is often sacrificed in pursuit of effectiveness, so that, sooner or later, all reach a level of complexity that makes them almost impossible to achieve significant new results. This state corresponds to the complexity wall and is comparable to a local maximum in the search for an optimal optimization algorithm. Fortunately, despite the vast production that this field of research has shown so far, there is evidence that it is still broad enough to allow exploration and new findings in areas still unvisited. This paper aims to present a new optimization technique called Overthrow, explaining its main features, and finally outlining some of its possible variations. 2. Objectives The main challenge in the field of optimization algorithms is to develop techniques that achieve an appropriate balance between the extensive searches of the environment (exploration) and the intensive searches of the maximum that it finds in the process (exploitation). It is also important, since metaheuristic techniques can not guarantee a global maximum will be found, so that there is some clear way to determine when the analysis should be terminated. In turn, for all practical purposes, the best techniques are those that require the least amount of a priori definitions, which usually come in the form of initial parameters. Furthermore, considering the complexity of the algorithm is added, in computational times, to the complexity of the function to be optimized, it is important for the algorithms to be as simple as possible. Finally, in order to take full advantage of new architectures it would be highly desirable that the algorithm be easily parallelizable. In short, a good optimization algorithm should achieve the following objectives: Achieve a balance between exploration and exploitation. Define when to end the analysis Have few initial parameters Be simple in computational terms Be easily parallelizable 3. Proposed methodology Just like many previous algorithms, overthrow is based on the interaction of a group of particles, in which the multidimensional position of each particle corresponds to a tentative solution to the problem being optimized. In contrast to the complex behavior of the global set, this algorithm is characterized by an extremely simple interaction between the particles. Because of its simplicity, this algorithm requires, initially, the definition of two parameters, namely the quantity of particles to be used (n) and the ratio of initial distance to final distance between particles (s), and 190
224 Where Supercomputing Science and Technologies Meet one rule, which governs the interaction of the particles. However, as will be discussed later, for most practical cases the value of s can be fixed (or set as a dependent variable of n) without affecting the performance of the algorithm, thus leaving n as the only necessary parameter to be defined. The name of the algorithm, Overthrow, refers to the rule that governs the interaction between particles, which can be expressed as follows: First, the algorithm randomly selects a pair of particles, and compares their relative fitness. Then, based on this comparison, it makes the particle less fit to overthrow the most fit one, taking its place and forcing the overthrown one to move to a new position in the multidimensional space. This new position is determined by keeping the original direction between two particles, and varying (shortening) their distance in the preset ratio s. In this way, the overthrown particle ends up in an opposite position in the multidimensional space to that originally occupied by the particle that overthrew it Mathematical description of the methodology The foregoing can be mathematically expressed as follows: Given F, the multidimensional fitness function to maximize, and calling: d = dimensionality of the fitness function The initial parameters are established: n = number of particles s = ratio of final distance to initial distance An initial population is randomly generated: Where: P i = ith particle A pair of particles is randomly selected: P u vs P v We compare the fitness of both particles and modify their positions accordingly: If F(P u ) < F(P v ) P u = {x v1, x v2, x v3, x vd } (P u takes the place of P v ) P v = {x v1 +s(x v1 x u1 ), x v2 +s(x v2 x u2 ), x v3 +s(x v3  x u3 ), x vd +s(x vd x ud )} (P v is moved to a place opposite to the original P u ) If F(P u ) > F(P v ) P v = {x u1, x u2, x u3, x ud } (P v takes the place P u ) P u = {x u1 +s(x u1 x v1 ), x u2 +s(x u2 x v2 ), x u3 +s(x u3  x v3 ), x ud +s(x ud x vd )} (P u is moved to a place opposite to the original P v ) If F(P u ) = F(P v ) P u = {x u1, x u2, x u3, x ud } (P u remains unchanged) P v = {x v1, x v2, x v3, x vd } (P v remains unchanged) Where: F(P i ) = result of the evaluation of the fitness function F on the ith particle. 191
225 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 By modifying the population this way, a new pair of particles is then randomly selected, and the process is repeated until a predetermined cut condition is achieved. This cut condition will be addressed later on. Reaching this point it is important to draw attention to the computational simplicity of the algorithm, since, as noted, it only requires the computation of additions, subtractions, and multiplications Graphic description of the methodology The following is a graphic example of the above methodology. In this example, the fitness function is simply the distance to the center on two dimensions. The threedimensional fitness landscape defined by this function is a cone of unitary slope toward the center, with its apex downwards. In this particular case the function is optimized by searching for the minimum. The quantity of particles n equals 5, and the value of s is In the upper left corner of each graph the pair of particles being compared in each step is indicated. On each graph the motion of the particles under comparison is indicated by a dotted line, showing the particles in their final positions after comparison. Figure 1. Function distance to the center on two dimensions (n = 5, s = 0.75), Source: Generated by the author 192
226 Where Supercomputing Science and Technologies Meet As can be seen in the example above, in these early steps no better solution than the initial one (1.605) is found, on the contrary, the average of the particles values goes from an initial to a final 3.223, passing through a maximum of So far this algorithm would appear to be quite ineffective in finding an optimum. In the next two graphs, however, the behavior of the algorithm applied to the same function after a much larger number of steps can be seen. The top graph represents the best value found, and the lower one the highest of the standard deviations of the particles, taken from the calculated standard deviations in each dimension. This maximum standard deviation (MSD) will be used hereafter as representative of the dispersion of the particles during the evolution of the algorithm, and will also be used as a termination criteria (cut condition) of the algorithm. Figure 2. Function distance to the center on two dimensions Source: Generated by the author It can be seen in this case, as with the one before, that during the first iterations no better solution is found than the one obtained by generating a random initial population. It can also be seen how in the initial iterations the MSD increases, which implies that particles are spreading out. What is remarkable is that at some point the algorithm starts to find better solutions (about the 100 th evaluation in this case) and the population begins to converge towards the global optimum, which in the case of this function is unique.1 The reason for this is that the particles globally lose their mobility as their values improve. In this way it is the fitness function itself that modifies the behavior of the algorithm, the maximum acting as attractors and minima as repellors of the particles. 4. Results In the previous section a particular case which used an extremely simple fitness function was analyzed. This function is easily solved by almost any optimization algorithm. However, in this section the behavior of the algorithm acting on some fitness functions of greater complexity will be shown. To get started with, the results of a run of the algorithm on the 4th dimension Rastrigin s function are presented. Rastrigin s function has many local maxima, which causes the algorithms in which the exploitation has predominance over the exploration to generally fail in their optimization since they soon stagnate in one of those local maxima. The following figure shows the shape of the surface (landscape) of this function when it is defined in two dimensions. 1 A run of the algorithm optimizing the function distance to the center on two dimensions and n = 16 can be seen in 193
227 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Figure 3. 2nd dimension Rastrigin s function Source: As can be observed in the graph below, the results of the run show that the population has a first stage of rapid expansion, followed by an extended period of exploration in which the dispersion remains stable, and finally a convergence towards the solution. In this case the analysis is terminated when the MSD is less than It is very interesting to notice how about the 37,000 th evaluation the algorithm decides to stay with a solution and therefore the pattern of its behavior changes and converges towards it. Given that the algorithm starts from a randomly generated population, and that the selection of the pair of particles to be compared at each step is also performed randomly, two different runs of the algorithm will therefore produce two different results, nevertheless, the characteristics of the MSD plot display a similarity. 2 Next, a comparison of the MSD of two runs of the algorithm on the 3rd dimension Ackley s function is shown. Figure 5. Ackley s function on two dimensions Source: Figure 6. Comparison of the MSD of two runs on the 3rd dimension Ackley s function (s = 0.75) Source: Generated by author Figure 4. 4th dimension Rastrigin s function (s = 0.75) Source: Generated by the author 2 A run of the algorithm optimizing Rastrigin s function on two dimensions can be seen in cjrsgeir0ua 194
228 Where Supercomputing Science and Technologies Meet Both runs reached the function s global optimum, but they did so performing a different number of evaluations of that function. However, a similarity between the general shape of the MSD in both runs is seen In the cases presented a pattern of behavior of the algorithm can be found. There is always a first stage of expansion, followed by an exploration stage, and finishing with a stage of decision and convergence. The extension and amplitude of these stages is somehow a description of the characteristics of the fitness function evaluation. Figure 8. Salomon s function on two dimensions Source: Influence of the initial parameters It s still necessary to assess the influence of the initial parameters n and s on the behavior of the algorithm Number of Particles Figure 7. Characterization of the MSD s sectors, Source: Generated by author In some particular cases, such as in Salomon s type functions, it has been noted that following a convergence that appears to be definitive, and after a significant number of evaluations, new expansions, exploration, and convergence are produced again. These doubts however, are often smaller than the original (measured as the dispersion). In some cases, it allowed the algorithm to jump to another maximum, which is even better than the one that produced the initial decision and convergence. This is explained by the shape of the local maxima in functions which present quite uniform valleys. To analyze the influence of n, the number of particles, the 3rd dimension Rosenbrock s function will be used. This function generates a characteristic trace of the MSD, showing a defined initial expansion, followed by a convergence towards a plateau, in which MSD is stabilized, and finally a lengthy convergence. 3 i Figure 9. Rosenbrock s function on two dimensions Source: 3 A run of the algorithm optimizing Rosenbrock s function on two dimensions can be seen in cs0xzbagmu4 195
229 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Runs were carried out with different values of n, arriving in all cases, except for n = 4, to the globally optimal solution of the function. Below is the evolution of the DEM during the evolution of the algorithm. Figure 10. Comparison between runs on the 3rd dimension Rosenbrock s function, while varying the number of particles (s = 0.75) Source: Generated by author As seen, as the number of particles increases the initial dispersion is greater, resulting in a larger area explored, and also the greater the number of evaluations subsequently required in order to reach convergence. In the case of the run with n = 4, which is not included in the graph above, the convergence was so rapid that the overall maximum wasn t reached. It is interesting to note that, regardless of the number of particles used, the fingerprints of the DEM are similar in the geometric sense of the word. The following question will help to understand the importance of this value: What is the value of s that will maintain a stable population of n particles, without collapsing or expanding, if they jump each other in a random manner? To address this issue we used a dummy fitness function which consists in a flat fitness landscape and, obviously modified the algorithm so that the particles change their positions compared even though both have the same value. The answer to the question for a population of n = 2 is trivial, because the equilibrium value of s under these conditions is clearly 1, since with that value both particles always retain the same original distance. However, it s seen that for values of n = 4 the equilibrium s drastically decreases to a value near 0.6, while in a population with n = 25 the equilibrium the particles drops to about 0.1. Below the behavior of the population when not homogeneous fitness functions are used is discussed with the aim of observing the influence of this function on equilibrium s. First, the fitness function distance to the center will be used, fixing the value of n = 25, while varying the value of s Ratio of distances In all examples given so far the value of s (the ratio between the final distance and the initial distance between each pair of compared particles) has been 0.75 without any justification. Close View 196
230 Where Supercomputing Science and Technologies Meet Figure 12. Comparison between runs of Rosenbrock s function varying the value of s (n = 25) Source: Generated by author Distant View Figure 11. Comparison between runs of the distance to center varying the value of s (n = 25) Source: Generated by author As can be observed, for values of s between 0.75 and 0.90 the behavior of the algorithm does not change much. However, for a value of s = 0.95 the algorithm begins to appear unstable, converging to the solution only after a much larger number of evaluations. For s = 1 there is no convergence. Next the same analysis is performed for Rosenbrock s function. Here it s seen that for values lower than 0.80 the convergence is rapid, almost independently of the value of s. However, for s equal to 0.80 the algorithm begins to show some instability, which gets worse for s equal to Notwithstanding, even despite the large dispersion that occurs in these values, eventually convergence is also achieved. That is why, after having discussed these and many other cases, it can be concluded that an empirical value of s = 0.75 has a good performance at a very large variety of fitness functions, and that s the reason why it has been used by default in almost all the previous analyzes. Using lower values produces convergence too quickly in some cases, resulting in the loss of the algorithm s ability to explore, while using higher values, in other cases results in algorithm instability. However, it has been observed that the algorithm finds the two dimensional Rosenbrock s function s solution at approximately the same number of evaluations, both for runs with values of n = 4 and s = 0.98 to run with values of n = 25 and s = A deeper analysis of the issue is still necessary, as it would be interesting to establish a mathematical correlation between n and s that could make the algorithm independent of either one of these initial parameters. In conclusion, the observed difference between the values of equilibrium s in the 197
231 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 dummy and real optimization functions reinforces what was stated previously, that is that the maximum in the fitness function plays the role of attractor of the particles since they produce convergence of the particles with much higher values of s. 5. Discussion and findings Observing how the algorithm behaves while dealing with different fitness functions, attention is called to some particular characteristics that stand out in the preceding sections. First, observing the initial dispersion of the particles, a subsequent convergence towards a point would seem counterintuitive. However, this convergence is finally produced by the attracting effect which is shaped by the form of the fitness function. It is important to note, too, how the algorithm changes its behavior even when the rules that define it remain the same. The stages of expansion, exploration and convergence are manifested in greater or lesser extent in all applications. The shape of the MSD (its fingerprint ) allows the inference of some characteristics of the function being optimized. Thus, functions with many local maxima have extended exploration MSD plateaus, while others with steep maximum tend to produce rapid convergence. Furthermore its noted that local maxima that occur in plateaus produce small further expansions after the initial convergence. The fact that the shape, not the size of the fingerprint is independent of the number of particles used in the algorithm is also interesting. One of the benefits of having the fingerprint is the ability to implement changes to the algorithm in order to address that particular problem more effectively. Thus, if a too rapid convergence is observed, it can be addressed by increasing the value of n (or s), or if a slow convergence is observed, some of the variations discussed below can be applied, as it could be a mutation. The following section introduces some of these variations of the algorithm, which are proposed to improve performance in particular situations. Most of these variations can be combined with another, as shown in the practical case presented in the Appendix Variations proposed for the algorithm Final Distance determined randomly from the initial distance. In this case the final distance is determined randomly by a formula such as: with d f being the final distance, di the initial distance, a (0, 1) a randomly generated number, and b any real number. Using this variation there will not be uniform movements, resulting in the particles becoming sometimes closer and sometimes farther from what s indicates. However, if an infinite quantity of df s are randomly calculated, the average would equal s.d i. 198
232 Where Supercomputing Science and Technologies Meet Selfregulated distance (The separation in the dotted line represents the probability of motion) Figure 13. Movement of the particles when the final distance is determined randomly from the initial distance Source: Generated by author Random direction In this case the displaced particle would move to some point in the nsphere of radius s.di. This variation gets more complicated as the dimensions of the nsphere increase, because the mathematical complexity in the determination of uniformly distributed points on its surface increases very rapidly with its dimensionality. In this variation an objective MSD is set. This one is compared with the calculated MSD at each step. If the computed MSD is lower than the target MSD, then the value of s is increased, and in the case where the calculated MSD is higher than the set MSD, the value of s is decreased. With this variation constant exploration is guaranteed, at the cost of never achieving convergence. Therefore some criteria should be established for finalization of the algorithm Conditional mobility In this variation the particles move only if the final situation is better than the initial one, or in other words, if the final position of the displaced particle is better than the initial position of the moved particle. An even tighter variation may require particles move only if the final position of both particles is better than the starting position of each one. These variations lead to the concentration of particles in several local maxima at the same time without reaching convergence on a single point, which can sometimes be useful. For example, using this variation in Himmelblau s function the particles concentrates in the four equal optima at the same time and they never converge to one, as they do in the original algorithm. Figure 14. Movement of the particles when the direction changes randomly Source: Generated by author 199
233 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Mutation This variation includes the random offset from one particle to another place in the surroundings, according to a certain rate. The mutation can be done randomly to any point of the initial environment in which the particles were generated, or randomly centered on the place from which it mutates, or even randomly centered in the best positioned particle of the set. The range of mutation, that is, the maximum distance from the center defined mutation, can also be a variable either preset or determined on the fly based on other factors, such as the MSD. To improve the performance of the algorithm it is recommended that the particle best positioned of all is always exempted from mutation. This can be easily accomplished by forcing the mutation to always occur on a particle that has been overthrown Elimination In this variation, the particles that exceed preset limits are removed from the population. They can either be returned, by being randomly regenerated in another position, or not Explosions This variation offers regular and systematic extinction of all particles except the best positioned, and the subsequent regeneration of a new random population that includes the survivor. Removal can be conditional on a number of evaluations, or on a threshold of MSD Levels This variation consists of generating populations using the best particles of previous runs. This can be performed in two or more levels, taking into account that for n, number of particles, and x, number of levels, the number of runs required is: ExpansionContraction In this variation a recurrent cycle of expansions and contractions of the population is established. The expansions can be produced, for instance, by increasing the value of s, and the subsequent contraction, can be caused by a decreasing of s. As in the variation H) the cycles can be specified based on a number of evaluations, or based on the MSD for the contractions, and a Minimal Standard Deviation (MiSD) for the expansions. 6. Conclusions The proposed algorithm largely meets the objectives proposed, namely:. It achieves a fair balance between exploration and exploitation with its behavior governed by the fitness function. It defines when to terminate the analysis since it converges to a solution. It has few initial parameters, s and n, although, as indicated, s can be set or defined as a function of n, leaving the 200
234 Where Supercomputing Science and Technologies Meet latter as the only initial parameter. It is simple in computational terms since it involves only d subtractions, d multiplications and d additions for each evaluation of the fitness function (where d is the dimensionality of the fitness function). It s easily parallelizable, needing only a nottootight locking strategy since the resilience of the algorithm is high. 7. References [1] Berry M., Linoff, G.: Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. [9] Weise, T.: Global Optimization Algorithms  Theory and Application, [10] Wierstra, D., Schaul, T., Peters, J., Schmidhuber, J. (2008): Natural Evolution Strategies. Proceedings of the IEEE Congress on Evolutionary Computation (CEC). Hong Kong, China [11] Williams, E., Crossley, W. (1998): Empiricallyderived population size and mutation rate guidelines for a geneticalgorithm with uniform crossover, [2] Holland, J.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, [3] Kennedy, J., Eberhart, R.: Particle Swarm Optimization. Proceedings of IEEE International Conference on Neural Networks, [4] Kennedy, J., Zomaya, A.: Swarm Intelligence, Handbook of NatureInspired and Innovative Computing. Springer, US, [5] Luke, S.: Essentials of metaheuristics. (http:// cs.gmu.edu/~sean/book/metaheuristics/), [6] Matyas, J. : Random optimization., [7] Nelder, J.A., Mead, R.: A simplex method for function minimization, Computer Journal 7, [8] Rich, E.: Inteligencia Artificial. McGrawHill Interamericana,
235 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Color and MotionBased Particle Filter Target Tracking in a Network of Overlapping Cameras with MultiThreading and GPGPU Francisco Madrigal*, JeanBernard Hayet, Mariano Rivera Computer Science Group Centro de Investigación en Matemáticas Guanajuato, GTO., México Abstract This paper describes an efficient implementation of multipletarget, multipleview tracking in videosurveillance streams. It takes advantage of the capacities of multiple core CPUs and of graphical processing units, under the CUDA framework. Our algorithm runs several instances of a recursive Bayesian filter, the Particle Filter. For tracking a single person, it updates a MonteCarlo representation of the posterior distribution over the target position and velocity. To do so, it uses a probabilistic motion model, e.g. constant velocity, and a likelihood associated to the observations on targets. At this first level (single video streams), the multithreading library TBB is used to parallelize the processing of the per target independent particle filters. At a higher level, we rely on GPGPU computing through CUDA to fuse targettracking data collected on multiple video streams, by resolving the data association problem. Tracking results are presented on some challenging tracking datasets. Keywords: Particle filter; multiview; tracking; GPGPU; multithreading 1. Introduction One of the most recent and striking trends in the market of public safety has been the skyrocketing development of videosurveillance systems. Thousands of cameras have invaded most of metropolis downtowns, with the principal motivation of using them as a strong dissuasive tool for potential criminals, and, when possible, with the aim to provide live monitoring tools and to give forensic evidence to solve crimes. However, the results so far have been quite disappointing, mostly because in many situations, human agents are let on their own with dozens of video streams to monitor. Hence, the research efforts for the development of next generation videosurveillance systems have focused on the automation of the monitoring tasks, for example with automatically generated alerts when suspect events take place. One of the key element for that purpose is the tracking system, i.e. the program in charge of detecting people present in the observed scene and of preserving their identities along all the video sequences. This is not a trivial task in real life situations. Because of the projective nature of a camera, occlusions among visible 202
236 Where Supercomputing Science and Technologies Meet people are unavoidable, and make the tracking difficult. Indeed, the system has to maintain the presence of the occluded persons while they are not observable, and it must take care of not inverting the identities of persons crossing each other. One of the most efficient strategies to overcome this problem is hardware, i.e. to rely on multiple cameras with overlapping fields of view [1,2,4,6,7]. With the help of several views on a same scene, most occlusions problems can be solved. The present article proposes a software architecture for such a multicamera tracking system, taking full advantage of recent advances on parallel computing, namely multiplethreading and GPGPU. We first give an overview of the tracking strategy, then describe the main two components of our system: (1) Local trackers running in separate threads for each video stream; (2) Data association of the local trackers estimates to global trackers. We present some quite interesting results on public tracking datasets, study the effect of parallelizing the local tracking and data association tasks, and finally make some conclusions over the whole approach. 2. Overview The overview of our approach is summed up in Fig. 1. The video streams corresponding to the different cameras are processed separately, and, for each of them, a set of trackers is run. One tracker is associated to one person, and estimates continuously the position of this person in the image. We will refer to each view with indices j=1...v, where V is the number of views. Then, trackers in view j will be denoted by, for k=1...n j. Note that these Figure 1.Overview of our tracking architecture. trackers estimate the position of targets in the image, in pixel coordinates. We will suppose that we have the geometric knowledge of how the image points are mapped to the real world, which is supposed planar. It is wellknown that, when considering a planar scene, the relation between the image coordinates and real world coordinates is through a homography [2]. In our case, these homographies are given after a phase of calibration. All of these trackers run according to a recursive Bayesian technique, and are implemented as sequential MonteCarlo filters, also named particle filters [2,5,6,7]. We will describe them in detail in the next section. Multithreading is used at this step, with the help of the Threading Building Blocks (TBB) library from Intel [9]. For the next step, i.e. the fusion of the information collected in video streams into estimates at a global level, the estimates that we get from local trackers are first converted into realworld coordinates (through the imagetoscene homography) for allowing the fusion to be done properly. 203
237 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 As shown in Fig. 1, this fusion relies on trackers similar to the previous ones. They provide estimates of the real world coordinates for each of the persons under the camera scrutiny. Again, these trackers are kept independent, and are referred to as T k, for k=1...n, where N is the number of persons tracked globally. Their observations come from the projected estimates of local trackers. The main problem then is to associate to each particle (weighted sample), for each filter, at most one observation from each of the camera, problem also known as data association. As this problem has a combinatorial nature, and even if the computational burden can be made lighter, it remains complex. Hence, as explained in the upcoming section, we make use of General PurposeGPU for massively accelerating the data association, and performing the updates of all the particles in parallel. Our prior for this Bayesian inference problem is given by the assumption of a constant velocity linear model, i.e. for target k, in view j, where is a zeromean, Gaussian noise and is a scale factor, evaluated at the mean of. This factor, that scales the amount of noise, incorporates the geometric knowledge about a view j, if available. For example, it scales down the motions at large distance from the camera, and amplifies those close to it. We describe this mapping with more details in [2]. 3. Local Trackers As mentioned above, local trackers in view j, in charge of estimating the position in images of observed targets, rely on recursive Bayesian estimators. Let the state of the tracker  the 4 X 1 vector holding all the quantities of interest, namely the position in x and y and velocity in x and y  be The additional index refers to the time instant. In this Bayesian scheme, we use prior information on the target motion and observations (image cues) extracted from the video stream, to derive the posterior distribution, where is the collection of all observations (images) collected in view j up to time Figure 2. Background subtraction: (a) Input image, frame 332 of PETS 09. (b) Segmentation results, six blobs formed from the pedestrians on the scene. The second element in Bayesian inference is the inclusion of observations Here, we use a somewhat classical probabilistic generative model of the appearance of the object we are tracking, derived from the one in [5] and based on a combination of color and motion histograms. For these two cues, likelihoods are evaluated through the Bhattacharya histogram distance D, between a reference histogram and a current histogram corresponding to the target state. Then, we define the corresponding likelihood as 204
238 Where Supercomputing Science and Technologies Meet where the exponent c refers to the channel in the HSV (i.e., we use histograms for Note that is the variance on the Bhattacharya distance, and it is specific to each cue channel. Reference histograms are initialized in the first frame the target is detected in, Algorithm 1 Thread description for local tracker in view j Initialize when target is detected and form the corresponding set of particles. while is not lost do Prediction. Predict target position by using the prior of Eq. (1) and sampling particles from the previous set of particles. Correction. Update particle weights by multiplying them by the likelihood of Eq. (2). Evaluate tracking quality. Evaluate the tracker quality based on the unnormalized weights and updates the lost flag. Normalization. Normalize particle weights. Resampling. Based on the particle weights, do resampling of particles. end while Figure 3. Projected estimates: The ellipses depict the variance of the projected singletarget estimators onto the ground plane. Colors correspond to the cameras IDs (red for view 1 and green for 2). For each view, we want to associate at most one of these observations to particles in the ground. and updated anytime the quality of tracking is well evaluated. Also note we incorporate a bit of spatial information along with color distribution, with two histograms per channel instead of one, defined on the upper half and on the lower half bounding box. The previous likelihood definition stays the same, except that histograms in each channel have a double dimension. Moreover, we incorporate an image motion model [2,5] based on absolute differences between consecutive images. These differences are accumulated in a histogram, and incorporated again in Eq. (2). 205
239 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Target detection, which serves to initialize the trackers, is based on blobs. A background subtraction algorithm generates a binary image with multiple blobs corresponding to zones in motion in the image (see Fig. 2). These blobs may relate to people in the scene or to other artifacts. Each blob is segmented and filtered, leaving those that have the dimension of a person (according to ). A blob generates a new tracker if and only if there is no other blob close to it. Hence, to sum it up, we have the Algorithm 1 running on a per thread basis, for each tracker, in each view. 4. Data Association and Global Trackers In the previous section, we described the different threads running for estimating individual target states in all video streams. In this section, we propose to perform associations across the different views. The idea is the following one: Again, independent particle filters are run that will be associated to physical entities evolving in the ground plane. We associate observations to these trackers: For each view, we try to associate at most one observation from each view to any particle in the ground. These observations are simply the projection on the ground plane of the mean estimate of each local tracker, represented in Fig. 3 by the colored ellipses, that depict the covariance matrices on these distributions. The mapping is done with the same homography mentioned above. Figure 4. Association between particles and projected estimation according to Mahalanobis distance (yellow line). We only consider for association the elements with a distance under a threshold (orange circle). To do the association, we propose a perparticle approach: each tracker T k in the ground plane (see Fig. 1) is modeled by a set of L particles,, for. The local trackers in each view j, give us all a distribution of positions on the ground plane, through the geometric mapping of its particles on the plane. We sum up these distributions, at this level, through their first two moments (mean and covariance): For each particle, and for each view, we then have to decide which of the local tracker estimate, if any, they will take as a source of observation. Let us note the particles is the candidate state and its corresponding weight. All the remaining processes presented here are done in parallel, for all k and l. Consider one of these particle, and forget for the moment the tracker to which it is associated or its particle index. 206
240 Where Supercomputing Science and Technologies Meet a) (b) Figure 5. In (a), the set of global trackers and the associated observations, i.e. the reprojections of the local trackers means. The yellow arrows are the estimated velocities. In (b), the constructed trajectories of a few targets in the ground plane. Then, the data association problem consists in choosing one of the local trackers, on the base of Mahalanobis distance defined by: The idea is then to choose, for each view, the local tracker k that minimizes that Mahalanobis distance, whenever this minimal distance passes under a given threshold. This idea is illustrated on Fig. 4. Now, as the number of potential candidates is large, as the number of particles itself is typically of a few hundred particles, and as these operations are independent, we chose to implement it on GP GPU, in the CUDA framework [8]. The entire process of data association and global tracking is done with the help of this framework. To speed up the operation of memory, we use the option of allocated memory (cudahostallocmapped), which allows us to use RAM instead of GPU memory and avoid spending so much time in copying all the data. We first perform the prediction step (same as local trackers) applying Eq. (1) in each particle in a independent thread. Then, for all particles p we calculate the Mahalanobis distance and keep only those under the threshold and with minimum distance. After that, we sum all distances to form a matrix of weights between observation of view j and global trackers. With this matrix, we find the association with lower cost using the Hungarian algorithm. We repeat this for all views. Once the association problem is solved, we update the weight of all particles with its Mahalanobis distance again. The normalization step is also done in parallel, by using the reduction technique to sum all weights and divide each of them, on a perthread basis. 5. Results We have tested our multipleview tracking algorithm on the widely known PETS 2009 dataset [3], which serves as a common benchmark in the area of tracking in video surveillance applications. In Fig. 6, we give a few examples of results, through the run of our algorithm along a 795frame video sequence across two views. This sequence is challenging, as it is medium densely crowded, and several occlusions occur, so that tracking in a single view would be difficult. We apply our algorithm here by fusing the local trackers on 2 views. The first two lines depict three tracking frames on view 1 and 2, respectively. The third line is the results of global trackers, integrating the reprojections of local trackers as observations. As it can be noted in these 207
241 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 of particles (in both plots, the horizontal axis is the log of the number of particles). Figure 6. Some tracking results: the first two lines depict the local trackers on 2 frames, on two overlapping views. The third line gives the results of global trackers, integrating the reprojections of local trackers as observations. The first two lines, the local trackers keep a precise estimation of the pedestrians trajectories, and as can be noted in the third, reprojected estimates agree quite well on the ground plane estimation. Last, in Fig. 7, we study the effect of the implementation on GPU and compare it to CPU, in a synthetic experiment consisting of one tracker following a point in a plane (with additive Gaussian noise) and real data coming from PETS 2009, while varying the number Figure 7.Performance comparison on the data association problem: CPU vs. GPU, in terms of errors (a) and computation time with synthetic and PETS'2009 data (b  c). 208
242 Where Supercomputing Science and Technologies Meet Last, in Fig. 7, we study the effect of the implementation on GPU and compare it to CPU, in a synthetic experiment consisting of one tracker following a point in a plane (with additive Gaussian noise) and real data coming from PETS 2009, while varying the number of particles (in both plots, the horizontal axis is the log of the number of particles). As it can see on the left, the tracking errors are quite similar in both versions, even if they are a bit higher in the GPU case, for small numbers of particles. On the center and right, one can see the dramatic improvement in terms of computational times: With CPU, the time cost, which is linear in the number of particles, is reduced by a large constant factor, which is important for considering realtime application of the particle filtering framework with unknown data association. Indeed, with unknown data association, the real state space is in fact very large, and requires large numbers of particles, which typically makes CPUbased implementations quite ineffective. 5. Conclusions We propose a particle filterbased multiple view tracker implementation that relies on particle filters at two levels, and exploits two types of parallelizing approaches: At the local level, on raw video streams, multiple threads are run for each video and each detected target, that update the estimates on the position and velocity of the target; At the higher level, another particle filter is run that fuses the information from local trackers into the ground plane stage; data association, i.e. the selection, among all rereprojected local trackers, of the closest one to global trackers, is done by parallelizing all Mahalanobis distances computations through GPGPU with CUDA. The approach is shown to be functional on moderately dense video sequences, and, above all, computational time improvements are shown that justify the use of GPGPU at the data association step. 6. References [1] J. Black and T. Ellis. Multicamera image measurement and correspondence. Measurement, (32)(1), pp.6171, (2002). [2] W. Du, J. Hayet, J. Verly, and J. Piater. Groundtarget tracking in multiple cameras using collaborative particle filters and principal axisbased integration. IPSJ Trans. on Computer Vision and Applications, (1), pp , (2009). [3] A. Ellis and J. Ferryman. Pets2010 and Pets evaluation of results using individual ground truth single views. IEEE Int. Conf. on Advanced Video and Signal Based Surveillance, pp , (2010). [4] M. Liem and D. Gavrila. Multiperson tracking with overlapping cameras in complex, dynamic environments. Proc. of the British Machine Vision Conf., (2009). [5] P. Perez, J. Vermaak, and A. Blake. Data fusion for visual tracking with particles. Proc. of the IEEE, 92(3), pp , (2004). [6] W. Qu, D. Schonfeld, and M. Mohamed. Distributed bayesian multipletarget tracking in crowded environments using multiple collaborative cameras. EURASIP J. on Advanced Signal Processing, pp. 116, (2007). 209
243 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 [7] J. Yao and J. Odobez. Multicamera 3d person tracking with particle filter in a surveillance environment. European Signal Processing Conf., (2008). [8] NVIDIA CUDA Programming Guide 3.2, cuda/3_2/toolkit/docs/cuda_c_programming_ Guide.pdf. (2011) [9] Intel Thread Building Blocks (TBB). threadingbuildingblocks.org. 210
244 Where Supercomputing Science and Technologies Meet Load Balancing for Parallel Computations with the Finite Element Method José Luis González García 1, Ramin Yahyapour 2, Andrei Tchernykh 3 1 GWDG, Göttingen, 37077, Germany; 2 GWDG, Göttingen, 37077, Germany; 3 CICESE Research Center, Ensenada, Baja California, México; Abstract In this paper, we give an overview of efforts to improve current techniques of loadbalancing and efficiency of finite element method (FEM) on largescale parallel machines. FEM is used to numerically approximate solutions of partial differential equations (PDEs) as well as integral equations. The PDEs domain is discretized into a mesh of information. Distributing the mesh among the processors in a parallel computer, also known as the meshpartitioning problem, was shown to be NPcomplete. Many efforts are focused on graphpartitioning to parallelize and distribute the mesh of information. To address this problem, a variety of generalpurpose libraries and techniques have been developed providing great effectiveness. Today s large simulations require new techniques to scale on clusters of thousands of processors, and to be resource aware due the increasing use of heterogeneous computing architectures as found in manycore computer systems. Existing libraries and algorithms need to be enhanced to support more complex applications and hardware architectures. Keywords: Finite element method, Graph partitioning, Load balancing. 1. Introduction The finite element method (FEM) is a powerful tool widely used for predicting behavior of realworld objects with respect to mechanical stresses, vibrations, heat conductions, etc. Applications have large computational, communication and memory costs to be useful in practice in the form of sequential implementations. Parallel systems allow FEM applications to overcome this problem [1]. The partial differential equations (PDEs) are used to describe the problem. The PDEs domain is discretized into a mesh of information, then the PDEs are transformed into a set of linear equations defined on these elements [2]. In general, iterative methods such as Conjugate Gradient (CG) or Multigrid (MG) are employed to solve the linear systems [3], [4]. The parallelization of numerical simulation algorithms usually follows the singleprogram multipledata (SPMD) paradigm: thus, the mesh must be partitioned and distributed [5], [6]. Distributing the mesh among the processors in a parallel computer was shown to be NPcomplete [7], [8]. So in recent years, much effort has been focused on developing suitable heuristics based on the graphpartitioning problem [9 16]. 211
245 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 In the next Section we present a background of FEM computations and basic concepts in the area. Section III presents the load balancing problem in parallel FEM computations and relevant previous work. Trends in the field and discussion of new ideas and approaches that consider the new emerging requirements are given in Section IV while Section V finishes the paper with some conclusions. 2. Background In this Section, we give an overview of the FEM. We refer the reader to [4] for a more extensive description. We also describe some available FEM frameworks. 2.1 FEM and PDE The PDEs are often used to model physical phenomena such as the flow of air around a wing, the distribution of temperature on a plate, the propagation of a crack [3], [6], and rarely have an explicit solution. The most widely used method to solve the PDEs is to discretize them into a mesh. The FEM replaces the original function by a function with some degree of smoothness over the global domain. A structure with a complex geometry is modeled by a number of small connected cells (elements, nodes). The matrices that arise from these discretizations are generally large and sparse. Since the derived matrices are sparse, the equations are typically solved by iterative methods such as CG or MG [3], [4]. This work was partially supported by CONACYT under grant number Solvers and preconditioners The FEM solver solves a set of matrix equations that approximate the physical phenomena under study. The first introduced iterative methods were based on relaxation of the coordinates like Jacobi, GaussSeidel, and SOR [4]. Other techniques utilize a projection process to approximate the solution of the linear system. The Krylov subspaces methods are considered among the most important techniques. MG methods were initially designed for the solution of discretized elliptic PDEs. Later, they were enhanced to handle other PDEs problems as well as problems not described by PDEs. As described in [4], a preconditioner is a form of implicit or explicit modification of an original linear system which makes it easier to solve by a given iterative method. 2.3 Meshes The quality of the solution heavily depends on the accuracy of the discretization. The elements of the mesh have to be small in order to allow an accurate approximation. Regions with large gradients are not known in advance. Adaptive techniques allow the solution error to be kept under control while computation costs can be minimized [17]. 2.4 Parallelization of numerical simulations The parallel efficiency heavily depends on two factors: the distribution of data (mesh) on the processors, and the communication overhead of boundary data. During the computations, the mesh is refined and 212
246 Where Supercomputing Science and Technologies Meet coarsened several times. Hence, the workload is changed unpredictably and a new distribution of the mesh is required. The application has to be interrupted for a load balancing step. A new distribution of the mesh is required without causing the change of the location for too many elements. 2.5 FEM frameworks and simulators A variety of FEM tools and frameworks have been developed last years. Readytouse software is also available for commercial use. We mention the most relevant tools including their main features. Charm++ [18] is based on multipartition decomposition and data driven execution. The framework separates the numerical algorithms from the parallel implementation. Heister et al. [19] focus on the design of efficient data structures and algorithms for today s new requirements. They have enhanced the library deal.ii [20] to take advantage of the large cluster power. Dolfin [21] employs novel techniques for automated code generation. Mathematical notations are used to express Finite Element (FE) variational forms, from which lowlevel code is automatically generated, compiled, and integrated with implementations of meshes and linear algebra. Dolfin differs from many other projects, such as Sundance [22] and Life [23], [24], among others, in that it relies more on code generation. FEAST [25] is a Finite Element based solver toolkit for the simulation of PDE problems on parallel HPC systems. It is the successor of the established FE packages FEAT and FEATFLOW [26]. The next version, FEAST2, is currently under development and will include new features such as 3D support. ESI Group [27] develops a wide selection of software for different applications such as biomechanics, casting, crash, electromagnetic, and fluid dynamics, among others. 3. The loadbalancing problem in parallel computations with FEM This Section presents information related to loadbalancing techniques. We mainly focus on loadbalancing through graph/mesh partitioning methods. Much work on this area has been done previously. 3.1 Description and factors leading to imbalance Loadbalancing maximizes application performance by keeping processor idle time and interprocessor communication overhead as low as possible. To minimize the overall computation time, all processors should contain the same amount of computational work and data dependencies between processors should be minimized. The most important causes of load imbalance in FEM parallel application are: the dynamism of the problem over time (in computational and communication costs), and the adaptively refinement of meshes during the calculation. Other causes may include the interference from other users in a timeshared system. Heterogeneity in either the computing resources or in the solver can also result in load imbalance and poor performance. 213
247 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO Multiphase problems Multiphase problems consist of various separate phases (e.g., crash simulations consist of two phases: computation of forcesand contact detection). Often, separate decompositions are used for each phase; and data needs to be communicated between phases from one decomposition to the other [28]. Obtaining a single decomposition that is good with respect to both phases would remove the need for communication between phases. Clearly, computing this single decomposition is more complex as each processor would have multiple workloads corresponding to each phase. 3.3 Load balancing through graph partitioning Meshbased PDE problems are often expressed as graphs. Graph vertices represent the data (or work) to be partitioned. Edges represent relationships between vertices. The number of boundary edges approximates the volume of communication needed during computation. Vertices and edges can be weighted to reflect associated computation and communication costs, respectively. The goal of graph partitioning, then, is to assign equal total vertex weight to partitions while minimizing the weight of cut edges. Each graph vertex represents only an approximation of the amount of work. But the computational work is dominated by the cost of the local subdomain solutions. Another limitation of the use of graphs is the type of systems they can represent [29]. To address this drawback, hypergraphs are used to model the PDE problems. Different graph representations can be used. We refer the reader to [30] for details. The output of graph partitioners is an array indicating for each graph vertex to which process (subdomain) it should be migrated. In a few words, the graphpartitioning problem is to divide the set of vertices of a graph into subsets (subdomains) no larger than a given maximum size, so as to minimize some cost function (e.g., the total cost of the edge cut). For the purposes of this paper, we use the definition of graph partitioning presented in [31]. Let G= G(V.E) be an undirected graph of V vertices, with E edges, which represent the data dependencies in the mesh. We assume that the graph is connected. We also assume that both vertices and edges are weighted (with positive integer values) and that v denotes the weight of vertex v. Similarly, e S_P and E c denote the weights of edge e, subdomain S P and edgecut E c respectively. Given that the mesh need to be distributed to P processors, define a partition π to be the mapping of V into P disjoint subdomains S P such that U p S p =V. The definition of the graphpartitioning problem is to find a partition π such that S p S and E c is minimized. Note that perfect balance is not always possible for graphs with non unitary vertex weights. To date, algorithms have been used almost exclusively to minimize the edgecut weight. This metric is only an approximation of communication volume and usually does not model the real costs [29]. It has been demonstrated that it can be extremely effective to vary the cost function based on the knowledge of the solver [32]. A more appropriate metric is the number of 214
248 Where Supercomputing Science and Technologies Meet boundary vertices. It models the resulting communication volume more accurately, but unfortunately it is harder to optimize [29] Partitioning algorithms Many methods have been proposed in the literature to deal with the partitioning problems of finite element graphs on distributed memory multicomputers. These methods have been implemented in several graph partitioning libraries. Greedy methods are based on the graph connectivity. Typically, the first subdomain of a partition is initialized with one single vertex and further vertices are added until the required subdomain size is reached. Then, a new subdomain is initialized with an unassigned vertex and it is built up in the same greedy fashion. The greedy approach usually results initially in very compact subdomains, but often the last subdomain consists of all leftover elements and its shape is not smooth. Different methods try to solve this problem [14], [33 35]. Geometric partition methods [36], [37] are quite fast but they often provide worse partitions than those of more expensive methods such as spectral. Furthermore, geometric methods are applicable only if coordinate information for the graph is available. Geometric partitioners can induce higher communication costs for some applications. Diffusive methods have been proposed primarily because of its simplicity and its analogy with the physical process of diffusion. It is the work diffusing in a natural way through the multiprocessor network [10], [38 45]. More elaborate methods, called spectral methods, use the connectivity measures based on the second smallest eigenvalue of the graph s Laplacian. These methods [13], [46] are quite expensive, but combined with fast multilevel contraction schemes they belong to the stateoftheart in graph partitioning software [47], [48]. Multilevel methods provide excellent graph partitions [12], [46], [49] and the basic idea behind them is very simple. The reader should refer to [12] for further details. We briefly describe the various phases of the multilevel algorithm. A series of progressively smaller and coarser graphs, G i =(V i,e i ), is created from the original graph G 0 =(V 0,E 0 ) such that V i > V i + 1. Coarser graph G i + 1 is constructed from graph G i by finding a maximal matching M i E i of G i and collapsing together the vertices that are incident on each edge of the matching. Vertices that are not incident on any edge of the matching, are simply copied to G i+1. When vertices are collapsed to form vertex, the weight of vertex w is w = v + u, while the edges incident on w is set equal to the union of the edges incident on v and u minus the edge (v,u). In the case where vertex z in G i contains edges to both v and u, such that (z,v) and (z,u), then the weight of the resulting edge in G i+1 is set to (v,z) + (v,z). Thus, during successive coarsening levels, the weights of both vertices and edges are increased. Maximal matchings can be computed in different ways [12], [31], [50]. The method used to compute the matching greatly affects both the quality of the bisection, and the time required during the uncoarsening phase [51]. The next phase is to compute a balanced partition of the coarsest graph G k =(V k,e k ). 215
249 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 The kway partitioning problem is most frequently solved by recursive bisection. It is also possible to directly compute a kway partition, but the coarsening phase may become more expensive to perform. It is well known that recursive bisection can do arbitrarily worse partitions than direct kway partitioning [52]. During the uncoarsening phase, the partition of the coarsest graph G k is projected back towards the original graph by going through the graphs G k1, G k2, refining the partition at each graph level. Local refinement heuristics [12] must be applied to improve the partition of G i Graph partitioning software Multilevel graph partitioning software is available in the form of public domain libraries such as Chaco [48], METIS [53] and SCOTCH [54]. The performance of this software has been compared several times in recent years [31], [55 57]. Due to the large number of configuration parameters of each library, it is hard to achieve clear conclusion. Jostle [58] is suitable for partitioning unstructured meshes for use on distributed memory parallel computers. Jostle also has a variety of builtin experimental algorithms and modes of operations such as optimizing subdomain aspect ratio. Jostle is very suitable for dynamic repartitioning. METIS is a set of serial programs for partitioning graphs, partitioning finite element meshes, and producing fill reducing orderings for sparse matrices. METIS is capable of minimizing the subdomain connectivity as well as the number of boundary vertices. Chaco addresses three classes of problems. First, it computes graph partitions using a variety of approaches with different properties. Second, Chaco intelligently embeds the partitions it generates into several different topologies. Third, Chaco can use spectral methods to sequence graphs in a manner that preserves locality. PARTY [59] serves a variety of different partitioning methods in a very simple and easy way. It can be used as standalone or as library interface, and provides default settings for an easy and fast start. The PARTY partitioning library provides interfaces to the Chaco library, and the central methods therein can be invoked from the PARTY environment. SCOTCH is a software package for static mapping partitioning, and sparse matrix block ordering of graphs and meshes. It is based on the Dual Recursive Bipartitioning (DRB) mapping algorithm and several graph bipartitioning heuristics [60] Hypergraph partitioning software Serial hypergraph partitioning libraries are available, such as hmetis [61], [62], PaToH [63], [64], Mondriaan [65]. But for large scale and dynamic applications, parallel hypergraph partitioners are needed. The load balancing library Zoltan [66], [67] also includes a serial hypergraph partitioner which uses multilevel strategies developed for graph partitioning [12], [46] Load balancing libraries The DRAMA [68] library performs a parallel computation of a mesh reallocation 216
250 Where Supercomputing Science and Technologies Meet that will rebalance the costs of the application code based on the DRAMA cost model. The DRAMA cost model is able to take into account dynamically changing computational and communication requirements. The library provides the application program sufficient information to enable an efficient migration of the data between processes. The Zoltan Parallel Data Services Toolkit [67] is unique in providing dynamic load balancing and related capabilities to a wide range of dynamic, unstructured and/or adaptive applications. Zoltan supports many applications through its datastructure neutral design. Similar libraries, such as DRAMA that supports only meshbased applications, focus on specific applications; Zoltan does not require applications to have specific data structures. Dynamic Resource Utilization Model (DRUM) [69] provides applications aggregated information about the computation and communication capabilities of an execution environment. DRUM encapsulates the details of hardware resources, capabilities and interconnection topology. UMPAL [70] is an integrated tool consisting of five components: a partitioner, load balancer, simulator, visualization tool, and web interface. The partitioner uses three partitioning libraries: Jostle, METIS and PARTY. The partitioning results are then optimized by the Dynamic Diffusion Method (DDM) [42], the Directed Diffusion Method (DD) [71] or the Multilevel Diffusion Method (MD) [58]. The web interface provides a mean for users to use UMPAL via Internet and integrates other four parts. 4. Current trends In this Section, we present the current trends related to loadbalancing techniques. Often, FEM libraries restrict their use to small systems and this becomes a limitation when thousand of cores are available. This has led to a significant disparity between the current hardware and the software running on it. Heister et al. [19] propose parallel data structures and algorithms to deal with massively parallel simulations. They enhanced the library deal.ii to overcome the problem. Another library designed for massively parallel simulations is ALPS [72]. It is based on the p4est library [73], but it lacks the extensive support infrastructure of deal.ii, and it is not publicly available. As the loadbalancing step could be relatively large, a loadbalancing step is necessary only when the degree of imbalance is high. Therefore, it is important to determine the influence of the imbalance on the total cost of a numerical simulation in order to decide if the loadbalancing step should be performed or not. Olas et al. [74] introduce a new dynamic load balancer to NuscaS [75] based on a performance model. This model estimates the cost of the loadbalancing step, as well as the execution time for a computation step performed with either balanced or unbalanced workload. Many supercomputers are constructed as networks of sharedmemory multiprocessors with complex and nonhomogeneous interconnection topologies. Grid computing enables the use of geographically distributed systems as a single resource. This paradigm introduces new and difficult problems in resource management due the extreme 217
251 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 computational and network heterogeneity. To distribute data effectively on such systems, loadbalancers must be resourceaware. That is, they must take into account the heterogeneity in the execution environment. Some attempts to address this issue are [76 80]. 5. Conclusions and future work In this paper, we have presented an overview of efforts to improve current techniques of loadbalancing and efficiency of FEM on largescale parallel machines. The research can be extended to a number of directions including the development of a more complex cost function, and prediction model. 6. References [1] T. Olas, K. Karczewski, A. Tomas, and R. Wyrzykowski, FEM computations on clusters using different models of parallel programming, in Parallel Processing and Applied Mathematics, vol. 2328, no. 2006, R. Wyrzykowski, J. Dongarra, M. Paprzycki, and J. Waśniewski, Eds. Berlin, Germany: SpringerVerlag, 2006, pp [2] O. C. Zienkiewicz and R. L. Taylor, The finite element method: The basis, 5th ed., vol. 1. Oxford, U.K.: ButterworthHeinemann, 2000, p [3] S. Blazy, W. Borchers, and U. Dralle, Parallelization methods for a characteristic s pressure correction scheme, in Flow Simulation with HighPerformance Computers: II, E. H. Hirschel, Ed. Braunschweig/Wiesbaden: Friedrich Vieweg & Sohn Verlagsgesellschaft mbh, 1996, p [4] Y. Saad, Iterative methods for sparse linear systems, 2nd ed. Philadelphia, U.S.A.: Society for Industrial and Applied Mathematics, 2003, p [5] R. Diekmann, D. Meyer, and B. Monien, Parallel decomposition of unstructured FEMmeshes, in Parallel Algorithms for Irregularly Structured Problems, vol. 980, no. 1995, A. Ferreira and J. Rolim, Eds. Springer Berlin / Heidelberg, 1995, pp [6] R. Diekmann, U. Dralle, F. Neugebauer, and T. Römke, PadFEM: A portable parallel FEMtool, in HighPerformance Computing and Networking, vol. 1067, no. 1996, H. Liddell, A. Colbrook, B. Hertzberger, and P. Sloot, Eds. Berlin, Germany: Springer Berlin / Heidelberg, 1996, pp [7] M. R. Garey, D. S. Johnson, and L. Stockmeyer, Some simplified NPcomplete graph problems, Theoretical Computer Science, vol. 1, no. 3, pp , [8] M. R. Garey and D. S. Johnson, Computers and intractability: A guide to the theory of NPcompleteness. San Francisco, U.S.A.: W. H. Freeman and Company, 1979, p [9] R. Diekmann, B. Monien, and R. Preis, Using helpful sets to improve graph bisections, in Interconnection networks and mapping and scheduling parallel computations, vol. 21, D. F. Hsu, A. L. Rosenberg, and D. Sotteau, Eds. American Mathematical Society, 1995, pp ] C. Farhat, A simple and efficient automatic FEM domain decomposer, Computers & Structures, vol. 28, no. 5, pp , [[11] B. A. Hendrickson and R. Leland, An 218
252 Where Supercomputing Science and Technologies Meet improved spectral graph partitioning algorithm for mapping parallel computations, SIAM Journal on Scientific Computing, vol. 16, no. 2, pp , [12] G. Karypis and V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM Journal on Scientific Computing, vol. 20, no. 1, pp , [13] A. Pothen, H. D. Simon, and K.P. P. Liou, Partitioning sparse matrices with eigenvectors of graphs, SIAM Journal on Scientific Computing, vol. 11, no. 3, pp , [14] H. D. Simon, Partitioning of unstructured problems for parallel processing, Computing Systems in Engineering, vol. 2, no. 2 3, pp , [15] C. M. Fiduccia and R. M. Mattheyses, A lineartime heuristic for improving network partitions, in 19th Conference on Design Automation, 1982, pp [16] B. W. Kernighan and S. Lin, An efficient heuristic procedure for partitioning graphs, Bell System Technical Journal, vol. 49, no. 2, pp , Feb [17] R. Verfürth, A posteriori error estimation and adaptive meshrefinement techniques, Journal of Computational and Applied Mathematics, vol. 50, no. 1 3, pp , [18] M. A. Bhandarkar and L. V. Kalé, A parallel framework for explicit FEM, in Proceedings of the 7th International Conference on High Performance Computing, 2000, pp [19] T. Heister, M. Kronbichler, and W. Bangerth, Massively parallel finite element programming, in Recent Advances in the Message Passing Interface, vol. 6305, no. 2010, R. Keller, E. Gabriel, M. Resch, and J. Dongarra, Eds. Springer Berlin / Heidelberg, 2010, pp [20] W. Bangerth, R. Hartmann, and G. Kanschat, deal.ii  A generalpurpose objectoriented finite element library, ACM Transactions on Mathematical Software, vol. 33, no. 4, Aug [21] A. Logg and G. N. Wells, DOLFIN: Automated finite element computing, ACM Transactions on Mathematical Software, vol. 37, no. 2, p. 28, Apr [22] K. Long, R. Kirby, J. Benk, B. van B. Waanders, and P. Boggs, Sundance, [Online]. Available: Sundance/html/. [Accessed: 10Nov2011]. [23] C. Prud homme, Life: Overview of a unified C++ implementation of the finite and spectral element methods in 1D, 2D and 3D, in Applied Parallel Computing. State of the Art in Scientific Computing, vol. 4699, no. 2007, B. Kågström, E. Elmroth, J. Dongarra, and J. Waśniewski, Eds. Berlin, Germany: SpringerVerlag, 2007, pp [24] C. Prud homme, V. Chabannes, and Feel++ Group, Feel++, [Online]. Available: https:// forge.imag.fr/projects/life/. [Accessed: 10Dec 2011]. [25] S. Turek, D. Göddeke, C. Becker, S. H. M. Buijssen, and H. Wobker, FEAST  Realization of hardwareoriented numerics for HPC simulations with finite elements, Concurrency and Computation: Practice and Experience, vol. 22, no. 16, pp ,
253 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 [26] S. Turek, Efficient solvers for incompressible flow problems: An algorithmic and computational approach, 1st ed. Berlin, Germany: Springer Verlag, 1999, p [27] ESI Group, ESI Group, [Online]. Available: [Accessed: 09Jan2012]. [28] S. Plimpton, S. Attaway, B. A. Hendrickson, J. Swegle, C. Vaughan, and D. Gardner, Parallel transient dynamics simulations: Algorithms for contact detection and smoothed particle hydrodynamics, Journal of Parallel and Distributed Computing, vol. 50, no. 1 2, pp , [29] B. A. Hendrickson, Graph partitioning and parallel solvers: Has the emperor no clothes?, in Proceedings of the 5th International Symposium on Solving Irregularly Structured Problems in Parallel, 1998, pp [30] A. Basermann et al., Dynamic loadbalancing of finite element applications with the DRAMA library, Applied Mathematical Modelling, vol. 25, no. 2, pp , Dec [31] C. H. Walshaw and M. Cross, Mesh partitioning: A multilevel balancing and refinement algorithm, SIAM Journal on Scientific Computing, vol. 22, no. 1, pp , Jun [32] D. Vanderstraeten and R. Keunings, Optimized partitioning of unstructured finite element meshes, International Journal for Numerical Methods in Engineering, vol. 38, no. 3, pp , [33] T. Goehring and Y. Saad, Heuristic algorithms for automatic graph partitioning, Minneapolis, U.S.A., [34] Y. Linde, A. Buzo, and R. M. Gray, An algorithm for vector quantizer design, IEEE Transactions on Communications, vol. 28, no. 1, pp , [35] C. H. Walshaw, M. Cross, and M. G. Everett, A localized algorithm for optimizing unstructured mesh partitions, International Journal of High Performance Computing Applications, vol. 9, no. 4, pp , [36] M. T. Heath and P. Raghavan, A cartesian parallel nested dissection algorithm, SIAM Journal on Matrix Analysis and Applications, vol. 16, no. 1, pp , [37] G. L. Miller, S.H. Teng, W. Thurston, and S. A. Vavasis, Automatic mesh partitioning, in Graphs Theory and Sparse Matrix Computation, vol. 56, A. George, J. R. Gilbert, and J. W. H. Liu, Eds. SpringerVerlag, 1993, pp [38] G. Horton, A multilevel diffusion method for dynamic load balancing, Parallel Computing, vol. 19, no. 2, pp , [39] H. Meyerhenke and S. Schamberger, Balancing parallel adaptive FEM computations by solving systems of linear equations, in Euro Par 2005 Parallel Processing, vol. 3648, no. 2005, J. Cunha and P. Medeiros, Eds. Springer Berlin / Heidelberg, 2005, pp [40] S. Schamberger, A shape optimizing load distribution heuristic for parallel adaptive FEM computations, in Parallel Computing Technologies, vol. 3606, no. 2005, V. Malyshkin, Ed. Springer Berlin / Heidelberg, 2005, pp [41] G. Cybenko, Dynamic load balancing for distributed memory multiprocessors, Journal of 220
254 Where Supercomputing Science and Technologies Meet Parallel and Distributed Computing, vol. 7, no. 2, pp , [42] C.J. Liao, Efficient partitioning and loadbalancing methods for finite element graphs on distributed memory multicomputers, Feng Chia University, Seatwen, Taichung, Taiwan, [43] R. Elsässer, B. Monien, and R. Preis, Diffusion schemes for load balancing on heterogeneous networks, Theory of Computing Systems, vol. 35, no. 3, pp , [44] S. Schamberger, On partitioning FEM graphs using diffusion, in Proceedings of the 18th International Parallel and Distributed Processing Symposium, 2004, p [45] A. Heirich and S. Taylor, A parabolic load balancing method, Pasadena, U.S.A., [46] B. A. Hendrickson and R. Leland, A multilevel algorithm for partitioning graphs, in Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), [47] B. A. Hendrickson and R. Leland, Chaco: Software for partitioning graphs, [Online]. Available: html. [Accessed: 11Jan2012]. [48] B. A. Hendrickson and R. Leland, The Chaco user s guide: Version 2.0, Albuquerque, U.S.A., [49] G. Karypis and V. Kumar, Analysis of multilevel graph partitioning, in Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), 1995, p. 29. [50] A. AbouRjeili and G. Karypis, Multilevel algorithms for partitioning powerlaw graphs, in International Parallel and Distributed Processing Symposium, 2006, p. 10 pp. [51] G. Karypis and V. Kumar, Analysis of multilevel graph partitioning, Minneapolis, U.S.A., [52] H. D. Simon and S.H. Teng, How good is recursive bisection, Moffett Field, U.S.A., [53] G. Karypis, METIS A software package for partitioning unstructured graphs, partitioning meshes, and computing fillreducing orderings of sparse matrices, Minneapolis, U.S.A., [54] F. Pellegrini, Scotch and libscotch 5.1 user s guide, Talence, France, [55] G. Karypis and V. Kumar, Multilevel kway partitioning scheme for irregular graphs, Journal of Parallel and Distributed Computing, vol. 48, no. 1, pp , Jan [56] R. Diekmann, R. Preis, F. Schlimbach, and C. H. Walshaw, Shapeoptimized mesh partitioning and load balancing for parallel adaptive FEM, Parallel Computing, vol. 26, no. 12, pp , [57] R. Battiti and A. A. Bertossi, Greedy, prohibition, and reactive heuristics for graph partitioning, IEEE Transactions on Computers, vol. 48, no. 4, pp , [58] C. H. Walshaw, The serial JOSTLE library user guide : Version 3.0, London, U.K., [59] R. Preis, The PARTY Graphpartitioning  Library  User manual  Version 1.99, Paderborn, Germany,
255 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 [60] F. Pellegrini, Static mapping by dual recursive bipartitioning of process and architecture graphs, in Proceedings of the 1994 Scalable High Performance Computing Conference, 1994, pp [61] G. Karypis, hmetis  Hypergraph & circuit partitioning, [Online]. Available: overview. [Accessed: 11Jan2012]. [62] G. Karypis and V. Kumar, hmetis  A hypergraph partitioning package  Version 1.5.3, Minneapolis, U.S.A., [63] Ü. V. Çatalyürek, PaToH v3.2, [Online]. Available: software.html. [Accessed: 11Jan2012]. [64] Ü. V. Çatalyürek and C. Aykanat, PaToH: Partitioning tool for hypergraphs, Columbus, U.S.A., [65] R. Bisseling, Mondriaan for sparse matrix partitioning, [Online]. Available: mondriaan.html. [Accessed: 11Jan2012]. [66] Sandia National Laboratories, Zoltan: Parallel partitioning, load balancing and datamanagement services, [Online]. Available: cs.sandia.gov/zoltan/. [Accessed: 11Jan2012]. [67] K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, and C. Vaughan, Zoltan data management services for parallel dynamic applications, Computing in Science Engineering, vol. 4, no. 2, pp , [68] B. Maerten, D. Roose, A. Basermann, J. Fingberg, and G. Lonsdale, DRAMA: A library for parallel dynamic load balancing of Finite element applications, in EuroPar 1999 Parallel Processing, vol. 1685, no. 1999, P. Amestoy et al., Eds. Springer Berlin / Heidelberg, 1999, pp [69] J. Faik, A model for resourceaware load balancing on heterogeneous and nondedicated clusters, Rensselaer Polytechnic Institute, Troy, U.S.A., [70] W. C. Chu, D.L. Yang, J.C. Yu, and Y.C. Chung, UMPAL An unstructured mesh partitioner and load balancer on world wide web, Journal of Information Science and Engineering, vol. 17, no. 4, pp , [71] Y. F. Hu and R. J. Blake, An Optimal dynamic load balancing algorithm, Daresbury, U. K., [72] C. Burstedde, M. Burtscher, O. Ghattas, G. Stadler, T. Tu, and L. C. Wilcox, ALPS: A framework for parallel adaptive PDE solution, Journal of Physics, vol. 180, no. 1, p. 8, [73] C. Burstedde, L. C. Wilcox, and O. Ghattas, p4est: Scalable algorithms for parallel adaptive mesh refinement on forests of octrees, SIAM Journal on Scientific Computing, vol. 33, no. 3, pp , [74] T. Olas, R. Leśniak, R. Wyrzykowski, and P. Gepner, Parallel adaptive finite element package with dynamic load balancing for 3D thermomechanical problems, in Parallel Processing and Applied Mathematics, vol. 6067, no. 2010, R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Waśniewski, Eds. Springer Berlin / Heidelberg, 2010, pp
256 Where Supercomputing Science and Technologies Meet [75] R. Wyrzykowski, T. Olas, and N. Sczygiol, Objectoriented approach to finite element modeling on clusters, in Applied Parallel Computing. New Paradigms for HPC in Industry and Academia, vol. 1947, no. 2001, T. Sørevik, F. Manne, A. H. Gebremedhin, and R. Moe, Eds. Berlin, Germany: SpringerVerlag, 2001, pp [76] K. D. Devine et al., New challenges in dynamic load balancing, Applied Numerical Mathematics, vol. 52, no. 2 3, pp , Feb [77] S. Sinha and M. Parashar, Adaptive system sensitive partitioning of AMR applications on heterogeneous clusters, Cluster Computing, vol. 5, no. 4, pp , [78] C. H. Walshaw and M. Cross, Multilevel mesh partitioning for heterogeneous communication networks, Future Generation Computer Systems, vol. 17, no. 5, pp , Mar [79] T. Minyard and Y. Kallinderis, Parallel load balancing for dynamic execution environments, Computer Methods in Applied Mechanics and Engineering, vol. 189, no. 4, pp , [80] J. D. Teresco, M. W. Beall, J. E. Flaherty, and M. S. Shephard, A hierarchical partition model for adaptive finite element computation, Computer Methods in Applied Mechanics and Engineering, vol. 184, no. 2 4, pp ,
257 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Parallelization in the NavierStokes Equations for Visual Simulation Arciga A. M., Vargas M., Botello S. CIMAT, Apartado Postal 402, Guanajuato, Gto. México. . {arciga, miguelvargas, Abstract We present a simple way to simulate fluids. The goal is not to obtain accurate solutions for the NavierStokes equations, but to have a small, fast and simple algorithm with real time results, running in a conventional, at this time, multicore PC. We want unconditionally stable results visually similar to the real solutions. The procedure is to work separately with each term in the NavierStokes equations (Operator Splitting). In the diffusive term we use a finite difference scheme, centered in space, backward in time. For the convective term we implement a semilagrangian scheme. In each time step we perform a correction to the velocity field to ensure mass conservation (Helholtz decomposition). We present a way to parallelize the convective solver, which is used three times during each time step. Keywords: NavierStokes equations, fluid simulation, parallelized semilagrangian scheme, OpenMP. 1. Introduction The most common kind of fluids are liquids and smoke, it is very often that you can see them even in virtual environments such as video games. There, people intent to simulate fluids and fast simulators are needed to achieve userfluid interaction in real time. The NavierStokes equations (12) describe the fluid movement [8], they are quite complex and to give high precision solution, it requires in general so complicated algorithms that usually are impermissible timeconsuming for being incorporated in a video game, it is worse if the geometry is not simple and/or the boundary conditions changes in time, this is the case when we have interaction objectsfluid. In a given region Ω of the two or three dimensional space and appropriate boundary and initial conditions for the main variables, which are, the velocity and the density distribution ρ, the system of equations (12) give us the velocity (x,t) of the fluid and the density distribution ρ(x,t) at a later time t>0 in the position x Ω. Note that the shape of the equations (1) and (2) are quite similar, this kind of equations is 224
258 Where Supercomputing Science and Technologies Meet knew as advectiondiffusion equations knew as advectiondiffusion equations with source. Sometimes the term convection is used as synonymous of advection, then equations (12) are also called convectiondiffusion with source. The advective terms generate the movement or translation, the diffusive terms κ ρ and μ generate the spreading, smoothing or diffusion of the variables ρ and, respectively. The source terms S and f generate, in this case, new density or new velocity in a given instant, for instance in equation (2) f is an acceleration term and define external forces, it could be the gravity acceleration, the momentum of a rowing, the pressure of the spoon that moves in the coffee cup, etc. The equation (1) is telling that the change of the density is because of a translation with velocity, plus some diffusion with rate κ, plus an increment due to the source term S. In a similar way, the change of the velocity is described in equation (2), except that now the advective part means that the movement or translation is because the velocity in itself, this autoadvection is just the one that gives greater difficult and complexity to the equations, in this part we have the nonlinearity and it makes the equations are tightly coupled. In this work we present a simple way to simulate fluids using the equations of Navier Stokes, our goal is to obtain solutions that looks very similar to the real solutions, the code will be as simple as possible and the simulation can be seen in real time, for the latter we can take advantage of the fact that some part of the code can be parallelized. To show the simulation we will use C/ C++, OpenGL and OpenMP to do the parallelization in a couple of cycles that do part of the intensive task. Fortunately, in this case, using OpenMP it is rather simple as including the desired header library 1 and putting a very simple instruction before the cycle to parallelize. In the next section we will describe the solver. 2. The Solver Structure The main pseudocode is as following The manner to proceed in the previous code is the suggested by the method called Operation Splitting [5], very used in CFD, which consists in to split the equation in as many equations as operators has the original equation. Let us explain a bit what the method is, there are several techniques of splitting, we will review the simplest which is due to the Russians MarchuckYaneko. Consider the problem 225
259 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 At the end of each set of problems we will have u (S), which will be the approximation for the solution u at the time t n+1 =(n+1)δt. In the latter problem each subproblem can be solved by any convergent numerical method. The previous method is of linear order in time. Another technique can be used, it is known as Dimensional Splitting and consist to split each operator getting it to lower dimension, usually one dimension. Also we can do Fractional Splitting, in which we do Operator Splitting, but during each cycle, as above, one or more operator are solved in a fraction of time. Then, in a latter step of time, at the same cycle, we solved the same operator in the remaining time. Under suitable conditions we can improve the accuracy to second order in time with this method. Lets explain the Fractional Splitting or StrangSplitting [5], procedure, in the case S=2. Now considering the equations that concern us, we focus on the solver for the density, it is the part Update the density distribution. To solve the equation (1) we need to solve the next three subproblems The first subproblem (ρ.sou) is the easiest, if we set s the array for the source from the density distribution, then pseudocode is we will solve, successively, the following set of three problems Modern compilers don t need the header <omp.h>. The instruction For each cell i,j visits each node in the domain. For the sake of simplicity, in the examples below we will use a 2D rectangular domain. In the second subproblem (ρ.dif) we will use a finite difference scheme, it will be centered in space, backward in time, with this 226
260 Where Supercomputing Science and Technologies Meet we gain unconditional convergence. We write such scheme, Here is the numerical value approximating where (x i,y j ) are the spatial nodes in the discretized domain and The left hand side in the previous scheme approximate the partial derivative the terms in the right hand side approximate the partials We recall that the Laplacian operator Δ is defined by Setting we can write the previous finite difference scheme as which is a pentadiagonal linear system for the unknowns Also it is diagonally dominant, so we can use the iterative method of GaussSeidel. It is possible to use another better solvers like Conjugate Gradient [6], there can be parallelized a coupled of matrixvector multiplications, but for the sake of code simplicity we will use the GaussSeidel method. Because our scheme is implicit, it is unconditionally stable, this means that errors keep bounded, or that the solutions don t blow up regardless the sizes Δx,Δy,Δt. A von Neumann analysis ensures that linear schemes of this kind will be stables [7]. We recall that the iterative method Gauss Seidel to approximate de solution of a linear system Ax=b of size n reads Therefore, our pseudocode will be The variable num_iter determines the maximum number of iterations for the method. The function set_bnd() sets the values in the bound. The idea here is to work with an array x of size (N+2)(N+2) for the spatial domain and use the bands x(i,0), x(i,n), x(n,i) and x(0,i), i=1,,n for the border. Each instruction For each cell i,j is perform in the interior of the domain (x without the bands). The function sol_lin in is called as In the third and last subproblem (ρ.adv) we will use a semilagrangian scheme, which consist in considering each node as a fluid particle, but instead of going forward in time and carry on this particle, we will ask where the particle were in our domain at a single timestep back. We seek neighboring nodes and we will do a weighted average with the 227
261 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 values from these neighbors. The following pseudocode explain the previous sentence In this code we calculate the density distribution rho, which is the advection of rho_0 with two dimensional velocity (u,v). Note that the calculus of each node is independent of the others, so it can be parallelized, this is indicated with the line #parallel. Using C/C++ with OpenMP the instruction will be In particular, the latter pseudocode tell us that we are going to need two arrays for the density, and after each update we need to do a swap between this two arrays. The same will be for the velocity field, in the two dimensional case, for example, u, u_old and v, p_old for the components in x and y of the velocity, respectively. Now, we will see how to do the weighted average with the values of the neighbors. First, we do a integer cast i 0 =int(x), j 0 =int(y). We are going to use the four nodes and. The weights will be pairs of linear interpolations. For example, the weight is a value between 0 and 1 and (xi 0 ) takes its complementary value in the same interval. Therefor, if the value of x is closer to i 0, we give greater weight to the values and +1. The remaining weights do something similar but in the other direction. The process described above is stable (the values of each remains bounded). In fact, if then, for all s,t we have In particular, for each i,j and each this is a bounded value depending on the initial density distribution. We had finished the part Update the density distribution from the main pseudocode, we have done the most part of the work because the part Update the velocity field uses many of the routines already made, we only need to call them with the respective values, the constant μ instead of κ, and the term f instead of S. Thus, to solve the diffusive part of the equation (2) we need to do 228
262 Where Supercomputing Science and Technologies Meet And, for the advective part we do The scheme that we will use to solve the Poisson equation is the following A not minor detail is missing, we know that the equations (12) come from two physical laws of conservation, the mass and momentum conservation. To guarantee than the velocity field will be conservative we will do a little correction in each time step. To do this we need of the Fundamental Theorem of the Vector Analysis, it is also known as the Theorem of Helmholz Decomposition, which says that any vector field (in this case can be seen as the sum of two vector fields, one without divergence called solenoidal and other irrotational (without rotor). Mathematically in our case this theorem reads where the negative gradient is the part without rotor, whereas the rotor of the vector A is the solenoidal component. We are looking for the value (not conservative part), then we can subtracted from the velocity field, this way we get a pure conservative field. In order to get the value of we apply the divergence operator in the above equation to get Again we are using a finite difference scheme, this time we choose centered differences for the second order partials and also centered differences for the first order partial derivatives of the divergence. This scheme is implicit and unconditionally stable. If we take Δx=Δy, it reads Which is again a pentadiagonal linear system with the same stencil that appeared in the previous scheme (equally distributed nodes) for which we have written a routine to solve it using GaussSeidel method. Now, we write the pseudocode for the correction to the field velocity that satisfies mass conservation. The previous equation for the unknown variable φ is called Poisson Equation due to the French mathematician. In each step of time, after the update of the velocity field, we solve the Poisson equation, then we subtract from 229
263 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 Note that the cycles in the previous code have been parallelized. In the first cycle we calculate the field divergence with centered differences, then we call the routine that solve the linear system to get the scalar field φ. In the last cycle we use the value of φ to correct the field velocity, adding the value for which we use again a centered difference. The last thing we need of the main pseudocode, from the beginning in this section, is to show the results, in our case to show the density distribution. To do this we use OpenGL directives and the toolkit GLUT Results In the following examples we use nodes, time step dt=0.1, and values κ= 10 ^(6),μ=0. We recall that one of our goals is to get real time simulations, and we had get this without the power of new graphics cards. In the tests we used a modest laptop with processor Intel Centrino with 2 cores at 2.0GHz, 4 GB of RAM Memory and without external Graphic Card. 3.1 Example
264 Where Supercomputing Science and Technologies Meet 3.2 Example 2 We take advantage of this and we have used the OpenMP directives to distribute the calculations. We present results which consist of frames of the simulation where we have used a conventional pc with modest features (two cores without dedicated graphical card). The results were highly successful. 5. References [1] Jos Stam, RealTime Fluid Dynamics for Games. Proceedings of the Game Developer Conference, March [2] R. Courant and E. Isaacson and M. Rees, On the Solution of Nonlinear Hyperbolic Differential Equations by Finite Differences, Communication on Pure and Applied Mathematics, 5, 1952, [3] D. Enright, S. Marschner and R. Fedkiw, Animation and Rendering of Complex Water Surfaces, in SIGGRAPH 2002 Conference Proceedings, Annual Conference Series, July 2002, [4] R. Fedkiw, J. Stam and H. W. Jensen, Visual Simulation of Smoke, In SIGGRAPH 2001 Conference Proceedings, Annual Conference Series, August 2001, Conclusions We have described a numerical solver for the NavierStokes equations with the intention of obtaining visually acceptable simulation for the fluid flows. We have identified parts in the code for which it is convenient to perform the calculation in parallel. [5] Helge Holdem, et al. Splitting methods for partial differential equations with rough solutions analysis and matlab programs, European Mathematical Society, [6] Jonathan Richard Shewchuk. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain, School of Computer Science, Carnegie Mellon University Pittsburgh,
265 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 [7] Randall J. LeVeque. Finite Difference Methods for Ordinary and Partial Differential Equations, University of Washington, [8] A. J. Chorin, A mathematical introduction to fluid mechanics 3ed, Springer,
266 Where Supercomputing Science and Technologies Meet Solution of Finite Element Problems Using Hybrid Parallelization with MPI and OpenMP Miguel VargasFélix, Salvador BotelloRionda Centre for Mathematical Research (CIMAT); Abstract The finite element method is used to solve problems like solid deformation and heat diffusion in domains with complex geometries. This kind of geometries requires discretization with millions of elements, this is equivalent to solve systems of equations with millions of variables. The aim is to use computer clusters to solve this systems. The solution method used is Schur substructuration. Using it is possible to divide a large system of equations into many small ones to solve them more efficiently. This method allows parallelization. MPI (Message Passing Interface) is used to distribute the systems of equations to solve each one in a computer of a cluster. Each system of equations is solved using a solver implemented with OpenMP. The systems of equations are sparse. Keywords: Finite element method, parallel computing, domain decomposition, sparse matrices, solvers. 1. Problem description We want to present the formulation of two traditional problems that can be solved using the finite element method. These formulations will be used to simulate examples with complex geometries. 1.1 Solid deformation We want to calculate linear inner displacements of a solid resulting from forces or displacements imposed on its boundaries. u The displacement vector inside the domain is defined as u( x, y,z) u( x, y,z) = v( x, y,z) w( x, y,z) the strain vector is ε (εx ε y ( x 0 ε= ε y γ x y γ y z y γ z x)= 0 z x) y 0 0 z 0 x (u z y 0 v w) = Ε u 233
267 3rd INTERNATIONAL SUPERCOMPUTING CONFERENCE IN MÉXICO 2012 The stress vector σ is defined as σ= (σ x, σ y,σ z, τ x y, τ y z, τ z x ) T where σ x, σ y and σ z are normal stresses; τ x, τ x and τ z are tangential stresses. Stress an strain are related by σ= Dε (1) D is called the constitutive matrix, it depends on Young moduli and Poisson coefficients characteristic of media. Solution is found using the finite element method with the Galerkin weighted residuals. This means that we solve the integral problem in each element using a weak formulation. The integral expression of equilibrium in elasticity problems can be obtained using the principle of virtual work [1 pp6571]., δε T σ dv= V V δu T b dv + δu T t da+ δu T i q i A i (2) here b, t and q are the vectors of mass, boundary and punctual forces respectively. The weight functions for weak formulation are chosen to be the interpolation functions of the element, these are is the number of nodes of the element, is the coordinate of the ith node, we have that M u= N i u i i= 1 using (3), we can rewrite (1) as: M ε= Ε N i u i i= 1 or in a more compact form ( ) E = EN1 EN 2 EN M u 1 u 2 u M (3) = Bu Now we can express (6) as then (2) by B T D B dv e u= b dv e V e V e K e e f b and (4) By integrating (4) we obtain a system of equations for each element, All systems of equations are assembled in a global system of equations, K is called the stiffness matrix, if enough boundary conditions are applied, it will be symmetric positive definite (SPD). By construction it is sparse with storage requirements of order O(n), where n is the total number of nodes in the domain. By solving this system we will obtain the displacements of all nodes in the domain. 1.2 Heat diffusion K e u e = f b e + f t e +q e K u= f σ= D B u + t da e A e The other problem that we want to solve is the stationary case of the heat diffusion, it is modeled using the Poisson equation, (5) where ϕ(x, y, z) is the unknown temperature distributed on the domain. Lets define the flux vector ϕ x ϕ q= k( z) y ϕ f t e +q e x k + x y k y + z k z = S x, y,z ( ) 234
268 Where Supercomputing Science and Technologies Meet Boundary conditions could be Dirichlet or Neumann In complex domains is complicated to obtain a solution ϕ(x, y) that satisfies (5). We will look for an approximate solution ϕ that satisfies f (ϕ(x, y))= 0 in the sense of a weighted integral, like W f (ϕ(x, y))dx dy= 0 Ω where W = W (x, y) is a weighting function. Reformulating the problem as a weighted integral W [ Ω x (k ϕ x )+ y (k ϕ ) S]d Ω+ y W [ϕ ϕ ]+W [q q] = 0 Boundary conditions Integrating by parts W k ϕ W x ω Ω x k ϕ ϕ d Ω+W k x y ω W Ω y k ϕ y d Ω W S d Ω+W [ϕ ϕ]+w [q q]= 0 Ω There are several ways to select weight functions when element equations are build. We used the Galerkin method, in this one the shape functions are used as weight functions 2. Schur complement method This is a domain decomposition method with no overlapping [2], the basic idea is to split a large system of equations into smaller systems that can be solved independently in different computers in parallel. Figure 1. Domain discretization (left), partitioning (right) We start with a system of equations resulting from a finite element problem K d= f (6) where K is a symmetric positive definite matrix of size n n. If we divide the geometry into p partitions, the idea is to split the workload to let each partition to be handled by a computer in the cluster. We can arrange (reorder variables) of the system of equations to have the following form II (K1 BI K 1 K 2 II 0 K 1 IB II 0 K 3 K 2 BI K 3 BI K p II K p BI I BB)(d1 IB I K 2 d 2 IB I K 3 d 3 IB K p K I B)=(f B) 1 I f 2 I f 3 I I d p f p d f (7) The superscript II denotes entries that capture the relationship between nodes inside a partition. 235
Preparing Our Youth to Engage the World. Educating for Global Competence: Veronica Boix Mansilla & Anthony Jackson
Educating for Global Competence: Preparing Our Youth to Engage the World Veronica Boix Mansilla & Anthony Jackson Council of Chief State School Officers EdSteps Initiative & Asia Society Partnership for
More informationIS 2010. Curriculum Guidelines for Undergraduate Degree Programs in Information Systems
IS 2010 Curriculum Guidelines for Undergraduate Degree Programs in Information Systems Association for Computing Machinery (ACM) Association for Information Systems (AIS) Heikki Topi Joseph S. Valacich
More informationB.2 Executive Summary
B.2 Executive Summary As demonstrated in Section A, Compute Canada (CC) supports a vibrant community of researchers spanning all disciplines and regions in Canada. Providing access to world class infrastructure
More informationBIG DATA IN ACTION FOR DEVELOPMENT
BIG DATA IN ACTION FOR DEVELOPMENT This volume is the result of a collaboration of World Bank staff (Andrea Coppola and Oscar Calvo Gonzalez) and SecondMuse associates (Elizabeth Sabet, Natalia Arjomand,
More informationbuilding a performance measurement system
building a performance measurement system USING DATA TO ACCELERATE SOCIAL IMPACT by Andrew Wolk, Anand Dholakia, and Kelley Kreitz A Root Cause Howto Guide ABOUT THE AUTHORS Andrew Wolk Widely recognized
More informationSupporting health research systems development in Latin America
Record Paper 6 Council on Health Research for Development (COHRED) Supporting health research systems development in Latin America Results of Latin America Regional Think Tank, August 2006, Antigua Brazil
More informationInfrastructure in the Comprehensive Development of Latin America
Infrastructure in the Comprehensive Development of Latin America Strategic Diagnosis and Proposals for a Priority Agenda IDeAL 2011 Title: Infrastructure in the Comprehensive Development of Latin America
More informationThe Global Information Technology Report 2014
Insight Report The Global Information Technology Report 2014 Rewards and Risks of Big Data Beñat BilbaoOsorio, Soumitra Dutta, and Bruno Lanvin, Editors Insight Report The Global Information Technology
More informationInternet of Things From Research and Innovation to Market Deployment
River Publishers Series in Communication Internet of Things From Research and Innovation to Market Deployment Editors Ovidiu Vermesan Peter Friess River Publishers Internet of Things From Research and
More informationAsking More. The path to efficacy. Edited by Michael Barber and Saad Rizvi With a preface from John Fallon, Chief Executive, Pearson plc
Asking More The path to efficacy Edited by Michael Barber and Saad Rizvi With a preface from John Fallon, Chief Executive, Pearson plc About Pearson Pearson is the world s leading learning company. Our
More informationIdentifying the TopTier Global Broadband Internet Ecosystem Leaders Stuart N. Brotman
NET VITALITY Identifying the TopTier Global Broadband Internet Ecosystem Leaders Stuart N. Brotman NET VITALITY Identifying the TopTier Global Broadband Internet Ecosystem Leaders Stuart N. Brotman April
More informationSOFTWARE ENGINEERING
SOFTWARE ENGINEERING Key Enabler for Innovation NESSI White Paper Networked European Software and Services Initiative July 2014 Executive Summary Economy and industry is experiencing a transformation towards
More informationVol. 19 No. 4 2012. Building a national program for cybersecurity science. Globe at a Glance According to the Experts Pointers Spinouts
Vol. 19 No. 4 2012 Building a national program for cybersecurity science Globe at a Glance According to the Experts Pointers Spinouts GUEST Editor s column Frederick R. Chang, PhD Considered by most to
More informationMaking Small Business Lending Profitablessss. Proceedings from the Global Conference on Credit Scoring April 2 3, 2001 Washington, D.C.
Making Small Business Lending Profitablessss Proceedings from the Global Conference on Credit Scoring April 2 3, 2001 Washington, D.C. GLOBAL FINANCIAL MARKETS GROUP, IFC FINANCIAL SECTOR VICE PRESIDENCY,
More informationTAKING OUR PLACE: 2 0 1 52 020
TAKING OUR PLACE: U NIV ERSITY O F MAN ITOBA 2 0 1 52 020 STRATEGIC PLAN TABLE OF CONTENTS Message from the President...3 Introduction... 4 Planning Context....5 Consultations: What We Heard....7 Acknowledgement...
More informationThe Global Information Technology Report 2012
Insight Report The Global Information Technology Report 2012 Living in a Hyperconnected World Soumitra Dutta and Beñat BilbaoOsorio, editors Insight Report The Global Information Technology Report 2012
More informationA proposal for the future elearning of Costa Rica
A proposal for the future elearning of Costa Rica (Vision, Mission, Policy, Strategy, and Action plan) Costa Rica @prende elearning Project Costa Rica  Korea Collaboration NIPA Consulting Team July
More informationCYBER SECURITY EDUCATION IN MONTENEGRO: CURRENT TRENDS, CHALLENGES AND OPEN PERSPECTIVES
CYBER SECURITY EDUCATION IN MONTENEGRO: CURRENT TRENDS, CHALLENGES AND OPEN PERSPECTIVES Ramo Šendelj, Ivana Ognjanović University of Donja Gorica, Montenegro Abstract Cybersecurity threats evolve as
More informationVENUSC Virtual multidisciplinary EnviroNments USing Cloud Infrastructures
VENUSC Virtual multidisciplinary EnviroNments USing Cloud Infrastructures Deliverable Title: D3.10  Future Sustainability Strategies  Final Partner Responsible: ENG Work Package No.: WP3 Submission
More informationSustainable Kingston Plan
Sustainable Kingston Plan Version 12010 Should you require this information in a different format, please contact Sustainability & Growth Group Email: accessibility@cityofkingston.ca Tel: 6135464291
More informationPriorities for Excellence. The Penn State Strategic Plan 200910 through 201314
Priorities for Excellence The Penn State Strategic Plan 200910 through 201314 Table of Contents The Context for Strategic Planning at Penn State... 1 The Process for Creating the University Strategic
More informationThe Network Society. From Knowledge to Policy. Edited by Manuel Castells and Gustavo Cardoso
The Network Society From Knowledge to Policy Edited by Manuel Castells and Gustavo Cardoso The Network Society From Knowledge to Policy Edited by Manuel Castells Wallis Annenberg Chair Professor of Communication
More informationUNIVERSIDAD PARA LA COOPERACIÓN INTERNACIONAL (UCI)
UNIVERSIDAD PARA LA COOPERACIÓN INTERNACIONAL (UCI) PROJECT PLAN FOR THE CALL CENTER PROJECT FOR ROCHE COMPANY USING THE SUMMITD METHODOLOGY AND PMBOK 2004 SOFIA ROJAS CAMBRONERO DOMINGO ROJAS FERNÁNDEZ
More informationBrazilian Institute for Web Science Research
ISSN 01039741 Monografias em Ciência da Computação n 46/08 Brazilian Institute for Web Science Research Nelson Maculan Carlos José Pereira de Lucena (editors) Departamento de Informática PONTIFÍCIA UNIVERSIDADE
More informationData driven What students need to succeed in a rapidly changing business world. February 2015
Data driven What students need to succeed in a rapidly changing business world February 2015 Contents Abstract 1 The changing business world 2 A world in flux 2 CEOs are bracing for change 3 How data
More informationFutures study. Meetings and conventions 2030: A study of megatrends shaping our industry
Futures study Meetings and conventions 2030: A study of megatrends shaping our industry 1 2 Contents Table of Contents Preface.... 4 Petra Hedorfer (GNTB)... 5 Matthias Schultze (GCB)... 5 Partners and
More informationAP Psychology. Teacher s Guide. Kristin H. Whitlock Viewmont High School Bountiful, Utah. connect to college success www.collegeboard.
AP Psychology Teacher s Guide Kristin H. Whitlock Viewmont High School Bountiful, Utah connect to college success www.collegeboard.com The College Board: Connecting Students to College Success The College
More informationStandards for Technological Literacy
Standards for Technological Literacy Content for the Study of Technology Third Edition Standards for Technological Literacy: Content for the Study of Technology Third Edition International Technology
More informationELEARNING C O N C E P T S, T R E N D S, A P P L I C A T I O N S
ELEARNING C O N C E P T S, T R E N D S, A P P L I C A T I O N S About Epignosis LLC. All rights reserved. Epignosis LLC 315 Montgomery Street, 8th and 9th Floors San Francisco, California, CA 94104 United
More informationGovernment of Chile www.gob.cl
Roadmap for the Fostering of Technology Development and Innovation in the Field of Astronomy in Chile Government of Chile www.gob.cl 1 2 Atronomy, Techology, Industry Roadmap for the Fostering of Technology
More information