CAHSI 2011 ANNUAL MEETING PROCEEDINGS

Transcription

1 San Juan, Puerto Rico March 27-29, 2011 CAHSI 2011 ANNUAL MEETING COMPUTING ALLIANCE OF HISPANIC-SERVING INSTITUTIONS PROCEEDINGS UNIFYING EFFORTS TO ADVANCE HISPANICS Sponsors CAHSI is funded by NSF Grant #CNS RECRUTING RETAINING ADVANCING HISPANICS IN COMPUTING 1

2 TABLE OF CONTENTS WELCOME LETTER... 4 AGENDA AT-A-GLANCE... 5 PROGRAM DETAIL: SUNDAY... 8 STUDENT POSTERS REVIEWERS PROGRAM DETAIL: MONDAY PROGRAM DETAIL: TUESDAY STUDENT PAPERS Smart Phone Healthy Heart Application (Nathan Nikotan, CSU-DH) Asserting Concepts in Medical Records using Classical and Statistical Techniques (Charles Cowart, CSU-SM) Generating Facial Vasculature Signatures Using Thermal Infrared Images (Ana M. Guzman, et al., FIU) A GPU Approach to Extract Key Parameters from ieeg Data (Gabriel Lizarraga, et al., FIU) Assessing the Performance of Medical Image Segmentation on Virtualized Resources (Javier Delgado, et al., FIU) CS0- Computer Animation Course Taught with Alice (Daniel Jaramillo, et al., NMSU) Genetic Algorithm with Permutation Encoding to Solve Cryptarithmetic Problems (Yesenia Díaz Millet, UPR- Politécnic) Session Hijacking in Unsecured Wi-Fi Networks (Edgar Lorenzo Valentin, UPR-Politécnic) Analysis of Different Robust Methods For Linear Array Beamforming (Angel G. Lebron Figueroa, et al., UPR- Politécnic) An Improved 2D DFT Parallel Algorithm Using Kronecker Product (Karlos E. Collazo, et al., UPR-Politécnic) Versatile Service-Oriented Wireless Mesh Sensor Network for Bioacoustics Signal Analysis (Marinés Chaparro- Acevedo, et al., UPR-M) Numerical Modeling of Bending of Cosserat Elastic Plates (Roman Kvasov, et al., UPR-M) WARD Web-based Application for Accessing Remote Sensing Archived Data (Rejan Karmacharya, et al., UPR- M) PDA Multidimensional Signal Representation for Environmental Surveillance Monitoring Applications (Nataira Pagán, et al., UPR-M) Hyperspectral Unmixing using Probabilistic Positive Matrix Factorization Based on the variability of the calculated endmembers (Miguel Goenaga-Jiménez, et al., UPR-M) Evaluation of GPU Architecture for the Implementation of Target Detection Algorithms using Hyperspectral Imagery (Blas Trigueros-Espinosa, et al., UPR-M) An Integrated Hardware/Software Environment for the Implementation of Signal Processing Algorithms on FPGA Architectures (María González-Gil, UPR-M) A Computational Framework for Entropy Measures (Marisel Villafañe-Delgado, et al., UPR-M) Continuous-time Markov Model Checking for Gene Networks Intervention (Marie Lluberes, et al., UPR-M) Virtual Instrumentation for Bioacoustics Monitoring System (Hector O. Santiago, et al., UPR-M) Coherence-Enhancing Diffusion for Image Processing (Maider Marin-McGee, et al., UPR-M) Data Quality in a Light Sensor Network in Jornada Experimental Range (Gesuri Ramirez, et al., UTEP) The use of low-cost webcams for monitoring canopy development for a better understanding of key phenological events ( Libia Gonzalez, et al., UTEP) A Model for Geochemical Data for Stream Sediments (Natalia I. Avila, et al., UTEP) Model Fusion: Initial Results and Future Directions (Omar Ochoa, et al., UTEP)

3 26. Prospec 2.2: A Tool for Generating and Validating Formal Specifications (Jesus Nevarez, et al., UTEP) Data Property Specification to Identify Anomalies in Scientific Sensor Data: An Overview (Irbis Gallegos, UTEP) Fusion of Monte Carlo Seismic Tomography Models (Julio C. Olaya, UTEP) Identifying the Impact of Content Provider Wide Area Networks on the Internet s Topology (Aleisa A. Drivere, et al., YSU) SPEAKER BIOGRAPHIES SPONSORS AND CONTRIBUTORS Full proceedings will be available online at: 3

4 T H E U N I V E R S I T Y O F T E X A S A T E L P A S O Ann Q. Gates Associate Vice President for Research Dear 2011 CAHSI Participants: Welcome to the 5 th Annual CAHSI meeting, Unifying Efforts to Advance Hispanics.. We are excited to be in San Juan, Puerto Rico. What a perfect setting for a meeting focused on the national agenda for the advancement for Hispanics. CAHSI was formed in 2002 and formalized in 2004 through a National Science Foundation (NSF) Broadening Participation in Computing (BPC) grant. Its focus is on recruiting, retaining, and advancing Hispanics in computing. We aim to increase the number of Hispanics who complete baccalaureate and graduate studies, in particular the number who receive Ph.D. degrees. To reflect of our steady growth and desire to extend our impact, we have changed our name to Computing Alliance for Hispanics, maintaining our acronym. In addition to NSF, CAHSI would like to acknowledge the National Center for Women in Technology (NCWIT) for sponsoring the student cultural tour of Old San Juan, Rock Solid Technologies for hosting the Monday luncheon, and Google for supporting student travel scholarships to the conference. Other contributors are listed in page 2 of the Agenda Book. The CAHSI meeting highlights outstanding students, researchers, and executives through plenary talks, panels, and workshops; the poster session showcases the excellent research being conducted by CAHSI undergraduate and graduate students. We have a dynamic and motivating program thanks to distinguished participants from numerous organizations. We thank you for your attendance--your involvement and contributions are critical to meeting CAHSI s goal. Enjoy the meeting and thank you for making a difference by being involved with CAHSI. Sincerely, Ann Quiroz Gates, Ph.D. CAHSI PI 500 W. University Admin 209 El Paso, Texas (915) FAX: (915)

5 Computing Alliance of Hispanic-Serving Institutions 2011 Annual Meeting Unifying Efforts to Advance Hispanics At-A-Glance Agenda SUNDAY, MARCH 27, :00 am 9:00 am Registration Breakfast Foyer A San Cristobal A,B,C,D STUDENT WORKSHOPS AND TALKS 9:00 am 10:30 am Workshop: Speed Mentoring Audience: Undergraduate and graduate students San Cristobal A,B,C,D 10:30 am 10:45 am Break Foyer A 10:45 am 11:30 am Workshop: Setting Timelines to Apply for Graduate School Audience: Undergraduate students San Cristobal E,F Student Talks: Grad Student Research I Audience: Graduate and Undergraduate students San Cristobal A,B,C,D 11:30 am 12:30 pm Workshop: Submitting a Competitive NSF GRFP Audience: Undergraduate students and 1 st year Master s San Cristobal E,F Student Talks: Grad Student Research II Audience: Graduate and Undergraduate students San Cristobal A,B,C,D Invited Students: Advocate Orientation Audience: Advocates San Cristobal G 12:30 pm 1:30 pm Lunch Las Olas 1:30 pm 4:30 pm STUDENT TOUR OF OLD SAN JUAN Meet in lobby 7:00 pm 9:00 pm RECEPTION AND POSTER PRESENTATION San Cristobal Foyer A 5

6 MONDAY, MARCH 28, :00 am 9:00 am Breakfast San Cristobal A,B,C,D OPENING REMARKS 9:00 am 9:15 am Welcome Ann Q. Gates, Associate Vice President for Research The University of Texas at El Paso San Cristobal A,B,C,D KEYNOTE ADDRESS 9:15 am 10:15 am Diversity, Identity, Inclusion, and Computing: One of these San Cristobal A,B,C,D Things Is Not Like the Others, Or Is It? Manuel A. Perez Quiñones Associate Professor, Department of Computer Science, Virginia Tech 10:15 am 10:30 am Break Foyer A SOCIAL SCIENCE PANEL 10:30 am 12:00 pm Panel: Social Science Research on Hispanics San Cristobal A,B,C,D Audience: General 12:00 pm 1:00 pm Lunch Las Olas POLICY PANEL AND UNDERGRADUATE STUDENT PANEL 1:15 pm 3:00 pm Panel: A National Agenda for Accelerating Hispanic/Latino Success Audience: General San Cristobal A,B,C,D Student Panel: Summer Research Internships Audience: Undergraduate students San Cristobal E,F,G 3:00 pm 3:15 pm Break Foyer A ROUNDTABLE DISCUSSION AND UNDERGRADUATE STUDENT PANELS 3:15 pm 5:15 pm Roundtable Discussion: Setting a Unified National Agenda for Hispanic Success In Higher Education Audience: General San Cristobal A,B,C,D Student Panel I: Challenging the Future Computer Stereotype: Attracting Students from Underrepresented Groups to Computer Science Audience: Students San Cristobal E,F,G Student Panel II: Creating a Student Research Pipeline Audience: Students San Cristobal E,F,G Dinner on your own 6

7 TUESDAY, MARCH 29 8:00 am 9:00 am Breakfast San Cristobal A,B,C,D STUDENT AND FACULTY WORKSHOPS 9:00 am 10:30 am Workshop: Multicultural Workshop Audience: General San Cristobal A,B,C,D 10:30 am 12:00 pm Workshop: Why Graduate School and GEM Fellowship? Audience Students Workshop: 300+ Students Can't Be Wrong! Gamescrafters, a Computational Game Theory Undergraduate Research and Development Group at UC Berkeley Audience: General San Cristobal E,F,G San Cristobal E,F,G 12:00 pm Lunch Las Olas 7

8 SESSION DESCRIPTIONS SUNDAY MARCH 27 STUDENT WORKSHOP: SPEED MENTORING FOR LATINOS IN COMPUTING Time: 9:00 am 10:30 am Target audience: Undergraduate and graduate students Presenters: Gilda Garreton, Ph.D., Principal Engineer, VLSI Research Group Patricia Lopez, Ph.D., Component Designer, Intel Corporation Speed mentoring is a networking exercise where people get advice in a series of short, one-on-one conversations with other mentors. In this session, the speed mentoring technique will be applied to create opportunities for Latinos to (1) identify possible mentors/protégés; (2) get quick feedback on situations and questions specific to cultural challenges, barriers, and opportunities; and (3) expand their network in the community. STUDENT WORKSHOP: SETTING TIMELINES TO APPLY TO GRADUATE SCHOOL Time: 10:45 am 11:30 am Target audience: Undergraduate students Presenters: Nayda Santiago, Ph.D., Associate Professor ECE, University of Puerto Rico Mayagüez Marisel Villafañe, Undergraduates student, University of Puerto Rico Mayagüez The purpose of the workshop is to provide awareness of the graduate school application process. In this workshop, we will present and discuss a series of deadlines for applying to graduate school. One of our goals is to help an undergraduate student plan activities during their career that will lead to a strong curriculum vitae and a competitive graduate school application packet. STUDENT TALKS: GRADUATE STUDENT RESEARCH PRESENTATIONS I Time: 10:45 am 11:30 am Target audience: Undergraduate and graduate students Presenters: CAHSI Graduate students CAHSI students provide short presentations on their research projects. This session is an excellent way to learn about the research being conducted at CAHSI institutions. Student presentation details are given below: Identifying the Impact of Content Provider Wide Area Networks on the Internet s Topology Aleisa A. Drivere, Antoine I. Ogbonna, and Sandeep R. Gundla, Youngstown State University Genetic Algorithm with Permutation Encoding to Solve Cryptarithmetic Problems Yesenia Díaz Millet, Polytechnic University of Puerto Rico Smart Phone Healthy Heart Application Nathan Nikotan, California State University Dominguez Hills Asserting Concepts in Medical Records using Classical and Statistical Techniques Charles Cowart, California State University, San Marcos STUDENT TALKS: GRADUATE STUDENT RESEARCH PRESENTATIONS II Time: 11:30 am 12:30 pm Target audience: Undergraduate and graduate students Presenters: CAHSI Graduate students CAHSI students provide short presentations on their research projects. This session is an excellent way to learn about the research being conducted at CAHSI institutions. Student presentation details are given below: 8

9 CS0- Computer Animation Course Taught with Alice Daniel Jaramillo, New Mexico State University Coherence-Enhancing Diffusion for Image Processing Maider Marin-McGee and Miguel Velez-Reyes, University of Puerto Rico at Mayaguez A GPU Approach to Extract Key Parameters from ieeg Data Gabriel Lizarraga and Mercedes Cabrerizo, Florida International University Using Software Engineering Techniques to Identify Anomalies in Sensor Data Irbis Gallegos, University of Texas at El Paso STUDENT WORKSHOP: SUBMITTING A SUCCESSFUL NSF GRADUATE RESEARCH FELLOWSHIP PROGRAM (GRFP) APPLICATION Time: 11:30 am 12:30 pm Target audience: Undergraduate students/1 st year Master s students Presenters: Malek Adjouadi, Ph.D., Professor and Director, Center for Advanced Technology and Education, FIU The NSF GRFP program has numerous application requirements that must be met to apply for the program. This workshop discusses these requirements with a focus on the essay sections of the application and the NSF review criteria. The presenter will provide advice on how to write competitive essays. Attendees will learn about the FellowNet initiative and the review process. STUDENT ORIENTATION: PREPARING TO BE A STUDENT ADVOCATE Time: 11:30 am 12:30 pm Target audience: CAHSI Student Advocates Presenters: Claudia Casas, M.S. CAHSI Program Manager, University of Texas at El Paso Make an impact by becoming a CAHSI advocate! This interactive workshop will prepare students and faculty in becoming an advocate. Student advocates work with students at their home institutions by encouraging and facilitating student participation in REU opportunities, seminars, workshops, and internships. Working with faculty, advocates involve students in activities that assist them in preparing competitive applications to local and external programs, scholarships and fellowships. The role of faculty advocates is to promote Hispanic faculty and young professionals into leadership roles. This includes award nominations and making recommendations for key committee positions, panels, and other positions that build leadership. 9

10 STUDENT POSTER SESSION DESCRIPTIONS SUNDAY MARCH 27 GRADUATE STUDENTS G1. Data Quality in a Light Sensor Network in Jornada Experimental Range Gesuri Ramirez, Olac Fuentes and Craig E. Tweedie University of Texas at El Paso Keywords: Data cleaning, assessing data quality, sensor networks, and machine learning. G2. Evaluation of GPU Architecture for the Implementation of Target Detection Algorithms using HyperspectralImagery Blas Trigueros-Espinosa, Miguel Velez-Reyes, NaydaG. Santiago-Santiago and Samuel Rosario-Torres University of Puerto Rico at Mayaguez Keywords: Hyperspectral imaging (HSI), target detection, and graphics. G3. Hyperspectral Unmixing using Positive Matrix Factorization Based on the Variability of the Calculated Endmembers Miguel Goenaga-Jimenez and Miguel Velez-Reyes University of Puerto Rico at Mayaguez Keywords: Hyperspectral Images, unmixing, endmembers variability, constrained positive matrix factorization, and probabilistic estimator. G4. Session Hijacking in Unsecured Wi-Fi Networks Edgar Lorenzo Valentín Polytechnic University of Puerto Rico Keywords: Network security, Wi-Fi, https, SSL, WEP, protection, cyber crimes, and session hijacking. G5. Coherence-Enhancing Diffusion for Image Processing Maider Marin-McGee and Miguel Velez-Reyes University of Puerto Rico at Mayaguez Keywords: Nonlinear diffusion, coherence-enhancement, and image processing. G6. Identifying the Impact of Content Provider Wide Area Networks on the Internet s Topology Aleisa A. Drivere, Antoine I. Ogbonna, Sandeep R. Gundla and Graciela Perera Youngstown State University Keywords: Internet topology, Internet backbone, Tier-1 network, traceroute, wide area network, content provider, content delivery network G7. A Model for Geochemical Data for Stream Sediments Natalia I. Avila, Rodrigo Romero and Phil Goodell University of Texas at El Paso Keywords: Geochemical data, stream sediments and elements. G8. Model Fusion: Initial Results and Future Directions Omar Ochoa, Vladik Kreinovich, Aaron A. Velasco and Ann Gates University of Texas at El Paso Keywords: Model fusion, earth tomography G9. CS0- Computer Animation Course Taught with Alice Daniel Jaramillo and Karen Villaverde New Mexico State University Keywords: alice, computer animation, CS0, java, and video game development 10

11 G10. Prospec 2.2: A Tool for Generating and Validating Formal Specifications Jesus Nevarez, Omar Ochoa, Ann Gates and Salamah Salamah University of Texas at El Paso Keywords: Software Engineering, Formal Specifications, and LTL G11. Assessing the Performance of Medical Image Segmentation on Virtualized Resources Javier Delgado and Malek Adjouadi Florida International University Keywords: Medical imaging, segmentation, and virtualization G12. Genetic Algorithm with Permutation Encoding to Solve Cryptarithmetic Problems Yesenia Díaz Millet Polytechnic University of Puerto Rico Keywords: Cryptarithmetic, genetic algorithm, and constraint satisfaction problem G13. Analysis of Different Robust Methods For Linear Array Beamforming Angel G. Lebron and Luis M Vicente Polytechnic University of Puerto Rico Keywords: Arrays SIgnal processing, beamforming, MVDR, DOA, and SOI G14. An Improved 2D DFT Parallel Algorithm Using Kronecker Product Karlos E. Collazo Ortiz and Luis Vicente Polytechnic University of Puerto Rico Keywords: Two dimensional DFT, 2D DFT, HPC, Parallel Computing, and Kronecker Product G15. Numerical Modeling of Bending of Cosserat Elastic Plates Roman Kvasov and Lev Steinberg University of Puerto Rico at Mayagüez Keywords: Cosserat elasticity, finite element method, differential equations G16. Smart Phone Healthy Heart Application Nathan Nikotan California State University Dominguez Hills Keywords: Cardiac, heart, blood pressure, HBP, medical device, iphone, Android, physician, patient monitoring, alert system G17. WARD Web-based Application for Accessing Remote Sensing Archived Data Rejan Karmacharya, Samuel Rosario-Torres, and Miguel Vélez-Reyes University of Puerto Rico at Mayaguez Keywords: Remote sensing, data archiving, distributed web application G18. Asserting Concepts in Medical Records using Classical and Statistical Techniques Charles Cowart California State University, San Marcos Keywords: Natural Language Processing, Machine Learning, Classification, Text Mining G19. Continuous-time Markov Model Checking for Gene Networks Intervention Marie Lluberes and Jaime Seguel University of Puerto Rico at Mayaguez 11

12 Keywords: Gene Regulatory Network, Probabilistic Boolean Networks, Markov-chain, intervention, model-checking algorithms G20. Generating Facial Vasculature Signatures Using Thermal Infrared Images Ana M. Guzman, Malek Adjouadi and Mohammed Goryawala Florida International University Keywords: Thermal imaging, anisotropic diffusion, vasculature signatures G21. The use of low-cost webcams for monitoring canopy development for a better understanding of key phenological events Libia Gonzalez, Geovany Ramirez and Craig E. Tweedie University of Texas at El Paso Keywords: Phenology, Digital web-cams, RGB image analysis, Jornada Basin experimental range, Digital imaging, Image processing G22. A GPU Approach to Extract Key Parameters from ieeg Data Gabriel Lizarraga, Mercedes Cabrerizo and Malek Adjouadi Florida International University Keywords: ieeg, CUDA, parallel processing, FFT, FFTW G23. Versatile Service-Oriented Wireless Mesh Network for Bioacoustic Signal Analysis Marinés Chaparro-Acevedo, Juan Valera, Abner Ayala-Acevedo, Kejie Lu and Domingo Rodriguez University of Puerto Rico at Mayaguez Keywords: Wireless mesh network, Remote sensing, Signal processing, Bioacoustics, NETSIG G24. An Integrated Hardware/Software Environment for the Implementation of Signal Processing Algorithms on FPGA Architectures María González-Gil and David Márquez University of Puerto Rico at Mayaguez Keywords: FGPA, Ambiguity Function, System Generator, Matlab, Signal-based Information Processing G25. A Computational Framework for Entropy Measures Marisel Villafañe-Delgado, Juan Pablo Soto and Domingo Rodríguez University of Puerto Rico at Mayaguez Keywords: Cyclic short-time Fourier transform, Entropy Measures, Time-Frequency Representations G26. Virtual Instrumentation for Bioacoustics Monitoring System Hector O. Santiago, David Márquez and Domingo Rodriguez University of Puerto Rico at Mayaguez Keywords: Virtual Instrumentation, Gumstix, Bioacoustics, Embedded Computing, Digital Signal Processing. G27. Data Property Specification to Identify Anomalies in Scientific Sensor Data: An Overview Irbis Gallegos and Ann Gates The University of Texas at El Paso Keywords: Software engineering, data quality, sensor networks, error detection, specification and pattern system, eco informatics G28. PDA Multidimensional Signal Representation for Environmental Surveillance Monitoring Applications Nataira Pagán, Angel Camelo and Domingo Rodríguez University of Puerto Rico at Mayaguez Keywords: PDA, iphone, Android, wireless mesh sensor network, bioacoustics 12

13 G29. Fusion of Monte Carlo Seismic Tomography Models Julio C. Olaya, Rodrigo Romero and Aaron Velasco University of Texas at El Paso Keywords: Seismic tomography, Monte Carlo simulation, Cauchy distribution, uncertainty, fusion, and visualization UNDERGRADUATE STUDENTS U1. Implementation of Target Enumeration Using Euler Characteristic Integrals John Rivera Youngstown State University Keywords: target counting, Euler characteristic integrals, DETER, testbed U2. Specialized Data Analysis, Aggregation & Visualization Tool Packages for R Tia Pilaroscia, Erin Hodgess, Lilian Antunes, Duber Gomez-Fonseca, Hooman Hemmati, Sarah Jennisca University of Houston-Downtown Keywords: Visual Analytics, Statistical Aggregation, Data Analysis, R-language U3. System Identification Using Particle Swarm Optimization Based Artificial Neural Networks Carlos J. Gómez-Méndez and Marcel J. Castro-Sitiriche University of Puerto Rico at Mayagüez Keywords: artificial neural network, ANN, particle swarm optimization, PSO, system identification, swarm intelligence U4. Automated Analysis of Galaxy Images Damaris Crespo-Rivera, Joel Quintana-Nuñez and Olac Fuentes University of Texas at El Paso Keywords: Image Processing, Automated Classification, Galaxy Image Classification, Pattern Recognition. U5. Procedural Generation for Virtual City Francisco Perez, Elio Lozano, Derek Díaz and Juan Sola University of Puerto Rico at Bayamon Keywords: Virtual city, procedural generation, suburban generation U6. Development of a library for Hyperspectral Image Analysis on the GPU Platform Yamil Asusta, David Bartolomei, Amílcar González and Rodolfo Ojeda University of Puerto Rico at Mayagüez Keywords: hyperspectral images, HSI, GPU, CUDA, OpenCL, GPGPU, Principal Component Analysis, PCA, K-means Clustering. U7. NOAA iphone Weather Application Craig Adams and Scott King Texas A&M University Corpus Christi Keywords: iphone, mobile, location, weather, NOAA U8. An Approach to Developing a Cognitive Radio O. García, J. Salomón, H. Tosado, J. Figueroa, G. Blas and Lizdabel Morales University of Puerto Rico at Mayagüez 13

14 Keywords: Cognitive Radio (CR), Software Defined Radio (SDR), USRP, GNU Radio U9. Detecting User Activities using the Accelerometer in Android Smartphones Sauvik Das, LaToya Green, Beatrice Perez, Michael Murphy and Adrian Perrig Keywords: Android, smartphone security, accelerometer U10. Development of INTPAVE, a Flexible Pavement Damage and Permit Fee Analyzer for Heavy Truck Traffic Cesar Tirado and Enrique Portillo University of Texas at El Paso Keywords: flexible pavements, permit cost, heavy vehicles, finite element analysis, rutting, intpave. U11. Easy Share Kemuel Cruz and Edwin Martinez University of Puerto Rico at Arecibo Keywords: Mobile application, Mobile phone, Education U12. Body Area Network for Biotelemetric aapplications Yolián Amaro-Rivera, Mariely Aponte-Bermúdez, Wilfredo Cartagena-Rivera, Damaris Crespo-Rivera and Carlos Ramos-García University of Puerto Rico at Mayagüez Keywords: Body Area Network (BAN), Universal Software Radio Peripheral (USRP), GNU Radio U13. SCIDS: Skin Cancer Identification System An Android Based Mobile Device Application Lisa Richardson, Miguel Alonso Jr.and Danmary Albertini Miami Dade College Keywords: Skin cancer, image processing, android, classifier, machine learning, Bayesian network, neural network, melanoma, basal cell carcinoma, squamous cell carcinoma, mobile application U14. Sensors Application in LEGO Rubik's Cube Problem Solver Gabriel Huertas, Raúl Huertas, Carlos Cardona, Albith J. Colón and Lizdabel Morales-Tirado University of Puerto Rico at Mayagüez Keywrods: Sensors, problem-solver, application U15. Automated Creation of Virtual Clusters on a Single Server for Parallel Programming Education J. Alejandro Medina-Cruz, José Ortiz and H. Ortiz-Zuazaga University of Puerto Rico at Rio Piedras Keywords: virtualization, parallel programming, distributed programming, operating systems, Oracle Solaris, Solaris Zones, Crossbow, scripting, python, network topology, MPI U16. Survey of Terrain Rendering Algorithms Jonathan M. Ortiz and Scott A. King Texas A&M University-Corpus Christi Keywords: Terrain Rendering, Polygonal Surface simplification, irregular meshes, binary triangle trees, btt, tile blocks, level-ofdetail, lod. U17. Top-k Queries in Wireless Sensor Networks Amber Faucet and Longzhuang Li Texas A&M University Corpus Christi Keywords: Wireless Sensor Network, Top-k Query, FILA, TAG, TinyOS, TOSSIM, Environmental, Monitoring, Network Simulator 14

15 U18. AudioAid: A Mobile Application for the Hearing Impaired William Gomez, Danmary Albertini and Miguel Alonso Jr. Miami Dade College Keywords: AudioAid, hearing loss, hearing aid, mobile technology, assistive technology U19. ResDec: A Mobile Application to Decode Resistors for the color Blind Ivette Carreras, Danmary Albertini and Miguel Alonso Jr. Miami Dade College Keywords: Resistor Decoder, Color Blindness, Object Detector, Android Platform U20. A New Genetic Sequence Editor Jacqueline Barreiro and Andres Figueroa University of Texas Pan American Keywords: Genetic Sequence Editor, Editor, DNA, protein, alignment U21. A Portable Expert Systems for Combat Casualty Care Brian Herrera California State University at Dominquez Hills Keywords: Expert systems, combat medicine, portable U22. Mathematical Modeling of the BMP-4 and FGF Signaling Pathways during Neural and Epidermal Development in Xenopus laevis Amie Albanese and Edwin Tecarro University of Houston-Downtown Keywords: BMP-4, MAP kinase, Xenopus laevis, embryonic development, FGF, neural, epidermal U23. Measurement of physical variables using the Universal Serial Bus Standard Victor O. Santos Uceta, Elio Lozano Inca and Marcel Rivera Ayuso University of Puerto Rico at Bayamón Keywords: USB devices, intelligent buildings, human interface devices, HID, physical variables sensors, USB programming U24. A Practical Approach to Developing Next Generation USB 4.0 Jesse Navas, Jr. California State University, Dominguez Hills Keywords: USB 4.0, USB Standard Type-A Plug, Data Transfer, Transfer Rate U25. Young Women in Computing: Programming Impact of Diversity Eclipse vs. App Inventor in Secondary Outreach Janie Chen, Stephanie Marquez, Natasha Nesiba and Nicole Ray New Mexico State University Keywords: Women, computing, diversity, high school, outreach, productivity, app U26. Relational Algebra Toolkit: Sharing and Querying Relational Algebra Expressions on the Web Nathan Arnold, Hussein Bakka, Artem Chebotko, and Jeremy Miller University of Texas Pan American Keywords: Relational algebra, XML, database, visualization, education U27. Active Sonar Simulation in Matlab and Ultrasonic Sonar with Arduino Chipset Board Interface Fernando J. Arroyo and Héctor L. Pacheco University of Puerto Rico at Mayaguez Keywords: Communication System, Active Sonar, Sonar Simulation, Matlab, Arduino, Ping U28. Apple Mobile Devices: A Developer s World Bretton Murphy and Karen Villaverde 15

16 New Mexico State University Keywords: Game Development, ipad/ipod/iphone Game Development U29. Web Server Benchmark Tools for httperf Samuel González Fonseca and Juan M. Sola-Sloan University of Puerto Rico at Bayamón Keywords: Web server benchmarks, httperf, Web traffic, performance analyzer U30. Auto Rigging Spore Creatures Sarah Spofford Texas A&M University Corpus Christi Keywords: Spore, Animation, Computer Graphics, MEL, Maya, Rigging, Inverse Kinematics U31. Educational CS RPG Games David Salas New Mexico State University Keywords: Role-playing game-world, Graphical User Interface, Multidimensional Arrays U32. Investigation of Point Generation and Shape Drawing: Implementation of a Binary Search Algorithm on a Two-Dimensional Plane Cory Edward Moody California State University Dominguez Hills Keywords: Point Generation; Two Dimensional; Binary Search U33. Managing Wireless Sensor Network Using Windows Mobile Devices Ashley N. Munoz and Ahmed Mahdy Texas A&M University-Corpus Christi Keywords: Mobile applications, smartphone, WSNs, Wireless Sensor Networks U34. Toward a New Model of Electronic Medical Record System for Homeless Patients Nelson Torres and Rafael Nieves Universidad del Turabo Keywords: Electronic Medical Records (EMR), e-health, medical records systems, patient-driven workflow 16

17 Edgar Acuña, University of Puerto Rico at Mayaguez Malek Adjouadi, Florida International University Richard Aló, University of Houston Downtown Rafael Arce, University of Puerto Rico at Rio Piedras Mohsen Beheshti, California State University Dominguez Hills José A. Borges, University of Puerto Rico at Mayaguez Marcel Castro, University of Puerto Rico at Mayaguez Alfredo Cruz, Polytechnic University of Puerto Rico Gladys O. Ducoudray, University of Puerto Rico at Mayaguez John Fernandez, Texas A&M University at Corpus Christi Edgar Ferrer, University of El Turabo Andrés Figueroa, University of Texas-Pan American Eric Freudenthal, University of Texas at El Paso Mario Garcia, Texas A&M University at Corpus Christi Ann Gates, University of Texas at El Paso Rocío Guillén, California State University at San Marcos Sarah Hug, University of Colorado at Boulder Elio Lozano, University of Puerto Rico at Bayamón Kejie Lu, University of Puerto Rico at Mayaguez Yahya Masalma, University of El Turabo Fernando Maymí, U.S. Military Academy, West Point, NY Lizdabel Morales, University of Puerto Rico at Mayaguez Edusmildo Orozco, University of Puerto Rico at Rio Piedras Humberto Ortiz, University of Puerto Rico at Rio Piedras José Ortiz, University of Puerto Rico at Rio Piedras Graciela Perera, Youngstown State University Manuel A. Pérez-Quiñones, Virginia Tech University Enrico Pontelli, New Mexico State University Steve Roach, University of Texas at El Paso Domingo Rodríguez, University of Puerto Rico at Mayaguez John Sanabria, Universidad del Valle, Colombia Nayda G. Santiago, University of Puerto Rico at Mayaguez REVIEWERS 17

18 Jaime Seguel, University of Puerto Rico at Mayaguez Juan Solá, University of Puerto Rico at Bayamón Juan Suris, University of Puerto Rico at Mayaguez Eliana Valenzuela, University of Puerto Rico at Arecibo Ramón E. Vásquez, University of Puerto Rico at Mayaguez Miguel Vélez, University of Puerto Rico at Mayaguez Luis Vicente, Polytechnic University of Puerto Rico Karen Villaverde, New Mexico State University 18

19 PROGRAM DETAIL MONDAY MARCH 28 KEYNOTE: DIVERSITY, IDENTITY, INCLUSION AND COMPUTING: ONE OF THESE THINGS IS NOT LIKE THE OTHER, OR IS IT? Time: 9:15 am 10:15 am Target audience: General Presenters: Manuel Pérez-Quiñones, Ph.D., Associate Professor, Department of Computer Science, Virginia Tech We think of computing as a heavy mathematical and technical field, were bits, bytes, megahertz, processes and mathematical proofs consume our conversation. But in this talk, Dr. Pérez-Quiñones will make the argument that this narrow view of Computing is one that makes it difficult to grow the discipline and to attract more students to the field. Diversity, Identify and Inclusion are important concepts for a Computer Scientist to understand. He posits that these concepts are equally relevant to the technical growth of the field as well as for broadening participation in it. PANEL: SOCIAL SCIENCE RESEARCH ON HISPANICS Time: 10:30 am 12:00 pm Target audience: General Presenters: Moderator: Maricel Quintana Baker, Ph.D., Principal, MQB-Consulting and CAHSI BOA Clemencia Cosentino, Ph.D., Senior Researcher, Mathematica Policy Research Jill Denner, Ph.D., Associate Director of Research, ETR Associates Alicia C. Dowd, Ph.D., Associate Professor, USC s Rossier School of Education Co-director, Center for Urban Education Carlos Rodriguez, Ph.D., Principal Research Scientist, American Institutes for Research The goal of this panel is to bring together a team of experts in the social science arena to explore the latest research and practices on pertinent issues related to Hispanic participation in the STEM disciplines. Panel members will talk about the barriers and challenges to access, persistence, participation, and degree completion, and describe the exemplary and recommended strategies which can lead us to increase Latino participation and success. The panel members will share their suggestions and ideas on how these practices and strategies can be adapted and incorporated into CAHSI s efforts and programs PANEL: A NATIONAL AGENDA FOR ACCELERATING HISPANIC/LATINO SUCCESS Time: 1:15 pm 3:00 pm Target audience: General Presenters: Moderator: Ann Q. Gates, Ph.D., Associate VP of Research, University of Texas at El Paso Janice Cuny, Ph.D., National Science Foundation, Program Director for Computing Education Judit Camacho, Executive Director, SACNAS Lorelle L. Espinoza, Ph.D., Director of Policy and Strategic Initiatives, IHEP Deborah Santiago, Vice President for Policy and Research, Excelencia in Education This panel aims to present perspectives about the issues regarding increasing Hispanic degree attainment in computing and STEM fields, describe how policy can bring about change, and discuss initiatives that NSF has supported to address underrepresentation. Representatives from national organizations will talk about their efforts on impacting policy and setting a national agenda for accelerating Hispanic success in higher education. UNDERGRADUATE STUDENT PANEL: SUMMER RESEARCH INTERNSHIPS Time: 1:15 PM 3:00 pm Target audience: Undergraduate students Presenters: Yolián Amaro, Undergraduate Students, University of Puerto Rico at Mayaguez La Toya Green, Undergraduate Students, University of Puerto Rico at Mayaguez 19

20 Ashely Muñoz, Undergraduate Student, Texas A&M University Corpus Christi Marisel Villafane, Undergraduate Students, University of Puerto Rico at Mayaguez The objective of this panel is to provide undergraduate students information about the experience of attending a summer research internship, or REU, and its benefits for developing a research tract and preparing for graduate school. The panelists will address the process of applying to REU, share their REU experiences with the audience, and answer questions. ROUND TABLE DISCUSSION: SETTING A UNIFIED NATIONAL AGENDA FOR HISPANIC SUCCESS IN HIGHER EDUCATION Time: 3:15 pm 5:15 pm Target audience: General Moderator: Carlos Rodriguez, Ph.D., Principal Research Scientist, American Institutes for Research Dr. Rodriguez will share his views and facilitate a discussion on how different communities, i.e., individuals, community-based efforts, non-for-profit organizations, industry, and institutions of higher education can unify efforts for promoting Hispanic student success in computing and STEM areas in general. The goal of the round table discussion is to stimulate dialog among the attendees and to outline an action plan for unifying efforts that can bring about change. STUDENT PANEL I: CHALLENGING THE FUTURE COMPUTER STEREOTYPE: ATTRACTING STUDENTS FROM UNDERREPRESENTED GROUPS TO COMPUTER SCIENCE Time: 3:15 pm 4:15 pm Target audience: Students Presenters: Moderator: Arely Mendez, Graduate Student, University of Texas at El Paso Amber Faucett, Undergraduate Student, Texas A&M University Corpus Christi Daniel Jaramillo, Graduate Student, New Mexico State University Alexandria N. Ogrey, Undergraduate Student, University of Texas at El Paso Hispanics have the highest growth rates among all groups in the U.S., yet they have remained considerably underrepresented in computing careers. In addition, there are a low number of women who choose to enter baccalaureate programs in computing fields. Educating a diverse group who are qualified to contribute to technical areas is critical for the economic and intellectual growth of our nation. This panel is composed of students who will talk about the efforts at their universities to engage K-12 students in activities that encourage them to attend college and enter computing or STEM fields. STUDENT PANEL II: CREATING A STUDENT RESEARCH PIPELINE Time: 4:15 pm 5:15 pm Target audience: Students Presenters: Moderator: Nathan Nikotan, Graduate Student, California State University Dominguez Hills Jonathan Ortiz, Undergraduate Student, Texas A&M University Corpus Christi Sarah Spofford, Undergraduate Student, Texas A&M University Corpus Christi Moises Carrillo, Undergraduate Student, University of Texas Pan American Student panelists provide insight into how undergraduates can get into effective research projects, improve communication skills, develop community awareness, and move into leadership positions. Other key topics that will be discussed are having a positive undergraduate research experience, selecting your research topics, writing personal essays, and networking. 20

21 PROGRAM DETAIL TUESDAY MARCH 29 FACULTY AND GRADUATE STUDENT WORKSHOP: MULTICULTURAL WORKSHOP Time: 9:00 am 10:30 am Target audience: General Presenters: Manuel Pérez-Quiñones, Ph.D., Associate Professor, Department of Computer Science, Virginia Tech Maricel Quintana Baker, Ph.D., Principal, MQB-Consulting and CAHSI BOA This session will present a broader perspective of diversity and cultural issues by addressing some of the research around the topic of diversity and cultural differences. Diversity is often equated with affirmative action and other points of view rooted in social justice. Yet, research shows that the benefits of diversity go beyond a social justice point of view. More diverse groups perform better than other more homogeneous groups. Companies with a diverse employee base tend to be more financially successful. Overall, there are many strands of research that are discipline specific that provide a strong case for why diversity matters. This breakout session will explore diversity from a multi-disciplinary perspective and explore the popular misconceptions on the topic. FACULTY AND STUDENT WORKSHOP: 300+ STUDENTS CAN T BE WRONG! GAMESCRAFTERS, A COMPUTATIONAL GAME THEORY UNDERGRADUATE RESEARCH AND DEVELOPMENT GROUP AT UC BERKELEY Time: 9:00 am 10:30 am Target audience: General Presenters: Dan Garcia, Ph.D. Lecturer SOE UC Berkeley The UC Berkeley GamesCrafters undergraduate research and development group was formed in 2001 as a "watering hole" to gather and engage top students as they explore the fertile area of computational game theory. At the core of the project is GAMESMAN, a system developed for strongly solving, playing and analyzing two-person, abstract strategy games (e.g., Tic-Tac- Toe or Connect 4) and puzzles (e.g., Rubik's Cube). Over the past ten years, more than seventy games and puzzles have been integrated into the system by over three hundred undergraduates. STUDENT WORKSHOP: WHY GRADUATE SCHOOL AND THE GEM FELLOWSHIP? Time: 10:30 am 12:00 pm Target audience: General Presenters: Marcus A. Huggans, Ph.D. Director of Recruitment and Programming The National GEM Consortium This workshop will prove the fundamental belief of the 21st century and beyond: all STEM professionals should hold an advanced STEM degree. Particularly, the participants will gather information about career and financial implications of NOT obtaining a graduate degree. If you think all you need is a bachelor's degree to be competitive in this global society or that you should work first then go back to graduate school, YOU CAN'T MISS THIS WORKSHOP! Come find out why graduate school is not an option but a necessity. 21

22 STUDENT PAPERS 1. Smart Phone Healthy Heart Application (Nathan Nikotan, CSU-DH) Asserting Concepts in Medical Records using Classical and Statistical Techniques (Charles Cowart, CSU-SM) Generating Facial Vasculature Signatures Using Thermal Infrared Images (Ana M. Guzman, et al., FIU) A GPU Approach to Extract Key Parameters from ieeg Data (Gabriel Lizarraga, et al., FIU) Assessing the Performance of Medical Image Segmentation on Virtualized Resources (Javier Delgado, et al., FIU) CS0- Computer Animation Course Taught with Alice (Daniel Jaramillo, et al., NMSU) Genetic Algorithm with Permutation Encoding to Solve Cryptarithmetic Problems (Yesenia Díaz Millet, UPR- Politécnic) Session Hijacking in Unsecured Wi-Fi Networks (Edgar Lorenzo Valentin, UPR-Politécnic) Analysis of Different Robust Methods For Linear Array Beamforming (Angel G. Lebron Figueroa, et al., UPR- Politécnic) An Improved 2D DFT Parallel Algorithm Using Kronecker Product (Karlos E. Collazo, et al., UPR-Politécnic) Versatile Service-Oriented Wireless Mesh Sensor Network for Bioacoustics Signal Analysis (Marinés Chaparro- Acevedo, et al., UPR-M) Numerical Modeling of Bending of Cosserat Elastic Plates (Roman Kvasov, et al., UPR-M) WARD Web-based Application for Accessing Remote Sensing Archived Data (Rejan Karmacharya, et al., UPR- M) PDA Multidimensional Signal Representation for Environmental Surveillance Monitoring Applications (Nataira Pagán, et al., UPR-M) Hyperspectral Unmixing using Probabilistic Positive Matrix Factorization Based on the variability of the calculated endmembers (Miguel Goenaga-Jiménez, et al., UPR-M) Evaluation of GPU Architecture for the Implementation of Target Detection Algorithms using Hyperspectral Imagery (Blas Trigueros-Espinosa, et al., UPR-M) An Integrated Hardware/Software Environment for the Implementation of Signal Processing Algorithms on FPGA Architectures (María González-Gil, UPR-M) A Computational Framework for Entropy Measures (Marisel Villafañe-Delgado, et al., UPR-M) Continuous-time Markov Model Checking for Gene Networks Intervention (Marie Lluberes, et al., UPR-M) Virtual Instrumentation for Bioacoustics Monitoring System (Hector O. Santiago, et al., UPR-M) Coherence-Enhancing Diffusion for Image Processing (Maider Marin-McGee, et al., UPR-M) Data Quality in a Light Sensor Network in Jornada Experimental Range (Gesuri Ramirez, et al., UTEP) The use of low-cost webcams for monitoring canopy development for a better understanding of key phenological events ( Libia Gonzalez, et al., UTEP) A Model for Geochemical Data for Stream Sediments (Natalia I. Avila, et al., UTEP) Model Fusion: Initial Results and Future Directions (Omar Ochoa, et al., UTEP) Prospec 2.2: A Tool for Generating and Validating Formal Specifications (Jesus Nevarez, et al., UTEP) Data Property Specification to Identify Anomalies in Scientific Sensor Data: An Overview (Irbis Gallegos, UTEP) Fusion of Monte Carlo Seismic Tomography Models (Julio C. Olaya, UTEP) Identifying the Impact of Content Provider Wide Area Networks on the Internet s Topology (Aleisa A. Drivere, et al., YSU)

23 Smart Phone Healthy Heart Application Nathan Nikotan Department of Computer Science, California State University Dominguez Hills 1000 Victoria Street, Carson, CA EXTENDED ABSTRACT Medical patients who have been undiagnosed with high blood pressure increase the probability of damaging the heart and other internal organs if left unmonitored/untreated. High blood pressure is often left undiagnosed because there is no visible symptoms. A health care team consisting of a patient, medical physician, and primary care givers will need a mobile device that can provide a monitoring alert system (i.e. iphone, Android, etc.). It is important to know your blood pressure, even when you're feeling fine. If your blood pressure is normal, you can work with your health care team to keep it that way. If it is too high, you will need treatment to prevent greater damage to relevant internal organs. It is important to develop an inexpensive heart monitoring system based on user-friendly mobile platforms. Using C++ Builder 2009, a Healthy Heart application was developed. Users were able to add information (diastolic pressure, systolic pressure, weight, height, age, etc.), display chart data, setup accounts (ie. health care team members), and send alerts. However, this initial application will need to be ported to Objective-C for iphone development and Java for Android development. Future work will include a physical device interface. KEYWORDS Cardiac, heart, blood pressure, HBP, medical device, iphone, Android, physician, patient monitoring, alert system. 1. Introduction Based on a mobile heart monitoring system [1], a Healthy Heart monitoring system can be developed as a tool between physicians, patients, and primary care givers. One-third of Americans in the United States may not even be aware of having high blood pressure (HBP). According to comscore.com, the smartphone market trend continues to garner profitable gains. Among 45.4 million US subscribers [2]: Research In Motion, in the lead, with 42.1 percent share Apple ranked second with 25.4 percent share Microsoft at 15.1 percent Google Android at 9.0 percent Palm at 5.4 percent Based on capitalizing on market share, the biggest players are Apple (iphone) and RIM (Blackberry) with Google (Android) gaining momentum. It would be good to develop the heart monitor application based on an evaluation criteria.[5] HBP itself usually has no symptoms. You can have it for years without knowing it. During this time, though, it can damage the heart, blood vessels, kidneys, and other parts of your body. This is why knowing your blood pressure numbers is important, even when you're feeling fine. If your blood pressure is normal, you can work with your health care team to keep it that way. If your blood pressure is too high, you need treatment to prevent damage to your body's organs. [6] 1.1 High Blood Pressure Blood pushing against the walls of the arteries as the heart pumps out blood. If this pressure rises and stays high over time, it can damage the body in many ways. High blood pressure (HBP) is a serious condition that can lead to coronary heart disease, heart failure, stroke, kidney failure, and other health problems. Blood pressure numbers include systolic and diastolic pressures. Systolic blood pressure is the pressure when the heart beats while pumping blood. Diastolic blood pressure is the pressure when the heart is at rest between beats. [6] Blood pressure tends to rise with age. Following a healthy lifestyle helps some people delay or prevent this rise in blood pressure. 1.2 Blood Pressure Ranges There are several categories of blood pressure, including: Normal: Less than 120/80 Prehypertension: /80-89 Stage 1 high blood pressure: /90-99 Stage 2 high blood pressure: 160 and above/100 and above The exact causes of high blood pressure that are not known to science are listed below. Several factors and conditions may play a role in its development, including: Smoking Being overweight or obese Lack of physical activity Too much salt in the diet Family history of high blood pressure 23

24 Figure 3: Mobile Architecture Figure 1: Blood Pressure Chart 2. Architecture Designs and Specifications Given initial specifications that patients (i.e. users) will be inputting data (ie. heart beat pulse) and receiving alerts, the Repository Architecture for a web-based application would be the best selection for this project. The Repository Architecture Style [4] is a data centered architecture that supports user interaction for data processing (opposite to the Batch Sequential Architecture). In a repository style, there are two distinct kinds of components: a central data structure representing the current state, and a collection of the independent components which operate on the central data store. Figure 2: Repository Architecture 2.1 Mobile Application Platforms Advances in mobile applications has furthered greater development of mobile applications in medical health monitoring (Figure 3). Data can be collected from patients and reported through mobile smartphones. This can reduce costs both in terms of health-care visits and expenses. [13] 2.2 Smartphone Platform Environment Smartphone capabilities go beyond feature phones (ie. Motorola Razr) by including sending SMS, taking pictures, large screen size, QWERTY-keyboard (virtual or physical), and highspeed wireless connectivity. [7] A mobile platform is to provide access to the phone devices. To run software and services on each of these devices, a platform and its core programming language will be needed to write mobile applications. Like all software platforms, there are three categories of interest: licensed, proprietary, and open source. Proprietary platforms are designed for device makers, not available for competing device makers. Palm uses three different proprietary platforms: C/C++ programming language, Windows Mobile-based, and webos. Also, Research in Motion (RIM) maintains their own proprietary Java-based platform used exclusively by BlackBerry devices. Also, Apple uses a proprietary Mac OS X as a platform (iphone) for their touch devices based on Unix. 3. Healthy Heart Application Installing and configuring MySQL can difficult but it is important to create the healthyheart database (and sample data information) so that system testing can be evaluated (ie. debugging the web-based application). In order for the healthyheart application to run properly, users will need to install and configure MySQL 5.5.7, mysql-to-odbc driver, and AppServer Below are the installation icons on Windows XP/Vista/7 platforms. See Figure 4. Figure 4: Installing MySQL Application 24

25 Figure 5: Healthy Heart Application Figure 8: alert notification 3.2 HBP Charts An important design feature is the ability to see the heart blood pressure rates on a chart (i.e. monthly, yearly, etc ). See Figure 9. We are hoping that iphone development will have simple but effective graphical library support to send/receive images (*.png, *.jpg, etc...) to all health care team members. Figure 6: Database Configuration 3.1 Threshold Detection and Alert Notification For the next iterative cycle, users will need to input data into the smart phone device. This information must be accurate. Assuming measurements are fairly free on error, the application will be able to detect if a systolic or diastolic threshold has been breached (Figure 7). An immediate alert notification will be sent (Figure 8) to all health care team members. Figure 7: High Blood Pressure has been detected Figure 9: HBP Chart 4. Design Implementation For BlackBerry application development, an Eclipse development environment will need to be built, while touch phone application development can be achieved by J2ME. In Figure 2, J2ME has nested configurations such Connected Device Configuration (CDC) and Limited Device Configuration (LDC). Building the BlackBerry/Eclipse development environment, there are 3 development applications required: Java SE Development Kit 6 Update 20 BlackBerry JDE

26 BlackBerry Simulator Building J2ME development environment, there are two development applications required: Java SE Development Kit 6 Update 20 with JavaFX Java ME SDK 3 Figure 10: Design Flow Mobile phone will receive the several output graphs which can be saved to the phone, so that the phone is only responsible for minimal computational processing. This allow users to spend more time on different user applications. Migration toward the iphone development would require Mac OS X (Snow Leopard) version The development environment is Xcode. The iphone software stack from highest to lowest, would be [13]: 1. Heart monitor application 2. Cocoa Touch framework (UI elements, event dispatching, application life cycle) 3. Media (graphics, animation, etc.) 4. Core services (collections, SQLite, networking, etc.) 5. Core OS Layer (Unix services, I/O, threads, power management, etc.) 5. Conclusion For a web-based application, the Repository Architecture for the Healthy Heart Application works sufficient well processing several requests establishing a MySQL database connection. The second phase of this long-term project will be to port this application to the iphone and Android platform will require: Program language (Objective-C and Java) preparation Development environment setup The third phase will be to develop device extensions in which to automate the data collection such as such as a selfcalibrating/self-validating aneroid manometer USB attachment interface to the smart phone. This will require electrical engineering principles, and manufacturing experience. ACKNOWLEDGMENTS The author(s) acknowledge(s) the National Science Foundation (grant no. CNS ) for its support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] Sadeghi, A. et. al., Mobile Heart Monitoring, May [2] Flosi, S., February 2010 U.S. Mobile Subscriber Market Share, April 5, Score_Reports_February_2010_U.S._Mobile_Subscriber_Market_S hare [3] Qian, K. et al., Software Architecture and Design Illuminated, Jones and Bartlett Publishers, 2010 [4] Tucker, A., Programming Languages: Principles and Paradigm 2 nd ed., McGraw-Hill Higher Education 2008 [5] Vaughn Aubuchon, Low-High Normal Blood Pressure Chart, Blood Pressure, November [6] US Department of Health & Human Services, What is Blood Pressure, November l [7] Rogers, Rick et. al. Android Application Development, O Reilly Media 2009 [8] Meier, R. Professional Android 2 Application Development, Wiley Publishing Inc [9] Kochan, S. Objective-C, Pearson Education Inc [10] Bucanek, J., Learn Objective-C for Java Developers, Apress Publications 2009 [11] Fling, B. Mobile Design and Development, O Reilly Media 2009 [12] Craft, C., iphone Game Development, Wiley Publishing, Inc [13] Dudney, B., iphone SDK Development: Building iphone Applications, The Pragmatic Programmers 2009 [14] Wargo, J., BlackBerry Development Fundamentals, Addison-Wesley Professional 2010 [15] Rizk, A., Beginning BlackBerry Development, Apress Publications 2009 [16] King, C., Advanced BlackBerry Development, Apress Publications 2010 Figure 11: Smart Simulation 26

27 Asserting Concepts in Medical Records using Classical and Statistical Techniques Charles Cowart California State University, San Marcos 333 S. Twin Oaks Valley Road San Marcos, CA (760) ABSTRACT Medical treatment records began as the notes of a patient s physician but in the information age are of increasing importance in a variety of situations. If medical text can be properly parsed, then computers can aid doctors in diagnosing previously unseen or misdiagnosed conditions in patients. Prior work in processing medical treatment records has broken the problem down into a number of smaller problems. These include the identification of medical concepts such as diseases and treatments, as well whether the patient has or does not have the disease, etc. Medical problem classification solutions often approach this problem from either the classical natural language processing perspective or with statistical machine-learning based techniques. We propose that a hybrid system incorporating techniques from both approaches can both leverage the strengths of each while mitigating many of their weaknesses. Our effort classified medical problems using a feature generation process based on classical natural language processing techniques and a Naïve Bayes classifier. Our initial results achieved an accuracy of 86.65% in recall, 86.71% in precision, and 86.68% F-Measure in competition using a relatively small set of only 169 features. This effort was supported in part by NSF grants CNS and SACI-LSA KEYWORDS Natural Language Processing, Machine Learning, Classification, Text Mining 1. INTRODUCTION The goal of the 2010 i2b2/va Challenge Evaluation was to work on three main tasks: 1) extraction of medical problems, tests and treatments; 2) classification of assertions made on medical problems and 3) relations between medical problems, tests and treatments. The data for this challenge included discharge summaries from Partners HealthCare and from Beth Israel Deaconess Medical Center, as well as discharge summaries and progress notes from University of Pittsburgh Medical Center. All records have been fully de-identified and manually annotated for concept, assertion, and relation information. A detailed description of the challenge can be found at the i2b2 website [1]. Our goal was to develop a solution that combined the strengths of both classical natural language processing (NLP) and machine-learning (ML) techniques. Classical NLP-based systems rely on hand-built parsers that are developed with knowledge of the target language s grammar. They often represent a top-down approach where the target sentence is broken down completely with the goal of extracting all possible meaning [4]. Several weaknesses exist with this approach. Unlike computer-centric languages, human-centric languages have been shown to have ambiguous grammar parsings; that is, a sentence can have multiple valid syntactic structures and the listener is left to determine which parsing or reading of the sentence is the correct one. Moreover, in contexts such as medical records, where it is assumed that the text is intended for a knowledgeable medical professional, sentences may be incomplete and much information may be implied or ungrammatical. This makes designing parsers that are flexible with regard to grammatical errors or implied knowledge more difficult. Moreover, the deterministic nature of such parsers makes them much less flexible when parsing new data than say a machine-learning parser. Machine learning approaches handle ambiguity better than their classical counterparts, in general by being statistical, rather than deterministic in nature [5]. Text is characterized by measurable heuristic properties individual to the corpus the parser was trained on. These properties may be identified manually or with the aid of feature selection algorithms. In either case however, to improve the performance of the parser, additional features of an increasingly discriminating nature are added. The addition of too many features can lead to overfitting a condition where the feature set is mapped so closely to the training data that the parser becomes less accurate, rather than more accurate, when 27

28 attempting to parse new data [6]. The tendency to add additional features also leads to models that are computationally more expensive for training, then testing. During the competition it was revealed that some teams used as many as 100,000 to 1,000,000 or more features. Given that the data from this problem domain is limited to patient records, where sentences are limited to communicating a relatively small number of concepts, it could be argued that such a high number of features could be more complex than the data in the domain itself. We posit that most of the sentences in the data are implicitly about the patient, if not explicitly stated. Moreover, the sentences are for the most part limited to tests and treatments the patient may or may not have been given, as well as problems the patient may or may not have in varying capacity. Also, if a problem, treatment, or test refers to another individual (most likely a relative of the patient), that individual is explicitly mentioned. Thus, the overall range of what is potentially being communicated is relatively limited when compared to language in general. We also posit that with respect to asserting the correct semantic category to a previously identified problem, we can assume that we do not need to fully understand a sentence; rather we only need to understand enough to reliably assert the correct category to the targeted problem. With this in mind we propose the application of classical, deterministic NLP techniques to normalize and simplify sentences as well as identify potential features. In this fashion, we take advantage of what deterministic techniques can reliably offer. We use them to encapsulate as much dependence within features as opposed to between features, and potentially simplify our feature model. Our simplified feature set, along with appropriate weighting, is then used to train a statistical model that is better suited to handling ambiguities than deterministic methods. In the following sections we describe our system, based on a hybrid approach that combines dictionary look-up, pattern matching, as well as machine-learning based part-of-speech taggers, shallow-parsers, and classifiers. 1.1 Task Description The possible assertion values that could be assigned to an identified problem were predefined by the organizers of the task as: 1. present (patient has concept) 2. absent (patient does not have concept) 3. possible (patient may have concept) 4. conditional (patient has concept under specific conditions) 5. hypothetical (patient may develop concept in the future) 6. associated with someone else (problem exists but not for the patient in question) We posit that it should be possible to perform the classification without having to fully ascertain the meaning of each sentence. Given a small fixed set of possibilities, a basic metric to measure success would be whether or not our algorithm performed better than a purely random assignment of assertions. Given a set of six possibilities, a purely random assignment of assertion values would correspond to a 16.67% success rate. Given the initial test data however, we determined that not all assertion values occur with equal frequency; the present value in particular occurred with roughly 67% frequency. Thus, a simple process assigning present to each input could potentially be correct some 70% of the time. Our initial goal then was to demonstrate that our algorithm could achieve more than 70% accuracy. 2. METHODOLOGY In this section we describe the different steps in feature generation using NLP methods. 2.1 Extract, Transform, and Load Process Our first objective was to split the given training data as represented in various file formats into a database schema. Data was validated for consistency during the extraction-transform-load (ETL) process using scripting methods. A minimal amount of normalization was performed to make processing of data straightforward; the only normalization allowed at this time were effects that could be safely reversed such that the original data files could be rebuilt. PostgreSQL was selected for use as our relational database management system based on our prior knowledge of its administration. 2.2 Normalizing Data Post-Query With our training and test data both managed by PostgreSQL, querying for data becomes more straightforward as our feature generation process evolves. Normalization of the data prior to the feature generation process is an important process in and of itself, and occupies several stages. Normalization is performed post-query because as our feature generation process evolves, so does the nature of what is considered relevant normalization. For example, numbers appearing throughout the queried data are unique and the value of their uniqueness can only be determined through experimentation. Thus, it is appropriate to perform normalization at this stage, rather than during the ETL phase, otherwise we may 28

29 find at a later in the process that we cannot access valuable data. Normalization can be classified into two categories, that which is context independent and that which requires an understanding of the context in which the word or phrase is used, in order to be normalized correctly. For example, since analysis of each problem is limited to the sentence it is found in, the. character becomes optional as it is only used to mark the end of a sentence and optionally within acronyms such as p.i.d. Removing. from the queried data eliminates variation and needless characters. Similarly, since capitalization of proper nouns may also be infrequent and optional by the medical practitioner, the sentence is converted entirely into lower-case to further eliminate variation. We take advantage of this by rendering all analyzed data in upper-case text, making for text that is easy to understand for both software and the human reader. Simultaneously, we performed an N-gram generation test to determine if reliable features could be identified without this process. We took each sentence in turn and built from it a list of all single words, then two adjacent words, and so on up through 10 words or the end of the sentence was reached. We then generated a unique list from this data and scanned the entire dataset for occurrences of each N-gram in each subset of training data (subset by assertion type). What we found was no N-gram occurred with greater than 3-10% frequency in any of the assertion types. An example of context-dependent normalization may be the recognition (and upper-capitalization) of key words that can affect the assertion of a concept, such as the word possible or negative for. In such cases we cannot process this keyword until we understand what it is properly associated with. For example, Patient negative for X implies that only X is not present while, Patient negative for X, Y, and Z implies that the negative for should be distributed to X, Y, and Z as well. 3. FEATURE GENERATION At this point in time, each target sentence has been normalized to remove excess textual variation and typed as much as is possible, given the information supplied to us and what we were able to reliably determine thus far. Each sentence at this point can be thought of as words that are either part of a known problem, treatment, or test, or are part of the untyped word phrases that can collectively be thought of as determiners words or phrases such as not or negative for which may affect the assertion of any concept they are associated with. Words however, are not both part of a concept and part of a determiner. We also posit that relative to determining the relevancy these determiners have relative to a problem, treatment, or test, the specific nature of the problem, etc. becomes irrelevant. For example, when determining the value of the word has, it does not matter whether the sentence is patient has CANCER or patient has ACNE, thus, such concepts can be replaced with a stock phrase which can still be interpreted correctly by a part of speech tagger and shallow parser. Our initial efforts examined each class of assertion training data ( absents, possibles, etc.) and manually identified keywords or key phrases that occurred with enough frequency to become readily apparent and manually generated regular expressions that would convert these key phrases and their variations into a single keyword. This would produce a drastically simplified sentence that could in turn be reduced again by combining multiple keywords. Where a parsing may be likely but perhaps not conclusive, the keywords were left alone as it was expected that this would be accounted for in the machine-learning classifier. In our initial effort, we used over 400 regular-expressions and identified 169 features for use in the machine-learning classifier. 4. CLASSIFICATION Our attempt during the i2b2 competition at classification was carried using the Naïve Bayes Algorithm [2], because it is straightforward and it is well known for generating good results. A Naïve Bayesian Classifier is an algorithm that uses the Bayes Theorem to separate or classify a set of inputs, based upon a prior set of pre-classified inputs. Given a set of examples for each of the six assertion types, each example containing the presence of one or more features, a complete list of observed features can be determined and each example can be represented as a row in a features matrix. Within this features matrix, the presence or absence of a feature in each example is represented as either yes or no. Relative to other probabilistic classifiers, the Naïve Bayesian Classifier is straightforward and assumes that each feature occurs independently of each other and contributes independently to the probability of the assertion type occurring. This proved to be of particular value to our classification process, as feature dependence would otherwise require a dataset that scales exponentially with the number of features. Given that we did not know at the time how many features we may define, and our final declaration of 169 features, our need for example data representing each and every possible combination of features would have far outstripped the data available to us. Moreover, it is worth noting that our iterative approach of merging features where appropriate into more unique and specific features may contribute to encapsulating some of the relatedness of the prior features. 29

30 We grouped our keyword-laden sentences by assertion type and reduced them each into a group of keywords. We then scripted a process to take this data, along with the list of unique features found, and produce an ARFF format file used by the Weka machine-learning software library [3]. The ARFF file specified each of the 169 features, with a possible value of either yes or no, representing their presence or absence in a sentence. It also contained keywords from each of the roughly 2800 sentences in the original training set. Training the Weka implementation of the Naïve Bayes Classifier is a relatively straightforward process where the previously mentioned ARFF file is used as input and a Weka model formatted file is obtained as output. This model is subsequently used along with test input to produce output indicating the most probable assertion for each test input as determined by Weka and the model. Scripting was used to map the results to the original test input. The initial training set of 2887 example concepts marked as problems and released by i2b2 was used as our primary training set. The subsequent release of the full assertion examples by i2b2 became our initial test set. The initial 2887 example assertions proved to be 90% effective in correctly determining the assertion type of the full example assertions (of which the set of 2887 examples was a subset). Repeating this process using the entire 11,968 sentence assertions training set we correctly classified the i2b2 test set of 18,550 assertions 86.7% of the time, based on the ground truth set that was released shortly thereafter. More specifically, Our initial results achieved an accuracy of 86.65% in recall, 86.71% in precision, and 86.68% F-Measure. Our work with Naïve Bayes Algorithm showed that with perhaps proper encapsulation of dependence within features, the issue of feature independence in Naïve Bayes Classifiers may become less of a factor when assessing the strength of the algorithm against more popular classifiers such as Support Vector Machines. Our goal is to apply our feature generation process to a wide variety of machine-learning classifiers and identify which algorithms and parameters will lead to the most accurate and most reliable performance. 6. REFERENCES 1. i2b2 NLP Challenge website (2010, November 1) T. Mitchell. Machine Learning. McGraw Hill Weka Toolkit (2011, January 5) Speech and Language Processing, 2 nd edition, D. Jurafsky & J.M. Martin, Pearson Education, 2009) 5. Introduction to Special Issue on Machine Learning Approaches to Shallow Parsing, in Journal of Machine Learning Research 2 (2002) p Ian H. Witten, E. F. (2005). Data Mining: Practical Machine Learning Tools and Techniques. San Francisco: Morgan Kaufmann. 7. Shortliffe, E. H., & Cimino, J. J. (2006). Biomedical Informatics: Computer Applications in Health Care and Biomedicine. New York: Springer. A closer examination of the sentences incorrectly tagged showed a dominance of failures where the nature of the concept had not been taken into account. For example, allergies by their very nature are conditional with the presence of an allergen. However, our algorithm would classify the sentence, Patient has allergies. as concept present rather than concept conditional. In several tests where subsets of our original feature set were identified using feature selection algorithms, we were able to show that Naïve Bayes Algorithm performed competitively with Support Vector Machines. 5. CONCLUSIONS Our work with manually generated regular expressions proved to be effective at simplifying the types of sentences found in medical records and from them generating small, focused sets of features for use in machine-learning classifiers that remain competitive in performance and require fewer resources to train and operate. 30

31 Generating Facial Vasculature Signatures Using Thermal Infrared Images Ana M. Guzman Florida International University W. Flagler Street Miami, FL Mohammed Goryawala Florida International University W. Flagler Street Miami, FL Malek Adjouadi Florida International University W. Flagler Street Miami, FL ABSTRACT This paper presents preliminary findings using thermal infrared imaging for the detection of the human face vasculature network at the skin surface and the generation of vasculature signatures. A thermal infrared camera with reasonable sensitivity provides the ability to image superficial blood vessels on the human skin. The experiment presented here consists of the image processing techniques used in thermal infrared images captured using a midwave infrared camera from FLIR systems. For the purpose of this experiment thermal images were obtained from 10 volunteers, they were asked to sit straight in front of the thermal infrared camera and a snapshot was taken of their frontal view. The thermal infrared images were then analyzed using digital image processing techniques to enhance and detect the facial vasculature network of the volunteers and generate a vasculature signature for each volunteer. KEYWORDS Thermal imaging, anisotropic diffusion, vasculature signatures 1. INTRODUCTION Skin forms the largest organ of the human body, skin accounts for about 16 percent of a person s weight. It performs many vital roles as both a barrier and a regulating influence between the outside world and the controlled environment within our bodies. Internal body temperature is controlled through several processes, including the combined actions of sweat production and the rate of blood flowing through the network of blood vessels within the skin. Skin temperature can be measured and visualized using a thermal infrared camera with a reasonable sensitivity. Facial skin temperature is closely related to the underlying blood vessels; thus by obtaining a thermal map of the human face we can also extract the pattern of the blood vessels just below the skin. Thermal Infrared Imaging has many applications in different scientific, engineering, research, and medical areas [3]. Different studies using thermal infrared imaging have been done to detect spontaneous emotional facial expressions [14], skin tumors [6], frustration [8], temperature increase on the ear and cheek after using a cellular phone [12], as well as recognize faces [10]. Studies presented in [1,10] are the only study up to date that provides an algorithm for the extraction of the human face vasculature network in the thermal infrared spectrum. Figure 1 illustrates the arteries and veins in the human face. In our previous work [4] we replicated the study in [10]. In this study, we present a modified approach to our previous work to detect vasculatures and introduced the generation of vasculature signatures which will be used in future research on the creation of a robust biometric system for the purpose of human identification or recognition. Figure 1. Superficial arteries and veins in the human face Courtesy of Primal Pictures [6] 2. METHODS 2.1 Participants For the purpose of this study we collected thermal infrared images from 10 different subjects. Each subject was asked to sit straight in front of the thermal infrared camera and a snapshot of their frontal view was taken. This process was repeated at least two more times in different days and at different times of the day to take into consideration subtle variations that may occur. 2.2 Equipment and Software The primary equipment used for this study is a thermal infrared camera (Merlin InSb MWIR Camera, FLIR Systems) [15] and a Microsot Windows based PC. The thermal infrared camera communicates with the PC through a standard Ethernet and iport grabber connection. The camera consists of a Stirling-cooled Indium Antimode (InSb) Focal Array Plane (FPA) built on an Indigo Systems ISC9705 Readout Integrated Circuit (ROIC) using indium bump technology. The FPA is a matrix of detectors that are sensitive in the 1.0µm to 5.4µm range. The standard camera configuration incorporates a cold filter that restricts the camera s spectral response to the 3.0 µm 5.0 µm band. The camera has a 25mm lens with a field of view. 31

32 The thermal sensitivity is C at 30 C ambient temperature. The absolute temperature measured depends on different factors such as emissivity of the object, ambient temperature, and humidity. Relevant parameters can be changed in the software (ThermaCAM Researcher V2.8 SR-1) provided by FLIR Systems. The temperature accuracy is ± 2 C and ± 2% of reading if all the variables (emissivity, temperature and humidity) are correctly set. chosen as the reference image and the rest were registered to the reference image. The thermal image of the subject is taken at different times therefore there is slight lateral and vertical shifts in the position of the subject relative to the camera s position. Figure 3 shows an example where such shifts are apparent. Registering the thermal images to a reference thermal image of the subject facilitates the process of creating a vasculature signature. In the ThermaCAM software the following values were used for the video recording: Emissivity: 0.98; Distance: 1.2 m Relative humidity: 50%; Temperature: 23 C. The thermal infrared images obtained are then processed using MATLAB and the FSLView tool in the FMRIB Software Library [11, 13] for image registration. 2.3 Design of the Experiment The recording of the thermal infrared images was done in a room with a mean room temperature of 23 C. The infrared camera was placed on a tripod. The subjects were asked to sit straight on a stationary chair with a headrest to avoid any head movement. The chair was placed 1.2m in front of the thermal infrared camera. The subjects were asked to look straight to the lens and a snapshot of their frontal view was taken. An illustrative example of a thermal infrared image is given in Figure 2. Figure 3. Image Registration process using FLIRT Vasculature Extraction After registering the thermal images for each subject we proceeded to extract the vasculature in each image. The vasculature extraction process has four main sections: face segmentation, noise removal, image morphology, and blood vessel segmentation postprocessing Face Segmentation In this step the face of the subject was segmented from the rest of the image. The segmentation process was achieved by implementing the technique of localizing region based active contours in which typical region-based active contour energies are localized in order to handle images with non-homogeneous foregrounds and backgrounds [5]. Figure 2. Thermal infrared image of a volunteer. 2.4 Image Processing The feature extraction process consists of four main steps: image registration, vasculature extraction, and generation of signature vasculature Image Registration In this version of our work we first perform intra-subject image registration of the thermal infrared images. Intra-subject image registration was performed using FSLView. The image registration process was achieved using the Linear Image Registration Tool (FLIRT), assuming the rigid body model option for 2D image registration. We used four images from each subject, one was Noise Removal After the face was segmented from the rest of the thermal infrared image we proceeded to remove unwanted noise in order to enhance the image for further processing. Noise removal from the image was achieved by using an anisotropic diffusion filter [9]. This processing step was designed to smooth certain regions while preserving and enhancing the contrast at sharp intensity gradients Image Morphology Morphological operators are based in set theory, the Minkowsky operators and DeMorgan s law. Image morphology is a way of analyzing images based on shapes. In this study we assume that the blood vessels are a tubule like structure running along the length of the face. The operators used in this experiment are opening and top-hat segmentation. 32

33 The basic effect of an opening operation is to remove some of the foreground (bright) pixels from the edges of regions of foreground pixels. The effect of this opening operator is to preserve foreground regions that have a similar shape to the structuring element, or that can completely contain the structuring element, while eliminating all other regions of foreground pixels. The top-hat segmentation has two versions, for our purpose we use the version known as white top-hat segmentation as this process enhances the bright objects in the image; this operation is defined as the difference between the input image and its opening. The result of this step is to enhance the maxima in the image Blood Vessel Segmentation Post-Processing After obtaining the maxima of the image we skeletonize the maxima. Skeletonization is a process for reducing foreground regions in an image to a skeletal remnant that largely preserves the extent and connectivity of the original region while throwing away most of the original foreground pixels. In Figure 4 we show the final result of the process outlined in subsections RESULTS The process outlined in sections 2.4 and 2.5 of this paper is applied to every thermal image obtained from each subject. Figures 6-11 show the results of performing the described process on three subjects as illustrative examples, which clearly distinct signature vasculature for each individual. Figure 6. Overlay of signature vasculature on a subject s thermal image. Figure 4. Result of skeletonizing a facial thermal image. 2.5 Generation of Signature Vasculature The generation of a signature vasculature consists of taking the extracted vasculatures for each subject and adding them together. The resulting image is a composite of four vasculature extractions, each one slightly different from the other. By adding the vasculatures our goal is to keep the features that are present in all the images as the dominant features that otherwise define the individual signature. We then apply an anisotropic diffusion filter to the result of the added vasculatures in order to fuse the predominant features; the result of this step is then skeletonized in order to obtain a vasculature signature as shown in Figure 5. Figure 7. Overlay of signature vasculature (red) on individual vasculatures (white) of subject in figure 6 Figure 5. Process used to generate the signature vasculature Figure 8. Overlay of signature vasculature on a subject s thermal image 33

34 Figure 9. Overlay of signature vasculature (red) on individual vasculatures (white) of subject in figure 8. Figure 10. Overlay of signature vasculature on a subject s thermal image. Figure 11. Overlay of signature vasculature (red) on individual vasculatures (white) of subject in figure CONCLUSIONS The presented results show that thermal infrared images allow us to extract facial vasculature features through a set of integrated image processing techniques. The results also show that the facial vasculature among individuals is unique and there is little change on its structure when the images are taken at different times, and through the generation of a signature vasculature we can observe that there are vasculature features that remain constant. These features that remain constant will allow us to match the vasculature signature to a specific individual. The implementation will then extend to what we believe will be a robust biometric system [2]. The authors continue to work on a similarity measure that will quantify the similarity of the signature vasculature among subjects, these results will be part of a future publication. ACKNOWLEDGEMENTS The authors appreciate the support provided by the National Science Foundation under grants CNS-BPC-AE , CNS- MRI-R , HRD-CREST REFERENCES 1. Buddharau, P., Pavlidis, I.T., Panagiotis, T., Bazakos, M., Physiology-Based Face Recognition in the Thermal Infrared Spectrum, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 29, No. 4, pp , April Chen, Y; Adjouadi, M; Han, CA; Wang, J; Barreto, A; Rishe, N A highly accurate and computationally efficient approach for unconstrained iris segmentation, Image and Vision Computing, 28 (2): , February Diakides, N. & Bronzino, J., Medical Infrared Imaging (CRC Press, Taylor and Francis Group, FL, 2008). 4. Guzman, A., Adjouadi,M., Goryawala,M., Detecting the Human Face Vasculature Using Thermal Infrared Imaging, CAHSI 4 th. Annual Meeting, Redmond, WA.; April 5, Lankton, S., Tannenbaum, A., Localizing Region-Based Active Contours, IEEE Trans. on Image Processing, Vol. 17, No. 11, pp , Nov Mital, M. and Scott, E., Thermal Detection of Embedded Tumors Using Infrared Imaging, Journal of Biomechanical Engineering, Vol. 129, February 2007, Moxham, B.J., Kirsh, C., Berkovitz, B., Alusi, G., and Cheeseman, T., Interactive Head and Neck (CD-ROM). Primal Pictures, Dec Pavlidis, I., Dowdall, J., et. al., Interacting with Human Physiology, Computer Vision and Image Understanding, Vol. 108 (2007), Perona, P., Malik, J., Scale-Space and edge Detection Using Anisotropic Diffusion, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 12, No. 7, pp , July Socolinsky, D., Selinger, A., Neuheisel, J., Face Recognition with Visible and Thermal Infrared Imagery, Computer Vision and Image Understanding, 91(2003), Smith, S.M., M. Jenkinson, M.W. Woolrich, C.F. Beckmann, T.E.J. Behrens, H. Johansen-Berg, P.R. Bannister, M. De Luca, I. Drobnjak, D.E. Flitney, R. Niazy, J. Saunders, J. Vickers, Y. Zhang, N. De Stefano, J.M. Brady, and P.M. Matthews. Advances in functional and structural MR image analysis and implementation as FSL, NeuroImage, 23(S1): , Straume, A., Oftedal,, G., Johnson, A., Skin Temperature Increase Caused by a Mobile Phone: A Methodological Infrared Camera Study, Bioelectromagnetics 26(2005), Woolrich, M.W., Jbabdi, S., Patenaude, B., Chappell, M., Makni, S., Behrens, T., Beckmann, C., Jenkinson, M., Smith, S.M., Bayesian analysis of neuroimaging data in FSL. NeuroImage, 45:S , Zeng,Z., Spontaneous Emotional Facial Expression, Detection, Journal of Multimedia, Vol.1(5): 1-8, Merlin Mid InSb MWIR Camera, User s Guide, Version 120,

35 A GPU Approach to Extract Key Parameters from ieeg Data Gabriel Lizarraga Florida International University West Flagler Street, EAS-2220 Miami, FL Mercedes Cabrerizo Florida International University West Flagler Street, EAS-2220 Miami, FL Malek Adjouadi Florida International University West Flagler Street, EAS-2220 Miami, FL ABSTRACT In this paper the possibility of parallel computing applied to intracranial electroencephalograph (ieeg) data processing is explored. It focuses in particular on Fast Fourier Transform (FFT) and its fastest version the so-called FFTW. Applying the FFT algorithm to the ieeg data is the most computational intensive step in the analysis of ieeg data. This study brings the power of GPUs (Graphics Processing Unit) into processing ieeg data. GPUs offer the possibility of parallelizing certain applications to up to 20x the CPU (Central Processing Unit) counterpart. Obtaining results faster can help medical experts make timely decisions and determinations as more detailed analyses and data mining can now be accomplished faster. KEYWORDS: ieeg, CUDA, parallel processing, FFT, FFTW 1. INTRODUCTION Seizures, caused by excessive synchronous neuronal activity in the brain, are unpredictable and occur at random intervals of time. Some patients who have recurrent seizures are able to tell when a seizure is about to happen, however, for most patients the seizure occurs without warning. Seizures can be very damaging and impose a state of fright in people who suffer them, as losing consciousness may occur while walking or driving, and sometimes with dire consequences. EEG and ieeg data have been previously analyzed at the Center for Advanced Technology and Education at Florida International University [1-4], but since the size of the EEG data is a critical issue when dealing with seizure detection and especially prediction paradigms, a reliable and computational efficient method needs to be implemented to speed up such critical processes. In this study we explore the possibility of parallel computing applied to EEG and ieeg data processing. The data was obtained from Miami Children s Hospital (MCH) through our joint Neuro Engineering program between FIU and MCH. We focus in particular on FFT (Fast Fourier Transform) applied to an ieeg dataset as means to explore the computational merits of the proposed approach. This study was conducted during the joint collaboration efforts with UFF (Universidade Federal Fluminense) in Rio de Janeiro, Brazil. UFF has strong research program and department for GPU development. The need to speed up current methods to process ieeg or EEG data comes from the current complexity of the algorithms that are being used, especially in inverse solutions for accurate 3-D source localization and most particularly in algorithms developed for seizure prediction, a problem that remains elusive to this day. Many of these algorithms are of O(N3) or higher order in computational complexity. However, several preprocessing steps have enough data independence to allow parallelization. CUDA offers a sound and cost effective way of processing many parallel threads; millions with current architectures and much more that are anticipated with the newly developed products [5]. 2. BACKGROUND The data used in this study was obtained sequentially from a significant sample of 8 patients who underwent two-stage epilepsy surgery with subdural recordings. A snapshot in time of these EEG data is shown in Figure 1. Figure 1: Typical EEG data, where an interictal spike is identified. The age of the subjects varied from 3 to 17 years. The number and configuration of the subdural electrodes differed between subjects, 35

36 and was determined by clinical judgment at the time of implantation. Grid, strip, and depth electrodes were used, with a total number of contacts varying between 20 and 88. The amount of data available for analysis was influenced by its recording duration, and by the degree to which the interictal EEG (EEG between seizures) was pruned prior to storage in the permanent medical record. The ieeg data was recorded at Miami s Children Hospital (MCH) using XLTEK Neuroworks Ver.3.0.5, equipment manufactured by Excel Tech Ltd. Ontario, Canada. The data was collected at 500 Hz sampling frequency and filtered to remove the DC component. All data sets for this particular study were ieeg segments up to 20 minutes in duration approximately with up to 200 Megabytes in storage requirements. The subdural EEG integration module with commercial 3-D software such as CURRY will allow us to scrutinize and validate ieeg results with EEG results that are recorded at the scalp level. In this way the risk at the time of any surgery or resection will be minimized. Furthermore, the co-registration of the MRI and CT using the powerful HERMES research platform will be extremely important in order to identify the boundaries and the exact location of the subdural grids implanted on a given patient. Once these two images are perfectly aligned, an accurate and reliable anatomical location will be derived from the CT, where all grids can be visibly distinguished from image background as illustrated in Figures 2and 3. It is noted that at this time only the ieeg processing is being parallelized, but the next step is to include in the parallelization process both the 3-D volume rendering of the MRI frames and the consequential analysis that will follow as well as the source localization process. 3. EXPERIMENTS A Kernel was programmed to be run on the GPU which also employed the FFTW implementation in CUDA. The reasoning was that such a process will provide a better basis for comparison than coding our own CUDA version. FFTW is a well-known efficient FFT implementation. The code generated for this study ran on a NVIDIA TESLA with 4GB of RAM machine. We compared our results against the CPU of the same machine, i GHz Quad Core, running the FFTW algorithm. CUDA code, as shown in Figures 4 and 5, runs on the GPU, from the GPU memory. Memory has to be allocated on the GPU, the data has to be copied, and then the kernel is executed with very little synchronization among the threads, except for a barrier [5]. Figure 2: Results of moving dipole based on EEG range selected in Figure 1 representing the reconstructed cortex with the electrodes implanted (two grids). Figure 4: CUDA architecture, from (a) Figure 3: Results of Current Density Reconstruction (CDR) and rotating dipole based on 40 electrodes. Figure (a) represents the reconstructed cortex with the electrodes implanted based on the location of the CT image in Figure (b). (b) Figure 5: CUDA Data Parallel Primitives courtesy of 36

37 It should be noted that TESLA hardware only allows for a single kernel to be executed at a time. Considerations were taken regarding the size of the data and the amount of RAM available [6]. CUDA has a memory hierarchy which allows many optimizations based on optimal cache management. The programmer has to be able to manage his own cache based on the speed of the different kinds of memories available, such as registers, shared memory, and global memory. CUDA code, despite of being C with an API, has the attractiveness of being simple and very easy to write and read. A single function (called kernel) is created, which will be run by many threads. The first step is to copy the data to the GPU memory. This is achieved by first calling cudamalloc() command function, which allocates memory in the GPU. Then, cudamemcpy() is called, which copies the data from the main memory to the GPU s memory. The library gave names to the functions, which are very familiar if the programmer had any experience with C or C++ programming. Once the data lies on the GPU memory the Kernel is executed. Threads are organized in: 1D, 2D, or 3D grids, each containing blocks of thread as illustrated in Figure 6. Each thread executes the Kernel, independently of the other threads. A scheduler algorithm executes blocks without context switching. This lack of context switch and the ability to spawn millions of threads are what differentiate the CUDA architecture from CPU architecture. While CPUs have to context switch among processes, the CUDA GPUs can switch among blocks without spending the time to unload and reload the state of the process. The programming of FFTW in CUDA uses an already existing library which implements FFTW. This library proved to be very efficient when applied to the ieeg data. 4. RESULTS These preliminary results show that GPUs can be used to speed up the FFT step of the ieeg processing up to 20 times. This allows for results to be obtained in seconds/minutes versus minutes/hours. As GPUs begin to become available at lower cost and develop into commodity hardware, a machine which employs their power can provide results in almost real-time. Figure 7 and Table 1 illustrate how much faster the GPU fairs against the CPU. In our time calculations, we included the time to copy the data to the GPU memory; the actual processing time is thus even smaller than what we show here. Figure 7: Results of Time vs. Data size Contrasting GPU and CPU Implementations Figure 6: CUDA Threads architecture courtesy of Table 1 shows the results after applying the FFT algorithm to 7 EGG files of different sizes. The first column refers to the average running time in microseconds (average after running the FFT 10 times to one file) using the CPU and the second column represents the same average, but using the GPU. As it can be noticed, the running time using GPU is significantly smaller when compare to the average running time using CPU. The algorithm was applied 10 times in order to rule out the possibility of getting the results 37

38 by chance. The average time of 10 different runs represents a more realistic time frame for this type of EEG application. Table 1: Results of the average FFT (10 times) to ieeg files CPU (AVG microseconds) GPU (AVG microseconds) File Size (MB) than a certain size to be sent to the different nodes. The expense of data breaking and transmission to the nodes was higher than the processing independence gained by having many nodes. So, after a number of nodes, which depended on the size of the input, the processing time was actually larger than the single node s time. CUDA does not have to contend with such an issue, at least in our single node implementation, as everything is allocated in a single computer, and there are no penalties for transmitting data to other nodes. Other researchers are working on distributed implementations of GPUs which work with a library similar to MPI (message passing interface) widely used for CPUs. A measure of caution should however be considered as such implementations in GPU might inherit some of the problems we observed with the use of the Marenostrum with the Barcelona Supercomputing Center. 5. FUTURE WORK We plan to extend our parallelization to other applications using ieeg as our main processing tool. ieeg data is comprised of independent components which allow for parallel processing of each one. The application of algorithms, such as the one we implemented, is very applicable in medical data. Data like MRI, fmri, or PET can also be analyzed in parallel as they present the sort of data independence required for parallelization. Modern computers nowadays come with multicore CPUs and GPUs, and there seems to be a trend towards even more cores on single chips. As new CUDA architecture emerge (like the FERMI architecture), more kernels can be executed at the same time. This allows several steps of the process to operate at the same time on data which lies on the GPU memory. We can organize our kernels in a way which the next processing steps make use of the results of the previous calculation, which is already in GPU memory. Given that CPUs can keep processing while the GPUs work on their kernels, unless a blocking call to the GPU is given, other parts of the whole algorithm can make use of the CPU, which otherwise might be idle waiting for the GPU to finish. An algorithm which used both CPU and GPUs in an efficient way might suit perfectly as a solution for problems such as the one we are trying to solve. 6. CONCLUSION It is evident that improving the performance of an algorithm by 20 times in computational time could bear significant impact on medical applications that require heavy computational loads and where large data sets are almost requirements for careful diagnosis. In an earlier research experience, while working with the Barcelona Supercomputing Center, a project was undertaken to run FFTW on Marenostrum, a supercomputer with 10,256 CPUs. Although computational improvements were achieved, but there were not as substantial as these latest results obtained with GPUs. Back then we discovered that due to the size of our data it was not computational feasible to break the data into segments smaller ACKNOWLEDGMENTS The authors appreciate the support provided by the National Science Foundation under grants CNS-BPC-AE , CNS- MRI-R , HRD-CREST , CNS , and NSF-OISE-PIRE The authors are also thankful for the clinical support provided through the Ware Foundation and the joint Neuro-Engineering Program with Miami Children s Hospital. REFERENCES 1. X. You, M. Adjouadi, M. Guillen, M. Ayala, A. Barreto, N. Rishe, J. Sullivan, D. Dlugos, J. VanMeter, D. Morris, E. Donner, B. Bjornson, M.L. Smith, B. Bernal, M. Berl, W.D. Gaillard, Sub-Patterns of language network reorganization in Pediatric Localization Related Epilepsy- a Multisite Study, DOI: , Human Brain Mapping, May M. Ayala, M. Cabrerizo, P. Jayakar, and M. Adjouadi, Subdural EEG Classification into Seizure and Non-seizure Files Using Neural Networks in the Gamma Frequency Band, Journal of Clinical Neurophysiology, 28(1):20-29, M. Tito, M. Cabrerizo, M. Ayala, P. Jayakar, and M. Adjouadi, A Comparative Study of Intracranial EEG Files Using Nonlinear Classification Methods, Annals of Biomedical Engineering, Vol. 38 (1), pp , January Tito M, Cabrerizo M, Ayala M, Jayakar P, Adjouadi M, Seizure Detection: An Assessment of Time- and Frequency- Based Features in a Unified 2-D Decisional Space using Nonlinear Decision Functions, Journal of Clinical Neurophysiology, Vol. 26 (6), pp , Dec CUDA Zone NVIDIA Corporation < 6. Sanders J.; Kandrot E., CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley Professional,

39 Assessing the Performance of Medical Image Segmentation on Virtualized Resources Javier Delgado Florida International University W. Flagler Street, Room EC Malek Adjouadi Florida International University W. Flagler Street, Room EC ABSTRACT Advances in technologies for virtualization of computing resources has led to its wide spread use for several application domains. For example, cloud computing, which has the potential to transform the way users consume computing resources, depends heavily on virtualization. Given the possible impact of cloud computing, it is beneficial to know how applications of interest will perform in such an environment. In this paper, we analyze the effect of virtualization using Xen on the performance of a popular medical image segmentation program. We compare the performance of virtualized to nonvirtualized (bare metal) resources with identical hardware. In addition to overall run time, we investigate the execution profile of the runs to get an in-depth idea of the performance bottlenecks. The results we obtain, surprisingly, result in quicker executions in the virtualized environment. We perform system profiling in an attempt to determine why the execution takes longer on physical machines. We find that the physical machines suffer from a large number of interrupts, which partially explains the results obtained. In the end, we conclude that the impact of virtualization is minimal, and likely no higher than the impact of not having a specialized operating system. KEYWORDS Medical imaging, segmentation, virtualization 1. INTRODUCTION The compartmentalization of computing resources into resource containers or virtual machines has become very common. The benefits of virtualization, such as resource consolidation and easeof-deployment, have been particularly attractive for cloud computing. Cloud computing can provide an economical advantage for users who do not require constant utilization of resources. This includes several types of users, such as researchers who spend large amounts of time modifying their applications and researching alternative methods and algorithms. Although these types of users are not constantly using resources, they do require a fast response time when running their programs. The layer of abstraction created by virtualization results in a performance impact which in turn could lead to certain programs running too slowly on virtual machines. There are mixed results in the literature regarding the magnitude of this impact [1][3][4]. This is due to a difference in the performance impact for certain computations. For example, I/O performance has historically been impacted significantly by virtualization [1][4]. In this paper, we describe the performance impact of running one of the applications we frequently use in a virtualized environment. We also provide an analysis on the performance requirements of this application in order to get an idea of where the virtualization is causing the greatest performance impact. 2. BACKGROUND 2.1 Benefits of Virtualization There are multiple benefits to using virtualization. We briefly discuss the benefits and explain how they relate to medical imaging: Consolidation. Virtualization allows multiple virtual machines (i.e. guest machines) running in a single physical machine to replace multiple physical machines. This results in savings on initial operating costs. A traditional example is the consolidation of web application servers and database servers into a single physical machine. The virtual machine manager or hypervisor running on the physical machine provides fair and flexible resource partitioning among the virtual machines. Since the applications we deal with are computationally intensive, consolidation in the traditional example given above is not applicable. However, given that many imaging tasks are parallel in nature (i.e. embarrassingly parallel ), perhaps virtualization can ease the distribution of sub-tasks to separate processing cores in a system. Given the ubiquity of multi-core systems, this could be a significant advantage. Since cloud resources are provisioned as virtual machines and subject to the overhead of consolidation, we consider knowing the performance impact to be important. Fault tolerance. Hypervisors provide facilities to suspend, resume, and migrate virtual machines if failures are suspected. We do not explore this feature in this work, but note that it is important for long running jobs that can benefit from migrating in case of impending interruption in the system(s) on which they are run. Security. Security is a major concern for medical data and is one reason why medical professionals and researchers may be apprehensive about using it. Virtualization alone cannot ensure that others cannot see your data. The hypervisor has access to the data residing in guest machines at all times. If it were possible to guarantee that guests can secure data access from the hypervisor, users concerned with security would feel more comfortable using the resources. Such a study is beyond the scope of this paper. However, even for such deployments as private clouds the other 39

40 benefits of virtualization are still applicable. This is to mean that the security constraint does not imply that virtualization and/or cloud computing are not viable for medical applications. Ease of Deployment: The additional abstraction layer provided by the hypervisor means that it is not necessary to configure virtual machines for different hardware, i.e. a virtual machine can be created and deployed on any machine in which a compatible hypervisor is available. Scientific applications such as those used in medical imaging tend to be difficult to install, thus being able to deploy a preconfigured image is beneficial. 2.2 Related Work A number of studies exist in the literature comparing the performance of applications running on virtual machines to their bare metal counterparts. In [3] the authors report the performance impact of using Xen on the NASA Advanced Science (NAS) Parallel Benchmarks [6]. They find the impact of Xen is typically minimal, although it becomes significant with a lot of parallel I/O. In [1], they perform several HPC benchmarks running on guest machines on top of Xen and on bare metal resources and then compare their performance. They find that the performance impact is minimal. However, they use a customized environment for enhanced I/O performance. In [4], on the other hand, the authors find that there is significant overhead when using either Xen or the Kernel Virtual Machine (KVM) included with the Linux Kernel [2]. 3. EXPERIMENTS 3.1 Application The medical task we investigate for this work is image segmentation of brain data. Segmentation allows medical professionals to observe the volumes, intensities, and ratios of the three main tissue types found in the human brain: white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). We used the FMRIB FSL [5][10] software package for the segmentation. This is a well-known software suite and is provided with source code, which allows us to perform a detailed analysis of the executions. Specifically, we use the Brain Extraction Tool (BET) to extract brain-specific image data and the FMRIB Automated Segmentation Tool (FAST) [12] to segment the images into WM, GM, and CSF. Since FAST takes up the majority of the time, we focus on it for the performance analysis. 3.2 Data Four sets of 3D MRI data provided by Miami Children s Hospital are used. Each data set contains 124 images sized 256 x 256, with pixel size of x x Infrastructure We use the Mind cluster located at the Center for Advanced Technology and Education (CATE). It consists of identical nodes containing dual-processor Pentium Xeon 3.6GHz processors (Netburst architecture). Each node contains 2GB of RAM. NPACI Rocks version 5.2 is being used as the Operating System. For the test, we configured half of the nodes as (bare metal) compute nodes and 4 nodes as vm-container (hypervisor) nodes using the Xen Roll [11], which uses version 3 of the Xen [7] hypervisor. The Xen VM images used contain the same version of the CentOS Operating System used by Rocks. 3.4 Experimental Procedure All experiments are executed three times on a bare metal (BM) node and on a virtual machine (VM) running inside a hypervisor node. All results presented reflect the averages of the three executions of FAST. The first tests simply measure the total execution times. To better understand where the bottlenecks of the executions reside, we repeat the experiments using the gprof [8] profiling tool. For gprof experiments, we re-compiled the entire FSL suite with debugging symbols and gprof support enabled. The default compiler optimization (O3) was used. The executions performed were the same as in the previous section. 4. RESULTS 4.1 Overall Execution Time The overall execution times are shown in Figure 1. The surprising realization is that the execution is faster for the executions running inside VMs. The advantage is small but significant and exists for all four datasets performed, which implies that it is not just anecdotal. Before looking closer at the code, we performed some additional experiments to try to isolate this unexpected behavior. First, since the application was residing on a network file system, we copied the relevant files to local storage and re-ran the tests. We also executed the experiments on different BM and VM nodes. The results were the same, so we performed a more in-depth inspection of the execution. Execution Time (minutes) Dataset 3 4 BM VM Figure 1. Execution times of four data sets on bare metal (BM) and on virtualized (VM) resources. We used an in-house developed lightweight system profiling tool described in [9], amon, to determine if there is system-related culprit. Amon reports time (separated into wall, user, and system times), peak memory usage, and number of context switches and interrupts for all processes executed in the system. We noted several things. First, the amount of time spent performing system calls is always 20-50% higher for the VM executions. This is expected since these calls can result in overhead from the hypervisor. However, system time accounts for less than 5% of total execution time, which we do not consider significant enough for these executions. A more noteworthy observation is that there are much more context switches and interrupts occurring in the 40

41 BM nodes. The BM nodes performed about twice as many context switches and about 10 times as many interrupts as the hypervisor nodes. The VMs themselves performed about 1/10 as many context switches as the hypervisor. No other programs were started during the three executions of the program during this time, which leads us to believe that there are OS services causing the interrupts. 4.2 In-Depth Execution Analysis Looking at overall program completion time alone does not reveal the cause of performance differences, so we used the gprof profiler to generate a functional breakdown of the executions and observe the difference in the performance of individual functions. Once all executions were completed, the Top 7 functions (i.e. the 7 in which the most time was spent executing) were obtained from the profile trace files and the results plotted. The basic patterns were the same, so due to space constraints we only show the results for datasets 2 (Figures 2-4) and 3 (Figures 5-7). Comparing the figures for dataset 2 to their counterparts for dataset 3 shows the similarity. In the figures, we use function identifiers for brevity. The function names are shown in Table 1. We will only refer to the figures for dataset 2 for the remainder of the analysis. Figure 2 compares the percentage of time taken by the Top 7 functions. As can be seen, the most highly used function takes almost twice as much of the overall execution time in the BM case. The other functions exhibit comparatively negligible differences. In terms of time (Figure 3), the BM execution takes nearly 4 times as long for this function. Figure 4 shows the cumulative time taken by the Top 7 functions. It shows that after the large difference in execution time of Function A, the execution time of each function execution remains about the same for both platforms. We therefore conclude that this function has been the most affected by the overhead in the BM nodes. % of Execution Time ID A B C D E F G Table 1. Mapping of function ID to full function Function NewImage::convolve() ZMRISegmentation::TanakaIterations() ZMRISegmentation::MRFWeightsTotal() ZMRISegmentation::PVClassificationStep() ZMRISegmentation::UpdateMembers() NewImage::calc_sums() ZMRISegmentation::Initclass() A B C D E F G Function ID BM VM Figure 2. Percentage of execution time of the Top 7 functions of Dataset 2 on BM and VM resources. Execution Time (s) A B C D E F E Function ID BM VM Figure 3. Execution time of the Top 7 functions of Dataset 2 on BM and on VM resources. Time (s) A B C D E F E Function ID Figure 4. Cumulative execution time of the Top 7 functions of Dataset 2 on BM and VM resources. % Execution Time A B C D E F E Function ID VM BM BM VM Figure 5. Percentage of execution time of the Top 7 functions of Dataset 2 on BM and VM resources. Execution Time A B C D E F E Function ID BM VM Figure 6. Execution time of the Top 7 functions of Dataset 2 on BM and on VM resources. 41

42 Time (s) A B C D E F E Function ID Figure 7. Cumulative execution time of the Top 7 functions of Dataset 2 on BM and VM resources. 5. FUTURE WORK These experiments were part of an ongoing work with the larger goal of optimizing throughput of medical jobs submitted to clouds by developing a sophisticated scheduling methodology. This will include the application of a performance prediction methodology [9], which will be refined after closer analysis of the performance of the FSL application used in this study. 6. CONCLUSION We have presented our findings from executing an image segmentation application on physical and bare metal resources. We came to the unexpected realization that the executions on virtualized resources were actually faster. A more fine-grained look at the execution of the application revealed that the execution on the virtualized resources is consistently faster, i.e. the execution pattern in both cases is the same, but the virtualized runs are faster. We thus suspect that the slowdown is due to too many interrupts occurring on the physical nodes, as suggested by performing the executions with a system profiler running. As we move forward, we will perform more in-depth system profiling to make more definite conclusions. However, we can conclude from these experiments that the overhead due to virtualization is not likely to exceed that of not having a specialized Operating System designed to minimize interrupts. 7. ACKNOWLEDGMENTS VM BM The authors appreciate the support provided by the National Science Foundation under grants CNS-BPC-AE , CNS- MRI-R , HRD-CREST , and NSF-OISE-PIRE The authors are also thankful for the clinical support provided through the Ware Foundation and the joint Neuro- Engineering Program with Miami Children s Hospital. 8. REFERENCES [1] Huang Wei, Jiuxing Liu, Bulent Abali, and Dhabaleswar K. Panda A case for high performance computing with virtual machines. In Proceedings of the 20th annual international conference on Supercomputing (ICS '06). [2] I. Habib. Virtualization with KVM. Linux Journal, 166, Feb [3] Lamia Youseff, Rich Wolski, Brent Gorda, and Chandra Krintz Evaluating the Impact of Xen on the Performance of NAS Parallel Benchmarks (extended abstract), UCSB Second Graduate Student Research Conference (GSRC), University of California, Santa Barbara, October [4] Michael Fenn, Michael A. Murphy, and Sebastien Goasguen A study of a KVM-based cluster for grid computing. In Proceedings of the 47th Annual Southeast Regional Conference (ACM-SE 47). ACM, New York, NY, USA, Article 34, 6 pages. [5] M.W. Woolrich, S. Jbabdi, B. Patenaude, M. Chappell, S. Makni, T. Behrens, C. Beckmann, M. Jenkinson, S.M. Smith. Bayesian analysis of neuroimaging data in FSL. NeuroImage, 45, 2009, [6] NASA. NAS Parallel Benchmarks. [7] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield Xen and the art of virtualization. In Proceedings of the nineteenth ACM symposium on Operating systems principles(sosp '03). ACM, New York, NY, USA, [8] Susan L. Graham, Peter B. Kessler, and Marshall K. Mckusick Gprof: A call graph execution profiler. SIGPLAN Not. 17, 6 (June 1982), [9] S. Masoud Sadjadi, Shu Shimizu, Javier Figueroa, Raju Rangaswami, Javier Delgado, Hector Duran, and Xabriel Collazo. A modeling approach for estimating execution time of long-running scientific applications. In Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium (IPDPS-2008), the Fifth High-Performance Grid Computing Workshop (HPGC-2008), pages 1-8, Miami, Florida, April [10] S.M. Smith, M. Jenkinson, M.W. Woolrich, C.F. Beckmann, T.E.J. Behrens, H. Johansen-Berg, P.R. Bannister, M. De Luca, I. Drobnjak, D.E. Flitney, R. Niazy, J. Saunders, J. Vickers, Y. Zhang, N. De Stefano, J.M. Brady, and P.M. Matthews. Advances in functional and structural MR image analysis and implementation as FSL. NeuroImage. 23(S1), 2004, [11] Xen Roll: User s Guide. Retrieved Feb. 2010: [12] Y. Zhang, M. Brady, and S. Smith. Segmentation of brain MR images through a hidden Markov random field model and the expectation maximization algorithm. IEEE Trans. On Medical Imaging, 20(1), 2001,

43 CS0- Computer Animation Course Taught with Alice Daniel Jaramillo Dr. Karen Villaverde New Mexico State University Department Of Computer Science Box 30001, MSC CS Las Cruces, NM 88003, USA ABSTRACT In this paper we describe our very positive experience in teaching a CS0 course called Computer Animation with Alice as our 3D development platform. We describe why Alice was chosen as our 3D development platform, what material was covered, how the course was conducted, the quality of the students projects, features of Alice that students liked, problems that students had when using Alice, and the evaluation of the course. We also compare and contrast two offerings of the course: a mostly male student course and all-female-students course. KEYWORDS alice, computer animation, CS0, java, video game development 1.INTRODUCTION The Alice computer animation course has been offered at New Mexico State University since Spring 2007 as a CS0 course. The course is an introduction to programming with a computer animation and game design focus that attracts students to enroll in this course and keeps them motivated throughout the semester but does not overwhelm the students with all the details associated with computer animation and game design. The students develop three complete projects as the semester progresses, with each successive project containing more content than the last as the student learns more features of Alice and becomes more comfortable working on the platform. Learning to program for the first time, creating computer animations, computer games, and learning a 3D game development platform could be difficult for a student to do in one semester, therefore we decided to give Alice [1] an opportunity, as Alice has been proven to be very easy to pick up by CS0 students [1] learning to program. Since Summer 2006, Alice has also been used in the Young Women in Computing[6] camp at New Mexico State University, in which a class of all high school girls learns to program with Alice, completing the normal semester s worth of content in just five weeks. The normal semester s class is mostly male attended. In both courses, we were very satisfied with the quality and the variety of projects produced by our students. Our students enjoyed the course very much, learned a lot, and felt a strong sense of accomplishment by having developed three complete animation projects, including a computer game, even though none of them had developed any animation or computer games before. Also, to keep the class interesting and engaging, we added a feature to assignments that was inspired from video games, the idea of hidden extra credit. In the remainder of the paper we describe why we chose Alice as our 3D development platform, what material was covered in our course, how we conducted the course, how we performed the grading, the quality of the students animation projects, features of New Mexico State University Department Of Computer Science Box 30001, MSC CS Las Cruces, NM 88003, USA kvillave@cs.nmsu.edu Alice that students liked, problems that students had when using Alice, and the evaluation of the course. Also, since we had the opportunity to teach an all female class with Alice, we will compare and contrast the experiences with the all female class and the mostly male class. 2.REASONS FOR CHOOSING ALICE We continue to use Alice as our 3D development platform for the following four reasons: (1) Alice is very easy to pick up by students with absolutely no programming experience. The drag and drop features of Alice make it difficult for students to make syntactical mistakes. (2) We wanted a 3D development platform that was freely available, multi-operating system, and easy to use. Alice is written in the Java programming language, and can be easily run on all major operating systems. (3) We wanted a 3D development platform that offered enough flexibility to develop different types of animations and computer games without overwhelming the students with options and features. (4) We wanted a 3D development platform with lots of 3D models already built-in. This was very important since we did not want our students spending a long time learning a complicated 3D modeling tool and then spending even more time drawing their 3D models. 3.MATERIAL COVERED The textbook used to teach our course is Learning to Program with Alice [4]. The following paragraphs describe how the book material was covered. The first few classes are dedicated to getting the students settled in and comfortable with accessing Alice. Since the majority of students during the semester are freshmen who do not have a strong programming background, time is spent to make sure they are familiar enough with the Linux operating system, accessing a Windows 7 Virtual Machine and loading Alice. Once they have been walked through the steps, they are assured they do not have to deal with that trouble again and accessing Alice is fairly easy. They are reminded and encouraged to download Alice at home and experiment on their own. Once students are comfortable with accessing Alice, they are then shown the Alice interface while introducing some terminology they will hear throughout the semester. Terms like Object and Method are defined as they are shown the Object Tree, Method Viewer, etc. We also introduce basic programming control constructs and terms, such as functions, methods, variables, and member properties, and students are shown how each object can contain these properties. We are aware that many if not all the 43

44 students might be overwhelmed with the information, so there are many opportunities for questions and the information is repeated throughout the course. Even more so, these topics are covered again in CS1, but in regular Java. At this point, students are encouraged to look through the built in examples, try the tutorial, or even create their own Alice world. From here, we progress through the book material. Beginning with basic movement in Alice and object manipulation we then move to more advanced topics like Event triggers, variables, computer animation, and game design. Some more advanced topics, like recursion, can be taught in Alice, but depending on the amount of time available for the course, they might not be taught. Since the course does not have exams, the final class in the course is a mini game tournament, where the students compete for high scores in each other s games. We have found that the competitive factor pushes the quality of the games a little bit further and helps the students think of ideas to include in their final game project. We also do not want to restrict the student s creativity in the class, so it is not required to make a game that has a competitive nature to it. 4.GRADING Several criteria and different grade percentages are used to grade the students in the course. Attendance and participation are at 30% of the final grade, project 1 at 10%, project 2 at 20%, project 3 at 30% and homework at 10%. The projects are arranged to progressively build up to their computer game project. Project 1 is a small, short trailer for their game, project 2 a longer trailer with sound and project 3, their final game. The requirements of the projects are designed to help with that progress and are graded in accordance with those requirements. Homework was included in the class curriculum for the first time in the Fall of This decision was made to keep the students practicing their skills in between projects. An unintended, but positive side effect of this addition is an increase in class attendance. Also new to the Spring 2010 semester was the inclusion of hidden extra credit. As part of the project requirements, hidden extra credit was mentioned, but the students were not told what exactly they needed to do to receive the extra credit. The result of this was students trying things not yet taught in class or going beyond the base requirements given. The hidden extra credit was not substantial enough to make the student rely on it for the difference between a letter grade but was just a fun addition for students who may have found the content easy or those who wanted to go beyond the requirements. For example, on the list of requirements for a Project, the student will see under extra credit a line of question marks. They are told in class that each question mark represents a letter in a complete sentence that when they complete that requirement, they will receive a small amount of extra credit. The students are not told what the hidden extra requirement is until after all projects are graded. This seemed quite enjoyable for students who wanted to get the most credit out of each project. The extra credit offered in the course can push the student well beyond 100% for their grade. This has encouraged students in the past to strive for very high grades, and to attempt things beyond the requirements given to them. Course attendance and participation is grouped together so that a student must not only attend class, but participate in the exercises of the day. If they decide to ignore the lecture or not participate in the exercises given, their attendance could be revoked. However, this action has never had to be taken with the students. Classes missed without prior notice given to the instructor ends up being worth 1% of the 30% available for attendance. For the animation projects, the students are free to select their own themes, story lines, characters, etc. The first project and second project are done individually and for the third one the students have the option of pairing up with another student. The main criteria used in the evaluation of the projects are listed on a requirements list, given to the student a few weeks prior to each project. All the projects are presented and played in class. This greatly facilitated the evaluation of the projects since the students showed all the features of their projects in class. While the final grades varied, a large majority were A s. Very few B s have been given and rarely have there been C s or lower. 5.PROJECT QUALITY The quality of the projects varied greatly across the semesters and the students. Project 1, a simple animation, usually had the most common results, as it was also the easiest project. Most of the projects fulfilled all of the requirements, while not straying too far into the students creative space, as they were still likely getting used the platform. Project 2, a more complicated animation, usually had projects that were either very impressive or not much different than project 1. We believe this difference came from the students who were more comfortable using Alice than the ones who still had the same skill set as project 1. Project 3, the game project, is where quality was the most diverse, with a few students turning in very high quality work, then a majority turning in passable, not outstanding work, then finally, a few students who did not understand the project requirements therefore turning in less than desirable work. We believe that the students who understood Alice completely were the ones able to turn in impressive games. As the semester became more demanding, the class fell as a priority to some students, so even though a majority could have had a greater understanding of the platform if they gave it more time, they were only able to complete the minimum requirements for the final project. The latter tier of quality usually came from students who did not attend class often and did not contact the instructor for assistance. Again, we are hoping the addition of homework will help close this skill gap. The feedback from students from the evaluations was used in figuring out where the problems were and what might be done to fix them, like adding homework. 6.CLASS FORMAT The class was taught to be an introductory level class covering a number of topics in animation, programming and game design. To help spawn creativity in animation, game design, and problem solving, weekly puzzles were given to the students to encourage thinking differently about solving problems. These puzzles came from the PerplexCity [4] line of games. Since Alice is an animation program, animation concepts were a large portion of the course. Students were taught to be very mindful of what the camera does and sees, and were shown different camera techniques in the Alice programming space. Students were also taught the idea of reference, that is if a student wanted to make a person object walk, it was best to look at videos of people walking, or even look at themselves in a mirror. From 44

45 there, they could use the built-in Alice movement commands to create realistic looking movements. Another large portion of the class was focused on programming techniques. From the beginning of the class, starting with storyboarding, students were told to plan ahead before diving into programming. If they knew what they wanted their scene to look like and how it would behave programming-wise, they would have a much easier time putting it all together. Before each example in the class, the instructor would write on the board a pseudo-code script, which the class would then flesh out into more Alice code related movements. Once it was decided that the script was complete, the class would then focus on the objects needed to complete the scene, and then in later classes, custom methods, functions and variables needed. Using those initial plans, basic programming constructs like While, If/Else, Methods and Functions were used to connect the plan to code. This gives the student some familiarity with these constructs if they decide to continue down a programming path in college. Since a large portion of the class is usually Freshmen, it was re-enforced that this type of planning was not only useful in programming and animation, but in other classes, such as writing or even in approaching a math problem. The final portion of the class was in game design, since the final project of the class was a game made in Alice. Here the focus was on the scale of the game, its playablity, and fun. Starting off with planning, like story-boarding or pseudo-code, the students were constantly reminded to keep their game small, as Alice does not save states or allow for dynamically creating and destroying objects. From there, it was important for them to concentrate on the core gameplay and how the player would interact with their game. Things like play testing and getting comments from others were highly suggested. Also, the students had to keep in mind what events needed to be made to allow the player to interact, and that those events made sense. A common control scheme in computer games is to use W,A,S,D for movement, so the students were reminded to use control schemes that made sense. For example, the students were told that it would not made sense to use U for up and D for down as the keys are not close to each other. Focusing on animation and game design during the course made the course more interesting, fun, and relevant to the students and shows the students that Computer Science is more than just writing code. 7.ENJOYABLE FEATURES OF ALICE Among the several features of Alice that students liked for the development of 3D animations and games, we have the following: (1) Alice comes with hundreds of already built-in low polygon 3D models, and all the parts of the 3D models can be individually manipulated before and during run-time. Therefore, there is no need for multiple versions of a 3D model to create an animation. Our students used only the Alice models for their projects. They did not have to spend time learning a modeling tool and then creating and exporting the models to Alice. (2) Alice provides automatic handling of computer graphics. When objects move in the virtual world, their perspective, lightning, and fog effects are modified automatically. Students can also easily adjust the ambient, directional, point, and spot lights, and fog effects during run time. Automatic handling of computer graphics allowed students to concentrate on their design, story and mechanics instead of studying and coding complicated computer graphics manipulations and dealing with graphics hardware nuances. (3) Alice provides easy camera manipulation. For different perspectives all that is required is to change a specific property of the camera. (4) Alice comes with a user-friendly interface especially in terms of 3D object manipulations. Re-arranging, resizing, rotating, copying, turning, and tumbling 3D objects in the virtual world is effortlessly achieved by using Alice s quad view of the virtual world, mouse controls and keyboard shortcuts which also allow zooming and scrolling. (5) Alice comes with a capturing pose feature which is very useful for object animation. This feature allows students to make an object look a certain way by manually moving its parts and capturing the new looks (poses) by just clicking a button. Alice can then cycle through these poses at run-time and fill in the movement details between the poses. (6) Alice provides a very easy-to-use event editor and automatic event handling. (7) Alice allows students to experience some basic concepts of parallel programming with its built-in DO TOGETHER instruction where blocks of code can be set to run simultaneously. (8) Alice s programs can be sped up and slowed down during runtime using a scroll bar. This feature was of great help for debugging and testing our students game projects. (9) Alice comes with a drag-and-drop code building feature. According to our students, this feature allowed them to pick up the language very quickly. 8.STUDENTS' PROBLEMS WITH ALICE Even though our students enjoyed working with Alice very much, they did experience several problems which we list below. (1) In Alice it is not possible to copy/paste code directly between Alice files. There is no merge feature between Alice files either. Our students had to rely only on the export/import of classes feature to share their work when they worked on a project as a team. (2) Alice does not allow the creation or destruction of objects during run-time. Therefore, our students had to improvise by placing all their objects at the beginning of their animation or game and making them visible or invisible during the game execution as it was needed. This required long load times and huge amounts of memory. Some animations or games required to be run on machines that had at least 4 GB of memory. (3) Alice does not have very good sound support features. It is difficult and not completely reliable to stop one audio file and start another one. Students had to use external audio editors like Audacity [3] to edit their sound files to adjust to the correct playing times. (4) In Alice it is not possible to enable or disable events during run-time. Our students had to devise clever ways to work around this problem. (5) Alice animations and games run differently on different hardware platforms and Alice versions. We had to perform the grading of projects in the machines the projects were developed on and in the version of Alice for which they were developed. Most students developed their games in Alice 2.0 rather than in Alice 2.2 since Alice 2.0 is more stable than Alice 2.2. For some animations, the font would also not match what was originally intended by the creator, which would leave some text looking glitched, or incomplete. (6) Alice does not come with automatic collision detection, it has to be programmed. Also, it is not possible in most cases to pro- 45

46 gram accurate collision detection since Alice only allows us to compute the distance between the centers of two objects. (7) It is not possible to create user-defined classes different from normal Alice classes. (8) It is not possible to create multidimensional arrays. (9) Alice as a program is beginning to look very aged, as the version we are using was written in 2005 the 3D models look outdated as well. This has a tendency to discourage students due to how their own projects look. 9.STUDENTS COURSE EVALUATION The evaluation information was taken from the 2009 semesters. The overall feedback was positive. Many students enjoyed the teaching style of the graduate student instructor, however most complaints were focused on the problems mentioned previously about Alice. Students gave office hour availability, organization of the course, enthusiasm and fair grading of the instructor, objectives of the course, and ability to convey the information of the course a 100% rating with A's and B's on the evaluations. Attitude of the instructor and syllabus usefulness was 93.75%, content of the course 87.5% and the text book's clarity 75%. 10.CLASS OBSERVATIONS From comparison of project quality of females to males, we found that females generally turned in projects of higher quality. This could be from the original design of Alice, which was to teach middle school girls programming, so the females may feel more connected to Alice than the males do. With the addition of homework, the overall quality of projects did go up, but we found we had to relax some of the requirements to ease the workload on the students. This could be that the majority of students were freshman and had not yet developed a more mature work ethic to handle the additional work. We also found that most of the students took the course because they thought it sounded interesting. Very few signed up because they saw an advertisement either posted or through . Since the class was advertised as a computer animation course, not many students were expecting a programming aspect. A few students took the course to increase their GPA. 11.PROJECT OBSERVATIONS Of the 70 final projects collected over the last 4 years, 18.57% of them were clicking based games, based on a wack-a-mole style game or similar % of the games were shooting or action-packed games; for example, the player would use a gun or their actions would cause explosions in the game.14.29% of the games were treasure hunting or object finding games % of the games were avoidance style games, where the player must not get hit by objects in the game % were sports games like soccer or football. 10% were card, music or quiz style games. 7.14% were adventure games and skill based games that were difficult to categorize, such as timing a click or key press. We believe that the clicking based games were the most popular, because it was based off an example given in the book, so students were likely sticking with what they felt comfortable. Next most popular were shooting and action games, which were mostly done by males in the course. Of the 70 games collected, only 28 were submitted by females. Which brings us to the treasure hunting games, which were most popular by females in the course. This is an interesting distinction between the genders, as males may have submitted more action oriented games, either because of a more aggressive nature, or video games they play, while females submitted games based on environments and searching. We believe that more time observing women in the course with appropriate surveys could tell us more about this. Avoidance games were likely popular because they were usually easier to program, requiring little in the way of events and functions for collision detection. With the other types of games, the results were quite mixed and likely based on the personal taste of the student. Quality of these projects was related greatly to the students attendance, with a majority of students turning in projects that met or exceeded requirements. There were only a few students, who while they had perfect attendance, did not turn in projects that met the requirements. Some exceptional projects, like Haunt, an adventure game with a few puzzles to solve and nice background music, or a working version of solitare can be found at 12.AFTERTHOUGHTS It is hard to say if the course was truly successful or not. It did not cause a surge of new students into the CS program, but the evaluations were overall very positive. If the class was heavily advertised, we may have had a larger number of students in the class, but that could have overwhelmed the instructors. If the class was specifically advertised as a CS recruitment class, perhaps some students would not have of enrolled. The instructors did, however, have an enjoyable time teaching the class, and believe that the experience from the class can be used to form a more successful CS0 program in the future, which is currently being examined. For this future CS0 program, Alice will not be used as we believe it is not as relevant and useful to today s students who are accustomed to beautiful and very fast 3D graphics for animation and games, which could potentially be affecting the CS1 enrollment after CS0. We are thinking of switching to a 2D game engine like Greenfoot [7] as it has very few bugs, is more robust, covers the entire Java language, and allows students to create a greater variety of animations and computer games. Another possibility is Windows Phone 7, which will be examined. 13.ACKNOWLEDGMENTS The authors would like to thank CAHSI, NSF, NMSU Computer Science Department and faculty, Carnegie Mellon for Alice and CAHSI grant NSF grant CNS REFERENCES [1] Alice. [2] Dann, W., Cooper, S., and Paush R Learning to Program with Alice. Second edition. Pearson Prentice Hall. [3] Audacity. [4] Perplex City [5] CAHSI [6] YWiC [7] Greenfoot 46

47 Genetic Algorithm with Permutation Encoding to Solve Cryptarithmetic Problems Yesenia Díaz Millet Polytechnic University of Puerto Rico 377 Ponce de León Ave. Hato Rey, PR ABSTRACT Computer programs and technology have been borrowing ideas from nature ever since they first surfaced. One such example is the Genetic Algorithm. Ever since their appearance, Genetic Algorithms have been used to solve many different kinds of problems requiring different types of analysis and encodings. The research performed for this paper is the result of a research involving cryptography and Genetic Algorithms. While cryptarithmetic problems are a type of constraint satisfaction problems, they are considered as a type of cryptography by some due to their origins and nature. Cryptarithmetics are a type of problem that can be solved through a blind search, i.e. brute force, as well as by rule based searching techniques. That is, they can be solved by implementing recursive backtracking techniques to verify all the possible solutions or by implementing a set of rules in order to come to the solution more quickly. For this research, Genetic Algorithms were used in order to solve cryptarithmetic problems for the purpose of using a search technique that is more efficient than a blind search and that could later be applied to other types of problems. KEYWORDS Cryptarithmetic, Genetic Algorithm, Constraint Satisfaction Problem (CSP). 1. INTRODUCTION Genetic Algorithms can be applied to any problem that involves the search throughout a large space of possible solutions. They can be more effective than a simple blind search depending on the type of problem and how they are implemented. Examples of these problems can be found in cryptanalysis and constraint satisfaction problems such as cryptarithmetics. Cryptarithmetic involve the mapping of unique digits to individual letters in a word addition. This paper focuses on permutation encoding to solve cryptarithmetic problems. It shows the different implementations of crossover and mutation for this type of encoding. The development of this research provided a means to learn about Genetic Algorithms, how they work, and the different scenarios in which they can be implemented. 2. CRYPTARITHMETIC Cryptarithmetics are a type of constraint satisfaction problem which require finding a unique digit assignments to each of the letters of a word addition so that the numbers represented by the words add up correctly [2]. Cryptarithmetics also have other constraints that must be applied in order to solve cryptarithmetic puzzle. Other constraints on this type of problems include the following: 1. The leftmost letter can t be zero in a word 2. There must be a one to one mapping between letters, or symbols, and digits 3. When letters are replaced by their digits, the resultant arithmetical operation must be correct. 4. The amount of letters contained in a puzzle cannot be greater than the number base. Typically, cryptarithmetics operations are in decimal which means a number from 0 to 9 will be assigned to each letter therefore each puzzle can only contain ten unique letters. Cryptarithmetic problems could also be solved in different bases like base 8 or base 16 but this is not usually the norm and thus this paper will concentrate on working with base 10. This means that, when performing a blind search, the amount of possible solutions to a given problem of base 10 is the permutation npr = 10Pr = 10!/(10 r)! where n is the number base and r is the amount of letters contained in a particular puzzle [2,4]. In the case of the most common cryptarithmetic puzzle SEND + MORE = MONEY, the amount of possible solutions would be 10P8 = This makes it a good candidate for implementing a Genetic Algorithm. Figure 1 Classical Cryptarithmetic Puzzle [8] 3. GENETIC ALGORITHM Genetic Algorithms (GAs) are search algorithms that are inspired by Darwin s theory of evolution [6]. They start with a set of possible solutions which is the population. Each possible solution is known as an individual or chromosome and will undergo different methods of evolution throughout many generations of populations until the best solution is found. In order to find the final solution 47

48 to the given problem, each individual must be assessed for their fitness. The fitness is a value that varies depending on the problem to be solved. It determines how close an individual is to being the correct or final solution to the problem Individual Encoding There are various types of encoding. These include binary encoding, where the information within the individuals are strings of bits; permutation encoding, where the information of each individual is a string of numbers which do not repeat; value encoding, where the information is a string of problem specific values; and tree encoding which is regularly used for genetic programming [6]. This GA for solving cryptarithmetic problems will be implemented utilizing permutation encoding. A string will be created containing each individual letter within the complete problem. Each letter will be assigned a random unique digit in order to initialize an individual. crossover points of two individuals and switching their contents in order to produce two children. This poses a problem when dealing with permutation encoding as it could result in duplicate values which violates the constraints placed on the cryptarithmetic problems. In order to implement the crossover with permutation encoding two parent individuals are selected from the population. One or two crossover points are chosen. In the case of single point crossover the values between the crossover point and the final value in each individual are chosen. For a two point crossover, the values between the two crossover points are selected from each individual. The values from the first parent are appended to the second parent and all the duplicates from the second parent are removed resulting in the first child and vice versa [3]. Figure 2 An Individual based on the classical cryptarithmetic problem. 3.2 Fitness Calculation The fitness of an individual determines how close it is to the true solution. In order to assess the fitness of a cryptarithmetic problem, it is necessary to focus on the constraint that states when the letters of a problem are assigned the digits, the resulting arithmetic operation must be correct. Each word added in the puzzle is called an operand and the word that they add to is the result. For this reason, the fitness can be calculated as: Figure 3 Fitness Calculation, where FOP is the first operand, SOP the second operand and RES is the result [1]. With this calculation, the lower the fitness value, the more fit the individual is. A perfect individual, that is the correct solution, will have a fitness of 0, thus making the problem true. 3.3 Evolution As previously stated, each individual will undergo a different method of evolution in order to find the best solution. These evolutionary methods include mutation and crossover. The implementation of both mutation and crossover varies depending on the encoding used for each individual Crossover Performing a crossover seems simple enough. On a binary encoding, it is a matter of simply selecting one or two Figure 4 Two point crossover between two individuals. Figure 4 shows a two point permutation crossover. The bars on the parents represent the crossover points while the underlined numbers represent the values appended for each child. It is important to note that not all values from one parent will be appended to the other parent in order to produce a child. This is because it is necessary to keep the amount of numbers in the individual equal to the amount of letters in a problem. Crossovers are a good way to create new individuals; however they aren t very beneficial for cryptarithmetic problems. If the best individuals are chosen and crossed over they might result in less fit individuals. Another problem that arises from crossovers is that sometimes the values needed to get closer to a solution are not contained within either parent, no matter how fit the parents are. For this reason, mutation was found to be better, as it allows reaching a solution much faster than with crossovers Mutation As with crossover, mutation varies depending on the encoding. To perform mutation on a binary encoding, it is a matter of simply selecting one or more random points in the individual and changing their values. This method will easily work with a cryptarithmetic problem that has few letters since it is just a matter of choosing a random number that has not already been assigned and replacing it with a number already in the individual; however it will not work on all cases. What would happen if a cryptarithmetic problem contains exactly ten letters? Each letter must be assigned exactly one number from 0 to 9 therefore it would 48

49 be impossible to replace one digit with another digit because this would result in a repetition. The best way to resolve this problem is to select two points at random and exchange their values. However this raises another problem. What would happen if the solution to a problem contains a number that is not found in the individual? This would make it impossible to reach the correct solution in the event that the fittest individual is off by a number. In order to solve this problem, the mutation is implemented both ways. A random point and number are selected and, if the number is not contained in the individual, the value contained at the chosen point is replaced with the number. If the number is already contained in the individual, then the mutation is performed by selecting two points and switching their contents. Add indv to newpop else do Add indv2 to newpop for newpopsize < popsize do indv Select Random individual from pop indv2 Mutate Best individual from pop Add indv2 to newpop pop newpop if solution found or convergence do stopped true Best Select Best individual from pop return Best Figure 6. Pseudo code of the Genetic Algorithm Figure 5 (a) Mutated Individual from Figure 2 by replacing the value of S with a random number. (b) Mutated Individual from Figure 2 by switching the values of D and O. 3.4 Implementation of the GA The GA for solving cryptarithmetic problems was implemented using the C++ language and utilizing single linked lists to store the individuals information as well as the populations. In the initial generation, a population of size popsize was generated randomly based on the problem. For each generation, the population s fitness was calculated and the fittest individual was selected. This individual was mutated and then a comparison between the original individual and the mutated individual was made. The most fit of the two would be added to the new population. The rest of the population would be generated by selecting random individuals from the current generation and mutating them. The mutated individuals are added to the new population and the process would begin again. The algorithm would stop when the solution was found or the population fitness converges, in which case the best possible answer would be returned. popsize desired population size stopped false Initialize pop of size popsize while(!stopped )do indv Select Best individual from pop indv2 Mutate Best individual from pop AssessFitness(indv) AssessFitness(indv2) if indv < indv2 do The algorithm was successfully implemented for a few problems but more consideration was given to the classical SEND + MORE = MONEY problem. This problem was implemented with different population sizes which resulted in the algorithm reaching the correct solution or the best possible solution at different amounts of time. The algorithm was also initially implemented with different types of searches and different modifications were performed resulting in the pseudo code in Figure 6. This version has successfully found the solution whereas earlier versions would alternate through convergence or the correct solution. Figure 7 Run of the algorithm with the problem SEND+MORE=MONEY and a population of size CONCLUSION The focus of this paper was the design of a Genetic Algorithm to solve cryptarithmetic problems. The way in which the individuals were encoded provided a means to be able to perform both crossover and mutation however it was found that it was more effective to perform mutations rather than crossover. It was also observed that the 49

50 population size played a role in how fast the algorithm could find a solution depending on the problem. Problems with more letters required larger populations of at least the same size as the amount of letters in order to be able to search through larger spaces at a shorter amount of time. It is worth noticing that while the algorithm might search through fewer generations with a larger population size; this does not imply that the algorithm will always find the solution in a shorter time than with a smaller population size. 5. FUTURE WORKS This paper served as an introductory work on Genetic Algorithms. Future research will concentrate on cryptanalysis of different cryptographic algorithms. The idea is to test whether genetic algorithms can be effective if used as a means to decode messages encrypted with different encryption algorithms. 6. ACKNOWLEDGEMENTS This research would not have been possible without the help of several important people who have encouraged me to continue the work and allowed me the opportunity to perform it. I would like to acknowledge Dr. Alfredo Cruz for giving me the opportunity to work on this research as well as for providing the opportunity to obtain the Nuclear Regulatory Commission (NRC) Grant Fellowship Award NRC I would also like to acknowledge Prof. Luis Ortiz who was one of my mentors during my Bachelor s degree as well as all the other professors who gave me their opinions about my work. Also, for encouraging me I would like to acknowledge the other students who have received the fellowship and are performing their own research at the university. These are Oscar Pérez, Patricia Becerra and Edgar Lorenzo who will be submitting their own work to CAHSI. Lastly, I would like to acknowledge the National Science Foundation (grant no. CNS ) for its support, as well as CAHSI for providing the opportunity to present this research. 7. REFERENCES [1] Abbasian, R., & Mazloom, M. (2009). Solving Cryptarithmetic Problems Using Parallel Genetic Algorithm. Computer and Electrical Engineering, ICCEE '09. Second International Conference on, 1(28-30 Dec ). Retrieved November 15, 2011, from er= &isnumber= [2] Clearwater, S. H., Hogg, T., & Huberman, B. A. (1992). Cooperative Problem Solving. Computation: The Micro and the Macro View, World Scientific,pp Retrieved December 18, 2010, from n/cryptarithmetic.pdf [3] Falkenauer, E. (1999). The Worth of the Uniform. Proceedings of the 1999 Congress on Evolutionary Compution, 1(July 6-9), [4] Md., A. S., Haider, M. B., Mahmud, M. A., Alaul, S. M., Hassan, M. K., Ahsan, T., et al. (2004). An Evolutionary Algorithm to Solve Cryptarithmetic Problem. In A. Okatan (Ed.). (pp ). International Computational Intelligence Society. [5] Menon, H. (n.d.). Cryptarithmetic. Scribd. Retrieved February 6, 2011, from [6] Obitko, M. M. (n.d.). Main page - Introduction to Genetic Algorithms - Tutorial with Interactive Java Applets Retrieved February 6, 2011, from [7] Sean Luke, 2009, Essentials of Metaheuristics, available at [8] Soares, J. A. (n.d.). A Primer on Cryptarithmetic. Cryptarithms Online. Retrieved February 6, 2011, from 50

51 Session Hijacking in Unsecured Wi-Fi Networks Edgar Lorenzo Valentin Polytechnic University of Puerto Rico PO Box 2050 Aguada, PR ABSTRACTT The intention of this research is to create awareness of the threats of using unsecured networks. Unsecured networks can be found everywhere, at airports, coffee shops, libraries, Universities, and small business. The problem is that most people are unaware of the risks of using unsecured networks, like open Wi-Fi or protected by the deprecated Wireless Equivalent Privacy (WEP) algorithm. They do not realize that anyone could be monitoring the network using a packet sniffer program like Wireshark, or the recently published Firesheep extension to capture unencrypted session cookies. The researcher will first demonstrate how an attacker could connect to a Wi-Fi network protected by WEP. Once connected in the network, the researcher will demonstratee the threat of a session hijack by collecting the cookies that allows an attacker to clone the victim s session ID, allowing an attacker to impersonate the victim in a web site. This paper will provide some advice to minimize the user s exposure to this kind of attacks and privacy exposure. KEYWORDS Network Security, Wi-Fi, HTTPS, SSL, WEP, Protection, Cyber Crimes, Session Hijacking. 1. INTRODUCTION Most people just want to be able to connect to the Internet, without giving attention to the security the wireless LAN is using. People connect to open Wi-Fi at public places, or at cyber cafes protected by WEP. Unfortunately, the WEP keys can be easily obtained, allowing an attacker to connect to the wireless LAN. Once an attacker is connected in the network it could start capturing the session cookies. This is possible because most web sites only encrypt the login process, but not the generated session cookie. This create a sensation of security in the vast majority of people, because they think that if a site uses HTTPS during the login process then they are safe to connect in the web site, but they do not know that an unencrypted session cookie will be generated and could be intercepted by an attacker sniffing the network. Once an attacker captures the session cookie he/she could use the victim s account to impersonate him. 2. UNSECURED WI-FI NETWORKS For this paper, when we refer to unsecured Wi-Fi networks we refer to OPEN and WEP protected Wi-Fi networks. By default, an open Wi-Fi do not provide any kind of encryption protection to the users connected in the network. In the case of networks protected by the deprecated WEP algorithm, it uses a level 2 cipher that is based in the stream cipher RC4, also called ARC4. It uses 64 and 128 bit keys to protect the network. The problem with the WEP algorithm is because it uses predictable initiation vectors (IVs), so it is relatively easy to obtain the WEP key no matter if it is using a 64 or 128 bits key. 2.1 Obtaining the key To obtain the WEP encryption key we will use the AirPcap Nx adapter [see Figure 1] to monitor the available wireless connections and capture the packets. Figure 1. AirPcap Nx adapter from Riverbed Technologies. [1] To manage and analyze the captured packets we use a tool called Cain & Abel [see Figure 2](2). This tool can capture the packets in a specific channel and provides cryptanalysis capabilities to crack the WEP key. Figure 2. Cain & Abel user s interface. [2] We start to scan and monitor available wireless networks to detect the channels and encryption used by the various wireless networks in our range. (For legal and security reasons we made the following examples in our own WLAN, using a Windows 7 virtual machine, and a Mac OS X as the host computer). For this example we are interested in capturing the packets from the access point with Basic Service Set Identifier (BSSID) 0026F2C46E1E, and Service Set Identifier (SSID) Private [see Figure 3]. Figure 3. Access points detected in range. [2] 51

52 The AirPcap Nx adapter allows packet injection to speed-up the packet capture process. [see Figure 4]. 3. HIJACKING THE SESSION Programs like Wireshark and the Firesheep extension use the pcap library to be able to sniff network traffic. Normally, interfaces ignores packets that are not intended to them, but if put into promiscuous mode they can accept any incoming network data. 3.1 With the Firesheep Extension In November 2010 the developer Eric Butler released a Firefox extension called Firesheep. The extension is a packet sniffer with some handlers (instructions) that allows to capture a victim s session cookie. Figure 4. Configuration settings. [2] Once we select what channel we want to monitor, then we start an active scan to capture the packets. A successful WEP key could be obtained using the decryption method called Korek s Attack if [2]: Capture a minimum of 250,000 WEP IVs for a 64-bit key. Capture a minimum of 1,000,000 WEP IVs for a 128-bit key. Figure 5. Korek s attack. If using the decryption method called PTW attack, then the WEP key could be obtained if ([2]: Capture a minimum of 70,000 WEP IVs for 64 or 128 bit keys. Figure 7. Firesheep extension. [7] If the attacker is connected in the same (unsecured) network as the victim, then he/she can capture the session cookies from some specificc web sites. In the example above we captured the session cookies from Bit.Ly, Amazon, Wordpress, MSN Live, and Twitter [see Figure 7]. Each time a user login or is already logged in a web site, the Firesheep extension is able to detect the cookie that contain the session and hijack it. In our own private network we were able to surf in Amazon.com site using another user s account [see Figure 8]. We were able to view the victim s recently viewed items, add items to the wish lists and see them, and even view items that he/shee had in the shopping cart and add some other items to it. Figure 6. PTW attack. [2] Once the attacker has the WEP key then he could connect in the network and start capturing the packets that contain the unencrypted session cookies, but by doing it he/she will be breaking the law. Figure 8. Hijacked session from Amazon.com. [7] In this case Amazon have several protections that won t allow an attacker to make purchase, view or change the victim s address, or view the order history. This is becausee Amazon asks for the user credentials before accessing this areas, so the damage that an attacker could do is limited to just view the items we were watching or our wish lists. In contrast, when we hijacked Twitter session we were able to post messages and make changes to the profile. 52

53 Figure 9. Hijacked session from Twitter.com. [7] We were able to hijack a session from MSN Live Hotmail, allowing us full control of the victim s account. This could be used to collect addresses from colleagues and friends and then be used for SPAM, read information that could be used for blackmailing, or even send an that could damage the victim s reputation. Figure 10. Hijacked session from Live..com. [7] Another site we tested and successfully hijacked the user s session was a blog based in the Wordpress platform. We got full administrator access to the Wordpress Dashboard, because the hijacked session was from someone logged in with administrator privileges. domains: [ 'elnuevodia.com' ], sessioncookienames: [ 'ASP.NET_SessionId', 'END:comentarios' ],}); Handler 1. El Nuevo Dia. // Author: Edgar Lorenzo register({ name: 'Gobierno-PR', url: ' domains: [ 'serviciosenlinea.gobierno.pr' ], sessioncookienames: [ 'SALUD' ], }); Handler 2. Government of Puerto Rico. // Author: Edgar Lorenzo register({ name: 'MyPoly', url: ' icon: ' domains: [ 'mypoly.pupr.edu' ], sessioncookienames: [ 'ASP.NET_SessionId' ], }); Handler 3. Polytechnic University student s records. 3.2 With Wireshark Wireshark is a network protocol analyzer, that lets us capture and interactively browse the traffic running on a computer network [see Figure 12]. Figure 11. Hijacked session from a blog based in Wordpress. [7] This is another example of how vulnerable we are when using unsecured networks Firesheep Handlers Here we show some custom Handlers (instructions) we created for Firesheep. The custom handlers are for El Nuevo Dia at elnuevodia.com, the Polytechnic University student s records at Mypoly.pupr.edu, and Government of Puerto Rico at serviciosenlinea.gobierno.pr. [ see Handler 1, 2, and 3]. // Author: Edgar Lorenzo register({ name: 'El Nuevo Dia', url: ' icon: ' Figure 12. Wireshark user s interface. [11] We start capturing packets in the network, and if necessary, we could proceed to filter the information by IP address and protocol. For this example our interest is in the computer with IP address , and the Hypertext Transfer Protocol (HTTP). In this case we applied the filter ip.addr== = and http [see Figure 13]. Figure 13. Filtering results in Wireshark. [11] 53

54 Then we proceeded to identify the session cookie and copying the values to a cookie editor to duplicate it. As we can see the session ID inside the cookie is identified by the name ASP.NET_SessionID [see Figure 14 and Figure 15]. Figure 14. Cookie and session identification. [11] Figure 15. Cookie Editor user s interface. [12] This will allow our browser to use the generated cookie and navigate the specific web site using the victim s session cookie. 4. RESULTS As a result of the availability of tools like the ones presented in this document, some high profile companies, like Facebook, Gmail, MSN Live, and Dropbox has implemented additional security in their web sites. They now provide HTTPS/SSL to protect its member s information, and session cookies. The only downside is that some sites, like Facebook and MSN Live, are implementing this additional security as an option for their users. Users need to manually activate the security option in their account settings, but most of them do not even know that this feature is available. Additionally, there is an increasing interest in teaching the web community about the risks of using unsecured networks, and at the same time this awareness is making big pressure in web sites so they start implementing more security features to their sites. New tools are being developed to protect users from some session hijacking attacks. The Electronic Frontier Foundation developed a tool called HTTPS Everywhere [10]. It is a Firefox extension that encrypts the communication by forcing a site to use HTTPS instead of HTTP, but it works only with some specific web sites. Another useful tool is called Blacksheep [9]. The tool is able to detect if the Firesheep extension is being used in the network and let you know about it. Other alternatives to be protected from session hijacks is by connecting to Wi-Fi networks protected by the Wi-Fi Protected Access (WPA/WPA2), because it encrypt the stream for each user connecting to the network. The best option to be protected from these attacks is to connect to the Internet using a Virtual Private Network (VPN) to encrypt all the data transmission. 5. CONCLUSION This document was to create security awareness in the people that use the Internet without knowing how exposed they are each time they connect to unsecured Wi-Fi networks, and to demonstrate them how easily this kind of attacks can be performed. Each year the threats are increasing, and attackers are becoming more sophisticated in their methods. The best recommendation to protect our information and minimize the exposure is to use a VPN service, and apply common sense when surfing the Internet, like always logout from each web site as soon as we finish using them, that way we close the session and the attacker will be unable to use the hijacked session cookie. 6. ACKNOWLEDGMENTS I would like to thank Riverbed Technologies ( for sponsoring this research by providing an AirPcap Nx adapter: allowing me to wirelessly capture and analyze the packets used in this work. I would also like to thank Dr. Alfredo Cruz: who encouraged me to review the Wireshark as a network security tool, and now the reason for this work. I would like to thank NSF Grant CNS for sponsoring the presentation of this paper in the 2011 CAHSI Annual Meeting. 7. REFERENCES [1] Metageek, LLC AirPcap Nx. Retrieved January [2] Massimiliano Montoro Cain & Abel. Retrieved December [3] Borisov N., Goldberg I., WagnerDing D. (In)Security of the Web Algorithm. Retrieved December [4] Chomsiri T Sniffing packet on LAN without ARP spoofing. In Third National Conference on Convergence and Hybrid Information Technology DOI= /ICCIT [5] Chaabouni R Break WEP faster with statistical analysis. Retrieved December [6] Cache J., Wright J., Liu V Hacking Exposed: Wireless (2ed). (pp , ). [7] Github Inc Firesheep. Retrieved December [8] Machlis S., New protection against Firesheep attacks. Retrieved January nst_firesheep_attacks [9] Rapoza J., Blacksheep sounds alarm against firesheep. Retrieved February [10] Electronic Frotier Foundation HTTPS Everywhere. Retrieved January [11] Wireshark Foundation Wireshark v Retrieved February [12] Mozilla Add N Edit Cookies v Retrieved February US/firefox/addon/add-n-edit-cookies

55 Analysis of Different Robust Methods For Linear Array Beamforming Angel G. Lebron Figueroa Luis M Vicente Polytechnic University Of Puerto Rico Polytechnicc University Of Puerto Rico 377 Ponce de Leon Ave. 377 Ponce de Leon Ave. Hato Rey PR Hato Rey PR angel_g_lebron@hotmail.com lvicente@pupr.edu ABSTRACTT Array processing is an interesting field that places sensor in particular geometries for the detection and processing of signals. The most important feature is the ability to perform spatial discrimination in addition to frequency filtering. This feature is known as beamforming, and what it does is enhance a desired signal arriving from a particularr spatial direction while canceling signals coming from other directions and at the same time suppressing uncorrelated noise. The objective of this article is to present a statistical method to measure and compare the performance of different beamforming methods in a consistent manner and under the same environmental conditions. This is important to prevent outstanding results based on particular scenarios that may help the particular beamformer under test in detriment of other methods; thus, avoiding the manipulation of data to arbitrarily diminishing the success of other beamformers. The method is based on Monte Carlo simulations on the variability of direction of arrivals, mismatch in the signal of interest and/or the location of the array elements. The results are consistent with the classical beamformer findings and validate the improved performance of newly proposed methods. The presented method could be implemented as a consistentt tool to compare beamforming techniques. KEYWORDS Array Signal Processing, Beamforming, MVDR, DOA, SOI. 1. INTRODUCTION The authors motivation for realizing this investigative work in the field of statistical signal processing was that the subarea of array signal processing is a very competitive one where lots of new beamforming methods are proposed every year. However, most of the researchers that propose these new methods do not follow a consistent procedure to measure the performance of their beamformers and in some occasions they do not compare their beamforming methods with already proposed and fully studied beamformers that have being implemented and have had good performance on real life applications. With the advent of robust beamforming methods [1-4] the performance difference between these methods have been reduced and therefore a consistent methodology must be implemented to justly compare them avoiding any kind of favoritism towards a particular beamforming method. The objective of this work is to propose a statistical method that could be used by any researcher that wishes to measure the performance of his beamformer on an unbiased manner. This method is composed of three parts. In the first part the beamformer performance will be measured using a predetermined quantity of snapshots and a predetermined number of interferences ensembles with random DOAs. The purpose is to analyzee the beamformer performance for a range that spans the low-sample-size to thee large-sample-size situations. The sample size is crucial in the estimation of the correlated signals that arrive at the array via the spatial correlation matrix. At the same time, to improve the speed of the process we need to minimize the sample size. Inn the second part the beamformer performance will be measured under mismatch errors of the signal of interest (SOI) direction of arrival (DOA). This is done by adding a predetermined SOI DOA mismatch in addition to the measurements specified in the first part. The purpose is to measure the beamformer robustness in the presence of SOI DOA uncertainties meanwhile the beamformer is subject to variations both inn the sample size and in the arrival of interferences. The uncertainty in the SOI DOA is a real problem that arises from the imperfect estimation of the arriving signals that must be done before the beamforming process. In the third and final part, the beamformer performance will be measured against uncertainties in the array element positions. This problem is consequence of manufacturing errors. In acoustic submarine arrays the element positionn errors are caused by the movement of the towing ship combined with the water currents. The three performance parameters that are included in this work are, firstly, the array gain (AA g ) that measures the reduction of the interferences present in the field. The second performance parameter is the array gain againstt white noise (AA w ) that measures the minimization of the uncorrelated or thermal noise present both in the field and in the array sensors. Finallyy the third parameter measured is the directivity index (DI) that quantifies the selectivity or sharpness of the beampattern mainlobe in addition to the attenuation of the sidelobes. The DI is ann indicator of the beamformer ability to both maintain unity gain at t the SOI DOA meanwhile attenuating any new potential interference arriving from any other spatial location. Figure 1. The Uniform Linear Array. 55

56 2. BACKGROUND 2.1 The Standard Linear Array The array considered in this work is the linear array that arranges K elements in a linear shape along the y axis as shown in Fig. 1. The linear array that maintains a constant distance between elements is named the uniform linear array (ULA). The standard linear array (SLA) particularizes the distance of the elements to half of the smallest wavelength corresponding to the highest frequency signal component. This condition satisfies the Nyquist criterion that avoids spatial aliasing [5] or multiple mainlobes at different spatial locations. 2.2 Signal Model The acoustic field measured in a point p k as a result of a narrowband plane wave is: u T jt ja pk ( p, a,, t) A t e e (1) k Where A(t) is a low varying complex amplitude with respect time, a is the direction cosines representing the DOA of the arriving wave, is the wave angular frequency, and = /c is the wavenumber or the ratio of with respect the speed of the acoustic field propagating in a homogeneous space. The field measured in all K array points could be arranged in vector form, since for a plane wave, the direction cosines and A(t) is constant. jt u( a,, t) A t e s (2) In (2) we dropped the explicit relation of u with p k for readability. The vector s is referred to as the steering vector and encompasses the phase difference resulting of the different locations of the K sensors from p 1 to p K. s e T ja p K T T 1 ja p2 ja p e e (3) The steering vector is used to discriminate the SOI from the interferences. The steering vector particularized to the SLA lying on the y axis and centered is: K 1 j sin sin 2 e K1 j 1 sin sin 2 s e (4) K1 j sin sin 2 e where and are the polar and azimuth angles of the arriving signal. Because the cylindrical symmetry of the linear array, there is a cone of ambiguity and any signal arriving at the same sin sin are considered as if it was the same signal. Therefore the ULA usage is limited to signals that arrive from a plane. In the particular case of the xy plane, the polar angle becomes The signal model is composed of the SOI, interferences and uncorrelated noise: u 0 l 1 l l (5) L t s t s i t i n t The SOI is represented by the term s(t)s 0 that includes the first two factors in eq. (1) and the SOI steering vector as per (3). The L interferences are similarly represented by i l (t)i l, where the first term is the equivalent to s(t) and i l is the l th interference steering vector. The last term n(t) represents the uncorrelated noise amplitudes in each sensor. 2.3 The Beamforming Process Narrowband beamforming process each sensor signal by multiplying it with a complex number. The result is added to obtain the beamformer output. The process takes advantages of the steering vector information to reduce the interferences and white noise meanwhile keeping the SOI unchanged. The beamforming process is resumed in: l 1 l H L H H t v s i t v i v n t H z( t) v u s (6) 0 Where v H is the weight vector transposed and conjugated. The objective is to obtain z(t)=s(t) by intelligently keep v H s 0 =1, v H i l =0 and minimize v H n(t). There are several beamforming methods to obtain the weight vector. The first one proposed by Bartlett [6] is the delay-and-sum (DAS) that is the result of minimize the white noise meanwhile keeping the SOI undistorted (distortionless constraint or v H s 0 =1). This beamformer do not zero the interferences present in the field, therefore we expect lower performance when interferences are present. Another beamformer proposed by Capon [7] minimize the interferences and the white noise subject to the distortionless constraint. The resulting beamformer is termed minimum variance distortionless beamformer (MVDR) and is optimum under the presence of interferences. However, the MVDR beamformer suffers of dramatic performance drop under uncertainties of the SOI DOA. Therefore robust beamformers were developed to maintain the performance under uncertainties [1-4]. 2.4 Performance Parameters The performance parameters used in this research are the following. The array gain is defined as the signal-to-interferenceplus-noise ratio (SINR) at the beamformer output divided by the SINR at the input: A g SINR SINR 0 i H v R H v R s in v / v 2 n l 2 s L l In (7) R s is the correlation matrix of the SOI, R in is the correlation matrix of the interferences plus noise. 2 s, 2 il, 2 n, are the variances of the SOI, interference l and noise respectively at the array sensors. The array gain is an indicator of the beamformer ability to reduce the interference and the white noise. The second performance parameter (not included in this article) is the array gain against uncorrelated noise. A SNR v R 1 2 il H 0 s w (8) 2 2 SNRi s v v (7) 56

57 Figure 2. 3-D DAS beampattern of a SLA. where v H v= v 2 is the norm squared of the weight vector. For a distortionless beamformer the numerator of (8) becomes 2 s and the array gain is the inverse of v 2. Since the norm is minimum for DAS weights [5], the A w for distortionlesss constraint is the maximum attainable and its value is K, meaning that the uncorrelated noise is reduced by increasing the number of elements. The last performance parameterr we studied (not included in this article) is the directivity index defined as [3]: DI 10log The DI represents the beamformer selectivity as the ratio of the SOI output gain (that is the unity for a distortionless beamformer) with respect the sum of the beamformer gain towards all possible spatial locations.. The expression in (9) involves the beampattern formula B(,) = v H s(,) 2 that represents the output power for a beamformer of weight vector v for a unity gain directional signal (represented by the generic steering vector s(,,)) arriving from an arbitrary location represented by y Fig. 2 shows the 3-D beampattern of a SLA with DAS weight vector steered to 0 = /2 y 0 = /3 The cone of uncertainty is clearly visible. Therefore, for lineal arrays the 2-D beampattern is used instead, since we assume all signals arrive from the xy plane. Fig. 3 shows the 2-D beampattern for the same DAS beamformer for an azimuth range of 180 o. The dashed trace shows the beampattern of a MVDR beamformer designed to cancel a single interference arriving at 0 = /3.75 (or 48 o ) that corresponded to a sidelobe in the DAS beamformer. 3. PROPOSED METHODOLOGY 3.1 Sample Size Scenarios The methodology proposed in this article involves the sample size, which is important to estimate the correlation matrices in equations (7-8). The signal arriving at the array in eq. (5) is read by the K sensorss at each sampling period T s. This is called a data snapshot. Each snapshot is stored in a data matrix composed of N snapshots as: X NN u 0 B 0, sin d 4 B, 0 0 The MVDRt uses the true correlation matrices derived from the steering vectors and powers [4]. The rest of the beamformers u 2 N 1 T s 2 d (10) (9) Figure 3. 2-D DAS beampattern of a ULA. The dashed line represents the MVDR solution. estimate the correlation matrices in (7) and (8) from the data in (10) as: R XN X H N / N (11) where for the SOI correlation matrix R s, the snapshots are composed only of the SOI i.e. u(t)=s(t) )s 0. For the interference plus noise correlation matrix R in the snapshots are composed only of interferences and noise i.e. u(t)=i l ( t)i l +n(t). The sample sizes considered are shown in Table 1. s1 u s2 Tablee 1. Sample size scenarios. s3 3K/ /2 2K 4KK 6.25K 12.5K 62.5K 125K The scenario selection is made from the minimum requirement of K snapshots to avoidd a rank deficient correlation matrix. An adequate estimator starts with 2K [5] that is referred to as low- represent samplee size. On the other hand, the last two scenarios the large-sample size behavior, where the correlation matrices are very well estimated forr stationary situations, but the time required to obtain that large number of samples is not reasonable in practicee due to speed constraints. 3.2 Interferencee Random DOAs In the proposed work we selected the number of interferences to be less than K/2 as it happens in practical designs. Depending on the interferences DOAs, the beamformer performance would yield very different results depending on the relative location of the beampattern sidelobes. If the interferences arrive at the natural nulls of the beampattern, the performance of the beamformer would be increased dramatically. However, if the interferences arrive at the sidelobes maxima, the performance would be reduced likewise. To compensate for the difference in performance and to includee as many different scenarios as possible, the proposed methodd will obtain the beamformer performancee by using ensembles where eachh one will distribute the interferences with uniformm random DOAs in all possible directions outside the beampattern mainlobe. The results will be averaged to obtain a better statistical measure under all possible situations. 3.3 Mismatch in the SOI DOA s4 In the presence of SOII DOA mismatch, the assumed SOI steering vector used to designn the weight vector is no longer true; we expect a drop in performance. The beamforming performance is s5 s6 s7 57

58 measured under a range of true SOI DOAs. The robust methods will adjust the weight vectors to modify the mainlobe to account for the uncertainties and maintainn the performance. This condition will modify the locations and gain of the beampattern sidelobes with respect to those of the non-robuschange the performance as well. Also, the interferences DOAs will affect the beamformer performance. This work proposes to evaluate the DOA mismatch by adding the methods in 3.1 and 3.2 to evaluate the beamformer under all possible situations. methods. The sample size scenario will 3.4 Location Errors Finally the beamformer could be affected by location uncertainties. The location errors affect not only the steering vector of the SOI but also those of the interferences. The errors will be uncorrelated from sensor to sensor. Therefore, the performance is reduced more dramatically than in the case of mismatch in the SOI DOA. The proposed work will include the randomness of the location errors 4. EXPERIMENTAL RESULTS To show the validity of the proposed method we use a SLA with 8 acoustic sensors and 3 interferences with INR of 30dB. Classical beamforming methods as the DAS and MVDR and the following robust beamformers were put under test: the Sparse Constraint with norm 1 (SC1) [2] by Zhang et. al., the (SC2) by Lebron, the Weighted Sparse Constraint (WSC) [3] by Liu et. al., and Vicente s robust method (vrobust) [4]. Fig. 4 shows the A g results for both low and large-samplee size on ensembles with random interferences DOAs. As we should expect, the MVDR methods (both theoretical and practical) have the highest performance, followed by the Vicente method. The other robust methods reducedd their performance in the absence of mismatch. This is expected since they are based on Diagonal Loading that trades performance in the absence of mismatch with robustness. Figure 4. Array gain comparison in absencee of mismatch. Figure 5. Array gain comparison in presencee of SOI DOA mismatch. Figure 6. Array gain comparison in presence of location errors. Fig. 5 shows the results when there is mismatch in the SOI DOA. The MVDR fails catastrophically since is not robust; the robust methods achieve higher A g with different level of success. Finally Fig. 6 shows the performance under location errors. The MVDR performs well since thee location errors compensate out in average. However, both SC1 and WSC perform poorly since they are designed against SOI DOA only. Lebron and Vicente methods achievee consistent behavior with respect the SOI DOA mismatch. 5. CONCLUSION We presented a statistical and consistent method to measure the performance of different beamformers under different sample size scenarios, interferences, and mismatch in the SOI DOA and locationn errors. The results obtained agreed to those expected for the classical beamformers. The method was used to compare the performance of newly developed robust beamforming methods. 6. ACKNOWLEDGMENTS The authors thank the Polytechnic University of Puerto Rico for the facilities and equipment made available for this research. We also want to thank Dr. A. Cruz and Dr. N.J. Rodriguez. The authorss acknowledge the NSF (grant no. CNS ) for its support. Any opinions,, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. 7. REFERENCES [1] H.. Cox, R. Zeskind, M. Owen, "Robust Adaptive Beamforming," IEEE Trans. Acoust., Speech, Signal Processing, vol. 35, no. 10 pp , Oct [2] Zhang Y., Ng B. P. and Wan Q., Sidelobe Suppression for Adaptive Beamforming with Sparse Constraint on Beampattern. Electronic Letters, Vol. 44, no. 10, pages , May [3] Liu Y. P., Wan Q. and Chu X. L., A Robust Beamformer Based on Weighted Sparse Constraint, Progress in Electromagnetics Research Letters, Vol. 16, pages 53-60, [4] Vicente L. M., 2009 Adaptive Array Signal Processing Using thee Concentric Ring Array and the Spherical Ring Array. Columbia, MO. [5] H.. L. Van Trees, 2002 Optimum Array Processing: Part IV of f Detection, Estimation and Modulation Theory.. New York, NY: Wiley. [6] Bartlett M.S., Smoothing Periodograms from Time Series with Continuous Spectra. Nature, 161: , [7] Capon J., High resolution frequency-wavenumber spectrum analysis, Proc. IEEE, vol. 57, no. 8, pp Aug

59 An Improved 2D DFT Parallel Algorithm Using Kronecker Product Karlos E. Collazo Polytechnic University of Puerto Rico 377 Ponce de Leon Ave. (787) Luis M. Vicente Polytechnic University of Puerto Rico 377 Ponce de Leon Ave. (787) ext. 340 ABSTRACT Two dimensional (2D) DFT is often used in digital signal processing for many digital image processing operations in the frequency domain. In a common feasible situation, an image needs to be processed because of illumination equalization issues and the specialized signal processing hardware contains a multi core type architecture. The previously proposed 2D DFT is not able to take advantage of this specialized DSP multi core architecture and reduce the computational time, since it is not optimized to be broken and segregated in segments. We propose in this paper an improved two dimensional (2D) DFT Parallel Algorithm to improve the computational time, and specialized hardware resources management. The proposed algorithm enables the computation to be broken and segregated in as many segments as needed, and it is shown that improves the computational time based upon the hardware architecture. Both methods can be parallelized at the code level but here we want to improve the parallelization at the algorithm first so when we educe the code, be able to optimize the parallelization level (could be seen as a double parallelization). KEYWORDS Two dimensional DFT, 2D DFT, HPC, Parallel Computing, Kronecker Product. 1. INTRODUCTION Digital image processing is one of the most important subfields within digital signal processing because of the image improvement for analysis, inspection, stash, and network transportation using current available technology. The inception of digital image processing was in the newspaper industry, when picture was sent as data throughout cable between countries for military intelligence on early 1920s [1] and thus the evolution of divergent techniques to reach medical, space exploration, system inspection, weather forecast and so other many digital image processing applications. One of the most important conventional signal processing algorithms used in digital image processing is the two dimensional (2D) DFT because of its endowments in the frequency domain processing. 2D DFT is very effective for image processing; it is an extension of the onedimensional discrete Fourier transform [1] applied to 2D data category in which images abide. Now days we affront a crucial consideration when talking about any fields of digital signal processing, which is how go parallel? The original 2D DFT is not able to fulfill this consideration before parallelize it at the code level. A 2D DFT Algorithm using Kronecker Product has been proposed in order to annihilate this ambiguity and provides frequency domain design for parallel processing enabling. The usage of Kronecker Product enables the possibility to parallelize the code by applying decimation algorithms to input digital signal (image) so the computation performed right before the digital signal pre-processing is endowed to be processed in multi core architecture dismissing the computational processing time relying upon the amount of available cores, increasing the multi core architecture performance and achieving real time processing. There are motley proposed methods to improve performance when processing 2D DFT. On section we are presenting the original method; this one briefly reviews the Fourier Transform and points out prevailing problems at the parallelization time. The second method presented on section is faster than the one shown on section 3.2.1; this proposes a columnwise FFT for faster computation, time and resources optimization; will briefly discuss the multidimensional FFT and resources efficiency. Section presents the third method; this method proposes row-column decomposition and operation reduction, reducing the number of complex multiplications. The fourth and last method is presented on section 3.3 which is our proposed method; this one applies a Kronecker Product between two DFT kernels of same size of the digital input signal size followed by decimation in order to segregate in n-cores as available in the DSP hardware architecture. Section 4 contains the experimental results and analysis, the conclusion is shown in section PREVIOUS 2D DFT ALGORITHMS 2.1 Original 2D DFT Algorithm Consider a digital image processing system for frequency domain filtering as the one described in Figure 2.1. The frequency domain filter is conceived in three steps, the first step and the one we ll be parallelizing applies the Fourier transformation describes in 2.1, to the input image, the second step creates a filter function which we will 59

60 ( ) ( ) ( ) (2.1) causes a memory overhead depending on the data size, this is known as the matrix transpose penalty [2]. not present in this chapter, and the third one is the inverse Fourier transformation, which will be parallelize as well. The image is received throughout an input device (sensor, lens, etc.) then is pre-process; we call pre-process to the DFT kernels creation as describes on equation 2.2 and also to the equation endowed described on equation 2.3. Now we proceed with the Fourier transformation computational process as part of the frequency domain filter. In this section takes place the mathematical computation that we are improving. As describe on equation 2.3 we need now to perform a straight multiplication without been able to (2.2) ( ) [ ( )] (2.3) parallelize the equation at the algorithm level, nor to reduce computational cost nor to improve time or resources management. Figure 2.2. Column Wise FFT Column-wise method is faster than the original 2D DFT algorithm blandish on section 2.2, but the only time efficiency problem rely on the matrix transpose penalty described in [2] which in algorithm presented in section 3, we avoided. Depending on the hardware architecture selected this can embody a memory cache issue or memory latency which must be avoided at the implementation level to fetch high efficiency and real time computational time and performance. We want to deflect our algorithm from stay in long duration loop transposing a burly input signal matrix and computing the desire operation. The experimental results section (4) shows the penalty on time when this operation is performed as well as the comparative table evincing the performance surge when avoid this kind of computation with the improved proposed algorithm. Figure 2.1. Frequency Domain Filtering 2.2 Column-Wise FFT Algorithm Column-Wise FFT Algorithm proposes the transpose between the matrix to shift rows and columns [2], it first perform an FFT on that row and then shifts rows and columns to perform another FFT computation. FFT (Fast Fourier Transform) is used for computation of the DFT (Discrete Fourier Transform) of an NxN point discrete signal; to be specific since in this chapter we are using 2D DFT we assume we are working on digital image signal processing. Column-Wise FFT algorithm endows vectorization on the computational process although depending on the data proportion we may heed memory latency at the hardware implementation level. Figure 2.2 [2] shows the computational scheme used by this method. This method takes the original matrix and transposes all its rows and after that performs the FFT computation column by column. Is clearly shown that once computing the FFT the process is fast, but when transposing the Column, 3. IMPROVED 2D DFT PARALLEL ALGORITHMS Figure 3.1 shows the proposed design block diagram. We are proposing an improved 2D DFT parallel optimized algorithm, to achieve low memory overhead, no matrix transpose penalty and performance maximization of the multi core hardware architecture. When we talk about 2D digital signal processing we almost always refer to a digital image processing. Consider the following application; you need to extract the illumination pattern from a digital image. To be able to do that you need to create a low pass filter and apply it to an image to get an illumination gradient. Now on second step from filter creation we need to apply a 2D DFT to a padded matrix from the input image to be able to create the low pass filter, so here we parallelize. In the following sections will explain stage by stage how to perform the improved algorithm. Our improved 2D DFT algorithm recommends the usage of the Kronecker product at the pre-processing stage. Once we receive the input image, we analyze the size, and calculate 60

61 Figure 3.1 Proposed Algorithm Data Flow the DFT kernel or DFT matrix W n k as shown in equation 3.1. After obtain the (3.1) DFT kernel now we calculate the Kronecker of the kernel as shown in equation 3.2, so the DFT equation will end up like equation 3.3. (3.2) ( ) [ ] ( ) (3.3) As the reader noticed, we don t need the usage of matrix transpose avoiding that way the memory transpose penalty. The next step will be decimation in time (DIT) by 2 (radix 2) using the Cooley-Tukey algorithm, which means we are using a radix 2 matrix. That way the image or input digital signal is prepared to get spread along the multi core hardware architecture to be processed. Now that we preprocessed our input digital image is time to process the signal in a multi core hardware architecture type. In this stage we spread the complex operations into cores available. Each processor will perform exactly the same operation avoiding communication between the cores or waits from other cores results. We parallelized the preprocessing stage, now to parallelize this stage, is absolutely necessary to burst in the code level. When processing stage computes the complex operations, it passes all the result from the different cores (processing stage) to next stage which is the post-processing stage. Post-processing fixes back the result to a logical result. ( ) [ ] ( ) (3.4) Equation 3.4 describes the algorithm needed to fix the results from the cores. Permutation is one of the most used operations in digital signal processing at the post processing stage to roll back the result and obtain the logical and real result. In our algorithm we used this method in this stage avoiding that way transposing rearrangement, and consequently matrix transposition penalty. In perfunctory, recalling that we re defining a radix-2 decimation kind; we first permuted the matrix values in order to enable the possibility of parallelization, then we performed the computation of the DFT in multiple cores (as needed and available as mentioned in 3.3.2) and then return the values to its original and real position. At the code level we have memory and CPU limitations and issues, when dealing with huge matrices. In this case we stored all the matrix values in a vector instead of an array, because of the constraints mentioned above. This way C++ allows us to work with bigger size matrices improving memory performance (we do more computation in all stages with less memory). 4. EXPERIMENTAL RESULTS We tested with different matrix sizes in different architectures and different core amount to compare results, in order to be fare in the comparison, we used parallelization at the code level for both algorithms, this mean that the original algorithm is only going to be parallelize at the code level and the proposed algorithm will be parallelized at both levels (algorithm and code). As shown in figure 4.1, we started the test with a single core and 32 bit architecture (AMD Turion); the first simulation with size 64, shows a time consumption of 32 milliseconds when using the original algorithm and 15 milliseconds when using the proposed algorithm; thus, increasing performance by 53.57%. You can see the payload on the single processor as green spikes and the memory consumption as blue spikes. On figure 4.2 we used a dual core 32-bit architecture (Intel Dual Core); you can notice the effectiveness of the parallelization at the code level by looking at the processor payload, in comparison with the single core you can se the reduction in the green spikes because of the payload sharing between processors and reduction in time as well, when solving the matrix with size 64 the time was reduced from 32ms to 18ms using original method and from 15ms to 9ms (50% performance improvement). Now on our last experiment we try same test with 16 cores 64-bit architecture so this time we were able to test with a 128 matrix size, the processor payload was divided between 6 cores; you can notice.70 seconds using original method and.54 using proposed algorithm. The result clearly demonstrates the improvement in time by using the proposed algorithm. Figure 4.5 shows a table with all time comparisons for better result views and then two graphs in figure 4.4, comparing time improvements using both methods and the different architectures mentioned above, M1 stands for method #1 (original method) and M2 stands for method #2 (proposed method). 61

62 Figure 4.1. Proposed vs. Original Algorithm in single core Figure 4.2. Proposed vs. Original Algorithm in dual core Figure 4.3. Proposed vs. Original Algorithm in 16 core Figure 4.5 Time Performance Table 5. CONCLUSION The presented work introduces a double parallelization level algorithm to perform the 2D DFT for image and video processing. The algorithm takes advantage of the highly parallelizable Kronecker product that is used in the proposed DFT algorithm. Therefore, the parallelization starts at the algorithm level, which is a desirable approach when using High Performance Computing, and is one of the paradigm shifts that all programmers would face in the next years. The second level of parallelization is achieved during the implementation of the code. The simulation results show the performance improvement of the presented method, achieving speed performance improvements around 50% or half of the time, and at the same time optimizing the effectiveness of the parallelization at the processor payload. 6. ACKNOWLEDGMENTS The authors thanks the Polytechnic University of Puerto Rico for the facilities and equipment made available for this research. We also want to thanks Dr. A. Cruz and Dr. N.J. Rodriguez. The authors acknowledge the NSF (grant no. CNS ) for its support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 7. REFERENCES [1] Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing 2 nd edition [2] Daehyun Kim, Victor W. Lee, and Yen-Kuang Chan, Image Processing on Multicore x86 Architectures, Signal Processing Magazine Volume 27 Number 2 March 2010 [3] J. Honec, P. Honec, P.Petyovsky, S. Valach, J. Brambor, Parallel 2D-FFT Implementation with DSPs. 13 th Int. Conference on Process Control, June 2001 Figure 4.4 Time Performance Graph 62

63 Versatile Service-Oriented Wireless Mesh Sensor Network for Bioacoustics Signal Analysis Marinés Chaparro-Acevedo, Juan Valera, Abner Ayala-Acevedo, Kejie Lu, Domingo Rodriguez {marines.chaparro, juan.valera, abner.ayala, kejie.lu, Department of Electrical and Computer Engineering Department University of Puerto Rico Mayaguez ABSTRACT Even though the networks speed has increased dramatically in recent years, the need of separating computational processes of the information routing introduces delays in the data exchange between processing devices and routing devices. As the amount of transmitted data increases, the delays in the transmission also increase. Hence, this article proposes the creation of a network whose nodes could have both, processing and routing capabilities in a single device. This will eliminate the switching times between the tasks while creating an extremely fast and versatile network for processing a large volume of data, in this case bioacoustics signals. The computing module, called NETSIG, will integrate the processing of the bioacoustics signals with the wireless networking routing in the context of a Versatile Service Oriented Mesh Network (VESO-Mesh). The idea behind this is that the bioacoustics signals were acquired transmitted and interpreted using a single device with tools installed in each one of the intelligent nodes of the network. KEYWORDS Wireless mesh network, Remote sensing, Signal processing, Bioacoustics, NETSIG. 1. INTRODUCTION T HE amount of information transmitted over the networks has growth considerably over the years. It has been used to transmit a large collection of data such as images, videos and sound signals. This could be done since a higher switching between packages and a lower latency in the delivery of frames has been achieved. Nevertheless, there are some cases in which the available technology is not enough to achieve the desired level of speed and reliability. This is where different ways of connecting devices in a network arises. Wireless Mesh Network (WMN) is technology that facilitates the connectivity and intercommunication of wireless clients through multi-hop wireless paths [1]. They are cost effective and very reliable since if one of the nodes in the network can no longer operate or loses its signal, the other nodes can still communicate with each other and send the data through a reliable path [2]. Furthermore, integrating the WMN capabilities with the Versatile Service Oriented architecture could achieve effective data process and access inside the network, a high-throughput backbone in WMN emphasizing the transmission of a large volume of data, efficiently provide services to users inside the network and effectively utilize the network resources in terms of energy, processing, storage, and so forth. The following section will explain how to apply and integrate this technology in a real problem, which is the surveillance of bioacoustics signals in Puerto Rico. 2. RESEARCH PROCEDURE 2.1 Objective and Data Acquisition using NETSIG The objective of this research is to identify the population of certain species of animals in one area and compare it with the species in other areas. In this manner, the diversity between the species and the population of each one of them could be better understood. However, this task could be very difficult because the amount of data that need to be analyzed and transmitted is huge; e.g. if 5 seconds of audio are sampled at 1MHz with 16 bits per sample, the generated signal will have around 10MB of length. Consequently, if one device does the routing and other device does the processing then operational time and the risk of losing information increases. A device, termed NETSIG, would address this problem by providing signal processing and routing capabilities in the same device. Figure 1 shows an initial prototype of the device. Figure 10 depicts a typical scenario where a NETSIG node could be put into practice. This scenario presents a sensor array processing (SAP) system consisting of a master sensor node (MSN), a basic interface module (BIM), a set of sensor signal processing unit (SPU) modules, and a set of sensors, each sensor attached to an SPU. An SPU has the capability of performing basic signal acquisition and preprocessing as well as routing operations. An SPU may be implemented using a single board computer such as the ALIX s described below. A NETSIG node removes the MSN from the SAP and incorporates network routing operations. 63

64 Figure 1. NetSig Device NETSIG stands for NETwork and SIGnal Processing and each device comes with a dual-core processor and highspeed wireless cards that support the MESH network functionality. The NETSIG, that will form part of the network, will give us the necessary processing power to process the signals inside the network, see Figure 2. Figure 2. Project Structure However, they require space and are not as mobile as usual routers. Using an Intel Core due processor, the power of a computer inside the network can be achieved. This will allow the distribution of computing power across the network and enabling the real-time routing and the processing of the bioacoustics signals. A bioacoustics signal is a time frequency signal, that is, a signal that changes over the time. This raw signal-data from birds, amphibians and other animals in the environment is acquired by sensors connected to the NETSIG. Eventually, this signal is analyzed and processed by the NETSIG node and the routing of the data is done. Due to its fast computation modality, the system could quickly respond to natural disaster or, in this case, as an application for environmental surveillance pertaining to acoustic monitoring of different animals. Figure 3 shows a graphical description of the system. Figure 3. Bioacoustics Signal Processing in VESO-Mesh Figure 3, shows the test-bed implementation that will be developed at Jobos Bay Estuary. Here we are developing a distributed sensor network environmental test-bed; with the purpose of enhancing the monitoring, modeling and management animals in the environment. The points A, B, C, D, E, and F are animals that are emitting different kind of signals. The sensors connected to the NETSIG devices get those signals and then they are processed and routed by the NETSIG device. All the devices in the network; routers, computers, PDAs, NETSIG devices, etc.; are connected in context of Wireless Mesh Network with VESO architecture. The following section will describe the exact details of the network configuration. 2.2 VESO-Mesh The versatile service-oriented network aims to distribute the computational capacity throughout the network, using special nodes that will integrate the both the ability to t process and transmit data,, as explained previously. The master sensor nodes, where the routing and processing is done, are called ALIX single board computers ALIX Single Board Computer These single board computers give us 500 MHz of processing power to process some basic signal processing operations. Thus,, we can more efficiently manage and reduce the load of the transmission in the network. In order for the single board computer to behave as routers we first created an open source Linux environment for them. A 16 GB Flash Card has been use for ALIX devices and an Atheros minipci card has been use for wireless capabilities. After configuring the single board computers with a Linux environment, we install MadWifi driver and OLSR routing protocol to create the mesh network. A Compact Flash (CF) card d reader was used to create the Linux environment in the 16 GB flash card that will be connected to the ALIX. This procedure has to be done inside a Debian environment; in our case, our PC client was running Debian Lenny 5.0. The command fdisk l was used to determine the location of the CF card reader device. This command was introduced in a terminal window and its output was /dev/sdb. 64

65 Figure 4. Creating partition Figure 4, shows how to create a new partition in the CF card. It describes the instructions followed to create a 4GB partition for the Linux environment and other applications. The rest of the flash memory can be used for data storage or for installing other operating systems. Figure 7. Network Interface 3. TEST-BED SIMULATION The testing of the VESO-Mesh network was performed with typical routers, a laptop computer and a NETSIG device as shown in figure 8. Figure 5. Downloading minimal Debian Lenny 5.0 In order to install the operating system, we download debootstrap and a minimal Debian Lenny 5.0 inside the flash card. At this point we are able to enter as an administrator (with the chroot command) inside the Debian Lenny. Here we will edit the file /etc/apt/sources.list by adding the following repositories: Figure 6. Linux Voyage repositories At this moment Linux Voyage will be installed in the flash card with Debian Lenny. Afterward the following programs were installed: grub, locals, MadWifi Driver and OLSR protocol. The steps described in figure 7, allows enabling Local Area Network and Wireless Area Network capabilities: Figure 8. Testbed in the UPRM The distance between each one of the nodes in the network was approximately 60 feet. In order to prove the multi-hop capability of the network a package request was done from one edge of the network to the other. Figure 9 shows the sequence of hops the packet has traversed in order to get to the destination node. 65

66 Figure 9. Multi-Hop Experimental Results The results were as expected; the packet visited the node and the node in order to reach its destination node CONCLUSION AND FUTURE WORK This paper presented a wireless mesh network whose nodes or devices, called NETSIGs, had processing and routing capabilities. The test-bed was developed in the University of Puerto Rico at Mayaguez, and the multi-hop and selfhealing capabilities of the network was proven. As a future work, speed test inside the local network will be done and the routers will be changed to NETSIG devices (see Figure 10). 5. ACKNOWLEDGMENT The authors would like to thank the graduated students David Marquez and Angel Camelo for the support provided for this work. The views presented in this paper are the opinions of the authors and do not represent any official position of any of agencies that partially supported this work. The author(s) acknowledge(s) the National Science Foundation (Grants CNS , CNS , and CNS ) for its support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 6. REFERENCES [1] P. Mohapatra. "Wireless Mesh Networks," Department of Computer Science, University of California, Davis, California, 2006, 2-7. [2] C. Edmonds, D. Joiner, S. Springer, and K. Stephen. "Wireless Mesh Network," Electrical and Computer Engineering, Oregon State University, Oregon, 2008, 2-3. [3] C. Aceros, N. Pagán, K. Lu, J. Colucci, and D. Rodriguez, "SIRLAB: An Open-Source Computational Tool Framework for Time-Frequency Signal Visualization and Analysis," 2011 IEEE DSP/SPE Workshop, Sedona, Arizona, USA, January [4] J. Sanabria, W. Rivera, D. Rodriguez, Y. Yunes, G. Ross, and C. Sandoval, A Sensor Grid Framework for Acoustic Surveillance Applications, IEEE Fifth International Joint Conference, on INC, IMS, and IDC, pp , S. Korea, [5] G. Vaca-Castano, D. Rodriguez, J. Castillo, K. Lu, A. Rios, and F. Bird, A Framework For Bioacoustical Species Classification In A Versatile Service-oriented Wireless Mesh Network,, Eusipco 2010, Aalborg, Denmark. [6] K. Lu, Y. Qian, D. Rodriguez, W. Rivera, and M. Rodriguez, A cross-layer design framework for wireless sensor networks with environmental monitoring applications, Journal of Communications Software and Systems,, vol. 3, no. 3, Sept H. Figure 10. NETSIG in a VESO WMSN Scenario [7] Lugo-Cordero, K. Lu, D. Rodriguez, and S. Kota, A novel service-oriented routing algorithm for wireless mesh network, Proc. IEEE MILCOM 2008, San Diego, CA, USA,

67 Numerical Modeling of Bending of Cosserat Elastic Plates Roman Kvasov Computing and Information Sciences and Engineering PhD Program University of Puerto Rico at Mayagüez Mayagüez, PR 00681, USA Lev Steinberg Department of Mathematics University of Puerto Rico at Mayagüez Mayagüez, PR 00681, USA ABSTRACT In this work, the numerical modeling of bending of Cosserat elastic plates is presented. The discretization is applied to a new theory of Cosserat elastic plates, which is based on a generalization of the well known Reissner plate theory. The theory provides the governing bending system of equations that consists of 6 second order partial differential equations. The case of hard simply supported macro and free micro boundary condition includes the combinations of high order derivatives, which are difficult to approximate for arbitrary domains. We show that for the case of rectangular Cosserat plates, the boundary conditions can be split into simpler forms of Dirichlet, Neumann and mixed type. The Galerkin finite element method is proposed for solving the governing system of equations. MATLAB environment was used for the domain discretization and the development of the finite element software; Wolfram Mathematica was used for the visualization purposes. The bending of Cosserat rectangular plates was successfully simulated for a plate made of syntactic foam and the numerical results are shown to be consistent with the previously obtained analytical solution. The error estimation shows the optimal (quadratic) rate of convergence of the method. The numerical algorithm was shown to also work for some nonrectangular plates and the results of the simulation are provided. KEYWORDS Cosserat elasticity, finite element method, differential equations. 1. INTRODUCTION Cosserat materials are the classical elastic materials with extra independent degrees of freedom for local rotations. The most widely accepted bending theory of classic elastic plates was developed by E. Reissner and is based on the classic elasticity [10]. However, the real Cosserat materials such as bones and foams often have a number of important length scales such as grains, particles, fibers, and cellular structures, that make the geometry of the deformation much more complex [7]. Since these materials cannot be adequately described by the classic elasticity, A. C. Eringen proposed a theory of Cosserat elastic plates in the framework of Cosserat Elasticity [9]. However, Eringen plate theory does not consider a transverse variation of the microrotation over the thickness and does not produce the exact classical Reissner plate equations for zero microrotations [4]. The theory of Cosserat elastic plates that takes into account the transverse variation of microrotation is presented in [1] and is based on the classical Reissner plate theory [1]. The new approach, in addition to the traditional model, takes into account the second order approximation of couple stresses and the variation of three components of microrotation in the thickness direction. The governing bending system of equations was obtained from the Generalized Hellinger-Prange-Reissner variational principle, the equilibrium system of equations and the constitutive relations [1]. The proposed plate theory was never numerically modeled before. In this paper, we simulate the bending of Cosserat elastic plates made of syntactic foam (Poisson s ratio 0.34 and Young s modulus 2758) and show that the numerical results are consistent with the analytical solution previously obtained in [3]. Apart from the case of rectangular plates, considered in [3], we also consider some nonrectangular plates and provide the results of the simulation. The Galerkin finite element method is shown to be an efficient way to solve the governing bending system of equations and the error analysis confirms the quadratic (optimal) rate of convergence. 2. GOVERNING SYSTEM OF EQUATIONS The theory of Cosserat elastic plates presented in [1], assumes the the approximation of couple stresses and the variation of three components of microrotation in the thickness direction. Author formulates the Generalized Hellinger-Prange-Reissner variational principle and shows that these assumptions result in bending equilibrium system of equations and constitutive equations. The obtained plate bending system of equations is:, 0, 0, (1) 0, 0 where is the pressure and the expressions for the bending moments and, the twisting moments and, the shear forces and, the transverse shear forces and, the micropolar bending moments and, the micropolar twisting moments and, the micropolar couple moments and were obtained by the integration of the stress and couple stress components along the thickness of the plate [1]. The obtained constitutive equations are: Ψ, Ψ, Ψ, Ψ, 2 1 Ω Ψ, Ω, 1ΨΩ, 2 Ω, Ω, 2 Ω, 1, Ψ 2, 1 Ω 67

68 1, Ψ 2 Ψ 1 Ω 4 Ω 12, where is the vertical deflection of the middle plate, Ψ the rotations of the middle plane around axis, Ω the microrotations in the middle plate around axis, Ω the rate of change of the microrotation along, the flexural rigidity of the plate, the shear modulus, and the characteristic length for torsion and bending respectively, the Poisson s ratio, Ψ the polar ratio, and constants depending on the form of approximation. After the constitutive equations are substituted into the plate bending equilibrium system of equations, the following governing bending system of equations is obtained where (2) Ψ, Ψ,,Ω, Ω, Ω the vector of unknowns and,, 1, 0,0,0 the right-hand side of the system. The operators are defined as follows:,,,,,,,,,,,,,,. where are positive constants that depend on the material (see [1] for the exact expressions and details). The system (2) is a system of 6 second order partial differential equations. The uniqueness of the solution of the system was shown in [1] and its ellipticity in Petrovsky sense was proven in [3]. Main regularity results and techniques for the analysis of similar elliptic systems were recently published in [2]. The establishment of the strong ellipticity of the operator is known to be important for the proof of coerciveness of the corresponding variational form. To show that the operator is strongly elliptic, it is sufficient to prove that the principal symbol of the operator is positive definite. Taking into account the identities and, for any nonzero, and,,,,,, the expression can be rewritten as 0 It is not hard to see that the equality 0 implies that either 0 or 0. Therefore 0 for all nonzero and, and the operator is strongly elliptic. Now we will formulate the hard simply supported macro and free micro boundary conditions that assume the transversal displacements and the tangential rotations vanish on the frontier. Similarly to the hard simply supported boundary conditions of the boundary value problems for the Reissner plate formulated in [8], we have 0, 0, 0, (3) 0, 0, 0 (4) where is the unit vector normal to the boundary and is the unit vector tangent to the boundary (Figure 1). 3. NUMERICAL SOLUTION In all further numerical experiments we will consider a plate of thickness 0.1 made of syntactic foam (lightweight engineered foam consisting of glass hollow spheres embedded in a resin matrix) with the following values for technical constants 0.34, 2758, , 0.1, Ψ0.1, 0.065, The numerical simulation of bending of the plate implies solving the system of equations (2) complemented with the hard simply supported macro and free micro boundary conditions (3) and (4). For the purposes of simplicity we propose to solve the system of equations using the classical Galerkin finite element method (FEM), with lowest-order basis functions and triangular elements. In order to implement the boundary conditions in the FEM, we need to find their expressions in terms of the unknown variables Ψ, Ψ,,Ω, Ω and Ω. This can be accomplished by substituting the constitutive equations into each of the 6 boundary conditions. 3.1 Bending of Rectangular Cosserat Plates Let be a Cosserat elastic plate and its boundary. In this section we will consider the case of the rectangular plate, :0, 0. The horizontal and vertical parts of are defined as, :0, 0 and, :0, 0 (Figure 1). Figure 1. Rectangular Cosserat elastic plate We will further assume that the initial distribution of the pressure is given as, sin sin on. 68

69 3.1.1 Expressions for Boundary Conditions Let us write the explicit expressions for the boundary conditions (3) and (4) in terms of the unknown variables on. The normal and tangent vectors are given as 0, 1 and 1,0 respectively. The condition 0 stays without a change. The condition 0 simply implies that Ψ 0 on. The condition 0 implies that 0. After substituting the expression for the micropolar couple moment from the field equations, we obtain Ω, 0 or equivalently Ω 0 on. The condition 0 implies that 0. After substituting the expression for the bending moment from the field equations, taking into account that Ψ, 0, and that 0 on, we obtain Ψ, 0, which is equivalent to Ψ 0. The condition 0 implies that, 0. After substituting the expressions for the bending moment and the shear force, taking into account that 0, Ψ, 0, Ψ, 0 and 0 on, we obtain Ω 0. The condition 0 implies that 0. After substituting the expression for the micropolar twisting moment from the field equations, taking into account that Ω 0 on, we obtain that Ω, 0 or Ω 0. Therefore we have the following boundary conditions on the horizontal part of the boundary : 0, Ψ 0, Ω 0 Ψ 0, Ω 0, Ω 0 In a similar way we obtain the boundary conditions on the vertical part of the boundary : 0, Ψ 0, Ω 0 Ψ 0, Ω 0, Ω 0 Note that the expressions for the boundary conditions on both and are first order boundary conditions that are now given in terms of the unknown functions and thus can be implemented in the finite element method Implementation of the Boundary Conditions The vertical deflection is imposed a homogeneous Dirichlet boundary condition, while the rate of change of the microrotation Ω homogeneous Neumann boundary condition. The functions Ψ, Ψ, Ω and Ω, were imposed the boundary conditions of mixed type (Dirichlet on and Neumann on or vice versa). The boundary conditions of Dirichlet type are known as essential boundary conditions, and affect the Hilbert space of functions where the solution is sought. The boundary conditions of Neumann type are known as natural boundary conditions, and are implemented in the variational formulation of the problem [6]. The variational formulation of the problem was obtained using the standard procedure of integration by parts on each second order operator. The Hilbert space where the solution of the variational problem is sought is defined as: where, 0 on,, 0 on,, 0 on,, and is a standard Hilbert space of functions that are square-integrable together with their first partial derivatives Numerical Results For the further numerical experiments, let us consider a square plate of size ; the initial distribution of the pressure is given as, sin sin. The governing system of equations for this plate is known to have an analytical solution. The solution was found using the method of separation of variables and some additional differentiation techniques, leading to a system of Sturm-Liouville eigenvalue problems [3]. Figure 2. The distribution of the initial pressure (left) and the analytical solution for the vertical deflection (right) It was shown that in this case the solution is: Ψ cos sin, Ψ sin cos sin sin, Ω 0 Ω sin cos, Ω cos sin The finite element simulation of the bending of the given Cosserat elastic plate produces 6 functions that are the solutions of the governing system of equations (2) (Figure 3). Figure 3. The graph of the solution Ψ, Ψ,,Ω, Ω, Ω of the governing system of equations (2) One of the quantities of physical importance is the vertical deflection. The graphs of the results of the simulation of the vertical deflection are given in the Figure 4. Figure 4. Numerical simulation of the vertical deflection calculated with 10 and 10 finite elements respectively. The numerical solution was verified for consistency with the analytical solution obtained in [3]. The estimation of the error of 69

70 the approximation in L 2 -norm is given in the Table 1. The slope of the decay of the error versus the diameter of the mesh in logarithmic coordinates shows optimal (quadratic) convergence of the method. Table 1. The estimation of the error in L 2 -norm, showing quadratic convergence for all six unknowns. Diameter Ψ Ψ Ω Ω E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-07 Slope Nonrectangular Cosserat Elastic Plates The case of nonrectangular plates represents an area open for research. There is no known analytical solution of the system of equations (2) for the domains of arbitrary form. Furthermore, the expressions for boundary conditions (3) and (4) contain linear combinations of the unknown functions together with their first and second partial derivatives, and do not split into simpler forms. These expressions are not easy to implement, and to our knowledge, there is no efficient numerical algorithm for their approximation. Figure 5. The triangulation of nonrectangular plate (left) and the initial pressure distribution (right) However, it is still possible to perform the numerical simulation on some nonrectangular domains with orthogonal boundary edges, applying the ideas for rectangular plates discussed above (see Figures 5 and 6 for results). Figure 6. Numerical simulation of the vertical deflection: isometric view (left) and front view (right) A possible way to deal with more complicated boundaries would be to orthogonalize the domain and solve the system on the new approximated boundary. However, in this case, the regularity issues and the solution convergence should be addressed. Ω 4. CONCLUSION In this paper, we numerically investigate the bending of Cosserat elastic plates. Since the case of hard simply supported macro and free micro boundary condition includes the combinations of high order derivatives, which are difficult to approximate for arbitrary domains, we separate the cases of rectangular and nonrectangular domains. For Cosserat rectangular plates, the boundary conditions are proven to split into simpler forms of Dirichlet, Neumann and mixed type boundary conditions. The Galerkin finite element method is shown to be an efficient way to solve the governing bending system of equations; the error analysis shows the optimal (quadratic) rate of convergence. The bending of rectangular Cosserat plates was successfully simulated for the plate made of syntactic foam and the numerical results are shown to be consistent with the previously obtained analytical solution. We show that the method also works for some nonrectangular plates and provide the results of the simulation. 5. ACKNOWLEDGMENTS This work is supported by NSF Grant CNS We want to thank Dr. Paul Castillo of the University of Puerto Rico at Mayagüez for his collaboration and helpful discussions. The authors also acknowledge that the publication of this work has been possible in part by the support of CISE fellowship (2010). 6. REFERENCES [1] Steinberg L Deformation of Micropolar Plates of Moderate Thickness, Int. J. of Appl. Math. And Mech. 6(17): [2] Costabel M., Dauge M., Nicaise S Corner Singularities and Analytic Regularity for Linear Elliptic Systems. Prepublication IRMAR 10-09, HAL: hal , version 2. [3] Reyes R Comparison of Elastic Plate Theories for Micropolar Materials. MS Thesis, University of Puerto Rico Mayaguez Campus. [4] Steinberg L., Elastic Plates Motions with Transverse Variation of Microrotation. ArXiv:: v2. [5] Madrid P Reissner Plate Theory in the Framework of Asymmetric Elasticity. MS Thesis, University of Puerto Rico Mayaguez Campus. [6] Solin P Partial differential equations and the Finite Element Method. J. Wiley & Sons, MR (2006f:35004). [7] Forest. S. Cosserat Media In Buschow K., Cahn R., Flemings M., Ilschner B., Kramer E., Mahajan S., editors, Encyclopedia of Materials: Science and Technology. Elsevier. [8] Arnold D., Falk R Edge Effects in the Reissner- Mindlin Plate Theory. Analytic and Computational Models of Shells, A.S.M.E., New York. [9] Eringen A Theory of micropolar plates. Journal of Applied Mathematics and Physics, Vol. 18, [10] Reissner E The effect of transverse shear deformation on the bending of elastic plates. Journal of Applied Mechanics, A,

71 WARD Web-based Application for Accessing Remote Sensing Archived Data Rejan Karmacharya, Samuel Rosario-Torres, and Miguel Vélez-Reyes Laboratory for Applied Remote Sensing and Image Processing Electrical and Computer Engineering Department University of Puerto Rico, Mayaguez Ph. (718) , ABSTRACT This work proposes a distributed n-tier architecture for a web based application, WARD, providing ubiquitous and secured access to archived remote sensing data. Remote sensing data is usually accompanied by auxiliary data that either elaborate the scene in which data was acquired or provide added information. Remote sensing data can be in different forms and formats and it is essential to define a good structural storage organization atop which this scalable application aims to provide an easy interface for users to input data and access archived data. The intended application provides the end users ability to upload the data pertinent to a remote sensing campaign and group them accordingly and can have access to archived data as well. The application will be built using ASP.NET and the.net framework with the backend in the form of SQL Server and file system. This scalable distributed application allows users to not be concerned with the data archiving or retrieval process. Collected data can be uploaded to the storage system from any internet connected node and retrieved via a visual searching mechanism. The data would be distributed between database and file system in different storage media allowing easy ingestion of present archive data. KEYWORDS Remote sensing, data archiving, distributed web application. 1. INTRODUCTION The growth of remote sensing technology with its application in various fields such as weather forecasting, cartography, defense, and environment have generated vast amount of data. Data collected for each study differs in size, format, quantity, quality and context. Even when a satellite captures a swath image, besides the image itself, it contains its metadata that defines its properties and the receiving station would probably maintain a log regarding the reception of the image. Likewise, when there are various types of remote sensing data being acquired for a particular region of study it is bound to contain auxiliary data associating the collected data and could further provide information about the environment as well as having log information. The usefulness of the data being collected is not only limited for the work in consideration but will be useful for future reference and hence should be managed properly. This project addresses the need for an application which can manage the data and their auxiliary. It would also provide the end users with an ability to load all the data collected during missions or via satellite acquisitions and provides an easy-to-use interface to search and retrieve the required information. The use of Relational Database Management Systems (RDBMS) for storing and managing remote sensing data provides a great platform as opposed to the still immature Object Oriented databases [1]. In addition, where there are lots of archive data as backup in external storage devices the database system may only store the location and metadata related to those files avoiding the need to copy those files to the database and thus making use of the existing storage space. This would allow data to be stored in a distributed fashion but the query for each data would only be done in the main database providing a managed repository for the remote sensing data. Due to the size of remote sensing data, its archive size increases rapidly with the potential of exceeding the storage capacity in the database storage system and hence it is necessary to address this issue. Providing a distributed storage architecture which can also store data in any drive in the network can feasibly resolve this issue. This is where hybrid data storage structure comes into play where data is stored either in the database itself or is saved in a network with its metadata being stored in the master database. Web applications provides consistent and ubiquitous framework for information exchange. ASP.NET provides a rich set of serverside controls and a prominent framework to develop web applications. ASP.NET is based upon the.net framework which provides a consistent, platform-independent and type safe programming infrastructure. In addition, it provides a good support for security, simplified development tools and techniques, separation of presentation and application logic, good debugging and application deployment features. Leveraging the benefits of web applications atop a distributed scalable framework with the aide of WCF (Windows Communication Foundation) services, a robust application can be developed for managing and accessing remote sensing data along with its accompanying data. The next section presents the background with the technologies that are to be used in the project along with the applications, technologies and research works pertinent to the current system. The third section provides a technical overview of the architecture design followed by the conclusion section and possible future enhancement works. 2. BACKGROUND Technologies The following technologies are intended to be used in this project. N-Tier Architecture The best way to build an application using the.net framework is to separate all logical procedures into separate classes providing a scalable, flexible and maintainable structure [10]. An N-Tier application consists of at least three logically separated layers with each layer liable to accomplish a separate logical process and interacts with the layer residing directly below it. A multi-tier layout helps to reduce tight coupling between UI, business processes and database and a change in any underlying layer would not affect upper layers. Hence, it is easier to change the 71

72 code or extend the application without the need for recompilation of the code or breaking the existing code. The layered layout enables deployment of each layer in a different node with minimal changes offering scalability. This architecture also reaps benefits from the use of web services for communication between its layers providing a service oriented architecture (SOA). ASP.NET and Google Maps For the front end development ASP.NET with C# on top of.net framework 4.0 would be used. ASP.NET provides an easy and unified web development and programming model with impressive tool support. It is completely object oriented and contains inbuilt page and controls framework, compiler, security framework, debugging support, web services framework and state management and application life cycle management. ASP.NET Web Forms is chosen over the MVC framework since the layered architecture readily provides ample separation of concern between programming stacks. Scripting languages such as JavaScript and jquery are inherently supported by ASP.NET adding a rich client side programming support. The popular Google Maps, based upon JavaScript API, will be used as the web mapping application to provide an interface to initiate visual search and render results visually as well [3]. As Google Maps is coded almost entirely in JavaScript and XML it readily integrates with any web application. Besides, there is Google Maps API offering a number of functions to manipulate maps and to add contents to the map [3]. It also provides web services allowing access to directions, geocoding, elevation and other information regarding places. To initiate the search in the application a polygonal area may be defined on the map and the resultant data corresponding to such regional criterion along with other provided query parameters shall be displayed on the map itself. Windows Communication Foundation (WCF) WCF is a set of Application Programming Interfaces (API) provided in the.net Framework for building service-oriented applications. It is the de facto method of building and running services leveraging the.net platform, and allows developers to define CLR types as services and then consume the same with the definition of end points and contract definitions [4]. Services are defined using a WSDL (Web Services Definition Language) interface that can be consumed by a client. The WCF implements several advanced web service standards for addressing, messaging and security. The basis of distributed service oriented system in the application is WCF SOAP web services. Besides SOAP, it also supports RSS and JSON making it pliable for future modifications. The messages using the service-oriented architecture can be passed asynchronously. The content of the message is strongly typed as defined by the contracts and the type of message is limited to the ones that can be defined using the.net language. The SOA based WCF architecture comprises of Endpoints, Contracts, Messaging, Runtime and Hosting. The endpoint defines the communicating node and the contract defines a set of operations which specify the operations supported by the endpoints. Messaging part contains channels that process message heading and messages as well. The runtime composes of a set of objects that send and receive messages. Hosting defines how a service can be hosted using either of IIS, Windows Services,.EXE, +COM or Windows Activation Services. Figure 1 depicts the flow of SOAP message with XML serialization and deserialization between a client node and the remote WCF service in the application. Several configuration settings for the security, transactions and reliability can be defined in the configuration files for the WCF service. Figure 1. WCF SOAP message flow. ADO.NET Entity Framework (EF) and SQL Server EF helps to eliminate the object-relational impedance mismatch between data models which is a commonality in conventional database applications [5]. EF removes the impedance mismatch by abstracting the relational schema of the data in the database and provides a more conceptual model to program with. In essence it is an ORM (Object-Relational Mapping) tool for creating a logical strongly typed object oriented relational schema of the database including a direct and efficient access to the SQL Server. It also comes with a good Language-Integrated Query (LINQ) to Entity support. SQL Server, an enterprise level Relational Database Management System (RDBMS) from Microsoft, is used as a database engine for the backend data storage platform. SQL Server is used as a database engine or a backend data storage platform for applications supporting thousands of users and hence is highly scalable and robust. All the tools that come along with it make administration and SQL programming tasks easier. Related Works WWW Information Processing Environment (WIPE) [6] is a web-based information processing and geographical information system (IP/GIS) providing network-centric manipulation of static and dynamic geo-spatial/temporal data distributed in server systems over the World Wide Web. It is a commercial product developed by Applied Coherent Technology (ACT) Corporation. WIPE makes use of distributed storage to store large number of historical and/or dynamic geo-spatial data. It provides facility to load raw data and its ingestion engine processing the data into correct format and archive the file for future reference and processing. It provides rich environment to process the data and the interaction can be either in HTML interface or JAVA applet interface. With WIPE users can search for data and download the data products as well. WIPE can be used as a platform for custom applications as well where it could be queried for data and data would be provided to it for ingestion. This application is more generic and it can only ingest and/or process a limited set of supported data formats. Moreover, WIPE is very expensive. SeaWEB [7] was developed at UPRM with a view to present and retrieve the data collected over aquatic surfaces and surfaces in an organized fashion for the data collected for the SeaBED [8] project. It is a simple web portal developed using Macromedia Flash and used Terrascope Image Navigator [9] and Terrascope 72

73 Search and Retrieval Engine [10] at the backend. SeaWEB has limited functionality and only handles remote sensing image data and groups the data according to sensor type. Terrascope was also developed at UPRM. It is a web application for search and retrieval of earth science data stored in a distributed database management system. The architecture is based upon peer-to-peer system capable of interacting with each other. It has a client Terrascope Image Navigator as a user interface and a server side Terrascope Search and Retrieval Engine for searching and retrieving data from the distributed database. Like in the case of SeaWEB this product also focused only on image data. Global Visualization Viewer from USGS (GLOVIS) [11] and Earth Explorer [13] are products from USGS which allows input of a detailed query and also has more data sets to offer but as in above mentioned cases they are only limited to maintaining records of the imagery. Google Earth Engine is a new product from Google as a planetary-scale platform aimed for environmental data and analysis. The Quicklook Swath Browser from Canada Center for Remote Sensing provides an interface somewhat similar to GLOVIS allowing selection and downloading of remote sensing images in a swath format [12]. Even though these products and research works have contributed great value to remote sensing image storage, processing and distribution, they still cannot fulfill the exact need of maintaining the archived data along with its auxiliary data and providing secured access to them. 3. TECHNICAL DETAILS System Overview WARD is based upon N-Tier architecture where the separation of concern is well maintained between the presentation layer for user interface, business layer for application processing and implementation of business logic, data access for providing abstraction to the database layer, and the database layer for actual storage of the data. Each layer is developed as a separate module to each other and can be deployed independently in different hosting machines. However, there are common functionalities such as error handling, logging and security codes which are defined such that they are available across all the layers. For communication between the layers, WCF SOAP (Simple Object Access Protocol) web services are used. It provides good visualization support for displaying the images available in the system as a response to the user query using Google Maps. Each item in the database is associated to GPS coordinate enabling its pinpoint in the mapping application utilizing available APIs. Figure 2 shows the architectural design of the application. The membership and role providers from ASP.NET shall be used for authentication and authorization services. Membership provider maintains the list of users along with their information and credentials providing a way to authenticate them. The role provider provides a way to map the users created using the membership provider to roles allowing definition of the authorized modules for the users. Only two roles are to be provided as Administrative and Registered having system wide and limited authorizations respectively. Presentation Layer This top-most layer in the architecture stack provides the user interface such as forms for input query and Google Maps display for users to interact with the system. Google provides several JavaScript APIs for adding and manipulating the mapping application, Google Maps, and the ASP.NET web forms and HTML represents the presentation components. ASPX pages, user controls, server controls and security related objects and classes are also included in this tier along with the controls developed using jquery. This layer can only communicate directly with the underlying business logic layer by invoking services for all data access operations. The user can select any type of polygonal region in the Google Maps and post a query as a service call and when the service invocation completes for search queries, the results are displayed either visually using the mapping application or in tabular structure. This tier shall request data and post request to the business layer via web services call defined using the WCF. The IIS web server hosts the application and is responsible for executing all the server side logic in the ASP.NET forms. The IIS might as well be considered as a separate layer in the architecture stack. Figure 2. Architecture design for WARD. Business Logic Layer It harbors the business processing logic such as pluggable data processing logic using interface implementations. It also contains service routing agents which route the requests and responses to and from appropriate destination endpoint addresses. These routing agents redirect the requests from the invoked service to the underlying apposite web service. It contains business objects that are integral to the functionality of all the layers. This layer serves as an interface between the actual data and the client application providing an abstract representation of the data and its access. The business layer contains a pluggable architecture developed with the help of interface definition and.net reflection. In its configuration file users will be able to specify the codes or the sequence of codes to run along with the path of the assembly containing the code. When the application detects this setting it invokes the code from the given assembly. The code in the given assembly should implement an interface contract defining the output and input type and this consistency helps the definition of a pluggable interface. The user can later simply add the code and its configuration according to their need and the application will execute and use the result of its processing. The plugin codes can be added or removed as needed. This tier contains definitions of business entities which are data containers defined to encapsulate the data representation. These entities are the objects encapsulating the result set returned by the queries from the database using LINQ in the subsequent layer and are used to transfer data from one layer to the another layer. Also, service interface are present exposing the functionalities which can be used by the preceding presentation layer. 73

74 Data Access Layer This layer abstracts the storage layer and provides the requested service data from the database or the file system. This layer shall define interface for interacting with the SQL server to read the configuration settings and data and an interface for interacting with the file system containing the remote sensing image files. It is only responsible in interacting with either the relational database system or the file system. The complexity of access to the files and database will be encapsulated by this layer and the business layer requesting data shall only be concerned about the methods to invoke and the type of objects returned by this layer. All the queries to the database are placed in stored procedures to help in consolidating and centralizing the data access logic and are saved as a part of the database itself. The design of this layer would be made flexible in order to accommodate usage of other relational database tools in the future such as MySQL. To query the entities, it will feature LINQ to Entities and LINQ to SQL returning references to the business objects defined in the business layer. Entity Framework with its object relational mapping provides the basis for this layer with various APIs to define the database context based upon the entities upon which CRUD (Create, retrieve, update and delete) operations can be invoked as user defined methods. One important issue is while downloading or uploading the hyperspectral or multispectral image files due to their sizes. The WCF web service by default doesn t support transfer of large files and hence they need to be configured for it. This layer will publish its web service and will be a separate assembly application in itself. Storage Layer It represents the actual physical device and SQL Server database engine which stores the data along with the file system where the remaining files are stored. The user authentication data, metadata of the image files, and all other auxiliary files are stored in the SQL Server. Any data which cannot be readily ingested into the database or current archive data in large volume will be placed in remote network drives and the network paths and/or its auxiliary data are stored in the database. The consistency will be maintained by using the file modification time and the time stored in the database. The SQL Server and file system form the hybrid storage system. Only data access layer can directly access the data available in this layer and it can only be done by authenticated users and hence providing a robust security measure. 4. CONCLUSION AND FUTURE WORK Here a solution for the management and access to remote sensing data in different formats, whether it is a point data or a hyperspectral image, along with its various possible supplemental data such as log files is proposed. A scalable distributed framework based upon service oriented architecture using WCF SOAP web services is presented to transfer remote sensing image data. Google maps provide a good interface for initiating a visual search and the resulting information can be depicted in the maps as well as provide a user friendly user interface. The application can be extended to contain various remote sensing image processing algorithms which can be configured to run for a given image type before uploading or downloading the content. For swath images, the region covered by the image may be depicted using the mapping application and development of an ingestion engine to import as many image types as possible would be the future goal. 5. ACKNOWLEDGMENTS This work was supported the NASA EPSCoR Program under award NNX09AV03A. The work used facilities of the Bernard M. Gordon Center for Subsurface Sensing and Imaging Systems sponsored by the Engineering Research Centers Program of the National Science Foundation under Award EEC Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or NASA. 6. REFERENCES [1] Mi Wang, Jianya Gong and Deren Li. Design and Implementation of Large-Scale Image Database Management System. Geomatics and Information Science of Wuhan University. Wuhan University [2] Paul D. Sheriff. Designing a.net Application. Internet: Apr 2002[Feb 2011] [3] Google. Google Maps API Family. Internet: [Feb 2011] [4] Microsoft. Windows Communication Foundation. Internet: Mar 2010[Feb 2011] [5] Microsoft. The ADO.NET Entity Framework Overview. Internet: Jun 2006[Feb 2011] [6] Applied Coherent Technology Corporation. Wipe Overview. Internet: wipe_overview.pdf. [Feb 2011] [7] Santiago. C. D. (2006). SeaWEB: A Web Protal for the CenSSIS SeaBED TESTBED. University of Puerto Rico, Mayaguez. [8] James Goodman, Miguel Vélez-Reyes, Fernando Gilbes, Shawn Hunt, Roy Armstrong. (2007). SeaBED: A Laboratory and Field Test Environment for the Validation of Coastal Hyperspectral Image Analysis Algorithms. University of Puerto Rico, Mayaguez. [9] Álvarez, A. C. (2003). TIN: TerraScope Image Navigator Providing Effective Ubiquitous Access to Distributed Multimedia Earth Science Data. University of Puerto Rico, Mayaguez. [10] M. Rodríguez, Enna Z. Coronado Pacora. (2003). SRE: Search and Retrieval Engine of TerraScope Earth Science Information System, University of Puerto Rico, Mayaguez. [11] USGS. Global Visualization Viewer. Internet: [Feb 2011] [12] Canada Center for Remote Sensing. Quicklook Swath Browser. Internet: [Feb 2011] [13] USGS. Earth Explorer. Internet: [Feb 2011] 74

75 PDA Multidimensional Signal Representation for Environmental Surveillance Monitoring Applications Nataira Pagán Undergraduate Student University of Puerto Rico Mayagüez Campus Angel Camelo Graduate Student University of Puerto Rico Mayagüez Campus Domingo Rodríguez - Advisor University of Puerto Rico Mayagüez Campus domingo@ece.uprm.edu ABSTRACT This work presents a signal representation environment for personal digital assistant (PDA) units for surveillance monitoring applications. The signal representation modality is of the multidimensional nature, with PDA unit able to display image and video information associated with signals acquired from the environment. The work concentrates on the multidimensional representation of bioacoustics signals, after being processed by a signal processing application. The signal representation application being utilized is SIRLAB, a computational modeling framework built for time-frequency signal analysis. The software environment presented here has being utilized as part of a wireless mesh sensor network equipped with specialized embedded computing nodes which integrate network protocol operations and signal processing computations. applications such as surveillance monitoring. The idea of system instrumentation is based on the importance of this method of measurement that t makes quick action in the most efficient way because the technique is basically to implement algorithms for processing signals. KEYWORDS PDA, iphone, Android, wireless mesh sensor network, bioacoustics. 1. INTRODUCTION There is a need to design and develop user interface instrumentation for real time e visualization of bioacoustics data while conducting analysis and classification tasks when monitoring birds and anurans (see Figure 1) in reserves and environmental observatories. The signals acquired from this environment must be priority data for scientists and experts in the research area. By priority data we mean that signals acquired most contain significant records because as a result a graphical representation will be processed and displayed in a Personal Data Assistant (PDA). Part of the system instrumentation, there are different computational units develop the tasks of processing signals. Also, these computational units will be the access points to acquire the processed data with the PDA. A PDA unit is a fundamental implementation entity in Digital Signal Processing (DSP) system instrumentation because it is used for addressing visualization issues for Figure 1.. PDAs Interfaces for Monitoring Applications We have defined the instrumentation as integration of a system, a DSP unit, an embedded computing unit, and a PDA unit. SIRLAB is an application implemented in the embedded computing unit that provides the data that we actually display in the PDA (see Figure 2). SIRLAB stands for SIgnal Representation LABoratory and is a computational tool for the visualization and analysis of signals with time dependent spectral content. SIRLAB was developed at the Automated Information Processing Laboratory of the Electrical and Computing Engineering Department of the University of Puerto Rico. 75

76 application generates the display of the information via web and it can work from anywhere within the environment under study. Ming-Chiao Chen, et. al., developed an Android/OSGi based vehicular network management system. One of the implementation interfaces developed by the authors, N. Pagán, et al., is also based on embedded OSGi frameworks, with applications for Android, with the ability of illustrating time-frequency images from a NetSig. As a matter of fact, the application designed interacts with the web to access the processed data from the bioacoustics signals recorded in the system under study. Figure 2. SIRLAB Concept Used in a Web Environment In order to improve access to information in the environment under study, we found that PDA s are practical tools and it is possible to develop applications that contribute, in a more friendly, to a user s data visualization needs. In order to display available data anytime, anywhere, we must design applications for mobile devices such as Android and iphone. Now, focusing on providing a complete tool that provides all the required specifications that a scientist needs to study the data, such as a tool that processes and analyzes the signals acquired in the environment under study, a is an option. In order to display the necessary information to complete studies for research, we must provide a more portable unit other than a workstation. Some precise specifications, and the ones we are expecting to provide along with raw data, include images, video, and audio representation of the processed signals. 2. BACKGROUND AND RELATED WORK This section describes background and research work related to the use of personal digital assistant (PDA) units as user interface monitoring devices in the environment under study. It is important to point out that the research work presented in this paper introduces a novel technique of visualizing vocalizations as time-frequency image representations to enhance signal processing operations such as the Fourier transform. Raymond R. Bond, et. al., present a web-based tool for processing and visualizing body surface potential maps. The tool developed can be launched from an iphone home screen. The authors developed a web-based application using the extensible Markup Language (XML) and Adobe Flash CS4 as Web development tools. The George Shih, et. al., disputes whether it is the iphone or Android the platform for innovation in imaging informatics. The authors are comparing, among many things, the complexity of the equipment and the development of applications for it. We have chosen both PDA s to provide an application that displays the data. We performed a comparison between both devices and we found that the evaluation approach to be taken at the beginning was to explore the features of both of them. From others work, we designed the interface of our project to provide a user-friendly application. We decided to access information via web and to evaluate the complexity at the programming level of the devices. Also, by reviewing different applications, we stated the conceptual system architecture we did approach in our work. 3. THEORETICAL FORMULATION In this section we present the foundational concepts associated with this work from the point of view of signalbased information processing. Information processing is defined as the treatment of information to effect an observable (measurable) change. Signal-based information processing (SbIP) deals with the treatment of signals in order to extract information relevant to a user. A signal is defined as any observable entity able to carry information from one event to another in the space-time domain. We are interested in signals, which are discrete, and of finite duration. Any discrete finite signal can be represented as a finite numeric sequence stored in a register after a measurement processed has been completed. The signal at this point is defined as raw data. Raw data is converted into process data after the data is treated using signal-processing methods. This work deals principally with the management and representation of processed data in two specific formats: 1) image 76

77 representation format, and 2) video representation format. We proceed to discuss in few the details the representation of data in image format on specific PDA s. Video representation is discussed as a developing work. An image representation is obtained from raw data through the use of a signal processing method called time- frequency distribution computation. A time-frequency distribution computation centers on the representation of the spectral characteristic of an acoustic signal as a function of time. As a result, a two-dimensional representation is obtained where one of the dimensions is time and the other dimension is frequency. A spectrum is defined as the representation of a numeric sequence into its frequency components. A program paradigm informs the language ordered in code of computer programming. Object oriented programming paradigm employs objects that usually compile methods encapsulated in a particular instance of a class to design programmed applications. This informs how the programming language is structured in the code. Java supports an object oriented, structured, and imperative programming paradigm. Structured programming means that it can be seen as a GOTO statement. Imperative programming describes calculations in terms of statements. Though, Objective-C C is also object oriented both own the paradigm of reflection which can observe and modify its own structure. 4. IMPLEMENTATION RESULTS In this section we present the recent work we have developed and the approach we are taking in terms of the applications created for PDA s devices. The arrival of the iphone simulator and Software Development Kit (SDK) has expanded design opportunities for developers. This simulator runs on the platform Xcode, which uses Objective-C, derived from the standard C language, as its programming language. The advantage of Xcode is that it has all the functions of an Integrated Development Environment (IDE) as well as tested the applications for the iphone simulator, which ensure that the designed and developed interface works as close as expected (see Figure 3). A Software Developer Kit (SDK) is also provided for an Android platform using Java programming language. The Android Development Tools (ADT) is connected with Eclipse IDE development environment that strengthens any design and development effort. Thanks to this connection, a developer can create and debug Android applications quickly and rather easily (see Figure 4). Overall, the implementation results show a promise on the design and development of PDA user interfaces for visualization applications associated with high-bandwidth multimedia wireless sensor networks. In particular, the idea of visualizing time-frequency signal representations of vocalizations of neo-tropical anurans in an environmental surveillance monitoring setting appears to be fruitful. 5. CONCLUSION AND FUTURE WORK The work presented here is concerned with the access and gathering of raw signal data in an environmental observatory with a high bandwidth multimedia sensor network, by a scientist or information user having the capability of freely moving in this network, accessing any desired directory at a particular time. Bioacoustics signal may be acquired from this surveillance-monitoring environment and treated, in the computational signal processing sense to extract information relevant to a user. Figure 3.. iphone User s Interface Figure 4.. Android User s Interface Emulator 77

78 Android is an operating system and iphone is a mobile device running ios.. Our first approach was to develop an image display application, more like a gallery of images that were registered internally in the memory of the phone. Our second approach was to access the images via web because in this manner, of all the data acquired from the environment, the scientist would select the most relevant. The application build was able to visualize images and also hear audio files that were also recorded in the environment under study. At first, we designed a preliminary application for a PDA that works in a simple way accessing the web where a directory contains all the images, video, and audio frames that were organized after being analyzed with the SIRLAB tool framework. We did provide a user interface to visualize the images and monitor the audio with both devices but we are trying to simplify the study of the species recorded by computing the data acquired from the environment and providing video signals. Below, Table I presents a comparison of some PDA user interface attributes and specifications, including a Palm OS (see Fig. 4), identified as important while designing and developing user interface algorithm implementations. TABLE I USER INTERFACE FEATURES iphone Android Op. System ios Linux Audio Display Capability Multi-Touch Capability Response N/A Sensitivity Landscape Feature Organization Display Full-Size Display Feature Quick Access Display Feature Friendliness & Appearance Inside Scroll View Inside Zoom View Storage Capability N/A N/A PDA Palm OS N/A N/A N/A N/A N/A N/A N/A N/A This informal comparison between mobile devices shows some of the characteristics of current mobiles that may induce a developer to favor a particular platform over another. For the case of design and implementation of time- frequency acoustic signal representations, entations, features such as Multi-Touch, storage, and friendliness & appearance become important. Our future implementation will be to provide a new version of the application designed that can display video files. We are looking forward to implement video signals to our application so that way scientists will able to understand the behavior of species under study. 7. ACKNOWLEDGMENTS The author(s) acknowledge(s) the National Science Foundation (Grants CNS , CNS , and CNS ) for its support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 8. REFERENCES [1] C. Aceros, N. Pagán, K. Lu, J. Colucci, and D. Rodriguez, "SIRLAB: An Open-Source Computational Tool Framework for Time-Frequency Signal Visualization and Analysis," 2011 IEEE DSP/SPE Workshop, Sedona, Arizona, USA, January [2] R. R. Bond, D. D. Finlay, C.D. Nugent, and G. Moore, A Web-based based tool for processing and visualizing body surface potential maps, Journal of Electrocardiology, Elsevier, Belfast, UK, 12, Apr [3] M. Chen, J. Chen, and T. Chang, Android/OSGibased vehicular network management system, Computer Communications,, Elsevier, Taiwan, [4] H. Lugo-Cordero, K. Lu, D. Rodriguez, and S. Kota, A novel service-oriented routing algorithm for wireless mesh network,, Proc. IEEE MILCOM 2008, San Diego, CA, USA, [5] J. J. Rodrigues, O.R. Pereira, and P.A. Neves, Biofeedback data visualization for body sensor networks, Journal of Network and Computer Applications, Elsevier, Portugal, 9, Aug A [6] J. Sanabria, W. Rivera, D. Rodriguez, Y. Yunes, G. Ross, and C. Sandoval, A Sensor Grid Framework for Acoustic Surveillance Applications, IEEE Fifth International Joint Conference, on INC, IMS, and IDC, pp , 38, S. Korea, [7] G. Shih, P. Lakhani, ani, and P. Nagy, Is Android or iphone the Platform for Innovation in Imaging Informatics, Journal of Digital Imaging, vol. 23, no. 1, Feb [8] G. Vaca-Castano, D. Rodriguez, J. Castillo, K. Lu, A. Rios, and F. Bird, A Framework For Bioacoustical Species Classification In A Versatile Service-oriented Wireless Mesh Network,, Eusipco 2010, Aalborg, Denmark. 78

79 Hyperspectral Unmixing using Probabilistic Positive Matrix Factorization Based on the variability of the calculated endmembers Miguel Goenaga-Jiménez University of Puerto Rico at Mayagüez Campus Call Box 9000, Mayagüez, PR 00681, tel: ext ABSTRACT Unmixing is the hyperspectral image processing problem of extracting the spectral signatures (or endmembers) and the fractional coverage of the materials (or abundances) present in a single pixel. Most approaches for unmixing assume that a single endmember represents a material category. However, in natural environments different factors cause variability. The proposed approach is based on positive matrix factorization (PMF). This method uses a single initial endmember and then finds the abundances by a least square optimization process. Instead of using a single endmember our approach finds a probabilistic distribution function (pdf) from several endmembers of the same material and then, this pdf is used to find the abundance. Comparison between this approach and the standard PMF are presented. KEYWORDS Hyperspectral Images, Unmixing, Endmembers variability, Constrained Positive Matrix Factorization, Probabilistic estimator. 1. INTRODUCTION Hyperspectral imagers (HSI) can collect hundreds of images corresponding to narrow contiguous spectral bands. HSI technology is used in a wide range of remote sensing applications such as agriculture, geology, ecology, surveillance, etc. Processing of Hyperspectral data is not a simple task because it usually requires to deal with large volumes of data produced inherently by HSI systems. Furthermore, in many practical applications, such as threat detection in surveillance systems, it is crucial to perform the processing in real time, making this task even more challenging. 1.1 Hyperspectral Imaging (HSI). Materials reflect, absorb, and emit electromagnetic energy, at specific wavelengths, in distinctive patterns related to their composition. A HSI image is built from the energy radiated by the earth which is collected by the sensor. The smallest element in a HSI is called pixel. The reflectance associated with each pixel is the result of the interaction between different physical factors but in particular, between the constituent materials in the field of view of the sensor. Three parameters characterize a HSI: the spatial resolution that determines the spatial size of the pixel; the spectral resolution that is the wavelength width of the different frequency bands recorded by the sensor; and the radiometric resolution that is the number of different intensities of radiation that the sensor is able to discriminate. Typically, HSI have few 100s of bands. Each Hyperspectral pixel is a vector, where the number of components Miguel Vélez-Reyes University of Puerto Rico at Mayaguez Campus Call Box 9000, Mayagüez, PR 00681,, tel: ext miguel.velez7@upr.edu depend on the number of bands in the image and represents a spectral signature (see Figure 1). Figure 1. Endmembers for Hyperspectral Image [1] 1.2 Linear Mixing Problem. Any approach for effective unmixing of Hyperspectral data must begin with a model describing how constituent material substances in a pixel combine to yield the composite spectrum measured at the sensor. Mixing models attempt to represent the underlying physics that are the foundation of Hyperspectral phenomenology, and unmixing algorithms use these models to perform the inverse operation, attempting to recover the endmembers and their associated fractional abundances from the mixed-pixel spectrum. Figure 2 (from [2]). illustrates the two categories of mixing models, the linear mixing model and the nonlinear mixing models. In Hyperspectral imaging, the reflected or emitted radiation represented by a single pixel in the remotely sensed image rarely comes from the interaction with a single homogeneous material. A pixel would be pure if the spatial resolution was smaller than the size of the class portion in the image, but, in real data it is not common to find pure pixels. However, the high spectral resolution of imaging spectrometers enables the detection, identification, and classification of sub-pixel objects from their contribution to the measured spectral signal. The linear mixing model [1] presents a pixel as the linear combination of the spectral signatures of each material multiplied by its relative abundances. 79

80 Figure 2. The linear mixing model assumes a well-define proportional checkerboard mixture of materials, with a single reflection of the illuminating solar radiation [2]. The spectral signature of each pure material is known as the endmember. The model is mathematically presented for each pixel by : where x is the measured spectral signature at a pixel, S, is a m p matrix of endmembers, a is the p-dimensional vector of spectral abundances, and w is a m-dimensional vector of measurement noise, m is the number of spectral bands, and p is the number of endmembers [1]. In HSI, m>p, notice that all elements of S, a, and x are constrained to be positive, and the sum of a ij for all spectral bands m is equal to one. For the entire HSI, the linear mixing model given above can be written in matrix form as (2): 1.3 The Constrained Positive Matrix Factorization (CPMF). Masalmah and Velez-Reyes [4] and Jia and Qian [3][5] have shown that the unmixing problem can be related to the computation of a constrained positive matrix factorization (cpmf) where in addition to the positivity constraint, add the sum to one constraint to the columns of A. The cpmf has been shown to be a powerful approach for Hyperspectral unmixing [4]. Jia and Qian [3] also include the sparseness and smoothness in the constraints based on the studies of Pascual-Montano et al. [7] and Hoyer [8]. The cpmf is the solution to optimization problem[6] (3),: Initial endmembers are obtained using SVDSS algorithm[6] in order to find the more independent spectral signatures in the image. This method has two principal iterative steps. In the first step, the endmember matrix will be fixed and the consequent abundances will be estimated. In this step, we enforce a i =1. This constraint has a physical significance because a material cannot have an associated abundance above 100 %. The estimation of each matrix on each step is done by solving a nonnegative linear least squared problem. The method iterates between the two steps until convergence is achieved. A key feature is the capability of cpmf to extract endmembers from the image that are not present as pure pixels. This is an important feature since many objects of (1) (2) (3) interest are only present in a sub-pixel level and most approaches for unmixing assume the presence of pure pixels. The sparseness concept refers to an image where only a few materials are present in each pixel which implies that abundances in most pixels are zero or close to zero and only few pixels are contributing with considerable information [8]. It also suggests that the number of basis components required to represent X is minimized. Actually, PMF produces a sparse representation of the data, with the disadvantage that we cannot control the degree to which representation is sparse. The sparseness measure permits us to identify how much energy of a vector is crowded into only few components. Also, the sparseness criterion is presented including the minimization of the matrix of abundances using the 1 norm [9]. The complete minimization problem is transformed as follows: 1.4 Positive Matrix Factorization with Gaussian Process (PMF). In this section a general method for including prior knowledge in a positive matrix factorization (PMF), based on Gaussian process priors is presented. It is assumed, that the non-negative factors in the PMF are linked by a strictly increasing function to an underlying Gaussian process, specified by its covariance function. This allows finding PMF decompositions, that agree with the prior knowledge of the distribution of the factors, such as sparseness, smoothness, and symmetries. Next we derive a method for including prior information in an Non-negative Matrix Factorization (NMF) decomposition by assuming Gaussian process priors[10]. In this approach, the Gaussian process priors are linked to the non-negative factors in the NMF by a suitable link function. To define the notation, let's start by deriving the standard NMF method as a maximum likelihood (ML) estimator and then move on to the maximum a posteriori (MAP) estimator, and then discuss Gaussian process priors [11]. The NMF problem can be stated as (5): (5) where X є R MxN is a data matrix that is factorized as the product of two element-wise non-negative matrices, S є R MxL and A є R LxN and the W є R MxN is the residual noise. There exists a number of different algorithms[3] [4] [7] [8] for computing this factorization, some of which can be viewed as maximum likelihood methods under certain assumptions about the distribution of the data. For example, least squares NMF corresponds to i.i.d. Gaussian noise [13] and NMF corresponds to a Poisson process [12]. The ML estimate of S and A is given by: (6) where the term to optimized is the negative log likelihood of the factors. 2. METHODOLOGY In this paper we consider the variability of endmember selected early in the process, by modeling them as samples of a Gaussian pdf. The mean of the class is used to initialize de cpmf algorithm. The approach used here is shown in Figure 3. (4) 80

81 Figure 4. Original Image APHill RGB Composite. Figure 3. Experimental Methodology used. 3. EXPERIMENT RESULTS For experimental purposes we are using a portion of AP Hill AVIRIS image the image depicted in Figure 4. A set of five signatures were used to test the results, but only results for vegetation endmember are shown. In this section we show preliminary results of the cpmf method for hyperspectral unmixing of estimating the initial endmembers using Maximum Likelihood. This was repeated six times to observe how endmember variability can affect the calculation of the abundances. In the APHill image we found vegetation, some buildings, roads, rooftops and bare soil as endmembers. Figure 5 shows the variability obtained for the same spectral signature in different parts of the image. Using the probabilistic positive matrix factorization we obtained the estimated vegetation endmember (see Figure 6) and only using the positive matrix factorization(pmf) method we obtained the endmember show in Figure 7. We can see how PMF using probabilistic estimator work properly, and the results are quite similar to those obtained with the standard PMF. Figure 8 shows the abundance for the vegetation endmember, using the proposed approach. 4. CONCLUSIONS In this work, was studied how the effect of the variability in the calculation of endmembers and abundances affects the process of unmixing in Hyperspectral images. We suggest as future work an improved method of the Positive Matrix Factorization using the modeling of this variability as a probabilistic function Figure 5. Variability of vegetation endmember for APHill Image. Figure 6. Estimated for Vegetation Endmember. 81

82 Figure 7. Output of cpmf of Vegetation Endmember. Figure 8. Output Abundance of cpmf for Vegetation Endmember. (Maximum Likelihood) for estimating the endmembers in the image. 5. ACKNOWLEDGMENTS This work is sponsored by the Center for Subsurface Imaging Systems (CenSSIS) under NSF Grant Number EEC Additional support came from the U.S. Department of Homeland Security under Award Number 2008-ST-061-ED0001. Presentation of this poster was supported in part by NSF Grant CNS REFERENCES [1] N. Keshava and J.F. Mustard. "Spectral Unmixing". Signal Processing Magazine, IEEE, 19(1):44-57, Jan [2] N. Keshava and J.F. Mustrad. "A Survey of Spectral Unmixing Algorithms". In Lincoln Laboratory Journal, Volume: 14. Number 1, Page(s): 55-78, April [3] D. Lee and H. Sebastian Seung. "Algorithms for Nonnegative Matrix Factorization". In Advances in Neural Information Processing, pp , [4] Masalmah Y. M. and Vélez-Reyes M. "A full algorithm to compute the constrained Positive Matrix Factorization and its application in unsupervised unmixing of Hyperspectral imagery". In Sylvia S. Shen and Paul E. Lewis, editors, Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XIV, volume 6966, page 69661C. SPIE, [5] S. Jia and Y. Qian. " Constrained Nonnegative Matrix Factorization for Hyperspectral unmixing ". Geoscience and Remote Sensing, IEEE Transactions on, 47(1): , January [6] Masalmah Y, Vélez-Reyes, M. "Unsupervised Unmixing of Hyperspectral Imagery". Circuits and Systems, MWSCAS'06. 49th IEEE International Midwest Symposium on Volume: 2. Page(s): , October, [7] A Pascual-Montano, J.M. Carazo, K. Kochi, D. Lehmann, and R.D. Pascual-Marqui. "Nonsmooth Nonnegative Matrix Factorization (nsnmf)". Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(3), Page(s): , March [8] Patrik O. Hoyer. "Non-negative Matrix Factorization with Sparseness Constraints". Journal of Machine Learning Research 5, Aug [9] K. Seung-Jean, K. Koh, M. Lustig and D. Gorinevsky. " An Interiorpoint Method for Large-scale l1-regularized least squares". Selected Topics in Signal Processing, IEEE Journal of, 1(4), Pages: , Dec, [10] C. E. Rasmussen and C. K. I. Williams."Gaussian Processes for Machine Learning". MIT Press, [11] M.N. Schmidt, H. Laurberg. "Non-negative Matrix Factorization with Gaussian Process Priors". Technical University of Denmark and Aalborg University. 17 Pages, January 16,2008. [12] T. Virtanen, A.T. Cemgil."Mixtures of Gamma Priors for Non-Negative Matrix Factorization Based Speech Separation". Tampere University of Technology. Tampere, Finland. Bogazici University, Istanbul, Turkey [13] A. Cichocki, R. Zdunek, and S. ichi Amari, "Csiszar's Divergences for Nonnegative Matrix Factorization: Family of new algorithms".in lecture notes in Computer Science, vol. 3889, pp ,

83 Evaluation of GPU Architecture for the Implementation of Target Detection Algorithms using Hyperspectral Imagery Blas Trigueros-Espinosa Miguel Velez-Reyes Nayda G. Santiago-Santiago Laboratory for Applied Remote Sensing and Image Processing University of Puerto Rico Mayaguez Campus P.O. Box 3535, Mayaguez, PR Ph. (787) , FAX (787) s: {blas.trigueros, miguel.velez, ABSTRACT A hyperspectral target detection algorithm tries to identify the presence of target objects or materials in a scene by exploiting the spatial and high-resolution spectral information contained in a remotely sensed hyperspectral image. This process may be computationally intensive due to the large data volumes generated by the hyperspectral sensors, which is an important limitation for some applications where the detection process must be performed in real time (surveillance, explosive detection, etc.). In the field of high performance computing, the technology of graphics processing units (GPUs), driven by the demanding market of video games, has recently experienced a rapid development reaching a computational power well above the fastest multicore CPUs. In this work, we are evaluating the performance of NVIDIA Tesla C1060 GPU as a computing platform for the implementation of three target detection algorithms (RX algorithm, matched filter, and adaptive matched subspace detector). The detection accuracy of the implemented algorithms was evaluated using a set of phantom images simulating traces of different materials on clothing. We have achieved a maximum speedup in the GPU implementations of around 20x over a multicore CPU implementation. The resulted speedups and running times suggest that applications for real-time detection of targets in HSI can greatly benefit from the performance of GPUs as processing hardware. KEYWORDS Hyperspectral imaging (HSI), target detection, graphics processing units (GPUs), CUDA, parallel processing. 1. INTRODUCTION Remote detection and identification of targets has become a desirable ability over the last few years, especially in military and public safety application domains [1, 2]. Multispectral and hyperspectral imaging (HSI) techniques have been traditionally used in remote sensing application for target detection and classification in a wide range of areas such as defense and security, biomedical imaging, earth sciences, etc [1-4]. Recent advances in hyperspectral imaging sensors, allows the acquisition of images of a scene at hundreds of contiguous narrow spectral bands. Target detection algorithms try to exploit this highresolution spectral information to detect target materials present in the imaged scene. A target detection problem in HSI can be described as a binary hypothesis test where two competing hypotheses are generated to differentiate the pixels containing the target of interest from the pixels containing only background spectra. Based on the measured pixel spectrum, a detection algorithm has to decide which hypothesis is true (target absent or present). The optimal detector is given by the likelihood ratio test defined as [5]: p Λ ( x ) = p where p( x H 1 ) and ( H 0 ) ( x H1) ( x H ) 0 p x are the conditional probability density functions under the hypothesis of target present, H 1, and the hypothesis of target absent, H 0, respectively. If the ratio Λ(x) exceeds a given threshold η, then H 1 (target present) will be selected as true. Otherwise, H 0 will be selected. The different mathematical models chosen to describe the spectral variability of the measured pixels lead to different types of target detectors. Due to the structure of the hyperspectral data sets and its potential large volume sizes, the computational cost of the detection algorithms may become very expensive, limiting the use of these algorithms in real-time applications. The new emerging technology of graphics processing units (GPUs) has experienced a rapid development in the last few years driven by the demanding market of video games and multimedia. GPUs are characterized by a highly parallel structure and can achieve a computational power in the order of hundreds of gigaflops at a relative low cost [6]. This makes the GPU an interesting alternative to multicore CPUs for addressing dataparallel compute-intensive problems, where the same program can be executed on different data elements in parallel, and an appealing candidate as a hardware platform for hyperspectral data processing. In this work, we are studying the NVIDIA CUDA TM GPU architecture as a parallel computing architecture for the implementation of different state-of-the-art hyperspectral target detection algorithms using a Tesla C1060 card. In Section 2, we present a brief theoretical description of the implemented target detector. In Section 3, we introduce a general description of the CUDA architecture of GPUs. Section 4 describes the different parallel implementation on the GPU and, finally, in Sections 5 and 6 we present the experimental results and the conclusions of the work, respectively. 2. TARGET DETECTION ALGORITHMS The first two algorithms, RX algorithm and matched filter, are detectors for full-pixels. These detectors assume that the pixels in the image contain information of only one class (target or background), i.e. do not contain mixed spectra. The last algorithm, the adaptive matched subspace detector, is a subpixel detector, which assumes that the target may occupy only a portion of the pixel area and the remaining part is filled with the background (i.e. mixed pixel). 83

84 2.1 RX Algorithm The RX algorithm for anomaly detector is one of the most used detection algorithms in image processing. In this his algorithm, the variability of the background is modeled using a multivariate normal distribution but the statistical distribution of the target class is assumed unknown. The RX anomaly detector is given by [7]: D( x) = ( x µ ) T Σ 1( x µ ) 0 where x is the pixel vector, µ 0 is the mean of the background class, and Σ is the covariance matrix. This detector computes the Mahalanobis distance of every pixel vector of the image from the mean of the background clutter and compares the resulting value with a given threshold. If the distance is greater than the threshold, the pixel is considered a target. 2.2 Matched Filter Detector The matched filter (MF) detector is derived by assuming that target and background follow a multivariate normal distribution with different means but the same covariance matrix. Computing the natural logarithm of the likelihood ratio yields the following detector [5]: D( x) = 0 T 1 ( µ 1 µ 0) Σ ( x µ 0 ) T 1 ( µ µ ) Σ ( µ µ ) 1 0 The idea of the matched filter is to project the pixel vector onto the direction that provides the best separability between the background and target classes. 2.3 Adaptive Matched Subspace Detector The adaptive matched subspace detector (AMSD) models the spectral variability of the background using a linear subspace model (structured model). Therefore, every pixel vector x can be represented as a linear combination of M basis vectors. The two competing hypotheses are [8]: H0 : x = Ba + w H1 : x = t + Ba + w where B is an L x M matrix whose columns represents the basis vectors of the background subspace, a represents the coefficients of the linear combinations of the background basis vectors, t is the target spectral signature, and w is an additive Gaussian noise. The AMSD is based on orthogonal projections of the test pixel onto the background subspace and the full linear space (target and background). This detector is given by: 1 ( P P ) T x E B x D ( x) = T x PB x T 1 T where PA = I A( A A) A is the orthogonal projection matrix to the range of A, and E = [t B] ] is the matrix whose columns span the union of the target and background subspaces. 3. CUDA ARCHITECTURE CUDA TM, which stands for Compute Unified Device Architecture, is a general purpose parallel computing architecture introduced by NVIDIA in November 2006 [6]. CUDA provides an easy access to the computing resources of the NVIDIA GPUs 0 using the C programming language with some extensions that allow the programmer to define functions called kernels that are executed in parallel lel by different GPU threads. In the CUDA programming model,, the GPU is viewed as a co-processor within the host computer, which has its own memory space and is capable of executing a large number of threads in parallel. Tesla C1060 belongs to the generation of NVIDIA GPUs with core architecture 1.3 (compute capability). This GPU card contains 30 streaming multiprocessors (SM). Each multiprocessor consists of 8 scalar processors (SP), 2 special functions for transcendental (SFU), a multithreaded instruction unit, and 4 types of on-chip memory: a set of bit registers, 16 KB of shared memory and a cached working set for constant and texture memory (Figure 1). Figure 1. Tesla C1060 Streaming Multiprocessor. When a kernel function is invoked, the threads are grouped in blocks and distributed to all the streaming multiprocessors. The thread blocks are split in groups of 32 consecutive threads called warps. All the threads within the same warp are executed together in single instruction multiple data (SIMD) fashion. In GPU devices of compute capability 1.3, the maximum number of active threads that can reside on each multiprocessor is 1, Hence, Tesla C1060 can run a maximum of 30,720 threads concurrently. Memory latency is a very limiting factor in GPU performance. Registers are the fastest memory space. Shared memory can be as fast as registers as long as the accesses are bank-conflict free. In contrast, the device off-chip memory can have a latency of clock cycles. However, simultaneous memory accesses by all threads of a half-warp can be coalesced into one memory transaction if all the accesses fall on the same memory segment in the devices compute capability 1.3. On the other hand, if the number of threads that can be active simultaneously on a multiprocessor is high enough, the thread scheduler can hide the memory latencies by selecting other threads for instruction execution. 4. GPU-BASED PARALLEL IMPLEMENTATION All three target detection algorithms have a general structure that shows an inherent parallelism. The output of the detectors is calculated independently for every pixel of the image. Therefore, if there are N pixels, the calculation of the detection output for the entire image can be decomposed into N parallel tasks without 84

85 communication between each other. This algorithm structure is known as an embarrassingly parallel problem [9]. In our GPU-based implementation, a CUDA kernel function for every detector algorithm was defined. It was configured as many threads as pixels in the image for the kernel execution, so every thread will execute the same code for a different pixel data vector. The image data was transferred to the GPU memory in band interleaved by pixel format (contiguous words in memory correspond to contiguous pixels in the image for the same band). This storage scheme allows coalesced memory transactions for all the threads of the same half-warp. Furthermore, the CUDA function cudamallocpitch() was used for the GPU memory allocation in order to guarantee the alignment conditions for coalescing. 4.1 GPU-based RX algorithm We have implemented on GPU an adaptive version of the RX algorithm. Since the mean and covariance matrix of the background class are usually not known a priori, they can be estimated locally using all the pixels in the region surrounding the test pixel. This region is selected for each pixel using a 2D spatially moving window in combination with a guard window as shown in Figure 2. The pixels contained in the guard window are excluded to avoid bias in the background parameter estimates. 2D moving window Figure 2. Structure of the 2D spatially moving window for background parameter estimation in RX and MF detectors. Since the product Σ 1( x µ ) test pixel guard window can be considered the solution of a 0 linear system of equations, the inverse of the covariance matrix is not computed directly in our implementation. Instead, every thread solves the system of equations performing the Cholesky decomposition of the covariance matrix [10]. In this implementation, every thread works on a different pixel, so the estimated mean vector and covariance matrix must be stored in the local memory space of the thread. This imposes a limitation in the maximum number of spectral bands that this algorithm can handle because the maximum amount of local memory per thread that can be allocated in devices of compute capability 1.3 is 16 KB. Since the algorithm uses extra local memory space for storing temporal results, the maximum number of bands is limited to GPU-based Matched Filter The GPU implementation of the MF detector is very similar to the RX algorithm implementation. We used the same approach to estimate the background parameters, but in this case, the MF detector needs an extra parameter: the target mean vector. This parameter is pre-computed from the image using a set of target sample pixels and it is used as an input to the algorithm. Since the target mean vector is the same for all the pixels of the image, it is stored in the constant memory space of the GPU. To solve the products Σ 1( x µ ) and Σ 1( µ µ ) 1, we use the 0 0 same algorithm for Cholesky decomposition as in the RX implementation. This decomposition is performed only once and the two solutions are obtained through back and forward substitutions. 4.3 GPU-based Adaptive Matched Subspace Detector For the implementation of the AMSD, it was necessary to estimate the set of basis vectors B for the background subspace. We selected this basis as the eigenvectors of the image correlation matrix R = X T X as proposed in [8]. The correlation matrix was computed on the GPU using the CUBLAS TM [11] function cublassgemm() for matrix-matrix multiplication. Once the correlation matrix is calculated, the eigenvectors are obtained from the function culadevicesgeev() of CULAtools TM [12]. CULAtools is a commercial GPU implementation of the LAPACK numerical linear algebra library. Finally, the orthogonal projection matrices were computed using the functions cublassgemm() and culadevicesgesv(), and stored in the constant memory space for use by the kernel function. 5. EXPERIMENTAL RESULTS The detection algorithms were implemented on the GPU using version 3.0 of the CUDA Toolkit and Ubuntu bit as operating system. The performance of the algorithms was tested on a workstation equipped with a quad core Intel Xeon CPU, 12 GB of RAM memory and two NVIDIA Tesla C1060 GPU cards. For the assessment of the parallel performance and detection accuracy of the implemented algorithms, a set of phantom images of a scene simulating traces of explosive on clothing was generated using the a SOC-700 visible hyperspectral imager from Surface Optics Corporation ( The SOC-700 imager acquires a 512 by 512 pixel image in the visiblenear infrared region from 0.4 to 1.0 microns with a spectral resolution variable from 2.8 to 25 nm. For each detection algorithm, a CPU-based implementation was developed as a baseline to estimate the performance of the GPU implementations. The CPU implementation was built using OpenMP [13] and tested on a quad-core Intel Xeon multiprocessor with hyperthreading (8 threads). Table 1 shows the running times of the detection algorithms for each implementation (averaged over 10 benchmark executions). The GPU-based implementation performs faster than the CPUbased implementation for all the three algorithms. In both implementations, the highest running time corresponds to the RX algorithm and the lowest running time to the AMSD. The speedup achieved for the AMSD over the CPU implementation was about 21x, whereas the speedups of the GPU implementations for the RX and MF detectors was only 3.3x and 3.1x, respectively. The decrease in the GPU performance in these two implementations may be due to high dependency on local data (GPU local memory has low bandwidth). Figure 3 shows the resulting detection maps for each detector algorithm. Table 2 shows the detection statistics (detection accuracy and percentage of false alarms). The detection accuracy of the adaptive RX algorithm is very limited by the size of the 2D moving window. For a window size of 51x51, only 5 small targets were detected (8% of detection accuracy). On the other hand, the 85

86 MF algorithm was able to detect the big target but with a high false alarm rate (15.3 %). The best detection performance was achieved by the AMSD (99.4 % of detection accuracy and 1.2 % of false alarms). Table 1. Running time (ms) comparison between GPU and CPU algorithms. CPU running time (ms) GPU running time (ms) RX MF AMSD RX MF AMSD Figure 3. Resulting detection maps for RX, MF and AMSD. Table 2. Detection Statistics for RX, MF, and AMSD. Detection Accuracy (%) False Alarms (%) RX MF AMSD CONCLUSIONS We implemented on a Tesla C1060 GPU three different types of target detection algorithms for hyperspectral imaging: the RX algorithm, the matched filter detector and the adaptive matched subspace detector. The GPU implementation of the AMSD shows the best performance in terms of the speedup achieved (21x) over a CPU parallel implementation using OpenMP. The speedups achieved in the RX and MF detectors were only around 3x due to memory limitations of the 1.3 architecture and the adaptive structure of the algorithms. However, these implementations still run faster than the corresponding CPU implementation and the resulting speedups and running times suggest that applications for real-time detection of targets in HSI can greatly benefit from the performance of GPUs as processing hardware. 7. ACKNOWLEDGMENTS This material is based upon work supported by the U.S. Department of Homeland Security under Award Number ST-061-ED0001 and used facilities of the Bernard M. Gordon Center for Subsurface Sensing and Imaging Systems sponsored by the Engineering Research Centers Program of the National Science Foundation under Award EEC Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Department of Homeland Security or the National Science Foundation. 8. REFERENCES [1] Schau, H. C. and Jennette, B. D. (2006). Hyperspectral requirements for detection of trace explosives agents. In Proceedings of SPIE: Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XII, 6233, p Y. [2] Dombrowski, M. S., Willson, P. D., and LaBaw, C. C. (1997). Defeating camouflage and finding explosives through spectral matched filtering of hyperspectral imagery. In Proceedings of SPIE: Terrorism and Counter-Terrorism Methods and Technologies, 2933, [3] Stein, D. W. J., Beaven, S. G., Hoff, L.E., Winter, E. M., Schaum, A.P., and Stocker, A. D. (2002) Anomaly detection from hyperspectral imagery. Signal Processing Magazine, IEEE, 19 (1), [4] Messinger, D. W. (2004). Gaseous plume detection in hyperspectral images: a comparison of methods. In Proceedings of SPIE: Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery X, 5425, [5] Manolakis, D., Marden, D., and Shaw, G. A. (2003). Hyperspectral Image Processing for Automatic Target Detection Applications. Lincoln Laboratory Journal, 14 (1), [6] NVIDIA Corp. (2010, February 20). NVIDIA CUDA Programming Guide Version 3.0. Retrieved May 3, 2010, from lkit/docs/nvidia_cuda_programmingguide.pdf. [7] Reed, I. S. and Yu, X. (1990). Adaptive Multiple-band CFAR Detection of an Optical Pattern with Unknown Spectral Distribution. IEEE Trans. Acoustics, Speech and Signal Processing 38, [8] Manolakis, D., Siracusa, C., and Shaw, G. (2001). Hyperspectral Subpixel Target Detection using the Linear Mixing Model. Geoscience and Remote Sensing, IEEE Transactions on, 39 (7), [9] Wilkinson, B. and Allen, M. (1998). Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers (1st ed.). New Jersey: Prentice Hall. [10] Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1998). Numerical Recipes in C. The Art of Scientific Computing (2nd ed.). New York: Cambridge University Press. [11] NVIDIA Corp. (2010, February 20). CUDA CUBLAS Library Version 3.0. Retrieved May 3, 2010, from lkit/docs/cublas_library_3.0.pdf. [12] Humphrey, J. R., Price, D. K., Spagnoli, K. E., Paolini, A. L., and Kelmelis, E. J. (2010). CULA: Hybrid GPU Accelerated Linear Algebra Routines, SPIE Defense and Security Symposium (DSS). [13] Chandra, R., Menon, R., Dagum, L., Kohr, D., Maydan, D., and McDonald, J. (2001). Parallel Programming in OpenMP. San Francisco: Morgan Kaufmann Publishers Inc. 86

87 An Integrated Hardware/Software Environment for the Implementation of Signal Processing Algorithms on FPGA Architectures María González-Gil (Undergrad. Student) University of Puerto Rico Mayaguez Campus David Márquez (Graduate Student) University of Puerto Rico Mayaguez Campus Domingo Rodríguez (Advisor) University of Puerto Rico Mayaguez Campus ABSTRACT This work presents an environment for the analysis, design, implementation, and modification of certain class of signal processing algorithms using an integrated hardware/software approach. This approach consists in five fundamental stages: algorithms: 1) Signal processing algorithm development using the numeric computation software package Matlab; 2) Simulink formulation of signal processing algorithms; 3) System generator algorithms implementation; 4) Field Programmable Gate Array (FPGA) algorithm simulation and emulation; 5) Signal processing algorithm validation through Matlab. The radar cross-ambiguity function was successfully implemented on a Virtex 5 field programmable gate array unit using this development and implementation environment based in the work presented by Márquez et al. in [1]. KEYWORDS FGPA, Ambiguity Function, System Generator, Matlab, Signalbased Information Processing. 1. INTRODUCTION This paper presents an environment for the analysis, design, implementation, and modification of certain class of signal processing algorithms using an integrated hardware/software approach. This approach consists in five fundamental stages: algorithms: 1) Signal processing algorithm development using the numeric computation software package Matlab; 2) Simulink formulation of signal processing algorithms; 3) System generator algorithms implementation; 4) Field Programmable Gate Array (FPGA) algorithm simulation and emulation; 5) Signal processing algorithm validation through Matlab (see Stage 1). In the next section we proceed to provide a concrete example of a successful signal processing algorithm design and implementation effort using this FPGA hardware/software approach. The problem discussed in the next section belongs to the class of signal processing algorithms know as time-frequency distribution. In particular, the program deals with the computation of the crossambiguity function of a transmitted signal and its return echo ([2], [3]). 1.1 Ambiguity Function Formulation In this subsection we present a mathematical formulation of the cross-ambiguity. In this work the FPGA is utilized to implement the Ambiguity function for large scale signals, considering the approaches that parallelizing the algorithm of this function by the use of Signal Algebra Operator techniques could be essential to effect large scale FPGA implementations. To perform this work, a Xilinx s FPGA Virtex 5 XUPV5- LT110T, with a 100 MHz clock on a ML505 evaluation board was utilized as computational architecture. 1.2 FPGA Definition A Field Programmable Gate Array or FPGA is a semiconductor circuit device that contains configurable logic blocks (CLBs) which can be arranged to perform sequential or combinational functions. Around the CLBs they are input/output blocks (IOBs) which connect the CLBs inputs and outputs to pins in the chip package. This CLB can be connected to each other using programmable routing channels. The function of the user is to define the logic functions of each CLB and how the IOB should work, and interconnect. Its configuration can be programmed using either a Hardware Description Language (HDL) such as VHDL, acronyms Very High Speed Integrated Circuit Hardware Description Language, or Verilog, or a schematic. The FPGA is inherently parallel, therefore, few software were created that assists the programmer in parallel programming such as pmatlab [4] created by the MIT Lincoln Laboratory. It is a library for Matlab that enables parallel computations on nonscientific computers. This library allow the user to parallelize their existing programs wrote in Matlab, by just changing or adding few code. In order to work with large scale numeric sequence, its length should be higher than 2 12 after all the zero padding procedure. To achieve this purpose, the pipeline method was utilized integrating the use of shared memory to write the data during the process. The maximum length of numeric sequence obtained from this process was The computation of the ambiguity function is an extension of the work presented by Rodriguez et al. in ([5], [6], [7], [8]), and by Rodriguez in ([9], [10], [11]). The Ambiguity function has the following formulation: The FPGA s Ambiguity Function Implementation Architecture has the following structure. A universal counter synchronizes the computation process. The input signals f and g is complex; therefore, there are four large scale shared memory equal to N independent process to store the real and imaginary parts of each signal. Since the signal f and g cannot change during the process, write control blocks were used to disable the writing on this memory to prevent the loss of data. A shifter block is included 87

88 when is required to do a circular shift on the signal g. This block has counter in cascade that determines the memory address to the signal with a shifted m. Also, it is required to compute the complex conjugate on the signal g. Therefore, a block is incorporated in the output share memory where the imaginary part of the signal is. Then, the Hadamard product is calculated in a block founded in a complex multiplier, for later this result will be loaded into the Fast Fourier Transform block, to obtain an N-point Fourier transform. Finally, a memory of size N is employed to store each column of the computed absolute value of the ambiguity function surface. The data was update each time a new column was computed. 2. HARDWARE/SOFTWARE APPROACH 2.1 MATLAB Algorithm Development In this section we describe our Matlab algorithm development stage. Matlab is a powerful simulation tool to understand the ambiguity function and obtain a better development of the algorithm. Its toolboxes and the ease of plotting allow to perform diverse tests as well as to make changes and quick adjustments to achieve a good approach of the algorithm. Figure 1 shows the simulation results for a Chirp pulse of 1024 samples, with zero padding included. Chirp signals have a sample frequency of 500 Hz, the instantaneous frequency at time 0 seconds is 0 Hz and the instantaneous frequency of 200 Hz is achieved at time 1 Seconds. Therefore, the chirp rate is 0.4 Hz. Assuming that for this example the received signal had a delay of 520 samples we proceeded to calculate the ambiguity function and to plot it. At this stage, some tools like pmatlab could be used to analyze the parallelization of the algorithm, and, thus, be closer to the algorithm implementation in FPGA. We can observe in more details the data flow, and the use of vectors and matrices, and concepts such as serial-parallel, buffers, partitioning, and others. Figure 1. Ambiguity Function for a Chirp pulse. 2.2 Simulink Algorithm Formulation In this section we describe our Simulink algorithm formulation stage. After having understood and designed the algorithm for the ambiguity function in Matlab, the next step is the implementation of the algorithm in Simulink environment. It has a graphical block diagramming tool interface that allows developing models through many predefined blocks. Figure 2. Implementation of Ambiguity Function in Simulink Figure 2 shows the implementation of our algorithm in Simulink. Ambiguity Function subsystem contains the blocks necessary to perform the calculation and send control signals to the scope and the data of the result produced to the To Workspace subsystem, thus available in Matlab for visualization and comparison with the algorithm written above in section System Generator Stage In this section we describe our System generator stage. We start by describing the development system used: Xilinx. Xilinx System Generator for DSP [12] is a powerful tool for design, simulation, and hardware co-simulation of algorithm for FPGA. It is not a replace for VHDL programming, but helps us to achieve lower design times. This tool adds new and specials Simulink blocks that can be converted to VHDL code and downloaded to FPGA, or the direct interaction with the hardware using Cosimulation, see Figure 3. In/Out Blocks allow the communications with the exterior of the FPGA and BlockRAM and FIFOs to manage the data between hardware and Simulink. Many predesigned blocks are included and their combination reduces the effort in implementation and accelerates the testing process. Figure 4 shows the appearance of an implementation using the System Generator development system. 88

89 Figure 5. Hardware Co-Simulation. Figure 3. Some Blocks used in Ambiguity Function implementation in System Generator for DSP. 2.5 MATLAB Algorithm Validation In this section we describe our Matlab algorithm validation stage. The validation of the algorithm may be accomplished in a timely manner. Through a program written in Matlab is possible to compare the results of what was done in sections 2.1 and 2.2. Figure 6 shows the validation of the results by reconstructing the example for large scale signals shown in [1]. Figure 4. Appearance of an Implementation in System Generator. 2.4 FPGA Simulation/Emulation Stage In this section we describe our FPGA simulation & emulation stage. When our design is ready to be tested, we can make simulation or emulation. In the simulation case (see Figure 4), Simulink interprets all components of Xilinx System Generator, and performs calculations of latency and computation time of each simulated component, in a similar way as the Simulink blocks. We can see the results within the same tool, or export the data to Matlab, or a.mat file, for analysis later. In the case of emulation, or hardware co-simulation, System Generator generates a new component (see Figure 5) from a specific configuration about the FPGA. Parameters such as card type, communication type, speed, and clock location physically, or any other information about the hardware, are needed. The FPGA must be on and connected, because this new component is responsible for running the binary program in the FPGA. The advantage of System Generator is its ability to communicate and share data with Simulink and then to Matlab, or save the result in text files, making it easier to validate the results. Figure 6. Ambiguity Function samples 3. CONCLUSION This work presented an environment for the analysis, design, implementation, and modification of certain class of signal processing algorithms using an integrated hardware/software approach. This approach consists in five fundamental stages: algorithms: 1) Signal processing algorithm development using the numeric computation software package Matlab; 2) Simulink formulation of signal processing algorithms; 3) System generator algorithms implementation; 4) Field Programmable Gate Array (FPGA) algorithm simulation and emulation; 5) Signal processing algorithm validation through Matlab (see Stage 1). 89

90 4. FUTURE WORKS It is expected that the signal processing algorithm development and implementation approach presented in this could be used for the FPGA implementation of other time-frequency distributions. In particular, we are interested in implementing the short-time Fourier transform for near real-time analysis of bioacoustics signals for environmental surveillance monitoring applications. 5. ACKNOWLEDGMENTS The author(s) acknowledge(s) the National Science Foundation (Grants CNS , CNS , and CNS ) for its support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 6. REFERENCES [1] David Márquez, Juan Valera, Angel Camelo, Cesar Aceros, Manuel Jiménez, and Domingo Rodriguez, Implementations of Cyclic Cross-Ambiguity Functions in FPGAs for Large Scale Signals, Latin American Symposium on Circuits and Systems LASCAS2011, IEEE International Symposium. [5] Jeremy Johnson, Robert Johnson, Domingo Rodriguez, and Richard Tolimieri, A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures., In Proc. of Circuits, Systems, and Signal Processing, vol. 9, no. 4, pp , [6] D. Rodriguez, J. Seguel, and E. Cruz, Algebraic methods for the analysis and design of time-frequency signal processing algorithms, Circuits and Systems, 1993., ISCAS 93, 1993 IEEE International Symposium, pp , May, [7] Domingo Rodriguez, Marlene Vargas Solleiro, and Yvonne Aviles, DFT beam forming algorithms for space-timefrequency applications, Digital Wireless Communication II, vol. 4045, no. 1, pp , [8] A. B. Ramirez and D. Rodriguez, Automated hardware-inthe-loop modeling and simulation in active sensor imaging using t16713 DSP units, in Circuits and Systems, MWSCAS th IEEE International Midwest Symposium, 2006, vol. 2, pp [9] Domingo Rodriguez, SAR point spread signals and earth surface property characteristics, Part of the SPIE Conference on Subsurface Sensors and Applications, Denver, Colorado, vol. 3752, pp , July, [2] J.E. Gray, An interpretation of Woodward s ambiguity function and its generalization, Radar Conference, 2010 IEEE, pp , May, [3] L. Auslander and R. Tolimieri, Radar ambiguity functions and group theory, SIAM J. Math. Anal., vol. 16, pp , May, [4] N.T. Bliss, J. Kepner, H. Kim, and A. Reuther, pmatlab: Parallel Matlab library for signal processing applications, Acoustics, Speech and Signal Processing, ICASSP IEEE International Conference on, vol. 4, pp. IV 1189 IV 1192, April [10] D. Rodriguez, A computational Kronecker-core array algebra SAR raw data generation modeling system, in Signals, Systems and Computers, Conference Record of the Thirty-Fifth Asilomar Conference, [11] Domingo Rodriguez, Tensor product algebra as a tool for VLSI implementation of the discrete Fourier transform, Proc. ICASSP, Toronto, ON, Canada, pp , [12] Xilinx, System generator for DSP user guide,

91 A Computational Framework for Entropy Measures Marisel Villafañe-Delgado University of Puerto Rico, Mayagüez Campus Domingo Rodríguez University of Puerto Rico, Mayagüez Campus ABSTRACT The entropy of one-dimensional signals is measured by considering its time-frequency representation as twodimensional array image. This time-frequency distribution provides further information not appreciated in the onedimensional representation of the signal. This work concentrates in the study of the effects of the windows size and type in the entropy measures of representations given by the Cyclic Short-Time Fourier Transform. Four different entropies are studied: Shannon, Rènyi, Tsallis- Havrda-Charvát, and Wavelet. KEYWORDS Cyclic short-time Fourier transform, Entropy Measures, Time-Frequency Representations. 1. INTRODUCTION The entropy of a process or a system is defined as the amount of information in the process or a measure of the uncertainty of information in a statistical description of a system [1]. Entropy gives information related to the changes in a signal, where for a signal of constant values its entropy measure is equal to zero. This work centers on formulating computational methods for calculating entropy measures of image data sets obtained from the timefrequency distribution of one dimensional signals. Entropy measures based on the time-frequency representations had been presented before computing the Rènyi s entropy [2]- [3]. The entropy measures being used in this work are the Shannon, Rènyi, Tsallis-Havrda-Charvát, and Wavelet. The first three entropy measures are based directly on the probabilistic characteristics of the data. The latter entropy measure is based on the wavelet decomposition of the data. Time-frequency analysis is well suited for study a large variety of signals. Among its applications are included the analysis of audio signals, radar and sonar systems, wave physics, mechanics and vibrations, spirals in turbulence, biology and medicine, as well as others [4]. Thus, this computational framework seeks to provide an alternate and confident tool for the analysis of information processes taking place on these applications. This article is organized as follows: Section 2 introduces the proposed computational framework. Section 3 presents the time-frequency distributions: short-time Fourier transform (STFT) and cyclic short-time Fourier transform (CSTFT); Section 4 presents the entropy measures. In Section 5 are presented the results. Finally, conclusions and future work are discussed. 2. PROPOSED FRAMEWORK The proposed computational framework is a developing structural entity consisting of six (6) fundamental components: 1) an initial set of input signals; 2) an initial set of computational operators to act on the set of input signals; 3) a set of rules for the computational operators to act on the set of input signals; 4) a set of composition rules for the computational operators, such that the composition of a finite numbers of operators results in a operator not necessarily in the initial set; 5) a set of output signals, which, in turn, may be used as input signals; and 6) a user interface. An instantiation of this framework may be as follows. First, a signal, say, is divided into several audio segments of equal length. This is done by sliding a fixed length window, say, along this signal. Both, the audio signal and the associated window are considered input signals under the computational framework. Then, for each segment, the CSTFT is computed producing the timefrequency representation for that segment of the signal. The resultant frame is an image. Figure 1 provides a depiction of this procedure where a signal is segmented using a sliding window and then an ordered set of overlapping frames is generated. Each time-frequency frame corresponds to the CSTFT of translated slide or segment. Once the time-frequency frames are generated, the next step is to proceed with the computation of the twodimensional histogram, for the Shannon, Rènyi, and Tsallis-Havrda-Charvát entropies. For the computation of the Wavelet entropy it is required to perform the wavelet decomposition of the time-frequency frames. Finally, the corresponding entropy measure is computed. This procedure produces entropy measure waveforms which have as their horizontal axes the time translation index of the set of ordered frames and as the vertical axes the values of the entropy measures. 91

92 3.2. Cyclic Short-Time Fourier Transform Let 0,1,,1 denote the indexing set of non-negative integers and let 1,,1 denote the natural indexing set of 11 positive integers, with. Then, denote by the set of all complex signals of order. The set is isomorphic to. Now, let be an arbitrary signal to be processed and allow being an associated window function, with. Introduce /2 as a zero-padded, shifted, version of the window signal. Notice that. For convenience, is always taken to be of the form 2,. The CSTFT is given by Figure 1: Time-frequency frame generation of audio signals.,,, (2) 3. TIME-FREQUENCY DISTRIBUTIONS Time-frequency analysis is a powerful tool for the analysis of signals with time-varying spectra for a certain time instant.. It allows visualizing multiple frequencies contained in signal. Important and well known timefrequency signal analysis techniques are the STFT, the modified ambiguity function, and the modified Wigner distribution. Here we focus in the CSTFT, which is a modification of the STFT Short-time time Fourier Transform where, =, 1. At this point, is known as the displacement period, / is the time lag, and is the spectral shift. The CSTFT enhances edge effects and improves the spectral resolution when compared with traditional spectrogram computations. This transform can be appreciated in Figure 2, where the time-frequency representation of the audio signal of an Eleutherodactylus Coquí is illustrated. The spectrogram has been the most widely used tool for the analysis of time-varying spectra [5]., where,., (1) Here, is a fixed length sliding window, which usually slides through the signal with a defined overlap. The Fourier transform is computed for the selected frame. As result, a complex value containing information of the magnitude and phase of the segment is added to a matrix which finally forms the time-frequency representation of the signal. Figure 2: (upper) Time-frequency representation of the Eleutherodactylus Coquí, computed with the CSTFT, (bottom) Audio signal. 92

93 4. ENTROPY MEASURES 4.1 Shannon Shannon s entropy has been one of the most popular and computed entropies. He suggested that there is a relationship between the information of a measure and its frequency of occurrence [6]. The two-dimensional Shannon entropy is defined as log,,, 3 where is defined as the entropy of the set of probabilities,. 4.2 Rènyi The entropy of order α of the distribution, is known as the Rènyi s entropy. Thus the Shannon entropy can be defined in terms of the Rènyi s entropy as the limiting case when alpha tends to 1. The two dimensional Rènyi s entropy for an image is given by [7] where is real and positive. 1 1,, 4.3 Tsallis-Havrda-Charvát The Tsallis-Havrda-Charvát entropy is given by [8] where is real and positive. 1 1, 1 4, 5 When 2 the Tsallis-Havrda-Charvát becomes the Gini- Simpson index of diversity 1,, Two-Dimensional histogram For the Shannon, Rènyi and Tsallis-Havrda-Charvát, the probabilities, are obtained from the two-dimensional histogram of the image [8]. Let, be the gray value of a pixel located at the point,. Let, the average gray value of the neighborhood of a pixel located at,. The average gray value for the 3 x 3 neighborhood of each pixel is given by,, (7). Thus, the normalized histogram is approximated by h, h,, h (8). 4.5 Wavelet The wavelet transform is well suited for entropy analysis, having many applications from fault localization [9] up to electroencephalogram analysis [10], as well as others. An important advantage of the wavelet transform is how the windows size is distributed in the transform. It has short windows for high frequencies and long windows for low frequencies. Thus, these variations in the window size according to the localization makes the wavelet transform unique and advantageous over other traditional transforms, such as the STFT. Another advantage of the wavelet transform is that it does not make assumptions about the signal s stationarity [11]. The two-dimensional stationary wavelet transform gives the decomposition of the approximation coefficients at a level into the approximation and details in the horizontal, vertical and diagonal directions. Given a wavelet coefficient,1,,, the energy at a resolution level is defined as (9). The total energy of the wavelet coefficients is given by The relative wavelet energy is defined as (10). (11). Finally, the wavelet entropy at the resolution level is defined as (12). 5. RESULTS To assess the computational implementation of the previously presented framework an audio signal of 0.74 seconds sampled at 44.1 (32768 points) was utilized. The CSTFT was computed for resulting frames of 8192x512 and 4096x

94 For both frames sizes, the following windows were computed: Blackman-Harris, Chebyshev, Flattop, Gaussian, Hamming, Kaiser, and Taylor. Figure 3 illustrates the Shannon entropy for time-frequency frames of 8192x256 for each window. From this figure is well appreciated that the entropy with the lower values corresponds to the time-frequency frame computed with the Kaiser window. As counterpart, the higher entropy values are obtained with the Flattop window. Figure 4: Rènyi entropy for time-frequency frames of 8192x512 and 4096x256, with Gaussian window. Figure 3: Shannon entropy for time-frequency frame of 8192x512. Windows: Blackman-Harris, Chebyshev, Flattop, Gaussian, Hamming, Kaiser, and Taylor. It is also of interest to compare how the entropy is affected by the time-frequency frame size. Figure 4 show the results for the Rènyi entropy for frames of 8192x512 and 4096x256, both for Gaussian window. 6. CONCLUSIONS A computational framework for the entropy measures of one-dimensional signals is presented. This framework consists of six (6) fundamental components integrated as a developing structural entity. A novel concept of this framework is to use concepts, tools, methods, and rules from operator signal algebra theory to exploit the algebraic and geometric properties of the signals and operators utilized throughout the framework. 7. ACKNOWLEDGMENTS The author(s) acknowledge(s) the National Science Foundation (Grants CNS , CNS , and CNS ) for its support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 8. REFERENCES [1] R.M. Gray, Entropy and Information Theory, Springer, New York, [2] R.G. Baraniuk, P. Flandrin, A.J.E.M. Janssen, O.J.J. Michel, Measuring time-frequency information content using the Rènyi entropies, IEEE Transactions on Information Theory, vol. 47, pp , May [3] S. Aviyenete, W.J. Williams, Minimum entropy timefrequency distributions, IEEE Signal Processing Letters, vol. 12, pp.37-40, Jan [4] P. Flandrin, Time-frequency and chirps, Wavelet Applications VIII, SPIE vol. 4391, Proc. of AeroSense 01, Orlando (FL), [5] L. Cohen, Time-frequency representations, Proceedings of the IEEE, vol. 77, no. 7, pp , July,1989. [6] C.E.Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, vol. 27, pp , , July, October, 1948 [7] A. Rènyi, On measures of Entropy and Information, Fourth Berkeley Symposium on Math, Statistics and Probability, pp , University of California Press, [8] P.K. Sahoo and G. Arora, Image Thresholding using twodimensional Tsallis-Havrda-Charvát Entropy, Pattern Recognition Letters, vol. 27, pp , [9] S. El Safty and A. El-Zonkoly, Applying Wavelet Entropy principle in Fault Classification, World Academy of Science, Engineering and Technology, vol. 31, no.10, pp , [10] H. A. Al-Nashash, J.S. Paul, and N. V. Thakor, Wavelet entropy method for EEG analysis: Application to global brain injury, Proceedings of the 1 st International IEEE EMBS Conference on Neural Engineering, Capri Island, Italy, March 20-22, [11] L. Zunino, D.G. Pérez, M. Garavaglia, and O.A. Rosso, Wavelet entropy of stochastic processes, Physica A: Statistical Mechanics and its Applications, vol., 379, no. 2, pp , 15-Jun

95 Continuous-time Markov Model Checking for Gene Networks Intervention Marie Lluberes University of Puerto Rico CISE Ph.D. Program Mayaguez, PR , x-5273 Dr. Jaime Seguel, Advisor University of Puerto Rico Electrical and Computer Engineering Department Mayaguez, PR , x-3523 ABSTRACT This research seeks to develop an algorithm for the automated study of the dynamics of gene networks modeled as probabilistic Boolean networks (PBN). Particularly, it is our great interest to address the problem of intervention. Intervention refers to a selective change in the parameters in a specific node or set of network nodes in such a way that its dynamic behavior moves away from or to some selected state(s). For reaching these goals, we want to explore the use of model-checking techniques, which are a class of algorithms used to automatically test whether a model of a system meets a given specification. Therefore, checking a PBN model would answer whether a specified state of the network is achievable or not under specific initial conditions. Given that PBNs are normally analyzed with Markov chain theory, the integration of Model checking algorithms for continuous-time Markov chains and PBNs will provide a powerful means for the simulation in silico of gene network dynamics. KEYWORDS Gene Regulatory Network, Probabilistic Boolean Networks, Markov-chain, intervention, model-checking algorithms. 1. INTRODUCTION Gene network dynamics, as a consequence of both internal and external interactions, constitute an area of study of great interest in Bioinformatics, mostly because such dynamics represent the state of a biological system. As such, knowing the underlying mechanisms that govern the network could provide the power to manipulate it in a desired way. Because of this, developing an automated system capable of simulate in an effective way the behavior of a gene network would also provide the knowledge to alter it, or intervene with it, guiding certain interactions in order to obtain a desired result or, on contrary, preventing or stopping certain behavior from occurring. There is no independence among genes, but mutual regulation. The interrelationships among genes constitute gene regulatory networks, which is a central focus of genomic research. To understand the underlying mechanisms of gene behavior, one possible approach is to model the genetic regulatory system and infer the model structure and parameters from real gene expression data. Then, the inferred model could be used to make useful predictions by mathematical analysis and computer simulations. This possesses tremendous potential clinical impact in the physiology of an organism and disease progression, and accurate diagnosis, target id, drug development and treatment [2]. It is a fact that biological phenomena manifest in the domain of the continuous. It is also true that, in describing such behavior, we usually employ binary language: it is expressed or not expressed; it is on or off; it is regulated or down regulated, for instance. Studies conducted quantizing genes to only two levels (0 or 1) suggest that information retained by genes when binarized its meaningful to the extent that it is contained in the continuous domain [2]. This allows gene networks to be put in the context of Boolean networks. As a dynamical system, gene networks exhibit structural stability. This means that the system is robust in presence of perturbations, maintaining its integrity and showing spontaneous emergency of ordered collective behavior. This behavior is shared with Boolean networks and PBNs through the existence of attractors and absorbing states, which act as some kind of memory for the system. However, the regularity of genetic function and interaction is not due to hard-wired logical rules, but rather to the intrinsic self-organizing stability of the dynamical system. Also, we may want to model an open system with inputs (stimuli) that affect the dynamics of the network. The assumption of only one logical rule per gene, as the intrinsic determinism of Boolean networks has, may lead to incorrect conclusions when inferring these rules from gene expression measurements, as the latter are typically noisy and the number of samples is small relative to the number of parameters to be inferred. PBNs, as Boolean networks, are rule-based, but are robust in the face of uncertainty. Their dynamic behavior can be studied in the context of Markov Chains, where Boolean networks are just special cases. They explicitly represent probabilistic relationships between genes, allowing quantification of influence of genes on other genes. Because of the above, PBNs are better suited for this study. Nevertheless, given its exponentially growing size (2 n states for n genes), answering questions as how is the best way to reach or avoid particular state(s) may be cumbersome if performed through exhaustion. Model-checking algorithms have the ability of automatically check if a certain condition is met under given specifications. Thus, it could answer questions as the previously stated in an efficiently manner. This would tremendously facilitate the intervention or deliberately perturbation of the network in order to achieve a desired behavior from it. 2. PROBABILISTIC BOOLEAN NETWORK 2.1 Boolean Network Model A Boolean network is a set of Boolean variables whose state is determined by other variables in the network. Formally: A Boolean network G(V, F) is defined by a set of nodes (genes) V = {x 1,..., x n }, and a list of Boolean functions F = (f 1,..., f n ). Each x i {0, 1}, i = 1,..., n is a binary variable and its value at time t +1 is determined by the values of some other genes at time t by means of a Boolean function f i F. That is, there are k i genes 95

96 assigned to gene x i and the mapping j k : {1,, n} {1,..., n}, k = 1,..., k i determines the wiring of gene xi. Thus, we can write x i (t + 1) = f i (x j1 (i)(t), x j2 (i)(t),, x jk i (i) (t)). A network with n genes can be encountered in 2n states. Each of these states forms a pattern containing the state of each gene individually, or GAP (gene activity profile). Some of these states are attractors, or states to which the network tends to flow capturing its long-term behavior. They represent the memory of the system. Attractors may be cycles of more than one state (see Figure 1). [2] c j (i) =! i j l(i) i! k " k =1 i where θ j is the COD for gene x i relative to the genes used as inputs to predictor f (i) j (see Figure 2) [2]. x 1(t+1) = x 1(t) x 2(t) x 2(t+1) = x 1(t) x 2(t) x 3(t+1) = x 1(t) x 2(t) x 3(t) a b Figure 1. Example of Boolean network a Boolean network with three nodes b State transition diagram We need to identify the networks from real experimental data in order to understand the genetic regulation. For this, we need to discover associations between variables using a coefficient of determination (COD), which measures the degree to which the expression levels of an observed gene set can be used to improve the prediction of the expression of a target gene relative to the best possible prediction in the absence of observations. Let x i be a target gene that we wish to predict by observing some other genes x i1, x i2,..., x ik. Also, suppose f(x i1, x i2,..., x ik ) is an optimal predictor of x i relative to some error measure ε. Let ε op t be the optimal error achieved by f. Then, the COD for x i relative to x i1, x i2,..., x ik is defined as:! = " i # " opt " i where ε i is the error of the best (constant) estimate of x i in the absence of any conditional variables [2]. 2.2 PBN Model The open nature of biological systems and the procedures to study them makes it necessary for the model representing them to cope with uncertainty. One solution to this is to absorb the uncertainty into the predictor by synthesizing a number of them with good performance, so that each gets a chance to contribute its own modest prediction. Each predictor s contribution is proportional to its determinative potential, as measured by the COD. Given genes V = {x 1,..., x n }, we assign to each x i a set F i = {f (i) 1,..., f (i) l(i) } of Boolean functions representing the top predictors for that target gene. Thus, a PBN could be defined as G(V, F) where F = (F 1,..., F n ) [4], and each F i in F is as previously described. The probabilistic predictor of each target gene can be thought of as a random switch, where at each point in (i) time or step of the network, the function f j is chosen with probability c (i) j to predict gene x i. Using COD normalized: Figure 2. A basic building block of a PBN [2]. At a given instant of time, the predictors selected for each gene determine the state of the PBN. These predictors are contained on a vector of Boolean functions, where the ith element of that vector contains the predictor selected at that instant for gene x i. This is known as a realization of the PBN. If there are N possible realizations, then there are N vector functions, f1, f 2,..., f N of the form f k = (f (1) k1, f (2) k2..., f (n) kn ), for k = 1, 2,..., N, 1 k i l(i) and where f (i) ki F i (i= 1,..., n). In other words, the vector function f k :{0, 1} n {0, 1} n acts as a transition function (mapping) representing a possible realization of the entire PBN. n Assuming independence of the predictors, N = l(i). Each! (i) realization k can be selected with P k =. The probability of transitioning from state (x 1,..., x n ) to (x 1,..., x n ) is given by [3]: N n Pr{(x 1,, x n )! (x ' ' $ 1,, x n )} = P k (1 " f i ' ' + &# ki (x 1,, x n ) " x i )) k =1 % i=1!##### "##### $ ( 3. PERTURBATION AND INTERVENTION The genome isn t a closed system, but one with inputs from the outside. Such stimuli can either activate or inhibit genes; therefore it is necessary for the model that represents it to reproduce this behavior. This is achieved by the inclusion of a realization in the form of a random perturbation vector γ {0, 1} n. Lets assume that a gene can get independently perturbed with probability p. Then if γ i =1 the ith gene is flipped, otherwise is not. For simplicity, we will assume that Pr{γ i = 1} = E[γ i ] = p for all i = 1,..., n (i.e. independent and identically distributed). Let x(t) {0, 1} n be the state of the network at time t. Then, the next state x is given by: $ & x! ", with probability 1- (1- p)n x' = %, '& f k (x 1,, x n ), with probability (1 # p) n where is component-wise addition modulo 2 and f k is the transition function representing a possible realization of the entire PBN, k = 1, 2,..., N [2]. In presence of perturbation with probability p, the entrances in the state transition matrix are computed by [4]:! i=1 n c i=1 ki *{0,1} 96

97 +. N n - # A(x, x ' ) = P k (1! f i ' & 0 -* %" ki (x 1,, x n )! x i )( 0 - k =1 $ i=1!##### "##### $ ' 0, / ){0,1} 1( 1! p) n + p 2(x,x ') 1 ( 1! p) n!2(x,x ') 1 1 [ x3 x '] Most relevant to our research is the fact that, performed in a deliberately way, a perturbation constitute an intervention. This would have the purpose of achieving a desired state or moving from an undesirable one on the network. We want to do this by perturbing those genes with greater impact on the global behavior, by perturbing the fewer amounts of genes possible, and by reaching the desired state as early as possible. In gene interaction, some genes used in the prediction of a target gene have more impact than others, making them of more importance. Thus, it is important to distinguish those genes that have major impact from those of minor impact. Similarly, we can determine the sensitivity of a particular gene, defining it as the sum of all influences acting upon it. This is important because it says how stable and independent the gene is. In [2] [4] a method to compute influences and sensitivities is given. One of the main benefits of determining influences and sensitivities of genes, is that this can tell which are the most vulnerable points of the network, or which ones are the most likely to affect its entire behavior if perturbed. High influent genes can control, making it possible to move to a different basin of attraction if perturbed. This kind of information may provide the potential targets when an intervention is needed in order to obtain a desired state of the system. 4. MODEL-CHECKING ALGORITHMS In [3] it is shown that PBNs dynamics can be modeled as Markovchains. Following this, when p > 0, the Markov chain is ergodic [4], thus every state will eventually be visited. The first passage time gives us the probability F k (x, y) of reaching for first time state y from state x, at time k. For k = 1, F k (x, y) = A(x, y), the transition probability from x to y. For k 2, [4] F k (x, y) = # z"{0,1} n!{y} A(x,z)F k!1 (z, y). While first passage time method is a very useful tool in finding the best candidates for gene intervention, the exponential growth of the network would make impossible to capture the long-run state for large networks. Also, only steady-state and transient-state measures would be determined, whereas we may be interested in probabilistic properties over paths. Continuous Stochastic Logic (CSL), a probabilistic timed extension of Computing Tree Logic (CTL), provides means for specifying measures both state and path-based for Continuoustime Markov chains (CTMC). Numerical methods to model-check CSL over finite-state CTMC are explored on [1]. 4.1 Continuous-time Markov chains Consider a CTMC as an ordinary finite transition system where the edges are equipped with probabilistic timing information. Let AP be a fixed, finite set of atomic propositions [1]: A (labeled) CTMC M is a tuple (S, R, L) with S as a finite set of states, R: S x S 0 as the rate matrix, and L : S 2 AP as the labeling function. Function L assigns to each state s S the set L(s) of atomic propositions a AP that are valid in s. Self-loops at state s are possible and are modeled by having R(s,s) > 0, allowing the system to occupy the same state before and after taking a transition. R(s,s ) > 0 iff there is a transition from s to s. The probability that the transition s s can be triggered within t time units is 1 - e -R(s,s ) t. If R(s,s ) > 0 for more than one state s, a competition between the transitions originating in s exists, known as the race condition. The probability to move from a nonabsorbing state s to a particular state s within t time units is given by [1]: R(s, s') P(s, s',t) =! (1" e " E(s)!t ) E(s) where E(s) = Σ s S R(s,s ) denotes the total rate at which any transition outgoing from state s is taken. The probability of moving from a nonabsorbing state s to s by a single transition, denoted P(s,s ), is determined by the probability that the delay of going from s to s finishes before the delays of other outgoing edges from s: P(s,s ) = R(s,s )/E(s). For an absorbing state s, the total rate E(s) is 0. Then, P(s,s ) = 0 for any state s [1]. For a CTMC, two major types of state probabilities are distinguished: 1. Transient-state probabilities where the system is considered at a given time instant t. πm (α,s,t) = Pr α { σ PathM σ@t = s } 2. Steady-state probabilities where the system is considered on the long run, i.e., when equilibrium has been reached. πm (α,s ) = lim t πm (α,s,t) The above two types of measures are state-based. But we would also be interested in probability on paths through the CTMC obeying particular properties. Suitable mechanisms to express such measures have not been considered. 4.2 Continuous Stochastic Logic Continuous Stochastic Logic (CSL) provides means to specify state as well as path-based performance and dependability measures for CTMCs in a compact and unambiguous way. This logic is basically a probabilistic timed extension of CTL [1]. Besides the standard steady-state and transient measures, the logic allows for the specification of constraints over probabilistic measures over paths through CTMCs. For instance, the probability can be expressed as follows: starting from a particular state, within t time units, a set of goal-states is reached, thereby avoiding or deliberately visiting particular intermediate states before. Four types of measures can be identified (see Table 1): 1. Steady-state measures: The formula S p (Φ) imposes a constraint on the probability to be in some Φ state on the long run. 2. Transient measures: The combination of the probabilistic operator with the temporal operator [t,t] can be used to reason about transient probabilities. More specifically, [t,t] P p ( at s ) is valid in state s if the transient probability at time t to be in state s satisfies the bound p. 3. Path-based measures: By the fact that P-operator allows an arbitrary path formula as argument, much more general measures can be described. An example is the probability of reaching a certain set of states provided that all paths to these states obey certain properties. 97

98 4. Nested measures: By nesting the P and S operators, more complex properties can be specified. These are useful to obtain a more detailed insight into the system s behavior and allow to express probabilistic reachability that are conditioned on the system being in equilibrium. Table 1. Measures and Their Logical Specification [1] (a) steady-state availability S p (up) (b) instantaneous availability at time t P p ( [t,t ] up) (c) conditional instantaneous availability at time t P p (Φ U [t,t ] up) (d) interval availability P p ( [t,t ] up) (e) steady-state interval availability S p (P q ( [t,t ] up)) (f) conditional time-bounded steady-state availability P p (Φ U [t,t ] up S q (up)) There are two main benefits when using CSL for specifying constraints on measures-of-interest over CTMCs [1]: 1. The specification is entirely formal such that the interpretation is unambiguous. An important aspect of CSL is the possibility of stating performance and dependability requirements over a selective set of paths through a model, which was not possible previously. 2. The possibility of nesting steady-state and transient measures provides a means to specify complex, though important measures in a compact and flexible way. Once we have obtained the model (CTMC M) of the system under consideration and specified the constraint on the measure of interest in CSL by a formula Φ, the next step is to model check the formula. The model-checking algorithm for CTL to support the automated validation of Φ over a given state s in M is adapted to these purposes. The basic procedure is as for model checking CTL: in order to check whether state s satisfies the formula Φ, we recursively compute the set Sat(Φ) of states that satisfy Φ and, finally, check whether s is a member of that set. For the nonprobabilistic state operators, this procedure is the same as for CTL [1]. For the purposes of intervention, it would be necessary to know how likely are certain states to reach steady-state on the network of genes. This information, and with the use of the influences and sensitivities previously explained, would aid in determining the genes that represent the best candidates for reaching a desired condition. For instance, if we want to verify if a particular state reach a steady-state condition with a certain probability, a very high-level algorithm would look as follows: Input: PBN, state s, measure m, constraint c Do: 1. Determine Bottom Strongly Connected Components BSCC of PBN. 2. If s isn t in some BSCC Output State specified doesn t reach steady state. Stop. 3. Else continue. 4. Compute transition probabilities to state s. 5. Use constraint c to compare computed probabilities with m. 6. If constraint is met with some probability p Output The condition is met with probability p. Stop. 7. Else Output The system doesn t meet the desired condition. Stop. 5. FUTURE WORK So far we have gained enough knowledge about modeling and dynamics of PBNs. At the moment, we are in the process of adapting model-checking techniques to our model. Next, we have to develop algorithms for the particular cases of steady and transient states, and for path-based measurements. We need then to test them against unreal at first, and then real data. This, of course, belongs to a cycle where results will be used to improve the algorithms. 6.CONCLUSIONS Among all the models already in use, PBNs make an ideal model representation for genetic networks. CTMC are used to model and study the dynamics of PBN. Also, CTMC have been widely used to determine system performance and dependability. CSL can be used as a model-checking algorithm for CTMC, and expand the traditional state-based measures to the use of path-based measures. Based on this, it is our believe that a model-checking algorithm for CSL can be used to study the dynamics of CTMC representations of PBN used to model genetic networks in an effective way. Avoiding the matrix-based model, this algorithm not only would mitigate the impact of the exponential size of the network, but the information gathered thanks to its ability of answering questions about the transition system of the PBN, will make interventions feasible. 7. ACKNOWLEDGMENTS This research is conducted in part thanks to the support of a RISE- NIH scholarship (grant 1R25GM A1) granted to the first author. The authors acknowledge the National Science Foundation (grant no. CNS ) for its support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 8. REFERENCES [1] Baier, C., Haverkort, B., Hermanns, H. and Katoen, J-P Model-Checking Algorithms for Continuous-Time Markov Chains. IEEE Transactions on Software Engineering. 2003; 29(6):1-18. [2] Shmulevich, I., Dougherty, E.R. and Zhang, W From Boolean to Probabilistic Boolean Networks as Models of Genetic Regulatory Networks. IEEE. 2002; 90(10). [3] Shmulevich, I., Dougherty, E.R., Kim, S. and Zhang, W Probabilistic Boolean Networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics (Oxford, England). 2002; 18(2): [4] Shmulevich, I., Dougherty, E.R. and Zhang, W Gene perturbation and intervention in probabilistic Boolean networks. Bioinformatics (Oxford, England). 2002; 18(10):

99 Virtual Instrumentation for Bioacoustics Monitoring System Hector O. Santiago (Underg. Student) University of Puerto Rico Mayaguez PO Box 9000 Mayaguez, P.R David Márquez (Graduate Student) University of Puerto Rico Mayaguez PO Box 9000 Mayaguez, P.R Domingo Rodriguez (Advisor) University of Puerto Rico Mayaguez PO Box 9000 Mayaguez, P.R ABSTRACT This paper informs about the first stage of developing a project design of a basic virtual instrumentation testbed framework. The framework consists of a System to be monitored, a Computational Acceleration Unit, an Embedded Computer Unit and Personal Data Assistants (PDAs). In this first stage we will focus on the system to be monitored. The Bioacoustics Monitoring system developed was programmed to record audio in specific time intervals and was designed to be autonomous with the use of solar energy. For the recording process of the Bioacoustics Monitoring System we used single board computers called Gumsitxs which are programmed via ssh using scripts under Linux operating system. The recorded files at laboratory were saved in WAV format and transferred wirelessly to a local server to be then access through webpage with PDAs or any computer for further data processing. KEYWORDS Virtual Instrumentation, Gumstix, Bioacoustics, Embedded Computing, Digital Signal Processing. 1. INTRODUCTION The concept of Virtual Instrumentation is the use of modular customizable hardware and programmable software to monitor a system behavior of specific variables. With the use of software for instrumentation we can replace limited hardware and achieve the desire monitoring of the variables with low cost and effective software. Our framework for virtual instrumentation consists of a System to be monitored, a Computational Acceleration Unit, and Embedded Computer Unit and Personal Data Assistants. Within these modules of the framework communication between the System and the Embedded Computer is intended to be bidirectional. In between these two we introduce a Computational Acceleration Unit with bidirectional communication with both of them. The PDAs are not intended to be a nearby module. They form part of the framework through internet wireless access. Also other configurations of the framework are considered by the integration of modules to the System, to the Embedded Computer Unit or an all-in-one System. The Bioacoustics Monitoring System developed will be discussed in this paper. The Bioacoustics Monitoring System solar power was previously design to supply energy for all day long. Although the power was previously designed we worked in a laboratory and solar power was simulated. The Bioacoustics Monitoring System developed consists of configuration of single board computers called Gumstixs which were mounted and programmed for audio recording purposed. The three boards used were the Verdex XL6P which is the main board with 600 MHz and 128MB of RAM, the netwifimicrosd for wireless, Ethernet and micro SD interfaces and the Audiostix2 with audio input and output interfaces. The Audiostix2 board also comes with a mini USB input that was used as the audio input by using a Turtle Beach USB Audio adapter. This USB comes with its own microphone but instead we used a Beyerdynamic microphone for better acquisition quality. All commands for programming and development of the System were written using Linux Operating System. The Gumstix is access through Ethernet using a secure shell connection with an embedded computer. The Gumstix is programmed to have a unique IP address for network identification and user access. Executable scripts were installed for the recording process of the audio files and to run repeatedly with a Linux time based job scheduler called Cronjobs. Once the file is recorded in WAV format the process of transferring the files is made by the Embedded Computer Unit within the same network. In this embedded computer three different scripts automatically make the transfer process of the files to an internet server for files to be available in a web page. One script retrieve the file from the Gumstix, a second script transfer the file to the server while a third script deletes the already transferred audio that remains on the Gumstix. These files are intended to be downloaded through the webpage by any computer or PDA for additional signal processing. 99

100 2. DEVELOPMENT 2.1 Framework The main framework (see Figure 1) for the System Instrumentation consists of a System to be monitored, a Computational Acceleration Unit, and Embedded Computer Unit and Personal Data Assistants. Figure 2: Gumstix Boards Connection and Specifications Figure 1: Frame Work for Virtual System Instrumentation This framework presents the communication between modules. The connections between the Embedded Computing Unit can be in one direction or bidirectional. This will allow us to manipulate the behavior of the System depending on the information received. The in-between Computation Acceleration Unit can work as a data processing unit with bidirectional data communication. The PDAs are minted to have wireless access to the Embedded Computer Unit and to access system data or further more to manipulate the modules.. The discussion on this paper will focus on the development of one of the Systems which is the Bioacoustics Monitoring System. 2.2 The System The System is mainly based on a single board computers called Gumstixs which are programmed to record audio files and automatically transfer them through the Embedded Computing Unit to an internet server. In order to capture audio on the Gumstix we assembled the audiostix2, netwifimicrosd FCC and Verdex XL6P motherboard (see Figure 2). The final assembly (see Figure 3) 3 is called the netmmc audio pack. Screws and Spacers are used as indicated verifying that there is no other contact within the boards. Plastic washers were used to avoid contact between the screws and the boards. Each netmmc has three holes for screws to make the joint. Figure 3: Gumstix netmmc Audio Pack Mounted Once mounted, we set a static IP on the embedded e computer by using Linux operating system terminal. For this first communication with the Gumstix we need to identify the Ethernet interface of the embedded computer named by default eth0 with a desired IP and a netmask using the ifconfig command on terminal to make a network between the embedded computer and the Gumstix. Once the embedded computer interface has a static IP we try the first communication with the Gumstix with the tcpdump command. The Gumstix comes with a predetermine IP that we will acquire when using the tcpdump command. Once we acquire the predetermined IP of the Gumstix we will change the embedded computers IP again to be able to access the Gumstix within the same network of the Gumstix Ethernet interface. Once we are within the same 100

101 network of the Gumstix we can enter the Gumstix by ssh command using the predetermined IP of the Gumstix. If make correctly the password asked is gumstix. To assign a static IP to the wireless and the Ethernet interface of the Gumstix, the interfaces file is edited. This file is located in folder /etc/network/ on the Gumstix. This step is very important because once the static IP is assigned to the Gumstix it will be used by the scripts to connect to it using an ad-hoc connection and used by the user to retrieve information, edit files on it or to configure other things on the Gumstix. In the interfaces file we edit lines to assign a static IP for the Gumstix. In our case we assign a static IP to the Gumstix s wireless interface wlan0 and eth0. We now have a static interface eth0 activated each time the Gumstix starts. The iface eth0 inet static indicates that the IP for the Gumstix is static and specifies the desired IP for the Gumstix in the address string. The netmask and the network IPs are also added. For the wireless setup we add the lines of wireless_mode ad-hoc and wireless_essid net. This name changes depending on the name of the essid name want to set to recognize the wireless signal of your Gumstix. Once the file is edited and saved you can disconnect the Gumstix from power and restart it. Now the Gumstix wireless interface wlan0 and eth0 has a static IP and can be accessed by wireless once the network configuration of the embedded computer is configured within the same network. One important thing is that every time you restart the Gumstix it is needed to update the Gumstixs date by using date command. arecord -d 60 -f S16_LE -c1 -r t wav $file echo $file >> /media/card/send/files.txt echo "recording" + $file >> /media/card/log It names a variable N which is later used to name each recorded file identifying the Gumstix and the year, month, date, hour and minute of the file to be recorded. The script also identifies the folder in which the audio is saved on the Gumstix and has the setups commands of the device to be ready to record. The arecord string triggers the recording process and it specify the duration, quality, sample rate and the format type of the archive to be recorded, which in our case is a WAV file. The script also creates a log file to have a reference of recorded files on the Gumstix. To auto trigger the script we used a Linux time based job scheduler called cronjobs. A cronjob is a process that is automatically run at whatever time you set it to run. This can be each day, each hour of every day, every 5 minutes of every hour of every day. Our crontab file for Gumstix contains the time intervals to record the WAV files and to be executed automatically. In our case we will be recording one file of 15 seconds every 15 minutes. These specifications of time intervals can be change by editing the crontab file as (see Figure 4). Figure 4: A cronjob file example 2.3 The Scripts In order to make the Gumstix record WAV files by themselves we need to make scripts that contain the commands necessary to start recording. A script is a set of executable instructions in a file that can be executed automatically. The executable files aplay, arecord and amixer are Linux commands for the audio card. These files are needed by the Gumstix because they are executed by the scripts and allow the Gumstix to record and to playback the files. The first script to be made is named scriptaudio and is in charge of triggering the recording process. An example of the used script has the following commands: #!/bin/sh N=`date +gum111-%y-%m-%d-%h-%m.wav` #String containing full route to saved file in Gumstix file=/media/card/send/$n #Recording cd quality for 60 seconds amixer -c 0 set Mic Capture 90% amixer -c 0 set Mic 90% This is a cronjob file example that triggers the script local copy in the tenth minute of every hour every date and the script server copy in the 17 th minute of every hour every date. Once we have a file recorded on the Gumstix we setup the embedded computer to be able to retrieve the WAV files from the Gumstix. This script was called localcopy and we programmed it to execute with 15 minutes intervals. Thus we need to do this automatically without the need of typing password when retrieving the WAV file from the Gumstix or when sending the WAV file to the server we use what is call a finger print generated by using the ssh-keygen command. This allows us to enter the Gumstix without password and to send the file to a server without using the password. The localcopy script contains a single string with the scp command specifying the Gumstix with its static IP. The string format is: scp root@ :/home/aud*.wav /home/audiofiles/. This string is also triggered by the cronjob and it copy all wav files from the Gumstix identify with IP address to the embedded computer /home/audiofiles folder. 101

102 2.4 File Transfer Having the recorded file on the embedded computer what is left is to send the file to a public folder on the server using another script with scp command named servercopy. The string on the script to copy files from the embedded computer to the server is: scp /home/audiofiles/audio*.wav aip@ece.uprm.edu/home/www/aip/public_html/wavs_hector This string copies all the wavs files from the embedded computer to the internet server. Once the files are deposit on the public folder at the server they can be access through this webpage that has a few examples of the audio recorded on the laboratory. These files can be then access and downloaded by any other computer or PDA with internet access for further data processing. 3. IMPLEMENTATION RESULTS This wired internet connection IP was automatically assigned by the server. The connection between the Gumstix and the Linux based computer was totally private. A fingerprint key generation was made for the wireless connection between the Gumstix and the Linux based computer and for the connection of the Linux based computer with the internet server. The scripts were successfully installed and programmed to repeatedly record and transfer an audio file every fifteen minutes to an internet server with a webpage. The files were access using a PDA with internet connection by accessing webpage. The files names are automatically assigned to identify the Gumstix and to include the date, hour and minute of the recorded file. These recorded files are intended to pass through signal processing for further analysis of the acquired data. 4. ACKNOWLEGMENTSS The author(s) acknowledge(s) the National Science Foundation (Grants CNS , CNS ,, and CNS ) for its support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 5. REFERENCES [1] C. Aceros, N. Pagán, K. Lu, J. Colucci, and D. Rodriguez, "SIRLAB: An Open-Source Computational Tool Framework for Time-Frequency Signal Visualization and Analysis," 2011 IEEE DSP/SPE Workshop, Sedona, Arizona, USA, January Figure 5: Implementation During the implementation of the Bioacoustics Monitoring System we focus on the programming of the System, the wireless communication and availability of the files for the PDA s. The solar power implementation for the System was not used because all the process was made on a laboratory. In the lab the Gumstix boards were assembled and programmed using a Linux based computer. A static IP was assigned to the wireless interface of the Gumstix ( ) and to the Linux based computer wireless interface ( ). We also maintained a wired internet connection of the Linux based computer interface to be able to transfer the files to an internet server (see Figure 5). [2] J. Sanabria, W. Rivera, D. Rodriguez, Y. Yunes, G. Ross, and C. Sandoval, A Sensor Grid Framework for Acoustic Surveillance Applications, IEEE Fifth International Joint Conference, on INC, IMS, and IDC, pp , S. Korea, [3] G. Vaca-Castano, D. Rodriguez, J. Castillo, K. Lu, A. Rios, and F. Bird, A Framework For Bioacoustical Species Classification In A Versatile Service-oriented Wireless Mesh Network,, Eusipco 2010, Aalborg, Denmark. [4] K. Lu, Y. Qian, D. Rodriguez, W. Rivera, and M. Rodriguez, A cross-layer design framework for wireless sensor networks with environmental monitoring applications, Journal of Communications Software and Systems,, vol. 3, no. 3, Sept H. [5] Lugo-Cordero, K. Lu, D. Rodriguez, and S. Kota, A novel service-oriented routing algorithm for wireless mesh network,, Proc. IEEE MILCOM 2008, San Diego, CA, USA,

103 Coherence-Enhancing Diffusion for Image Processing Maider Marin-McGee and Miguel Velez-Reyes Laboratory for Applied Remote Sensing and Image Processing University of Puerto Rico at Mayaguez P.O. Box 3535, Marina Station, Mayagüez, PR Ext. 5263, ABSTRACT Analyzing flow-like patterns in images for image understanding is an active research but there have been much less attention to the process of enhancement of those structures. The completion of interrupted lines or the enhancement of flow-like structures is known as Coherence-Enhancement (CE). In this work, we are studying nonlinear anisotropic diffusion filtering for coherence enhancement. Anisotropic diffusion is commonly used for edge enhancement by inhibit diffusion in the direction of highest fluctuation of gray level. For CE, diffusion is promoted along the direction of lowest fluctuation of gray level values in the neighborhood thereby taking into account how strongly the local gradient of the structures in the image is biased towards that direction. Results of CE applied to gray scale, color and hyperspectral images are presented. KEYWORDS Nolinear Diffusion, Coherence-Enhancement, Image Processing. 1. INTRODUCTION The most stable and reliable descriptor of the local structure of an image is the structure tensor [12]. This tensor is a symmetric positive semi definite matrix that at each pixel determines the orientation of minimum and maximum fluctuation of gray value in a neighborhood of a pixel. The structure tensor was presented by [8] and also by [2] in an equivalent formulation. The structure tensor is derived from the gradient; it provides the main directions of the gradient in a specified neighborhood of a point. Its eigendecomposition provides information of how strongly the gradient is biased towards a particular direction which is known as coherence. Oriented flow-like structures can be found by looking to the orientation of lowest fluctuation of gray value in an image, and it is determined by the eigenvector of the structure tensor with the smallest eigenvalue. This process has been studied in the context of gray and color images [17] by the pattern recognition and computer vision fields and used for automatic grading of fabrics or wood surfaces [15], segmenting two-photon laser scanning microscopy images [7], enhancing fingerprint images [12], enhancing corners [7] and 3-D medical imaging [16,3]. But few have been done with vector-valued data such as multispectral MSI) and hyperspectral images (HSI). This paper study a PDE-based coherence diffusion method proposed by [16] and uses a finite volume scheme proposed by [7]. The main contribution in this paper is to extend CED to MSI and HSI and show that coherence enhancement can be used for Remote Sensing problems. Results will be presented for gray, color, and hyperspectral images. This paper is organized as follows. Section 2 introduces CED its diffusion tensor and some properties. Section 3 presents the discretization schemes. Section 4 shows experimental results. 2. COHERENCE ENHANCING DIFFUSION Adaptive smoothing methods are based on the idea of applying a process which itself depends on local properties of the image. A regularized adaptive method can be extended to anisotropic processes which make use of an adapted diffusion tensor instead of scalar diffusivity. This is the case of the anisotropic diffusion model introduced by Weickert in [15,14]. Let us consider a scale image domain ( ) ( ) with number of rows and number of columns, and boundary. Let ( ) The following initial boundary value problem represent a class of anisotropic diffusion filters: ( ) ( ) ( ) ( ) (1.2) (1.3) Hereby, u, the filtered version of image f (x), is the solution of the initial boundary value problem for the diffusion equation with f as initial condition, see Equation (1.2). Equation (1.1) has a partial derivative with respect to time, in image processing there is no real evolution in time, variable t in this case is a dummy variable that represent an iterative process. The vector n denotes the outer normal unit vector to Ω and, is the Euclidean scalar product on Equation (1.3) is a Neumann boundary condition, which mean that the flux is zero outside the boundary. Equations (1.1) to (1.3) will be called P1. CED is basically a 1-D diffusion, where a minimal amount of isotropic smoothing is added for regularization purposes. In order to make diffusion tensor adaptable, in the sense that it is obtained a strong smoothing in one direction and low smoothing along the edges, it will depend on the structure tensor that in turn depends on the edge estimator, which is also known as the smoothed gradient. is obtained by finding the gradient of the smoothed image obtained by a Gaussian convolution of variance ( ) ( )( ) ( ) ( ) ( ) ( ) where denotes the extension of u from to that can be obtained by mirroring at [4] and the * is a convolution. 2.1 Diffusion Tensor Diffusion tensor D was designed to generate a strong smoothing in a preferred direction and a low smoothing perpendicular to it, e.g. for images with interrupted coherence of structures. D is built upon the structure tensor, The structure tensor provides the main directions of the gradient in a specified 103

104 neighborhood of a point. Its eigen-decomposition provides information of how strongly the gradient is biased towards a particular direction, for the oriented flow-like case, we want the direction of lowest fluctuation of gray value in the image. The structure tensor is defined by: where, u and are as in Equations (1.4) and (1.5). is a symmetric, positive semi-definite matrix and its eigenvectors are parallel and orthogonal to. with eigenvalues μ 1 μ 2 is the average of in a neighborhood. From, useful information about coherence can be obtained. is large for anisotropic structures and tent to zero in isotropic ones. Constant areas are characterized by, straight edges by and corners by [7]. Let ( ) be the orthogonal set of eigenvectors corresponding to eigenvalues ( ), which depends on solution u, is constructed using ( ) producing a filtering process such that diffusion is strong along the coherence direction w and increases with. In addition, satisfies smoothness, symmetry and uniform positive definiteness properties. D is defined as:, - [ ] [ ] Cottet et al [5] presented a reaction-diffusion model where the diffusion was along the eigenvector corresponding to the smallest eigenvalue, ( ) of its diffusion tensor. But this model did not produce a scale-space (see 2.2.2), and its diffusion tensor D had eigendirections adapted to which make them sensible to changes in. Note that by definition ( ) guarantee that the process never stop and keeps the diffusion tensor uniformly positive definite. C is a threshold parameter [14]: and. For a vector valued image with m bands the structure tensor is defined as and all the other definitions are as in the scalar value case [13]. 2.2 Scale-Space and Integration Parameter There are two concepts about CED that are worth elaborating to fully understand this particular kind of diffusion: (i) what is the roll of the integration parameter, and (ii) the fact that this particular PDE-based CED produces a scale space Integration Parameter The structure tensor is the average of the gradient orientations in a neighborhood of size ; with orthonormal eigenvectors and corresponding eigenvalues. These eigenvalues measure the average contrast, this is, the gray value fluctuation, for all bands in the eigen-directions within an integration scale. Therefore, is the direction of higher average contrast and gives the preferred local orientation or coherence orientation. (a) (b) (c) (d) Figure 1. Smoothed Gradient vs. Structure Tensor in a fingerprint image of size (a) Original image. (b) Gradient orientation σ=0.5. (c) Gradient orientation σ=2.5. (d) Structure Tensor σ=0.5,ρ=4. Figure 1(b)-(c) illustrate the gradient orientation using grey values. Vertical gradients are depicted in black and horizontal ones in white. It is observed that if is too small then high fluctuation of noise remains. As gets larger then it is useless, since neighboring gradients with the same orientation, but with opposite sign (direction) cancel one another. Therefore, gradient smoothing averages directions instead of orientations [14]. Figure 1(d) shows the coherence orientation for the fingerprint image and also how well the fingerprint singularity (minutiae) is described. Consequently, the matrix representation of the image gradient allows the integration of information from a local neighborhood without cancellation effects. The eigenvalues of D are chosen as follow [15]: Figure 2. Impact of the integration scale on CED. with image size is. Figure 2. shows the Van Gogh paint Road with Cypress and Star [11] in which all parameters are equal except the integration parameter to show the impact of this parameter. When, the filter will produce artifacts, while increasing it to make flow-like structures look like completed lines since the average of orientations in this bigger neighborhood captures the coherence orientation, but further increasing to ρ hardly make any difference [14] Scale- Space { ( ) ( ) Figure 3. Scale-space produced by coherence-enhancing diffusion in a thyroid cell with ( ) 104

105 Equation (1.1) has an evolution in time which simulates an iterative process instead of a real evolution in time. In this case, t is known as a scale. As t is gradually increased from t = 0 in which ( ) ( ) is the original image, some diffusion processes produce a family * + of gradually smoother versions of f which is known as scale-space. Images usually contain structures at a large variety of scales. An image representation at multiple scales is useful in cases where it is not clear in advance which the right scale for the depicted information is. Moreover, a hierarchy of image structures can be obtained by comparing the structures at different scales which eases a subsequent image interpretation. Figure 2 shows a scale-space produced by CED applied to thyroid cells. In that figure is evident a hierarchical collection of simpler images that characterize the original image. This concept is known from linear diffusion which is equivalent to Gaussian filtering [14], but this filtering dislocates edges when moving from finer to coarser scales. So an edge found in a coarse scale does not match the position of the same edge in the original image, this problem is known as the correspondence problem. Nonlinear diffusion was initially proposed to modify the linear process in order to better capture the geometry of the image itself. 3. DISCRETIZATIONS PDE methods assume that there is a continuous diffusion but images are discrete. Then it is necessary to discretize the PDE both in time (scale) and space and use discretization schemes to create the actual filter. The uniform distribution of the pixels constrains discretization meshes to be rectangular, structured, i.e., all elements and nodes have the same number of neighbors. Finite differences and finite volumes are two numerical methods suitable for meshes with those characteristics and to approximate conservative processes as in problem P1. Consider discrete times for and the time step size, h is the grid size. Let denote the approximation at pixel P of ( ). There are several discretization schemes depending on how the actual iteration depends on the previous one: explicit, semi-implicit and implicit. In the explicit scheme the next time step can be computed explicitly from the previous time step by a multiplication and without solving a system of equations. For the semi implicit and implicit, it is necessary to solve linear and nonlinear equation systems respectively [10]. The implicit method will not be considered in this study. After the discretization in scale and space and the boundary condition are applied to Equation (1.1) and assuming central-space and forward time the scheme equations are given by: Explicit: ( ), Semi-Implicit: ( ), and Implicit: ( ). Here ( ) is a functional notation to say that matrix A depend in the actual solution. 4. EXPERIMENTAL RESULTS CED was implemented using a semi implicit finite volume scheme for P1 introduced by [7]. Three problems are used to show its usefulness: Image restoration, enhancement of biological hyperspectral images (HSI) and enhancement of remote sensed data. 4.1 Image Restoration Figure 4 shows enhancement of a Claude Monet's painting Woman with a parasol, looking left. CED is used to help art curators to find the orientations of the artist s strokes. CED enhances the texture of clouds and completes the flow-like structures as line-like. In the fingerprint image the lines are completed and the restoration is evident. Figure 4. Images after CED of Woman with a parasol, looking left [9] and fingerprint image. Both images where processed with σ=.5 and t = 20 and t = 50 respectively. 4.2 Enhancement of Biological HSI Figure 5 shows thyroid cells, they were collected with a Citoviva hyperspectral microscope with range from nm with 16 bands each of 20 nm. Also, it shows that filtering this kind of cells with CED completed borders in the small round-shape cells. t = 10 iteration were done since it is desirable a small enhancement that preserves the main characteristic of the cells. Figure 5. Enhancement of Thyroid cells. First row: Original Images. Second row: CED with n = 10; ρ = 4 σ = Enhancement of Remote Sensed Data. Figure 6 shows a region of interest (ROI) of a quick view image (gray scale) taken from AVIRIS webpage [6] and Figure 7 are ROIs from MODIS sensor taken from North America between Washington state and Alaska, with spatial resolution of 500 m [12]. They show wakes, plums and ships which are enhanced by 105

106 closing interrupted lines and there is no enhancement of its surrounding as the case of the ocean and cloud images. a) c) e) b) d) f) Figure 6. Enhancement of a ship and its wake. ROI from AVIRIS flight over Deep Horizon Gulf Of Mexico Oil Spill [6]. a)original Image, b) After CED, c-d)zoomed portion of red box), e-f) Zoomed portion, green box. Figure 7. MODIS images [1]. First column: Plumes captured by altostratus clouds and its CED with t = 100, = 6, = 4. Second Column: Boat and its CED with t = 150, ρ = 10, σ = SUMMARY AND CONCUSIONS CED generates a scale-space evolution by means of a nonlinear anisotropic diffusion equation. Its diffusion tensor reflects the local image structure by using the same set of eigenvectors as the structure tensor. The latter is better suited to define the diffusion tensor than the smoothed gradient, since it does not produce cancellation effects and more information can be extracted from it. The eigenvalues of the diffusion tensor are chosen in such a way that diffusion acts mainly along the direction with the highest coherence, and becomes stronger when the coherence increases. CED usefulness was illustrated by applying it to scalar and vector valued images. In addition this method can be used for remote sensing applications as the enhancement of wakes, ships and plums from satellite data. 6. ACKNOWLEDGMENTS This material is based upon work supported by the U.S. Department of Homeland Security under Award Number 2008-ST ED0001. This work used facilities of Gordon-CenSSIS, The Bernard M. Gordon Center for Subsurface Sensing and Imaging Systems, under the Engineering Research Centers Program of the National Science Foundation (Award Number EEC ). The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied of the National Science Foundation or the U.S. Department of Homeland Security. The Authors thanks Professors Paul Castillo and Olga Drblìková for useful comments during the implementation process and Prof. Badrinath Roysan from the University of Houston for providing the Thyroid Tissue image data set. 7. REFERENCES [1] About MODIS. [2] Bigün, J. and Granlund, G.H Optimal Orientation Detection of Linear Symmetry. IEEE First International Conference on Computer Vision (London, Great Britain, June 1987), [3] Burgeth, B. et al A general structure tensor concept and coherence-enhancing diffusion filtering for matrix fields. Visualization and Processing of Tensor Fields. (2009), [4] Catte, F. et al Image selective smoothing and edge detection by nonlinear diffusion. SIAM J. Numer. Anal. 29, 1 (1992), [5] Cottet, G.H. and Germain, L Image processing through reaction combined with nonlinear diffusion. Mathematics of Computation. 61, 204 (1993), [6] Deep Horizon Gulf Of Mexico Oil Spill flight ID: f100517t01, run Id: $p00_r06$. [7] Drblíková, O. and Mikula, K Convergence Analysis of Finite Volume Scheme for Nonlinear Tensor Anisotropic Diffusion in Image Processing. SIAM J. Numer. Anal. 46, 1 (2007), [8] Fröstner, W. and Gülch, E A Fast Operator for Detection and Precise Location of Distinct Points, Corners and Centers of Circular Features. Proceedings of the ISPRS Intercommission Workshop on Fast Processing of Photogrammetric Data [9] Monet, C Woman with a parasol, looking left. [10] Strikwerda, J.C Finite difference schemes and partial differential equations. Society for Industrial Mathematics. [11] Van Gogh, V Road with cypress and star. [12] Weickert, J Multiscale texture enhancement. Computer Analysis of Images and Patterns. V. Hlavàc and R. Šára, eds. Springer Berlin / Heidelberg [13] Weickert, J A review of nonlinear diffusion filtering. Scale-space theory in computer vision. (1997), [14] Weickert, J Anisotropic Diffusion in Image Processing. Teubner-Verlag. [15] Weickert, J Coherence-Enhancing Diffusion Filtering. International Journal of Computer Vision. 31, 2-3 (1999), [16] Weickert, J Coherence-Enhancing Diffusion Filtering. International Journal of Computer Vision. 31, 2-3 (1999), [17] Weickert, J Coherence-enhancing diffusion of colour images. Image Vision Comput. 17, 3-4 (1999),

107 Data Quality in a Light Sensor Network in Jornada Experimental Range Gesuri Ramirez Computer Science University of Texas at El Paso 500 W. University Ave. El Paso Texas, Tel: (915) gramirez12@miners.utep.edu Olac Fuentes Computer Science University of Texas at El Paso 500 W. University Ave. El Paso Texas, Tel: (915) ofuentes@utep.edu Craig E. Tweedie Biology University of Texas at El Paso 500 W. University Ave. El Paso Texas, Tel: (915) ctweedie@utep.edu ABSTRACT Assessing the quality of sensor data in environmental monitoring applications is important, as erroneous readings produced by malfunctioning sensors, calibration drift, and problematic climatic conditions such as icing or dust, are common. Traditional data quality checking and correction is a painstaking manual process, so the development of automatic systems for this task is highly desirable. This study investigates machine learning methods to identify and clean incorrect data from a real-world environmental sensor network, the Jornada Experimental Range, located in Southern New Mexico. We analyze several learning algorithms and conclude that learning algorithms are an effective way of cleansing this type of datasets. KEYWORDS Data cleaning. Assessing data quality. Sensor networks. Machine learning. 1. INTRODUCTION Wireless sensor networks (WSN) allow for monitoring large areas that are remote, difficult to access, and/or dangerous to humans. Sensor networks are capable of covering large areas and producing measurements in near real time [1], [2]. A sensor network consists of autonomous devices, called sensor nodes, capable of measuring, processing and logging data for sensors connected to them. Sensor nodes are able to measure various environmental variables such as ambient light, temperature, pressure, humidity, and rain, among others. In this paper we focus only on light sensors, but the methods are general enough to be applied to other modalities. There are at least two limitations to the current hardware of sensor nodes. The first is the electric power needed to maintain a node, including the ability to store, process and transmit data. The second is the node's storing capacity. These limitations and methods to detect incorrect data values are further discussed in [2], [3], and [4]. These hardware limitations make the development of software for node self-monitoring a challenge. Sensor nodes usually lack the processing capacity to run complex data quality assessment programs to detect when a sensed value is likely to be incorrect, which can occur due to external noise, broken sensors, or other reasons. Error detection is usually not an easy task and can be approached in various ways. Some approaches involve the detection of incorrect values as soon as they are sensed [5], [6], and [7], however, this requires extra energy consumption from the senor nodes and is not feasible in remote areas. Thus approaches where nodes are limited to sensing and transmitting data and where data cleaning happens off-site as a post-processing stage are beginning to be more commonly used. In this paper we describe an automatic data cleaning system based on machine learning. The system takes advantage of the redundancy of data provided by sensors that monitor neighboring areas at similar wavelengths to detect inconsistent sensor readings that may indicate malfunctions or excessive noise. We present experimental results showing the application of our system to clean data from a real-world environmental sensor network, the Jornada Experimental Range (JER), located in Southern New Mexico, a United States Department of Agriculture - Agricultural Research Service station. We analyze several learning algorithms and data replacement schemes and conclude that learning algorithms are an effective way of cleansing this type of datasets. Figure 1. Sensor network at Jornada UTEP site. 8 towers measuring 6 difference species. Figure 2. Light sensors. PAR and SR. 107

108 Figure 3. (a) Hobo tower logging light 28 sensors for days. (b) Full data-set of the 28 sensors. (c) Percentage of dust in a light sensor. 2. PROBLEM Data cleaning using traditional methods is problematic and time consuming because it requires manual review of the data. Thus the design of methods to automate this process is becoming an active research area. One such method was proposed by Deresynski and Dietterich [8]. That method uses long-term (multiyear) historical records of a single sensor's readings in order to derive a probabilistic model of its behavior over time. Afterwards, the quality of future readings can be assessed by computing their likelihood given the model. Our approach is similar, but instead of probabilistic models we use neural and instance-based learning algorithms and derive predicted sensor readings from observations made by multiple sensors. The learning algorithms exploit the redundancy provided by sensors that monitor neighboring and overlapping areas in order to learn the interdependencies of the sensors' behaviors. This allows us to make accurate predictions without needing the long-term data used in [8]. 3. METHODS The general approach consists of learning to predict the value of each individual sensor given the values of a set of related sensors, thus a network of n sensors requires n predictors. This is computationally expensive and thus makes our method better suited to post-processing, rather than on-line monitoring. We compute the likelihood of a given reading to be erroneous by comparing it with the predicted value, and, if it is found to be inaccurate, we replace its value using a replacement strategy that takes into consideration the prediction made by the learning algorithm as well as the sensed value. We used three well-known learning algorithms to predict sensor readings: a feed-forward artificial neural network (ANN), k-nearest neighbors (KNN), and locally-weighted regression (LWR). For a detailed description of the algorithms, see [9]. 4. Data Source - WSN Set-up at the Jornada The data we used originates from a wireless sensor network that was installed at a UTEP (University of Texas at El Paso) site on the Jornada Experimental Range, located near Las Cruces, New Mexico, to collect pilot data for a given sensor type (light sensors). The data collected was filtered and corrected to be used as dataset for test the learning algorithms to test. Figure 4. Output vs. prediction from fit ANN with 8 units in the hidden layer. The light sensors are setup in the following manner. There is one PAR and one SR sensor facing upward to measure the incoming sun light and 15 PAR and 15 SR sensors facing downward to measure reflected sun light throughout the entire sensor network sample area. The sensors were placed in pairs to monitor each dominant land cover type representative of the sampled ecosystem, which consists of plant species and bare ground (see Figure [1]). These dominant plant species are tarbush, honey mesquite, creosote, fluff grass, and bush muhly grass. Figure 5. Output vs. prediction from 3NN for sensor PAR14. The sensor network was installed by the Systems Ecology Lab (SEL) at the Jornada Experimental Range in Las Cruces, NM. The wireless sensor network was installed along a transect where six different species are being studied (see Figure [1]). The sensor types include: 1 pressure, 6 rainfall, 8 leaf wetness, 16 soil moisture, 16 Photosynthetically Active Radiation (PAR), and 16 Solar Radiation (SR) sensors. The sensor network is comprised of eight sensor nodes placed in a 110 meter long tramline. Along the transect, there were 17 selected plants, 4 different soils, and 2 open areas. The open areas are used for control purposes. There is more information about the site in Herrera et al. [10]. The main purpose of this facility is to measure the attributes of the studied 108

109 species, compare data from the sensor network with data from a robotic tram system [11], [12], and to analyze these data for monitoring carbon, energy, and water balance in the Chihuahuan desert. The entire sensor network consists of 87 sensors in 8 sensor nodes but for the purpose of this study, we were only interested in light sensors. There are 32 installed light sensors in the WSN. 5. DATASET The data-set was collected at the UTEP site on the JER. The dataset comes from data recorded every 5 minutes with samples occurring every 30 seconds. The data-set contains 3095 measures over a hour period (see Figure 3 (b)). For our tests, there were 14 PAR & SR sensors looking up and 14 PAR & SR sensors looking down. The canopy of the sensors was creosote (see Figure 3 (a)). The entire dataset was normalized in order to have the same range values for all the sensors. A new generated dataset was constructed in order to simulate errors caused by dust, snow, and bad functionality. To simulate dust numbers were randomly generated to increases data readings up to 50%. In that range, dust starts to accumulate, but wind can increase or decrease that percentage. Dust can never reach 100% or 0% due to wind however (Figure 3 (c)). For snow the error is similar to dust but it is just for one day. Malfunction of sensors can occur when the sensor reports a lower value than expected. Another possible error is when a sensor is disconnected. For each case, we generated 28 datasets simulating the errors. 6. TESTING The test uses the dataset collected from the network without additional noise and the dataset with simulated noise in order to compare the performance of the three learning algorithms (knearest neighbors, locally-weighted regression, and feed-forward artificial neural network) at predicting the readings of a single sensor given the readings of the other 27 sensors on the network as input. The general conditions for the tests are the following. With respect to the algorithms used (ANN, KNN and LWR), each test uses the standardized dataset, where each attribute is re-scaled to have zero-mean and unit variance, and 31 inputs and only one output. The artificial neural network has 22, 22x18, 8, and 4 units in the hidden layer, the layered, the KNN algorithm utilizes k=5, k=3, and k=1 and the LWR algorithm utilizes 200 data points to construct a local approximation. All tests were done using a 10- fold cross-validation process with an error calculated from the square root of the mean square error (MSE). To measure if each value is correct, we calculate the distance between the noise-free data value and the predicted value for each instance. A relatively small distance value signifies a good quality value and quality decreases when the distance increases. 7. RESULTS Table 1 shows that the best ANN set-up is with 8 units in the hidden layer. Table 2 shows that LWR has the best accuracy predicting new values. The rest of the tests were made using the noisy dataset with the ANN, KNN, and LWR. For Figures 7, 8, and 9, the blue line is the bad values of a specific sensor on specific error. The black circle is the original and correct value of that sensor. The green dot is the predicted value. Finally, the red line is the MSE that is the difference between the predicted and incorrect value. While the red line is far from zero, the data quality could be worse. The Figure 9 shows the best accuracy with LWR. Table 1. Results of tests using ANN. Test 1 Test 2 Test 3 Test 4 Algorithm: ANN Goal: Epoch: Hidden 22 22x units: MSE: Time: 786.3s s 233.0s 162.9s Figure: 4 Table 2. Results of tests with KNN and LWR algorithms. Test 6 Test 7 Test 8 Test 9 KNN K: LWR MSE: Time: 7.6s 6.8s 7.6s 20.1s Figure: CONCLUSIONS AND FUTURE WORK As shown in Table 2, the best algorithm to predict correct values was locally-weighted regression with 200 data points to construct a local approximation. The ANN algorithm is the second best; its only drawback is that the training time is long. However once they are trained, a prediction of a new value or dataset is fast. For LWR the problem is when a new instance is processed, it needs to process the whole training dataset to predict the correct value. Figure 6. Output vs. prediction from LWR for sensor PAR

110 Figure 7. 3NN, sensor SR05 is looking up, the error is dust that starts at 150 and finish at Figure 8. ANN, sensor SR05 is looking up, the error is dust that star at 150 and finish at Our next goal is to complete the development of a fully automated system for data cleansing. We plan to take the following steps: Test the system with datasets with more than one failing sensor. Add new sensors modalities (solar panel voltage, rain gage, pressure, leaf wetness, etc.) Extended experiments with larger datasets. Test alternative methods to assess the data quality, particularly, methods that allow multi-modal distributions to model expected sensor values. 9. ACKNOWLEDGEMENTS This work is supported in part by NSF grants HRD and We would like to thank Geovany Ramirez, Jose Herrera, Libia Gonzalez, Aline Jaimes, and Mark Lara for help in different phases of this project. Presentation of this poster was supported in part by NSF Grant CNS Figure 9. LWR, sensor SR10 is looking down, the error is bad functionality and starts at 1000 and finish at REFERENCES [1] D. Estrin, R. Govindan, J. Heidemann, and S. Kumar, Next century challenges: scalable coordination in sensor networks, in MobiCom 99: Proceedings of the 5th annual ACM/IEEE international conference on Mobile computing and networking, (New York, NY, USA), pp , ACM, [2] K. Romer and F. Mattern, The design space of wireless sensor networks, Wireless Communications, IEEE, vol. 11, pp , dec [3] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, Wireless sensor networks: a survey, Computer Networks, vol. 38, no. 4, pp , [4] J. Gama and M. M. Gaber, eds., Learning from Data Streams. Springer, [5] J.-Z. Chen, C. C. Yu, M. T. Hsieh, and Y. N. Chung, Employing chnn to develop a data refining algorithm for wireless sensor networks, in 2009 WRI World Congress on Computer Science and Information Engineering, vol. 1, pp , March [6] M. Shuai, K. Xie, G. Chen, X. Ma, and G. Song, A Kalman filter based approach for outlier detection in sensor networks, in Computer Science and Software Engineering, 2008 International Conference on, vol. 4, pp , December [7] J. Green, Bhattacharyya, and B. Panja, Real-time logic verification of a wireless sensor network, in Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering (CSIE 09), vol. 3, (Los Angeles, CA), pp , March [8] E. Dereszynski and T. Dietterich, Probabilistic models for anomaly detection in remote sensor data streams, 23rd Conference on Uncertainty in Artificial Intelligence (UAI-2007),

111 [9] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, [10] J. Herrera, A. Jaimes, J. Brady, G. Ramirez, C. E. Tweedie, and D. P. Peters, Utilizing novel technologies and cyberinfrastructure to understand global change impacts and feedbacks in an arid ecosystem. The 94th ESA Annual Meeting (August 2 7, 2009), August [11] J. A. Gamon, Y. Cheng, H. Claudio, L. MacKinney, and D. A. Sims, A mobile tram system for systematic sampling of ecosystem optical properties, Remote Sensing of Environment, vol. 103, no. 3, pp , Spectral Network. [12] J. Herrera, A. Jaimes, G. Ramirez, L. Gonzalez, and C. E. Tweedie, A robotic tram system used for understanding the controls of carbon, water and energy land and atmosphere exchange at the jornada basin experimental range. The 95th ESA Annual Meeting (August 1 6, 2010), August

112 The use of low-cost webcams for monitoring canopy development for a better understanding of key phenological events. Libia Gonzalez University of Texas at El Paso 500 W. University Av. El Paso TX (915) lgonzalez16@miners.utep.edu Geovany Ramirez University of Texas at El Paso 500 W. University Av. El Paso TX (915) garamirez@miners.utep.edu Craig E. Tweedie University of Texas at El Paso 500 W. University Av. El Paso TX (915) ctweedie@utep.edu ABSTRACT Long-term phenological observations offer a solid approach in understanding climate change and its implications on the timing of plant cycles. The understanding of these events can be further facilitated with the employment of near-remote sensing such as eddy covariance, hyper-spectral measurements, and imaging sensors. This study utilizes and incorporates a range of technologies and cyberinfrastructure developed and located at the Jornada Basin Experimental Range, host of the Jornada Basin Long Term Ecological Research (LTER) program, with an over-arching goal of understanding the control and pathways of carbon, water and energy uptake, storage and release at multiple scale assessments in arid ecosystems. The objective is to use autonomous time-series photography made at high frequency using low-cost digital web-cams to record variation over seasonal phenological events of vegetation. Thus linking canopy development with flux monitoring networks can provide a better understanding of key phenophases and its role on ecosystem function. Four low-cost digital web-cams are explored for consistency of red, green and blue channels (RGB) and derive relative brightness of an interest area within each image. The cameras were placed and positioned to capture the same image. From this image, the same interest area was selected and analyzed using MATLAB. The RGB channels were extracted and calculated for overall brightness. Preliminary results indicate minimal differences in percentages between images exist, which makes it ideal when comparing images from multiple cameras through time. The use of low-cost digital web-cams along with field observations of plant phenological events provides adequate means of monitoring seasonal landscape development. KEYWORDS Phenology, Digital web-cams, RGB image analysis, Jornada Basin experimental range, Digital imaging, Image processing. 1. INTRODUCTION Phenology is simply nature s calendar, this schedule is crucial for plants life-cycle events that sustain life [3]. Long-term phenological observations offer a solid approach in understanding climate and climate change and its implications on the timing of plant cycles. Ecosystems and life-forms are highly influenced by phenology, including interactions among organisms, like their seasonal behavior, and their biological and physical processes. Phenological observations improve understanding of plants response to climate change. Near remote sensing observations, using repeat digital photography help us capture the exact date of key phenological events, like summer green up and autumn senescence, along with weekly field phenological observations of 215 individuals of the 5 dominant species. 2. OBJECTIVES The overarching goal of this project is to develop and test a new method for documenting plant phenological change in the northern Chihuahuan Desert using webcams. To achieve this, this study links: (1) Field based phenophase monitoring (2) Acquisition and post-processing of webcam repeat digital photography, and (3) Spectral measurements from a Robotic Tramline. Specifically, the objectives, and underlying questions of this study are: (1) Monitor phenophase development of key plant species and their physical environment to determine the following: a. Do key plant species differ in their seasonal and temporal patterns of phenological development? b. How plant phenophase development may be related to changes in the physical environment such as temperature, day length and soil moisture. (2) Develop a network of webcams and image processing software to automate the acquisition and post-processing of imagery suitable for documenting landscape-level phenological change by answering the following: a. What type of cameras is suitable to detect phenological development by extracting RGB color bands? b. Can the extraction of RGB color bands, capture spatial and temporal phenological patterns? c. What type of software can we develop, to atomize the process of acquiring, storage, and analysis of digital images? d. Can we document variation over seasonal phenological events of vegetation in arid and semiarid ecosystems using autonomous photo-timeseries made at high frequency using low-cost digital web-cams? 112

113 (3) Cross-calibrate measurement of landscape-level phenological development using webcams with field-based phenophase measurements and spectral indices derived from a robotic tram system. This activity will specifically address the following questions: a. How can we link canopy development, repeated digital photography, and spectral measurements, for a better understanding of key phenological events and its role on ecosystem function? b. How web-cam RGB analysis, and field based phenophase monitoring are related? How sensitive are webcams, to detect phenophases? c. How can we detect phenotypic plasticity utilizing webcams? Detect the ability of an organism to change its phenotype in response to changes in the environment utilizing low-cost web-cams. 3. METHODS 3.1 STUDY SITE Research for this study was conducted at Jornada Basin Long Term Ecological Research (JER-LTER-USDA) Las Cruces, NM, USA (see Figure 1), 1188 m above sea level, Desert composition at Jornada Basin is dominated by Creosotebush (Larrea tridentata), Mesquite (Prosopis grandulosa), Tarbush (Flourensia cernua); shrubs, Bush muhly (Muhlenbergia porteri),woollygrass (Dasyochloa pulchella); grasses. Historically has been through a broad scale conversion of perennial grasslands to dominance of xerophytic woody plants. The climate is an arid midlatitude desert zone (BWk), and a gentle slope of ~ 2 degrees. 3.2 Phenophase monitoring In March 2010 Three Phenology transects were positioned, to monitor noticeable stages in the annual phenological cycle of predominant species (Phenophase), phenophase development is being monitored at sites every 50 ms - along two 300 m transects and one 110 m transect- within the tower s footprint. Following the plan used by the USDA Forest Service Forest Inventory and Analysis (FIA) program. The design is proper for regional scaling, and periodic measurement cycles. Weekly phenophase monitoring is being conducted at the phenology transects collecting phenophase measurements following the National Phenology Network protocols. The rationale of the long-term phenological study at the systems ecology lab (SEL) research site is to monitor phenological stages (Phenophases) of shrubs, and perennial grasses from week to week. This monitoring results in the detailed observation of selected plants, in each of its phenological state, from which percentages can be calculated. These percentages can be used to answer several ecological questions. Such as, What percentage of individuals of a particular species flowered? And in what timing, What percentage of individuals which flowered went to produce fruit? Field Methodology In the North/West (290 ) and South (180 ), transects, we have monitoring sites every 50mts with three individuals (with some exceptions where the individuals were not found) of each dominant species listed above (see Table 1): Symbol PRGL FLCE LATR MUPO DAPU Scientific Name Prosopis glandulosa Common Name Honey mesquite Family Fabaceae Flourensia cernua Tarbush Asteraceae Larrea tridentata Creosote bush Zygophylla ceae Growth Habit Deciduous Tree/Shrub Deciduous Shrub Evergreen Shrub Duration Perennial Perennial Perennial Muhlenbergi a porteri Bush muhly Poaceae Graminoid Perennial Dasyochloa pulchella Woollygrass Poaceae Graminoid Perennial Table 1. List of dominant species, information taken from USDA (United States Department of Agriculture) Plants Database Individuals Selection When selecting the individual plants to observe, I considered the following issues: (Based on the USA National Phenology Network plan for phenology monitoring) Plants Health -Choose healthy plants, physically undamaged, and free of insects and disease. Figure 1. Systems Ecology Lab (SEL) research site; Red triangle corresponds to the Eddy Covariance Tower, green flags represent three phenophase monitoring transects; the Western transect is located along a Robotic tram line. Number of Individuals -Monitoring 3 individuals of the same species at a site gives an idea of variation in phenology among plants. I 113

114 Plots per transect select individuals that grow in a similar environment, receiving the same amount of sun or shade, and not being direct neighbors. Which Plants -The plants selected were in a 10m perimeter from the 50m mark, within the main Transect. We selected three individuals from each species a small, medium, and big one. Individuals Marking -Each individual (Table 2) is marked with a wood bar painted with different colors depending of each species, and marked with a metal identification (the one contains the code and number of the individual). Example: PRGL 1 which means (Prosopis glandulosa) Mesquite #1, from the specific site you are at. NW (Northwest Transect) Site 1 means the first 50mts, and continually. South Transect North-West Transect East Transect Individuals per site Sites MUPO DAPU LATR PRGL FLCE TOTAL Sub total Table 2. Total of individuals per site Eq. 1 and used to calculate relative (or normalized) brightness for each channel as in Eq. 2 (1) (2) Figure.2 Test images of the webcams made on UTEP s Green roof. To differentiate changes in canopy sate, we normalize the size of the Area of Interest, to Y 300, X 500 for all the images, and run a relative channel brightness formula; (see Fig 3.) East transect The east transect is a 110m transect, positioned parallel to the robotic tramline system. For this transect we selected 10 individuals of the 5 dominant species. The selected individuals are plants that are being monitored by both a unispec An instrument that obtains leaf/canopy reflectance measurements-, and direct phenophase field observations. These individuals are being monitored from a board walk, specially made to avoid disturbance in the areas of interest. 3.3 Web-cam Selection We reviewed four different cameras from three different manufactures. The test was using the image acquisition and image processing tool box of MatLab. The site for our study was the green roof of the Biology Building at The University of Texas at El Paso (UTEP), as scenery the cameras were pointed to the same area of interest, under the same lighting conditions. (Fig. 2) displays the cameras viewing area. We selected similar regionsof-interest (ROI) from each camera, so the ROI s could be compared. Camera color channel Red, Green, and Blue (RGB) was extracted from the regions of interest from each camera. The overall brightness (Total RGB) of the ROI is calculated as is in Figure 3. Channel percentage: 1 Red, 2 Green, 3 Blue. We found that Microsoft Vx7000 webcam is the best candidate for this study, since results were extremely similar to results from a higher resolution (8MP), in addition the Microsoft Vx7000 is a low cost webcam Webcam specifications We utilize a Microsoft webcam model Vx7000, with 1600 x 1200 pixel resolution (2 MP), 58 degrees of horizontal view angle and with a manually fixed focus for the specific experimental area (The Eddy Covariance tower s footprint). Images were stored with a minimal compression factor in JPEG format. This offers a 114

115 visual record of phenological stages to quantify variation, by RGB channel extraction Software development and testing With the selected cameras we developed a program that facilitates image acquisition, and processing. We created a userinterface that displays the camera viewing area, to select regions of interest that we want to capture with each webcam, and it defines the capturing schedule. (see Fig 4.) Figure 7. Relative Brightness channel Image Acquisition and Processing Figure 4. User-interface that displays the camera viewing area Images are collected on a daily basis around midday, automatically using a scheduler to capture digital images using Matlab, with the help of 4 Microsoft Vx7000 webcams Analysis is conducted on an Area-of-Interest (Figure 8), avoiding mountains and sky above, and the towers fence below the area-of-interest. Capturing images from 7:00 am to 7:00 pm Webcam Testing We evaluated image quality of 4 Microsoft Vx7000; the sample photos from each webcam had a variety of light conditions (Fig 5.). Therefore we compared images from each camera between them (Fig 6.). Figure 5. Webcam variety of light conditions Figure 8. Area of interest selection for each phenocam. Figure 6. Webcam selected regions of interest. We selected the same region of interest from each image and run a relative channel brightness formula.(fig 7.) 3. - Camera color channel information is extracted from the images (Red, Green, and Blue). The overall brightness (Total RGB) of the Area-of-Interest is calculated and used to calculate relative brightness for each channel.. As used in [1, 2] Total RGB = Red + Green + Blue Channel % = Channel Total RGB 4. - Image analysis of the footprint was conducted on rectangular static regions of interest (ROIs) that remained constant for each 115

116 camera. As used by [1 and 2]. We calculated the green index using the following formula: Ig = (Green Red) + (Green Blue) Ig = (2 X Green) (Red + Blue) 5. - We also calculated Near NDVI using the following formula: nndvi=(green Red) / (Green + Red) 4. Robotic tram measurements To document seasonal phenological dynamics, and quantify changes of a Chihuahuan Desert shrubland, we integrated optical remote sensing with time-lapse repeat digital photography -phenocams- (Figure 9).The normalized difference vegetation index (NDVI) was captured using a UniSpec dual spectrometer at each quadrat, and compared to the red, green, and blue (RGB) from a Microsoft Vx7000 digital web-cam, for each date of imagery. Mounted on a 110 meter robotic tram system, that carries a host of sensors for measuring atmospheric and ground based optical reflectance. 5. Results As a conclusion we found that Microsoft Vx7000 webcam is the best candidate to monitor phenophase development of key plant species and their physical environment, since results were extremely similar to results from a higher resolution (8MP) camera, in addition the Microsoft Vx7000 is a low cost webcam. And that the cross calibration between cameras, show a relative low margin of error. This means; that minimal differences in percentages between images exist, so this makes it ideal when comparing images from multiple cameras through time. Preliminary results indicate that by the extraction of RGB color bands, from repeat digital photography we can capture spatial and temporal phenological patterns. This study demonstrates seasonal trends starting with a green-up response at the beginning of the study, followed by a decline on day 243 as a result of a senescing canopy. Green % had an evident response to the shift form summer to autumn. This study documents variation over seasonal phenological events of vegetation in arid and semi-arid ecosystems using autonomous photo-time-series made at high frequency using lowcost digital web-cams (Figure 10). Furthermore these results were compared to canopy development, and spectral measurements, for a better understanding of key phenological events and its role on ecosystem function, we found that the results from the web cams were comparable to the results from the field phenological observations (Figure 11). Figure 10. Variation over seasonal phenological events of vegetation Figure 9. Integration of optical remote sensing, sensing with time-lapse repeat digital photography (pheno-cams). 116

117 Figure 11. Field phenological observations. 6. ACKNOWLEDGMENTS This work is supported in part by NSF grant CNS We would like to thank Gesuri Ramirez, Jose Herrera, Aline Jaimes, and Debra Peters, John Anderson, Dawn Browning, Jornada Experimental Range staff, for help in different phases of this project. 7. REFERENCE [1] Richardson, A.D., et al., Use of digital webcam images to track spring green-up in a deciduous broadleaf forest. Oecologia 152 (2), [2] S.A. Kurc*, L.M. Benton, Digital image-derived greenness links deep soil moisture to carbon uptake in a creosotebush-dominated shrubland. Journal of Arid Environments 74 (2010) [3] Michael A. Crimmins Æ Theresa M. Crimmins, Monitoring Plant Phenology Using Digital Repeat Photography. _ Springer Science+Business Media, LLC

118 A Model for Geochemical Data for Stream Sediments Dr. Rodrigo Romero University of Texas at El Paso 500 West University Ave. El Paso, Texas, Dr. Phil Goodell University of Texas at El Paso 500 West University Ave. El Paso, Texas, Natalia I. Avila University of Texas at El Paso 500 West University Ave. El Paso, Texas, ABSTRACT There is a large amount of geochemical data being handled by geologists. Such data is organized in tables by location or element. Although, this data is available online, it is difficult to obtain the desired information in a short time, due to the absence of an efficient model that allows users to retrieve and compare results. In an effort to facilitate geochemical data manipulation two disciplines, Computer Science and Geology joined forces. The objective is to facilitate geochemical data manipulation for geologists from all levels of expertise. In order to create an efficient model of data manipulation, the suggested approach was to create a database model that would allow the users to query the information based on level of expertise, category, location, element, or available information on maps. The model created for this project has created the base to expand the project to a wider area; the possibility now is to expand from findings in one state to the whole country. the advanced level was defined, it was divided into different categories and subcategories. Consequently, the information was broken down into smaller components of information so that it was easier to determine how tables should be organized, what their primary keys were and whether or not tables should relate to other tables. 2.3 Table Creation After further analysis it was determined that the model to be implemented was a Relational Database Management System (RDBMS). Were the primary key of most tables was the element (see figure 1 and 2) being analyzed by each category and each table would relate to other tables so that more elements and categories could be compared per query. KEYWORDS Geochemical Data, Stream Sediments and Elements. 1. INTRODUCTION This project was based on the geochemical findings of the state of New Mexico. Since the beginning of the project the only available tool was the thesis from geologist Fares Howari [1]. He processed the geographical information from the NURE [2] program and the resulting information was used to create a model for efficient data manipulation. Figure 1. Element as the primary key and its data type. The information was being displayed as an online document with links to see different categories, images or tables. All this information was organized in an efficient manner so that users were able to retrieve and compare information in the same query. 2. DESIGNING THE DATA MODEL 2.1 Understanding the Content The main objective of this project is to organize data to facilitate its manipulation, and in order to organize data, such data has to be processed and understood. The conflict comes when two disciplines have to combine knowledge in order to create the desired product. In this case, the geographical information had to be understood in order to be organized in a manner that allowed information to be queried and combined according to user s preferences. 2.2 Simplifying Data The first approach taken was to divide the geochemical data by level of expertise, only the advanced level of expertise was going to be considered at this stage of the project, because it was the one involving more data manipulation with tables and images. Once Figure 2. Element Percentiles table. Once data was simplified and organized into different categories the remaining task was to create each table and define its keys. As seen in the database diagram (see figure 3) all tables have dependent relationships to the Element table. Therefore, the primary key for all tables is the element ID named Element. Only when comparing results, a composite element key will be used. The table creation helped the geologists involved in the project to understand how the data could be manipulated and how to state specifically which results to retrieve from the desired tables. With this in mind, it was easier to develop the idea of a website that could manipulate the data in the tables. 118

119 Figure 3. Database ER and Tables Examples 119

120 3. CREATING THE WEBSITE 3.1 Website Navigation There was a lot of support given from the geologist s team to create a website that would ultimately display all the information that they had gathered. Unfortunately, the navigation suggested for the site was very poor, due to their low expectations on how the data could be retrieved. The navigation suggested consisted in long tree views that would display the information by categories. However, after redesigning a navigation template that would accommodate as much information as possible, without creating confusion for the future users, everyone in the project agreed on it. The next step was to define the interfaces for the users and the pages to query the database. 3.2 User Interfaces After much discussion, several ideas were being considered for user interfaces. The idea was to prompt the user to select an element or elements before it could continue to any other portion of the navigation. The idea behind this approach was to allow the user to query as many categories as needed for the selected element or elements. With this approach, the navigation was optimized because the user was not forced to select an element for each category. If the user decided to change the element to query, the navigation would allow the user to go back to the element selection page to select another element. Originally, the interface that prompts the user to select an element was intended to be a periodic table that would display all elements and so the user could select the element to query. Unfortunately, the element findings in the state of New Mexico do not have the majority of the chemical elements. Therefore, for simplicity purposes, only the elements found in the state of New Mexico are available for selection, with simple text selection and a submit button. 3.3 Web Page to Query the Database When designing the web site, a major concern was whether to open a session on the database for the entire time that the user was querying the database through the website or to open a session only to display results. The second option was selected to keep all queries running in a single page, instead of having multiple pages opening sessions and querying the database. Once this decision was taken, it was time to create all the pages that would request results from this page. Also, it was time to star creating all the queries that would make each of those pages post some results. 3.4 Web Pages and Their Languages All pages were cratered with HTML, PHP and the database was being queried using SQL. The only page that had a great amount of PHP code in it was the page in charge of displaying the results. All other pages had a minimal amount of PHP code, most of it was used to carry the element through all pages when the user is executing queries without having to select the element every time a query is executed. All queries were created using SQL code and PHP and all the content from the web site was developed using HTML code. All images and maps were set locally at the server. Maps are accessed using PHP code; however their location is not stored in the database. 4. EDITING MAPS 4.1 Given Information The geologist s team facilitated a wide amount of information regarding to maps. Most maps were organized by categories which defeated the purpose of the selected model. As an example, all the maps regarding the category Univariate Analysis had to be divide into a subcategories, as shown in figure 4, the subcategory Histograms of Raw Data had to be broken down into elements. Figure 4. Histograms of the Raw Data Set 4.2 The objective The intention was for the user to be able to retrieve a category in which he or she should be able to see a brief introduction to what the table and the image being displayed are explaining. Followed by such introduction an image and a table of the element should be displayed. In case the user was requesting information on several elements the order would remain the same, first a brief introduction of the category and then the graphs or maps that the user requested, followed by a table with all the information from all the element selected. However, at that moment the user was able to select an element and query different categories, but no maps or graphs were being displayed (see figure 5 [3]). Figure 5. Example of brief introduction and table 4.3 Achievement In order to achieve the desired goal, all maps and graphs had to be reorganized by element, category and subcategories. Once all this work was completed, the images were added to the information that the user was able to request. As seen in the example, (see figure 6 [3]) the user gets a title with the element being queried, a brief description of the results being displayed, an image that in this case is a map and a table. The ideal case would be to have images and graphs for all elements, however not all elements have this information. 120

121 Therefore, the webpage was configured to let the user know when the requested information cannot be provided, due to the lack of such information. In case a user queries an element for which the requested information is not found, a message saying Sorry no records found for the given criteria. that would allow the geochemical data allocation of the entire country is being considered. When all these improvements are done, more elements will be available for querying. Therefore, the site would have a wider amount of information to offer for its users. 6. ACKNOWLEDGMENTS The author(s) acknowledge(s) the National Science Foundation (under CREST Grant No. HRD and Grant No ) for its support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 7. REFERENCES [1] Zumlot T., Goodell P., Howari F. Geochemical mapping of New Mexico, USA, using stream sediment data. Thesis. The University of Texas at Austin [2] Smith S National Geochemical Database. Version Figure 6. Example of given result for Al [3] Romero R., Goodell P., Avila N New Mexico Geochemistry of Streams Sediments.University of Texas at El Paso 5. FUTURE WORK Based on the work from this project, it is expected to expand the website to accommodate more than one state. Geochemical research is being done in Colorado; therefore the site is expected to offer that information in the near future. Also, as part of expanding the project, the creation of a more generalized model 121

122 Model Fusion: Initial Results and Future Directions Omar Ochoa 1 Mentors:Vladik Kreinovich 1, Aaron A. Velasco 2 and Ann Gates 1 Center of Excellence for Sharing Resources for the Advancement of Research and Education through Cyber-infrastructure (Cyber-ShARE) 1 Department of Computer Science, 2 Department of Geological Science The University of Texas at El Paso, El Paso, TX 79968, USA (915) omar@miners.utep.edu ABSTRACT Developing tomography models of the Earth from one data set may conflict with models developed using a different data set. For example, seismic refraction analysis and the analysis of the gravity data can result in different density distributions. Ideally, we should generate a single model based on the joint analysis of all related datasets. However, algorithms for such joint processing are still being developed, yet we have developed general algorithms to fuse models coming from different datasets. We have numerically validated this model fusion approach, and find it to be robust and stable. Preliminary results applied to seismic and gravity data from the Rio Grande Rift reveals that statistical model fusion formulas and algorithms assume that we know the accuracy and spatial resolution of the original models, while in practice, these values are rarely known. To estimate these accuracies, we compare several different models for the same distribution; then, the models which are closer to the average are assumed to be more accurate. Thus, in this paper, we present the results of our preliminary work and the challenges that we have discovered. KEYWORDS Model fusion, earth tomography. 1. INTRODUCTION One of the most important studies in Earth sciences is of the interior structure of the Earth. There are many sources of data for Earth tomography models: first-arrival passive seismic data (from actual earthquakes) [5], first-arrival active seismic data (from seismic experiments) [2, 4], gravity data, and surface waves [6]. Currently, each of these datasets is processed separately, resulting in several different Earth models that have specific coverage areas, different spatial resolutions, and varying degrees of accuracy. These models often provide complementary geophysical information on Earth structure (P-wave and S- wave velocity structures), where combining the information derived from each requires a joint inversion approach. Designing such joint inversion techniques presents important theoretical and practical challenges [3]. As a first step, the notion of model fusion [1] provides a practical solution: to fuse the Earth models coming from different datasets. Since these Earth models have different areas of coverage, model fusion is especially important because some of the resulting models provide better accuracy and/or spatial resolution in certain spatial areas and depths, while other models provide a better accuracy and/or spatial resolution in other areas and depths. Preliminary results applied to seismic and gravity data from the Rio Grande Rift validates the previously proposed approach and reveal that it is assumed that the accuracy and spatial resolutions of the original models are known. However, in practice, these values are rarely known. In order to estimate these accuracies, several different models of the same distribution are compared and those that are closer to the average are assumed to be more accurate. In this paper the preliminary results and the challenges that have been encountered are presented. 2. MOTIVATION In geophysics, and in other areas of science, there are several descriptions of the same quantity coming from the analysis of different data sets. For example, both seismic analysis and the analysis of the gravity data result in a density distribution. The seismic analysis leads to a more detailed density model, with higher spatial resolution. However, the coverage area of this method is limited, for example only density values in the areas through which the seismic rays pass, i.e., the areas in the vicinity of the shot points and depth not exceeding the Moho layer, contribute to the modification of velocity values. A smashed 2- dimensional seismic model of the Rio Grande Rift zone, an area that includes El Paso, TX, is presented in Fig 1. Figure 1. Rio Grande Rift Velocity Model 122

123 In contrast, the gravity model provides a much larger coverage, but its spatial resolution is much lower due to gravity being an average of all the bodies under the measurement. A gravity model of the Rio Grande Rift zone, including El Paso TX, is shown in Fig RESULTS Preliminary results of applying this model fusion to actual seismic and gravity data from the Rio Grande Rift zone can be seen in the fused model shown in Fig 3. Figure 2. Rio Grande Rift Gravity Model Different data sets provide complementary information about the geophysics of a region. It is therefore desirable to create a single model that combines all the available data types in order to gain a deeper understanding of the interior structure of the earth. 3. IDEAL AND PRACTICAL SOLUTIONS 3.1 Joint Inversion For each data type, the corresponding model is obtained by the corresponding inversion procedure i.e., start with sensor data, transform data into values of density and/or velocity. Ideally, a single density model should be generated based on the joint analysis of all related datasets. However, algorithms for such joint processing (joint inversion) are still being developed. 3.2 Model Fusion Since it is not yet practical to fuse different types of data into one joint inversion, it is desirable to combine (fuse) models coming from different datasets. Previously general algorithms have been developed for such a model fusion [1]. 3.3 Process The first step in the process used to create the fusion model is to convert the models to a common unit. In this case, since the models to be fused are gravity and velocity models, it is only necessary to convert the velocity values into density values since the values in the gravity model are density values. Then the models are fused by combining the density values. The models have to be aligned as to have the same matching coordinates; this was done by matching latitude and longitude for both experiment settings. The resulting model was visualized using the Visual Toolkit software package. Figure 3. Fused Model from Gravity and Seismic Models 4. CHALLENGES The first challenge, easily visible from the merged model shown in Fig. 3, is the need for smoothing some transitions. The merged model has several areas of abrupt transition. Some of these transitions are actual transitions between different geophysical layers, so these transitions should be abrupt. However, other transitions are abrupt not because of any actual difference in geophysical layers, but simply because the usual gravity model is discrete i.e., it only has a finite number of different density values. Even when the actual transition from, for example, 5K to 6K kg/m^3 is smooth, the gravity model represents it as a sequence of layers, e.g., with values 5.0, 5.2,...5.8, and 6.0. When the two models are fused, these abrupt transitions lead to similar abrupt transitions in the merged model. Therefore, to make a model more geophysically meaningful, it is desirable to smooth such transitions, while retaining the abrupt transition between layers. The second challenge is that to get a preliminary fused model, the fusion was assumed to have equal accuracies for the two models being combined. To get a more accurate model, it is desirable to know the relative accuracy of different models. For example, if we have two values and, with accuracies and, then the most accurate fusion of these two values is: It is also desirable to know the spatial resolution of different maps. The problem is that the accuracy and spatial resolution of the original models are rarely known. One way to estimate these accuracies is by comparing several different models for the same distribution; then, the models which are closer to the average are assumed to be more accurate. However, such estimates can be made only if at least three different models exist -- since for two, both are equally close to the mean; for density, only two models are available. In order to apply this technique, models describing density and different velocities must be jointly fused, and use the known relations between different quantities, similarly to how we transform the seismic data 123

124 into a density model by using the known relation between density and velocity. A third challenge, which is specific to the density model fusion is that the gravity model is usually based on the data from the seismic model; therefore the models to be fused are not fully independent. How to correctly take this dependence into account is still an open problem. 5. CONCLUSIONS In many cases, there are several types of data describing the same geophysical area. For example, to describe density and velocity, there are models such as seismic data and gravity data. Ideally, it is desirable to perform a joint inversion of different types of data, but such joint inversion algorithms are still being developed. As a practical solution, it is proposed to fuse the models coming from different datasets. In previous work, a theoretical background was developed for such model fusion. In this paper, the preliminary result of applying model fusion to gravity and seismic models is shown. The results are promising, in the sense that the resulting model combines a wider coverage of the gravity model with a higher spatial resolution of the seismic model. However, there is still room for improvement: there is a need to smooth the unphysical abrupt transitions, and the fact that different models have different accuracies which might be statistically dependent has to be investigated. Future work will be to resolve the stated challenges and to investigate the use of these techniques outside of the geophysical science. Additionally, an improvement to the visualization package is under development at the Cyber- ShARE Center. 6. ACKNOWLEDGMENTS This work is funded by the National Science Foundation grants: Cyber-Share Center of Excellence: A Center for Sharing Cyber-resources to Advance Science and Education (HRD ), and Computing Alliance of Hispanic-Serving Institutions CAHSI (CNS ). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 7. REFERENCES 1. Ochoa, O., A. A. Velasco, C. Servin, V. Kreinovich. Model Fusion under Probabilistic and Interval Uncertainty, with application to Earth Sciences. in Proceedings of the 4 th International Workshop on Reliable Engineering Computing REC 2010, Singapore, March 3-5, 2010, pp Averill, M. G. A Lithospheric Investigation of the Southern Rio Grande Rift, University of Texas at El Paso, Department of Geological Sciences, PhD Dissertation, Averill, M. G., K. C. Miller, G. R. Keller, V. Kreinovich, R. Araiza, and S. A. Starks. Using Expert Knowledge in Solving the Seismic Inverse Problem. International Journal of Approximate Reasoning, 453: , Hole, J. A. Nonlinear High-Resolution Three- Dimensional Seismic Travel Time Tomography. Journal of Geophysical Research, 97: , Lees, J.M and R. S. Crosson, Tomographic Inversion for Three-Dimensional Velocity Structure at Mount St. Helens Using Earthquake Data. Journal of Geophysical Research, 94: , Maceira, M., S. R. Taylor, C. J. Ammon, X. Yang, and A. A. Velasco, High-resolution Rayleigh Wave Slowness Tomography of Central Asia. Journal of Geophysical Research, Vol. 110, paper B06304,

125 Prospec 2.2: A Tool for Generating and Validating Formal Specifications Jesus Nevarez, Omar Ochoa, Mentor: Ann Gates Department of Computer Science, The University of Texas at El Paso, El Paso, TX 79968, USA {jrnevarez2, agates@utep.edu Mentor: Salamah Salamah Department of Computer and Software Engineering, Embry Riddle Aeronautical University, Daytona Beach, FL 32114, USA salamahs@erau.edu ABSTRACT Formal approaches to software assurance such as runtime monitoring, model checking and theorem proving require formal specification of behavioral properties to verify a software system. Creation of formal specifications is difficult and requires a strong mathematical background. Traditionally, there has been inadequate tool support for users in the creations of formal specifications. The Property Specification Tool, Prospec, was developed to assist users in the creation of formal specifications. This paper describes the current features of Prospec 2.2 and the planned future direction of the tool. KEYWORDS Software Engineering, Formal Specifications, LTL. 1. INTRODUCTION Formal approaches to software assurance such as runtime monitoring, model checking and theorem proving require formal specification of behavioral properties to verify a software system. Creation of formal specifications is difficult and requires a strong mathematical background. For example, model checkers, such as SPIN [1] and NuSMV [2] use formal specifications written in Linear Temporal Logic (LTL) [3][7], which can be difficult to read, write, and validate. This problem is compounded if properties must be specified in more than one formal language. These limitations and the lack of adequate tools for this task deter the adoption of formal methods in software development. Prospec 2.2 addresses some of the challenges of creating the formal specifications with easy-to-use graphical user interfaces that guide the user throughout the process of creating formal specifications. This paper describes the current features of Prospec 2.2, improvements on the graphical user interface, and the planned future improvements for the tool. 2. BACKGROUND The Prospec tool uses the Specification Pattern System (SPS) [4] and Composite Propositions (CP) [5] to assist developers in the elicitation and specification of system properties. The SPS [4] is a set of patterns used to assist in the formal specification of properties for finite-state verification tools. SPS patterns are high level abstractions that provide descriptions of common properties holding on a sequence of conditions or events in a finite state model. Two main types of patterns of SPS are the occurrence and the order of events or conditions. Occurrence patterns are universality, absence, existence, and bounded existence. Order patterns are precedence, response, chain of precedence, and chain of response. In SPS, a pattern is bounded by the scope of computation over which the pattern applies. The beginning and end of the scope are specified by the conditions or events that define the left (L) and right (R) boundaries, respectively. Composite Propositions were introduced by Mondragon et al. [5] to handle pattern and scope parameters that represent multiple conditions or events. The introduction of CPs supports the specification of concurrency, sequences and non-consecutive sequential behavior on patterns and scopes. Mondragon proposes a taxonomy with 12 classes of CPs. In this taxonomy, each class defines a detailed structure or sequential behavior based on the types of relations that exist among the set of propositions. 3. MOTIVATION Ther first version of Prospec, Prospec 1.0, provided users with guidance through the process of defining specification properties. However, studies performed on Prospec 1.0 provided valuable feedback on how to improve the tool: to provide the capability to access all the properties defined in a given project; to allow the capability to apply the negation operator to propositions; to indicate the properties that contain a recorded assumption; and to modify the physical position and labels for parameters S and P in the response and precedence patterns in the pattern screen. Additionally, Prospec 1.0, could only generate Future Interval Logic (FIL) [6] from SPS. Prospec 2.0 and Prospec 2.1, answered these and other concerns. Among improvements to the graphical user interface, the newer versions of Prospec provide the capability of LTL generation formulas. The current version of Prospec, 125

126 Prospec 2.2, features the ability of saving and opening projects in XML and allows the generation of properties combining atomic and non-atomic propositions. 4. TOOL DESCRIPTION 4.1 General Features Prospec 2.2 is a tool that guides the user throughout the process of creating formal specifications. It guides the user via a graphical user interface that allows for the specifications of propositions, scopes and patterns. To help users decide which is the desired pattern and scope, Prospec presents a decision tree that assists in the selection of the appropriate pattern and scope for a given property. The decision tree asks simple questions to the user to distinguish between the different types of patterns, and scopes and to distinguish relations among multiple conditions or events. By answering these questions, the practitioner is led to consider different aspects of the property. Currently Prospec 2.2 generates formulas in two types of logic; FIL and LTL. Prospec features the capability of XML writing and reading which enables Prospec to save and open projects in this format. A validation feature is to check for input completeness by ensuring that every property has a completely defined pattern and scope when generating a formula, and will notify the user when a proposition is incomplete and requires further input to be entered. 4.2 Graphical User Interface Description Figure 1 depicts the Main Window. On the left side of the window the Property Browsing Tree is shown. Prospec 2.2 incorporates the Property Browsing Tree which allows practitioners to browse, traverse and quickly preview properties being specified. Also this tree allows the edition of properties attributes such as scope, pattern, CPs, and propositions. Once the user selects a property attribute in the tree, the appropriate window will be opened allowing modification of the property attribute. property element was clicked on the tree, the Status Window will show the name, description, assumptions, scope and pattern associated to the selected property. If a user selects a proposition, the Status Window provides information such as the proposition s name, the proposition s description and the type of proposition (CP or Atomic). Figure 2. Main Window with Status Window The Property window describes the basic property information such as the property name, the informal property description as provided by the client, and any assumptions made about the property. These properties can be created either by right clicking on the tree or by clicking on the Define menu and then clicking on the Property menu item as shown in figure 3. Once a property has been created, it can be modified or deleted by clicking the correspondent options on the tree. Figure 1. Prospec Main Window Another interface improvement is the Status Window, depicted in figure 2, located in the lower part of the Main Window. This window provides information depending on what item of the Property Browsing Tree was selected. If a Figure 3. Property Browsing Tree with Pop-Up Menu Propositions can be conditions or events. Conditions are propositions that hold in one or more states. Events are propositions that change their Boolean values from one state to the other [2]. Propositions can be created by accessing the Define menu or by right-clicking the Property Browsing Tree. There are three modalities of properties that can be created in Prospec; Atomic (single proposition), 126

127 Composite (two or more propositions) and Incomplete (modality that allows a user to create a proposition without having to specify if it is going to be atomic or composite). After the properties and propositions have been created, scopes and patterns need to be defined. To specify the pattern or scope the user can click in the Define menu or simply by selecting a property in the Property Browsing Tree. After the property was selected the Status Window will appear in the lower part of the Main window. In order to select a scope the user should click on the Scope text box. After that the Scope window will be displayed. In this window the user can choose among the five different types of scopes. After selecting one the user must click on the Done button. In order to define a pattern, repeat the steps but instead of clicking on the Scope text box click in the Pattern text box. The Pattern window will appear in the screen so that the user can choose the pattern that suits the most the specification the user is trying to define. After the proposition is complete and all the properties attributes have been defined, the user can proceed to generate the formulas for the complete properties. To generate the formulas, the user can go to the Generate menu or by selecting a property on the tree and clicking on the Generate Formula button located on the lower left side of the screen to display the Formula window. In this window the user will encounter the formula in LTL, which is the predefined logic. The formula can be generated in FIL by clicking on the FIL radio button located in the middle of the window. Multiple formulas can be generated at the same time. 5. FUTURE WORK Future work planned for the tool includes the further enhancement of the validation mechanism. It is planned to add a consistency validator that will check if properties are consistent with each other within the same project. Furthermore, additional generation of combined types of properties for LTL needs to be defined, that is a formula that includes a cp and an atomic proposition. Currently the tool will simulate an atomic proposition as a cp in order to generate a combination of formulas, but the output LTL is redundant and repetitive. 6. CONCLUSIONS The purpose of this tool is to aid software specifiers in the generation of formal properties without the users needing to have a strong mathematical background. This paper describes Prospec 2.2, a tool developed to assist users in the creation of formal specifications. This paper describes the current features of Prospec 2.2 and the planned future work for improving the tool, including the addition of specifications consistency checking in the work project. 7. ACKNOWLEDGEMENTS This work is partially supported through the CREST Cyber- ShARE Center funded by NSF for CAHSI grant number CNS Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 8. REFERENCES [1] G. J. Holzmann, The model checker SPIN. IEEE Trans. on Softw. Eng., 23(5): , May [2] A. Cimatti, E. Clarke, F. Giunchiglia and M. Roveri, NuSMV: a new symbolic model verifier. In Proceeding of International Conference on Computer- Aided Verification (CAV'99). Trento, Italy, July [3] Emerson, E., A temporal and modal logic. In Handbook of theoretical Computer Science (Vol. B): Formal Models and Semantics, J. van Leeuwen, Ed. MIT Press, Cambridge, MA [4] Dwyer, M. B., Avrunin, G. S., and J. C. Corbett, Property specification patterns for finite-state verification. In Proceedings of the Second Workshop on Formal Methods in Software Practice (Clearwater Beach, Florida, United States, March 04-05, 1998). FMSP '98. ACM Press, New York, NY, [5] Mondragon, O., Gates, A., and S. Roach, Composite propositions: toward support for formal specification of system properties, Proceedings of the 27th Annual IEEE/NASA Goddard Software Engineering Workshop. Greenbelt, MD, USA, December [6] Kutty, G., Moser, L. E., Melliar-Smith, P. M., Dillon, L. K., and Y.S. Ramakrishna, First-Order Future Interval Logic. In Proceedings of the First international Conference on Temporal Logic (July 11-14, 1994). [7] Manna, Z. and A. Pnueli, Completing the Temporal picture," Theoretical Computer Science, 83(1), 1991,

128 Data Property Specification to Identify Anomalies in Scientific Sensor Data: An Overview Irbis Gallegos Mentor: Ann Q. Gates The University of Texas at El Paso 500 W. University Av. El Paso, TX, ABSTRACT The amount of sensor technology that collects environmental data at research sites is rapidly increasing. Scientists need to be assured that the collected data sets are correct. An approach based on software-engineering techniques was developed to support the scientist s ability to specify data properties (through guidance using property classifications), which can be used for near realtime monitoring of data streams. This work identified data properties that are relevant to sensor data collected by environmental scientists, including data properties that rely on temporal constraints and data relationships. The paper presents an overview of a data categorization scheme that resulted in the Data Specification and Pattern System (D-SPS). D-SPS if the foundation for the Data Property Specification (DaProS) tool that assists scientists in specification of sensor data properties. The Sensor Data Verification (SDVe) tool was developed to identify anomalies in data obtained from sensors using the properties specified through DaProS. In collaboration with experts, a series of experiments were conducted to determine the effectiveness of the approach in detecting anomalies in scientific sensor data. KEYWORDS Software engineering, data quality, sensor networks, error detection, specification and pattern system, eco informatics. 1. INTRODUCTION Environmental scientists use advanced sensor technology such as meteorological towers, wireless sensor networks and robotic trams equipped with sensors to perform data collection at remote research sites. The amount of environmental sensor data acquired in real time by such instruments is increasing and the ability to evaluate at near-real time the accuracy of the data and the correct operation of the instrumentation become critical in order to not lose valuable time and information. The goal of the research is to improve sensor-data error detection through scientist-specified properties that capture temporal and data relationships regarding experimental conditions and instrumentation used in environmental science studies. This work presents the data property categorization developed from the findings of a literature survey of 15 projects that collect environmental data from sensors and a case study conducted in the Arctic. More than 500 published data properties were manually extracted and analyzed from the surveyed projects and the case study. The data property categorization allows scientists to identify recurrent data patterns. Using such data property categorization, the Specification and Pattern System (SPS) used in software engineering was adapted to capture data properties, which resulted in the Data Specification and Pattern System (DA- SPS). The data property categorization and the DA-SPS were used irbisg@miners.utep.edu to develop the Data Property Specification (DaProS) tool to assist scientists specify sensor data properties. The Sensor Data Verification (SDVe) tool was developed to verify expert-specified data properties. A series of experiments are being conducted in collaboration with experts working with Eddy covariance (EC) data from the Jornada Basin Experimental Range (JER) to determine the effectiveness of the proposed approach for identifying anomalies in sensor data. 2. DATA PROPERTY CATEGORIZATION An initial case study in the Arctic and a literature survey were conducted to glean the types of data properties that are of interest to scientist and to identify the processes followed to detect anomalies in sensor data. The following subsections describe the results of the case study and the literature survey. 2.1 Initial Case Study A case study was conducted in collaboration with the System Ecology Laboratory at The University of Texas at El Paso and the University of Alberta to evaluate the use of the Data Assessment Run-time (DART) [1] monitoring framework in assessing data collected by robotic tram systems located on the Barrow Environmental Observatory near Barrow, Alaska. In DART, environmental scientists using the tram system specify and verify data properties as data are streamed wirelessly from the tram system. For the DART framework, a data property is a logical statement about data values associated with hyperspectral sensor readings. DART is intended to detect deviations from an ideal set of data that is derived from expert knowledge and historical data, i.e., the representative data set. DART was used to assess the data at near-real time for three runs of a tram system on the same date to show the feasibility of the approach; however, due to limited wireless connectivity in the field, the remainder of the 2008 seasonal data was assessed using DART s post-processing mode. The case study identified two types of properties: data properties and instrument properties. Data properties specify expected values and relationships related to field data readings i.e., noise ranges, and data values outside the specified thresholds for spectral ranges. Instrument properties specify expected instrument behavior and relationships by defining examining attributes and instrument functions based on reading (e.g., low voltage, bad fiber optic, and loose connections). The case study also showed that software engineering run-time verification techniques can be adapted to be used as data assessment techniques. 2.2 Data Property Categories Checking the quality of sensor data is an essential step in data processing and requires identifying and analyzing data anomalies. A literature review [2] was conducted to review and analyze 128

129 current efforts in evaluating data quality documented by a total of 15 projects focused on environmental sensor data collection. The projects illustrate how data quality is incorporated into sensor data collection systems and processes at field sites, data centers, or both. The reviewed projects were in one or more of the following fields: atmospheric studies (6), oceanography (9), meteorology (6), hydrology (1), and land productivity (1). The data collected through the projects include CO 2, carbon balance, energy balance, spectral data, bathythermography, water salinity, tide gauge, vessels data, and temperature and wind profiles. 532 data properties were manually extracted and analyzed from the projects in the literature review. From analysis of the extracted data properties, the property categorization resulted. The classification divided the properties into two major types: experimental readings and experimental conditions. Experimental readings properties specify expected values and relationships related to field data readings and can be used to identify anomalies in a dataset, as well as random data errors, i.e., those errors that can be detected, estimated, and minimized by examining the convergence of calculations with increasing size of data sets [3, 4]. Experimental conditions properties specify expected instrument behavior and relationships by defining examining attributes (e.g., voltage) and instrument functions based on readings. This type of properties can identify systematic errors, i.e., persistent offsets or multipliers that can affect the whole or a portion of the dataset [4]. The values being checked may be sensor readings, derived values based on one or more sensor readings, pre-defined values, and historical values. Properties labeled experimental readings are divided into the following five subcategories: Datum: A datum (D) property specifies the expected value of a single sensor reading. A sensor reading is compared against a pre-defined or historical value. Time-Dependent Datum: A time-dependent datum (TDD) property specifies the expected value(s) of a single type of sensor, where the readings are filtered by date and time. The selected sensor readings are compared against a predefined value or a historic value. Datum Relationship: A datum-relationship (DR) property specifies the relationship between two or more types of sensor readings. A DR property can be used to compare sensor readings against readings from other types of sensors, against a predefined constant value, or against an historic value. Time-Dependent Datum Relationship: A time-dependent datum relationship (TDDR) property specifies the relationship between two or more related sensor readings that are filtered based on time. The selected readings may be compared against each other, against a predefined value, or an historic value. TDDR properties capture relationships within time series data and datasets behaviors dependent on time. Instrument-Dependant Datum: An instrument-dependant datum (IDD) property is one that specifies a property about an instrument that influences behavior of the sensor readings. Experimental conditions properties are divided into the following five subcategories: Instrument: An instrument (I) property specifies the expected behavior of an instrument by describing an attribute of the instrument. The attribute is compared against either a predefined value or an historic value. Time-Dependent Instrument: A time-dependent instrument (TDI) property captures the expected behavior of a single instrument that is dependent on time. The instrument reading is compared against a predefined constant value, a historic value, or a time entity in a given time constraint. Instrument Relationship: An instrument relationship (IR) property captures the relationship between one or more related instruments. An IR property can be used to compare the behavior of the instrument. Time-Dependent Instrument Relationship: A time-dependent instrument relationship (TDIR) property captures the relationship between two or more related instruments and expected behavior based on time. A TDIR property can be used to compare instrument behavior dependent on a time. Data-Dependant Instrument: A datum-dependant instrument (DDI) property captures a known datum or datum relationship whose value influences instrument behavior, or causes an instrument s action. DDI properties capture continuity problems. 3. DATA PROPERTY SPECIFICATION The Specification and Pattern System (SPS) [29], a software engineering solution for specifying and refining properties about critical software systems, provides the foundation for the approach used to specify data quality properties using a categorization system. In SPS, a pattern describes the essential structure of some aspect of a system s behavior and provides expressions of this behavior in a range of formal specification languages and formalisms. Each pattern is associated with a scope, which is the extent of the program execution over which the pattern must hold. The SPS was adapted to create the Data Property Specification and Pattern System (D-SPS), which uses scopes, patterns and Boolean statements to specify data properties. 3.1 Data Property Specification and Pattern System (D-SPS) In the D-SPS, Boolean statements express data properties, which are defined using mathematical relational operators that are applied to a datum, datum relationships, and Boolean methods that are available to the scientist. In DA-SPS, a property scope delimits the subset of data over which a property holds. The scope is defined by specifying the data reading occurrences in a dataset Δ over which a property will hold. Given data reading L ϵ Δ and data reading R ϵ Δ, a practitioner delimits the scope of a property by designating one of the following types: Global: the property holds for all the data in dataset Δ; Before R: the property holds over the sequence of data that begins with the first datum in Δ and ends with the data reading immediately preceding the first data reading in Δ that matches R; After L: the property holds over the sequence of data starting with the first data reading in Δ that matches L and ending with the last data reading in Δ; 129

130 Between L and R: the property holds over the sequence of data starting with the first data reading in Δ that matches L and ending the data reading immediately preceding the first data reading that matches R; and After L until R: the property holds over the sequence of data starting with the first data reading that matches L and ends with the data reading immediately preceding either the first data reading that matches R, or the last element in Δ if data reading R does not occur. Users typically select patterns after reviewing various options. As described earlier, the patterns are grouped as experimental readings, which describe the expected behavior of the data, and experimental conditions, which describe external conditions such as those associated with the functioning of the instrument or weather conditions. Time dependent patterns are interpreted over a discrete time domain, e.g., over the natural numbers N. Timed patterns assume a system clock, where the clock is treated as a local entity for each dataset value. For timed patterns, it is assumed that the independent value associated with each dataset value is a discrete time t. A time constrained property specifies one of the following: Minimum Duration(P,c): Boolean function P holds for a minimum of c units of time; Maximum Duration (P,c): Boolean function P holds for a maximum of c units of time; Bounded Recurrence(P, c): Boolean function P holds every c units of time; Bounded Response(T,P,c): Boolean function T holds after Boolean function P holds at no more than t + c time, where t is the time that P holds; and Bounded Invariance(T,P,c): Boolean function T holds for at least t + c time before Boolean function P holds, where t is the time that T holds. Patterns associated with categories that are not time dependent are specified as follows: Universality(P):Boolean function P always holds over dataset Δ; Absence(P): Boolean function P never holds over dataset Δ; Existence(P): Boolean function P holds at least once over the dataset Δ; Precedence (T,P): Boolean function T holds before Boolean function P eventually holds; and Response (T,P): Boolean function T holds immediately after Boolean function P holds. The data property categorization can be used to help practitioners determine which property pattern best suits the data property to be specified. The data categories are related to a property pattern depending on whether the property to be specified is timedependent or not and by the number of entities required to specify the property. 4. D-SPS TOOL SUPPORT Software tools have been developed to support environmental scientists ability to specify and verify data properties to identify anomalies in sensor data. The following subsections describe the data property specification tool DaProS and the sensor data verification tool SDVe. 4.1 Data Property Specification (DaProS) Tool The Data Property Specification prototype tool (DaProS) [5] was developed to assist practitioners in specifying and refining data properties that capture scientific expert knowledge about their data processes. The tool uses property categorization using dataproperty scopes and patterns to guide the specification process. The tool guides practitioners using a decision tree and a series of questions to assist in specification and refinement of data properties. To validate the intended meaning of specified properties, DaProS generates natural language descriptions of specified properties using a disciplined natural language. Analysis of the data property categorization and specification process garnered from the literature showed that scientists placed less attention on instrument malfunctioning than on the data values. It is important to note that instrument malfunctions are a major source of anomalies. Also, scientists are aware of sensors and data relationships, but these relationships are rarely used for anomaly detection. Scientists typically perform data inspections to determine instrument malfunction rather than monitoring the instruments performance. During the specification process, several factors were identified that affect data property specification using our approach. Some data properties are described at such an abstract level that it was difficult to specify them formally without more detail. The advantage of using a tool such as DaProS is that a practitioner, while trying to specify the property, would realize the need to refine the property to capture the intended meaning. Other data properties are complex and need to be decomposed into several simpler properties. A number of specifications are a combination of data verification and data steering properties, and this required separating the two concerns. Combined property specifications require both verifying that the properties adhere to predefined behaviors--the verification aspect, and guaranteeing that a reaction occurs in response to a data or instrument stimulus--the steering aspect. Combined property specifications can be decomposed into separate data verification properties and data steering properties. Due to the inherent ambiguous nature of natural language, data property descriptions are sometimes too ambiguous, requiring involvement of the expert. 4.2 Sensor Data Verification (SDVe) Tool DaProS s ability to generate properties in an exchangeable format and to mitigate ambiguity allows scientists to use the generated properties as input to data verification mechanisms that can interpret and verify such properties over scientific datasets. Toward this effort, a prototype Sensor Data Verification (SDVe) tool has been developed to identify anomalies in the data from the specifications generated by DaProS. SDVe takes as input a property specification file generated from DaProS and a sensor data dataset file obtained from a data logger, and verifies that the data in the dataset adhere to the property specified in the property specification file. SDVe raises alarms whenever the data property 130

131 is not satisfied by the data. SDVe is implementation agnostic and data-type agnostic. 5. IMPACT ON DATA ANOMALIES DETECTION The use of a scientific data property categorization to specify properties encourages scientists to further analyze and refine properties for specific ecosystems, increases the scientists ability to reuse properties and to document expert knowledge, fosters standardization of scientific processes related to data quality, and allows data properties to be interpreted and verified by data verification mechanisms. With the data property categorization in place, scientists have the ability to further analyze and refine properties for their specific ecosystems. In the environmental sciences, it is difficult for scientists to distinguish true errors from anomalies generated by environmental events. The approach presented in this paper allows scientists to fine tune data properties that can distinguish errors from environmental events. For instance, in Eddy Covariance data [2], data obtained during strong rainy conditions are considered bad data; yet, it is difficult for scientists to determine when a rain event occur just by looking at the data. With DaPROS and SDVe tools, scientists can specify properties to capture rain events by specifying properties that identify sudden changes in temperature or atmospheric pressure. The data anomaly detection process can be improved in several ways using the approach. Capturing data properties formally allows the scientist to document knowledge about scientific domains, and this in turn facilitates knowledge sharing and reuse by other scientists. The changes in properties can be performed at the specification level eliminating the need to make modifications to source code as is often the case in many monitoring systems. The data property specification process can support standardization of data quality processes for similar scientific communities. A set of data properties can be specified to cover the needs of specific ecosystems and be shared by members of a community. Tool support will allow scientists to discuss and refine existing and new properties. The common data property set will allow scientists to verify the data being collected using the same properties and tools, thus moving toward a unified way of verifying data. DaProS abilities to generate properties in an exchangeable format and to mitigate ambiguity allow scientists to use the generated properties as input to data verification mechanisms such as SDVe that can interpret and verify such properties over scientific datasets. This approach could help scientists determine at nearreal time if the anomalies in the data are true errors or environmental features with scientific implications about the ecosystem. The DaProS and SDVe tool are being used to specify and verify respectively data properties to identify anomalies in Eddy covariance (EC) sensor data obtained from biomesonet towers that collect carbon dioxide (CO2), energy, and water balance measurements at the Jornada Basin Experimental Range (JER). As part of the validation of the SDVe tool, an EC dataset was randomly selected from the datasets collected during the 2010 winter season. The dataset was seeded with anomalies and was used as input to SDVe to determine if the tool identifies false positive and false negatives in the data. The data properties of interest for this experiment and the seeded anomalies were based on a use-case scenario created in collaboration with environmental scientists. Two statistical sample size calculators were used to determine the number of anomalies to be seeded for each data property in order to obtain statistical significance. Preliminary results show that a combination of data properties specified by experts using DaProS, and used as input to SDVe, can be used to identify anomalies in collected sensor data. SDVe correctly identified the expected anomalies intentionally seeded by a scientist and identified unexpected instances in which seeded anomalies introduced to check for a type of data properties also violated other data properties mostly because of the overlapping of the properties. A second experiment using seeded files and non-overlapping properties is being conducted to further validate SDVe. A separate set of datasets obtained from an Eddy covariance tower located at the JER site, which was used to sense the environment during days in a season when environmental variability occurred, is also being processed and analyzed. 6. ACKNOWLEDGMENTS The author(s) acknowledge(s) the National Science Foundation (grant no. CNS ) for its support. The authors also thank Dr. Craig Tweedie, Dr. John Gamon, Dr. Deana Pennington, Aline Jaimes, and Santonu Goswani for their efforts and support for this research effort. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 7. REFERENCES [1] Gallegos I., Gates A.Q., Goswani S., Tweedie C., Gamon J Towards Near-Real Time Data Property Specification and Verification for Arctic Hyperspectral Sensor Data. To be published in Proceedings of the North American Fuzzy Information Processing Society Conference (NAFIPS) [2] Gallegos I., Gates A. Q., Tweedie C Toward Improving Environmental Sensor Data Quality: A Preliminary Property Categorization. In Proceedings of the International Conference on Information Quality (ICIQ) [3] Taylor, J.R An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements, 2nd Ed., University Science Books, pp 94 [4] Moncrief J.B., Malhi Y., Leuning R The Propagation of Errors in Long-Term Measurements of Land-Atmosphere Fluxes of Carbon and Water. In Global Change Biology, Vol. 2, Issue 3, pp [5] Gallegos I., Gates A.Q., Tweedie C DaProS: A Data Property Specification Tool to Capture Scientific Sensor Data Properties. In Proceedings of the Workshop on Domain Engineering DE@ER

132 Fusion of Monte Carlo Seismic Tomography Models Julio C. Olaya Faculty Mentors: Rodrigo Romero and Aaron Velasco* Department of Computer Science Department of Geological Sciences* The University of Texas at El Paso, El Paso TX 79968, USA 1. Abstract Velocity models of the Earth s crust created with seismic tomography algorithms are affected by uncertainty of first-arrival time readings. To assess the impact of model sensitivity to reading uncertainty, which is a typical measurement-error scenario with multiple possible sources of error, Monte Carlo techniques use arrival time error distributions to perturb the inputs of seismic tomography algorithms in order to simulate reading errors. Since velocity models are used for further geological analysis, this study proposes the application of a model fusion technique as a method to construct the best velocity model out of a set of Monte Carlo sample models. The main issue is that none of the sample models can be readily identified as the best in terms of velocity structure information and sensitivity to arrival-time uncertainty. Our results indicate that the fused velocity model, which combines the structural information of all the computed samples, has better RMS residuals per shotpoint than any of the sample models. Keywords Seismic tomography, Monte Carlo simulation, Cauchy distribution, uncertainty, fusion, and visualization. 2. Introduction Uncertainty of first-arrival time readings (referred to as pick times) alters the velocity models created with seismic tomography algorithms. Generally, the readings are manually performed by an expert analyst. Thus, model sensitivity to reading uncertainty, which is a typical measurement-error scenario with multiple possible sources of error, can be used to determine model robustness. Averill [1] used an algorithm for traveltime seismic tomography, referred to as the Vidale-Hole algorithm [2][3], and combined it with Monte Carlo techniques to find the upper bounds of model sensitivity to uncertainty [4]. Since the algorithm does not process uncertainty to compute velocity models, Averill et al. calculated the effect of uncertainty on the model by repeatedly applying Cauchy-distributed perturbations to the readings of first arrivals of one experiment. This procedure was used to generate 14 random variations of the experiment in order to obtain a 95% level of confidence on the results [1]. The Monte Carlo simulation algorithm, outlined in Figure 1, takes the same initial velocity model, but uses stochastically-perturbed pick times as input to compute all the sample velocity models. Once all the samples are computed, the upper bounds of all the models are computed and output as the outcome of the Monte Carlo simulation. For N initial velocity models apply cauchy errors to pick times While (residuals > pick error) For each source compute first arrival times For each receiver compute ray coverage compute velocity perturbations End For End for smooth velocity perturbations update velocity model End While End For compute upper bounds of N velocity models output upper bounds of velocity uncertainty Figure 1. Computing velocity model sensitivity to uncertainty of first-arrival readings While computing model sensitivity to uncertainty of pick times provides a qualitative measure of velocity model robustness, this computation does not indicate if any of the output models is the best representation of velocity structure. Since the justification for the procedure is the assumption that pick-time input for the tomographic algorithm contains measurement errors, an important question to answer is which is the best velocity model out of the computed N models? 132

133 This study proposes the application of a model fusion technique [5] to compute the best velocity model out of a sample of models generated using a collection of similarly accurate sets of input pick times. The rest of this paper presents the method and the results obtained from applying it to models produced using the Potrillo Volcanic Field experiment [1]. 3. Method Monte Carlo computation of velocity model sensitivity to pick-time uncertainty implies that each of the N sets of input pick times, including the original unperturbed times, is as accurate as the rest of the sample inputs. The computation produces N velocity model samples that are all valid because they start from the same initial model, use equivalently accurate input pick times, are computed with the same tomographic algorithm, and have residuals below the same acceptable pick error. The issue is determining which one is the best output model of the velocity structure. Selecting one of the N models disregards the structural information contained in the other models and risks selecting a model which, while meeting the convergence criteria, may be highly sensitive to uncertainty. Instead of selecting one model, we propose constructing one model utilizing the N samples. In this case, constructing one model using only the computed upper bounds, which are all worst cases of sensitivity, would be extremely conservative and probably mostly incorrect. On the opposite end, constructing a model out of only lower bounds, which are all best cases of sensitivity, would be extremely optimistic and probably incorrect also. Thus, we propose constructing a model that merges velocity structure information contained in each of the models while reducing model sensitivity to pick-time uncertainty and avoiding use of only bestcase values. Our construction technique is based on a fusion method for heterogeneous models introduced by O. Ochoa et al. [5]. Details are presented in the following section. 3.1 Homogeneous Model Fusion Model fusion techniques extract and combine velocity structure information contained in heterogeneous models built using, for instance, velocity data, gravity data, and magnetic data of a region [5]. The method presented here is a Monte Carlo technique to create a fused seismic tomography model in four major steps. First, N sets of input pick times are generated using one set of experimental pick times and a random perturbation algorithm that reflects the stochastic characteristics of the pick-time measurement procedure. For this work, a Cauchy distribution with the experiment acceptable pick error as a parameter was utilized. Second, each set of pick times is input into the Vidale-Hole algorithm to generate a velocity model sample. Third, a technique for fusing homogeneous models is used to compute a combined velocity model out of the N samples. Fourth, the experimental unperturbed pick times and the fused model are input into the seismic tomography algorithm to verify model convergence. The fused model can then be used to perform other model robustness and resolution tests such as verifying quality of coverage and performing a checker-board test, a common test used to explore model resolution and robustness [1] Uncertainty-Augmented Models The Cauchy distribution, which is utilized for the Monte Carlo simulation, generates uncertaintyaugmented velocity model samples that are more in agreement with expected uncertainty [4]. Each velocity model is computed using the Vidale-Hole algorithm, which is implemented by the While statement wrapped by the Monte Carlo simulation in Figure 1. Briefly described, the algorithm is based on iterative forward modeling and inversion. The former step computes the first arrivals throughout the model using the eikonal equation and a 3D velocity model [2]. Next, the inversion step uses back-tracing to compute ray coverage and velocity perturbations to be applied to the velocity model. The velocity perturbations are then smoothed and used to update the velocity model until model convergence or other stopping criteria are detected [3]. 3.3 Model Fusion Taking the uncertainty-augmented velocity model samples generated by the Monte Carlo simulation, model fusion is applied to compute a single model of the velocity structure of the crust of the Earth. The approach is based on the application of Least Squares as a Maximum Likelihood estimator of the velocity model parameters [5]. Each model comes from an independent set of Cauchy-perturbed measurements. The probability density function p(v) represents the total of these estimated velocity models as a product of a number of models. Computing the maximum value of p(v), which is the maximum of the likelihood function below [5], it is possible to derive the Least Squares expression as follows: (1) 133

134 The expression in exp(x) increases and maximizes the product A exp(-b(x)) while minimizing B(x). What is needed is the value of v that minimizes the following equation: (2) Calculating the derivative of equation (2) with respect to v and setting it to zero produces the following result: (3) 5. 1 Velocity Model Samples Figure 2 shows one of the velocity model samples produced by the Monte Carlo simulation. The figure illustrates a shotpoint with a combined visualization of hit-count coverage (shown in green) and vectorized coverage with ray path uncertainty (shown in white). Cauchy-perturbed pick times were used to generate N velocity model samples using the Vidale-Hole algorithm and a 1D velocity model with the following list of depth-velocity values: {0.0,3.0; 7.12, ; 14.32, ; 40.72, ; 46.12, ; 58.12; }. Figure 3 shows the initial 3D velocity model computed. Note that velocity isosurfaces are flat before the seismic tomography algorithm is applied. Equation (3) is used to compute a fused model from several independent, heterogeneous models. This approach assumes that each model has independent errors and an independent standard deviation. Since the sample velocity models used for this work have the same standard deviation, it is possible to remove the standard deviation from the equation as shown in equation (4). The fused velocity model can be computed as follows: (4) Figure 3. Initial velocity with no uncertainty Figure 4 shows the final velocity model computed without pick-time uncertainty. The red circle encloses the region shown as a close-up in Figures 5 through Results The experimental data set was produced by the Potrillo Volcanic Field experiment of 2003 [1], which used 8 explosions and 793 seismometers spaced at spans of 100 m, 200 m, and 600 m over a region with a length of 205 km. Based on calculations for a 95% level of confidence done by Averill [1], N was set to 14 in the Monte Carlo simulation and the pick error was set to s. The following sections present the results of this research visually and quantitatively Figure 4. Velocity model without uncertainty Figure 5. Close-up of the final velocity model without uncertainty Figure 2. Ray coverage with uncertainty at one shotpoint Close-ups of two Monte Carlo sample velocity models, number 4 and 13, are shown in Figures 6 and 7, respectively. Velocity isosurfaces in Figures 5 through 8 have similar shapes, but they show differences generated by the Cauchy perturbations of input arrival times. 134

135 5. 1 Fused Velocity Model of velocity model samples show similarities and differences of sample velocity isosurface shapes. Figure 8 shows a close-up of the final velocity model constructed by applying model fusion to the 14 Monte Carlo simulation samples. Residuals for each shotpoint of the unperturbed model and the fused model are shown in Figure 9 for a comparison. Final velocity Fused Velocity Figure 6. Velocity model sample number 4 Figure 7. Velocity model sample number 13 Figure 8. Fused Velocity Model In most instances, the root-mean-square (RMS) residuals of the fuse model shotpoints are better than the corresponding residuals of the unperturbed model. Shotpoint 5 has the biggest changes unperturbed model residual is vs in the fused model. Shotpoint 3 did not change unperturbed model residual is vs in the fused model. An important additional merit of the fused model, however, is the amalgamation of the velocity structure illuminated by all the Monte Carlo samples into a single model which is expected to be a better approximation of the actual structure than any of the samples. 6. Conclusions This paper presents a Monte Carlo method to construct a fused velocity model generated with seismic tomography. To produce each sample model, the pick times used as input for the tomographic algorithm are perturbed using a Cauchy distribution with the experimental pick error as a parameter. Visualizations Figure 9. Root mean square of the travel time residuals A comparison of RMS residuals for each shot point in the unperturbed model and the fuse model shows that the fused model has better convergence than the unperturbed model. Future work may include exploration of fusion results for the Gaussian distribution and fusion based on the application of genetic algorithms. 7. Acknowledgements This material is based upon work supported in part by the National Science Foundation under Grants CREST HRD , CNS , and CNS References 1. M.G. Averill, A lithospheric investigation of the Southern Rio Grande: Ph.D. Dissertation, The University of Texas at El Paso, J. E. Vidale, Finite-difference calculation of traveltimes in three dimensions, Geophysics, 55(5), 1990, pp J. A. Hole, Nonlinear high-resolution threedimensional seismic travel time tomography, Journal of Geophysical Research, 97(B95), 1992, pp M. G. Averill, K. C. Miller, V. Kreinovich, and A. A. Velasco, Viability of travel-time sensitivity testing for estimating uncertainty of tomography velocity models: a case study, Geophysics, O. Omar, A. Velasco, C. Servin, and V. Kreinovich. Model Fusion under Probabilistic and Interval Uncertainty with Application to Earth Sciences, 4 th International Workshop on Reliable Engineering Computation (2010). Copyright 2010 Professional Activities Centre, National University of Singapore. Published by Research Publishing Services, 2010, pp

136 Identifying the Impact of Content Provider Wide Area Networks on the Internet s Topology Aleisa A. Drivere Antoine I. Ogbonna Sandeep R. Gundla Graciela Perera Department Computer Science and Information Systems Youngstown State University, Youngstown, OH , USA s: aadrivere@my.ysu.edu, aiogbonna@ysu.edu, srgundla@my.ysu.edu, gcperera@ysu.edu ABSTRACT In this paper, we are studying the trend of large content providers like Google, Yahoo!, and Microsoft deploying their own wide area networks (WANs) for end-user content delivery. We are attempting to show the impact such deployments may be having on the topology of the Internet. We are basing our study on work done in by Gill, Arlitt, Li, and Mahanti, and we are investigating whether preliminary trends identified in that study have continued. We would like to see if there are any significant differences in our data as compared to theirs, which was collected over two and a half years ago. Keywords: Internet topology, Internet backbone, Tier-1 network, traceroute, wide area network, content provider, content delivery network. 1. Introduction As the Internet developed from its creation in 1969 as an academic-only network of four connected computers, to the beginning of its commercial availability in 1995, a hierarchical system of Internet Service Providers (ISPs) evolved that enabled communication between Internet users in geographically diverse locations. The largest of these ISPs such as AT&T, Sprint, and Qwest, formed the backbone of the Internet and became known as Tier-1 ISPs. Somewhat smaller ISPs and even smaller regional ISPs became known as Tier-2 and Tier-3, respectively. Consumers as well as content providers have typically accessed the Internet via this Tier-1/Tier-2/Tier-3 hierarchical system, but this may be changing somewhat. A 2008 study done by Gill, Arlitt, Li, and Mahanti showed that large content providers like Microsoft, Yahoo, and Google are deploying their own wide area networks [5]. One of the effects of such deployments is that traffic to and from these content providers tends to bypass major Tier-1 networks. As per Gill, et al., the reasons for this may include: 1) the vulnerability of smaller ISPs to de-peering and transit disputes involving Tier-1 and Tier-2 ISPs, hence an interruption of customer service. 2) Limitations and/or uncertainty about video-on-demand delivery due to issues regarding IP multicast on the Internet [4]. With the emergence of applications that involve the necessity of realtime content delivery, such as video-on-demand, large content providers may feel that it is more beneficial for their business model to avoid such uncertainty. 3) Plans by larger content providers to provide cloud computing services such as software-as-a service (SaaS) to subscribing customers. Such services are already being offered by Microsoft, Google and Yahoo!, and the trend is likely to continue. Along with major content providers, content delivery networks such as Akamai and Limelight have been deploying their own WANs and offering services to smaller content providers who may not have the resources to deploy their own WANs. 2. Background Internet Protocol (IP) transit is where an ISP advertises routes reaching its customers to other ISPs, and advertises a default route to its customers, which allows them to access the entire Internet. Peering is when only traffic between the two peers and their customers is exchanged, and neither peer has any information about the other s available routes. A transit-free network uses only peering [8]. An ISP is identified as Tier-1 if it is transit-free and does not pay any other network for peering. A Tier-2 network may peer with other networks, but still pays transit fees and/or peering settlements, although some networks classified as Tier-2 may be transit-free [9]. It is sometimes difficult to determine if a network pays for peering, due to confidential business agreements. Because of this, some large networks often identified as Tier-1 may actually be Tier-2 [9]. A Tier-2 ISPs will usually connect to one or more Tier-1 networks in order to reach every part of the Internet. Tier-3 networks, often called access networks, can only reach the Internet by paying for IP transit. It is through these Tier-3 ISPs that most end-users connect to the Internet. 3. Methodology 3.1 Tools For this study, we used two main tools: traceroute, and a list of the top 20 content providers in the United States provided by the Web traffic analytics company Alexa ( Traceroute is an Internet test measurement tool that can run on any computer (host) that is connected to the Internet. The host issuing a traceroute query is known as the source. The source user issues a query by specifying a destination: either a hostname 136

137 ( or an IP address. Traceroute then sends multiple, uniform size packets of data toward that destination. On the way to the destination, the packets pass through a series of routers. When a router receives one of these packets, it returns back to the source a short message giving its IP address and hostname [4]. The distance between two directly connected routers is often called a hop. Alexa has developed a toolbar that gathers information about Web traffic and site visits from toolbar users, and has devised a ranking system (TrafficRank) based on the results gathered. Alexa s list of the top 20 content providers in the United States as of April, 2010 appears in Table Data Collection For our data collection, we used the same basic method used by Gill et al. We issued traceroute queries from five geographically distributed public traceroute servers located in the United States, to content providers servers. The public traceroute servers were found on Figure 1 lists the public traceroute servers we used, and indicates their approximate geographical locations. We resolved the hostnames of the providers (translated hostnames to IP addresses) only once at Youngstown State University; this approach was used to prevent our queries from being re-directed to local contentcaching servers [5]. A single traceroute query was issued to each content provider from each traceroute server. Our data was collected on April 16, For the purposes of this study, we only included only one site for each company, since some companies own multiple sites (YouTube and Blogger are owned by Google, Live and Bing are owned by Microsoft, ESPN.go and Go are owned by Disney). Table 1: Alexa Top 20 sites in the United States, April google.com 11. myspace.com 2. facebook.com 12. twitter.com 3. yahoo.com 13. msn.com 4. youtube.com 14. aol.com 5. wikipedia.org 15. go.com 6. blogger.com 16. bing.com 7. craigslist.org 17. linkedin.com 8. ebay.com 18. cnn.com 9. amazon.com 19. wordpress.com 10. live.com 20. espn.go.com Figure 1: List of public traceroute servers used And a map of their approximate geographic locations 1. Internet Partners Inc, Portland OR ( 2. University of Southern California, Los Angeles CA ( 3. Steadfast Networks, Chicago IL. ( 4. TowardEX, Acton MA, ( 5. InternetBiz, South Beach FL, ( Data Analysis To analyze our data, we used three of the four metrics that were used by Gill, et al: 1. For each content provider, the average number of hops on routers belonging to Tier-1 networks. The lower this number, the more likely it is that traffic to and from the content provider is bypassing Tier- 1 networks in favor of its own WAN or the WAN of a content delivery network. 2. For each content provider, the number of paths that involve Tier-1 networks. The analysis of this metric is the same as Metric 1 3. For each content provider, the number of unique ISPs directly connected to the content provider network. We determined this by identifying the IP addresses immediately preceding the first IP addresses belonging to the content providers. The higher this number, the more widely deployed a content provider's WAN may be. To determine ownership of the routers in the traceroute query output, we used the American Registry of Internet Numbers (ARIN) Whois database ( This allowed us to determine if a router was owned by an ISP, a content provider, or a content delivery network. We should note that Gill, et al used one more metric: number of geographic locations where a content provider's routers are located. We attempted to analyze our data using this metric, but we did not produce any conclusive results. 4. Limitations In their study, Gill, et.al issued traceroute queries from 50 different traceroute servers located worldwide. Due to time constraints, we were only able to locate and use five traceroute servers, so our data sample is smaller. The traceroute output did not always show a complete path to the destination server; the reasons for this could be the following: 1) as a security precaution, some routers do not respond to traceroute queries and return no information. 2) The maximum number of pre-specified hops that a traceroute query will return (usually 30) was reached [6]. When a router in one of our traceroute queries returned no response, we used the last available IP address returned in our analysis. Also, for the purposes of this study, we included transit-free major networks Cogent Communications, XO Communications, and AboveNet as Tier-1 networks, as well as the AOL Transit Data Network, even though these networks have or may have settlementbased or paid peering with one or more networks [9]

138 5. Results Table 2: Analysis of traceroute results. Avg. number of hops on a Tier-1 network Number of paths with no Tier-1 networks Number of different connected ISPs 1. Google Facebook Yahoo wikipedia craigslist ebay Amazon MySpace Twitter MSN AOL Go.com Linkedin CNN Wordpress Metric 1: Average number of hops on Tier-1 networks: Lowest: Linkedin, with 1.2. Linkedin seems to be paired with the content provider Limelight Networks. Next lowest: Yahoo!, MSN, and Google with 1.6, 2, and 2.2 respectively. Highest: Twitter at 8.2 and AOL at 7.4. Twitter seems to be owned by NTT America, and AOL uses the AOL Transit Data Network. Metric 2: networks: Number of paths that involve no Tier-1 Highest: Google, Yahoo!, MSN, and Myspace, 3 out of 5 paths. Myspace is an Akamai customer. At the time Gill, et al. collected their data, Myspace was a Limelight customer. Next highest: Amazon and Linkedin, 2 out of 5 paths. Lowest: Ebay, Twitter, AOL, Go, CNN, Wordpress, 0 paths. Metric 3: Number of different ISPs to which a content provider is connected: Highest: Facebook and Linkedin, 5 each. Next Highest: Google, Yahoo!, Amazon, MSN, and Myspace, 4 each. Lowest: Twitter and AOL, 1 each. One further note should be added: at the time of this study, we considered Bing as part of the Microsoft network, thus we did not include it in our data collection. As our data collection proceeded, we discovered that Bing was using the content delivery network Akamai. 6. Conclusions and Future Work The results we obtained in our version of the study done by Gill et al confirmed the results of their study: that the largest content providers (Microsoft, Google, and Yahoo) have begun to deploy their own wide area networks. Our results also revealed other interesting things, such as content providers affiliations with content delivery networks (Myspace, Bing), and content providers who are affiliated with large Tier-1 networks (Twitter, AOL). Working on this paper caused us to ask some questions, such as: 1) Will major content providers like Google continue to build their networks and offer content delivery to smaller content providers, essentially becoming a content delivery network like Akamai? 2) Will any changes occur to the current state of IP multicast on the Internet? These questions could perhaps be answered by future research. Our next step is to collect new data using more geographic locations worldwide. We are considering making the following changes to the data collection/analysis process: 1. Using the Internet testing and measurement tool DipZoom to issue our traceroutes [11]. Since DipZoom employs volunteer servers, we would like to see if this mitigates any bias introduced by the use of public traceroute servers, many of which are maintained by ISPs. 2. Gathering similar traceroute data using a control group of random sites that are not major content providers, in order to study the differences in the logical topologies. 3. Using some additional metrics in our analysis. Some suggestions for possible metrics: a. Average number of total hops per content provider across all traceroute queries. b. Average number of hops inside a content provider s network, across all traceroute queries. c. Average total delay per content provider, across all traceroute queries. d. Relationships between metrics References 1. Alexa Top Sites in the United States, 2010, 2. Amini, Lisa, with Anees Shaikh and Henning Schulzrinne, Issues with Inferring Internet Topological Attributes, 2003, Columbia University, New York, NY. 3. Andersen, David G., with Nick Feamster, Steve Bauer, and Hari Balakrishnan, Topology Inference from BGP Routing Dynamics, 2002, DARPA, San Diego, CA. 4. Kurose, James F, with Keith W. Ross, Computer Networking: A Top-Down Approach Fifth Edition, 2010, Addison-Wesley: New York. 5. Gill, Phillipa, with Martin Arlitt, Zongpeng Li, and Anirban Mahanti, The Flattening Internet Topology: 138

139 Natural Evolution, unsightly Barnacles or Contrived Collapse?, 2008, University of Calgary, Calgary, AB, Canada. 6. Luckie, Matthew, with Young Hyun and Bradley Huffaker, Traceroute Probe Method and Forward IP Path Inference, 2008, available at for list of Tier-1 networks

140 SPEAKER BIOS Malek Adjouadi, Ph.D Professor and Director, Center for Advanced Technology and Education Florida International University Malek is currently a Professor of Electrical and Computer Engineering at Florida International University, and the Director of the Center for Advanced Technology and Education funded by NSF since He is leading the efforts on the joint Neuro-Engineering program between FIU and Miami Children s Hospital. His research interests are in image and signal processing, human-computer interfaces as assistive technology tools, and neuroscience applications in pediatric epilepsy. Judit Camacho Executive Director SACNAS Judith has been engaged with SACNAS for the last 17 years and has served two terms as executive director. She has a math degree from University of California, Santa Cruz and graduate coursework in public health from Johns Hopkins University. In between her terms at SACNAS, she worked at the National Institutes of Health (NIH) in the Division for Minority Opportunities in Research at the National Institute of General Medical Sciences, and subsequently in the Office of Workforce Development at the National Cancer Institute. Among the projects that she helped craft at NIH were the Summit on Latino Research, Outreach and Employment at the NIH and the Introduction to Cancer Research Careers program. Throughout Camacho s terms at SACNAS, the organization has seen unprecedented growth and the promotion of the society to a national reputation as a foremost scientific society representing minority communities in science. Clemencia Cosentino de Cohen, Ph.D Senior Researcher Mathematica Policy Research Clemencia recently joined Mathematica Policy Research. She came from the Urban Institute, where she was the Director of the Program for Evaluation and Equity Research. Most of her work centers on studying efforts to increase student achievement and improve the participation of underrepresented groups (ethnic and language minorities, women) in STEM (science, technology, engineering, and mathematics) education and the STEM workforce. Examples include the (ongoing) evaluations of the National Science Foundation (NSF) Bridge to the Doctorate initiative to increase the number of minority students completing graduate degrees in STEM and of the NSF ADVANCE program to foster the careers of women (and minority women) in academia in STEM; the (completed) evaluation of the Louis Stokes Alliance for Minority Participation Program designed to educate and retain minority students in undergraduate programs in STEM; and the (soon to be released) evaluation of the Historically Black Colleges and University Program to increase institutional capacity to educate minority students in STEM. In addition, she recently completed an ambitious and, in education, pioneering work in designing a retrospective evaluation of NASA s entire portfolio of higher education projects in STEM, many of which are focused on minority students or minority-serving institutions. In 2010, her work on women in engineering was selected as the feature article of the American Society for Engineering Education flagship magazine, PRISM, and received the Urban Institute President s award for outstanding research publication. She recently joined the board of the Evaluation of CISE Pathways to Revitalized Undergraduate Computer Education (NSF CPATH). 140

141 Jan Cuny, Ph.D. Program Officer, NSF Since 2004, Jan Cuny has been a Program Officer at the National Science Foundation. Before coming to NSF, she was a faculty member in Computer Science at Purdue University, the University of Massachusetts, and the University of Oregon. At NSF, Jan leads the Education Workforce Cluster and its two programs: Computing Education for the 21 st Century (CE21) and the Broadening Participation in Computing Alliance (BPC-A) program. Together these programs aim to increase the number and diversity of students majoring in computing. Jan has had a particular focus on the inclusion of students from those groups that have been traditionally underrepresented in computing: women, African Americans, Hispanics, Native Americans, and persons with disabilities. For her efforts with underserved populations, Jan is a recipient of one of the 2006 ACM President s Award, the 2007 CRA A. Nico Habermann Award, and the 2009 Anita Borg Institute s Woman of Vision Award for Social Impact. Jill Denner, Ph.D. Associate Director ETR Associates Jill Denner is Associate Director of Research at Education, Training, Research (ETR) Associates, a non-profit organization in California. She does applied research, with a focus on increasing the number of women and Hispanics in computing. She has developed several after school programs and her research on these programs has contributed to an understanding of effective strategies for promoting youth leadership, building youth-adult partnerships, increasing students confidence and capacity to produce technology, and engaging girls and Hispanics in information technology. As part of a long-standing commitment to bridge research and practice, her research is designed and conducted in collaboration with schools and community-based agencies. Dr. Denner has been PI on several NSF grants, written numerous peer-reviewed articles, and co-edited two books: Beyond Barbie and Mortal Kombat: New Perspectives on Gender and Gaming, published by MIT Press in 2008, and Latina Girls: Voices of Adolescent Strength in the US, published by NYU Press in Dr. Denner has a PhD in Developmental Psychology from Teachers College, Columbia University, and a B.A in Psychology from the University of California, Santa Cruz. Alicia C. Dowd, Ph.D. Associate Professor University of Southern California s Rossier School of Education Co-director The Center for Urban Education (CUE) Dr. Dowd s research focuses on political-economic issues of racial-ethnic equity in postsecondary outcomes, organizational learning and effectiveness, accountability and the factors affecting student attainment in higher education. Dr. Dowd is the principal investigator of a National Science Foundation funded study of Pathways to STEM Bachelor s and Graduate Degrees for Hispanic Students and the Role of Hispanic Serving Institutions. Through this study, CUE is examining the features of exemplary STEM policies and programs to identify ways for institutions to increase the number of Latino STEM graduates. As a research methodologist, Dr. Dowd has also served on numerous federal evaluation and review panels, including the Education Systems and Broad Reform Panel and the National Education Research and Development Center panels of the Institute for Education Sciences (IES) and NSF s Science, Technology, Engineering, and Mathematics Talent Expansion Program (STEP-Type 2) review panel. She was also a member of the technical working group consulting on the evaluation design for the Academic Competitiveness and SMART (science, mathematics, technology) grants awarded by the U. S. Department of Education. Currently she is a member of the advisory group for the Congressional Advisory Committee on Student Financial Aid (ACSFA). Dr. Dowd was awarded the doctorate by Cornell University, where she studied the economics and social foundations of education, labor economics, and curriculum and instruction. Her undergraduate studies were also at Cornell, where she was awarded a Bachelor of Arts degree in English literature. 141

142 Lorelle L. Espinosa, Ph.D. Director of Policy and Strategic Initiatives Lorelle L. Espinosa, Ph.D., is the director of policy and strategic initiatives at the Institute for Higher Education Policy (IHEP). She provides leadership in aligning IHEP research, programs, policy initiatives, and other services with the organization's strategic direction. An expert on various higher education topics, Espinosa is well versed-as both a practitioner and researcher of higher education-on issues of postsecondary access and persistence of underrepresented groups. She has published on the transition and advancement of underrepresented minority students in science, technology, engineering, and mathematics (STEM) postsecondary education, with a current emphasis on women of color in STEM. Espinosa is a featured blogger for Diverse: Issues in Higher Education < ("STEM Watch") where she writes about the national imperative of building and sustaining a diverse STEM pipeline. She holds an M.A. and Ph.D. in Education from the University of California, Los Angeles. She received her B.A. from the University of California, Davis and her A.A. from Santa Barbara City College. Prior to her graduate work and arrival at IHEP, Espinosa worked in the areas of student affairs and undergraduate education at the University of California, Davis, Stanford University, and the Massachusetts Institute of Technology. Dan Garcia, Ph.D. Lecturer SOE UC Berkeley Dan Garcia is a Lecturer with Security Of Employment (SOE = "tenured" teaching faculty) in the Computer Science Division of the EECS Department at the University of California, Berkeley, and joined the Cal faculty in the fall of Dan received his PhD and MS in Computer Science from UC Berkeley in 2000 and 1995, and dual BS degrees in Computer Science and Electrical Engineering from Massachusetts Institute of Technology in His research interests are computer science education and computational game theory. Gilda Garretón, Ph.D. Principal Engineer Oracle Labs/Oracle Gilda Garretón is a Principal Engineer at Oracle Labs/Oracle and her main research focuses on VLSI CAD and parallel programming. She is an Open Source advocate and a Java/C++ developer. Gilda received her B.A. and Engineering degree from the Catholic University of Chile (PUC) and her Ph.D. from the Swiss Institute of Technology, Zurich (ETHZ). She is the cofounder of the community Latinas in Computing (LiC) whose goal is to promote leadership and professional development among Latinas in the engineering field. 142

143 Ann Gates, Ph.D. Associate Vice President of Research and Sponsored Projects The University of Texas at El Paso Ann Quiroz Gates is the Associate Vice President of Research and Sponsored Projects at the University of Texas at El Paso and past chair of the Computer Science Department. Her research areas are software property elicitation and specification, and workflow-driven ontologies. Gates directs the NSF-funded Cyber-ShARE Center that focuses on developing and sharing resources through cyber-infrastructure to advance research and education in science. She was a founding member of the NSF Advisory Committee for Cyberinfrastructure, and she served on the Board of Governors of IEEE-Computer Society Gates leads the Computing Alliance for Hispanic- Serving Institutions (CAHSI), an NSF-funded consortium that is focused on the recruitment, retention, and advancement of Hispanics in computing and is a founding member of the National Center for Women in Information Technology (NCWIT), a national network to advance participation of women in IT. Gates received the 2010 Anita Borg Institute Social Impact Award and the 2009 Richard A. Tapia Achievement Award for Scientific Scholarship, Civic Science, and Diversifying Computing; she was named to Hispanic Business magazine s 100 Influential Hispanics in 2006 for her work on the Affinity Research Group model that focuses on development of undergraduate students involved in research. Marcus A. Huggans, Ph.D. Director of Recruitment and Programming The National GEM Consortium Dr. Huggans completed his engineering studies at the University of Missouri-Rolla (now known as Missouri S&T). He received a B.S. degree in Electrical Engineering and M.S. & Ph.D. in Engineering Management. He was one of the first African-American males to earn a Ph.D. in this discipline from the University. Dr. Huggans has extensive experience in the STEM field with over seventeen years of working in the industry. He has worked for 3M, AT&T Bell Labs, Department of Justice-Federal Bureau of Investigation (FBI), and Texas Instruments Inc. Dr. Huggans ran his own real estate company while teaching Marketing, Management, and Mathematics at the university and community college. He also worked at the Missouri S&T as the Director of the Student Diversity and Academic Support Program (SDP). Under his leadership, Missouri S&T experienced unprecedented growth in the recruitment of underrepresented minority students in the areas of science and engineering. He began working at the National GEM Consortium in 2006 as a Senior Recruiter and Programs Specialist, and now he is the Senior Director of External Relations at GEM. At GEM, Dr. Huggans recruits and conducts graduate programming to encourage underrepresented minority students to pursue their graduate degrees in science, technology, engineering, and applied mathematics (STEM) fields. His Motto: If there is light in the soul, there will be beauty in the person. If there is beauty in the person, there will be harmony in the house. If there is harmony in the house, there will be order in the nation. If there is order in the nation, there will be Peace in the World. Patty Lopez, Ph.D. Component Design Engineer Intel Corporation Patty Lopez spent 19 years as an Imaging Scientist for Hewlett Packard, creating and transferring technology in imaging and color algorithms into scanner, camera, and all-in-one products. Patty joined Intel in Fort Collins, Colorado in August 2008 and now works on microprocessor logic validation for manufacturability. Patty graduated with high honors from New Mexico State University with a B.S. in Computer Science and earned her M.S. and Ph.D. in Computer Science at while working at NMSU s Computing Research Laboratory, a state funded center of technical excellence. Patty chairs the Governance Committee of the NMSU Foundation, has established the Ross & Lydia Lopez Minority Scholarship, and serves an 143

144 advisory board member for several STEM departments and programs at NMSU. She joined the CAHSI Board and CRA-W Board in 2010, and represents Intel on the Anita Borg Institute Advisory Board. She has served as co-chair for the GHC Birds of a Feather Committee in 2009, co-chair of the GHC Technical Poster Session Committee in 2010, and is serving as the co-chair of the GHC Panels and Workshops Committee for She is a founding member and co-chair of Latinas in Computing, a grassroots organization whose mission to promote and develop Latinas in technology, and received the HENAAC/Great Minds in STEM Community Service Award in Patty is a MentorNet mentor, and is a member of the NCWIT Workforce Alliance. Her current passion is computer science education, building the STEM pipeline for K-16, and creating an inclusive organizational culture in the workplace. Manuel A. Pérez-Quiñones, Ph.D. Associate Professor, Department of Computer Science, Virginia Tech Manuel A. Pérez-Quiñones is Associate Professor of Computer Science, and a member of the Center for Human-Computer Interaction at Virginia Tech. Pérez-Quiñones holds a DSc in CS from The George Washington University. His research interests include human-computer interaction, personal information management, user interface software, digital government, and educational/cultural issues in computing. He is Chair of the Coalition to Diversify Computing ( ), member of the editorial board for ACM's Transactions on Computing Education journal,and a founding member of the board of the non-profit Virginia Latino Higher Education Network. At Virginia Tech, he has been chair of the Hispanic/Latino Faculty and Staff Caucus, Associate Dean and Director of the Office for Graduate Recruiting and Diversity Initiatives of the Graduate School, and a Multicultural Fellow. Dr. Carlos Rodriguez Principal Research Scientist American Institutes for Research, Washington, DC Dr. Rodriguez serves as a Principal Research Scientist at the American Institutes for Research (AIR). He is nationally recognized for his expertise on issues of equity, access, and educational attainment for minority populations in K-12 and higher education, especially in STEM in higher education. Over the past 18 years at AIR, he has served as project director for national evaluations of STEM-related high school, undergraduate, and graduate initiatives. He has ongoing work with the National Science Foundation (NSF) on evaluation issues related to underrepresented minorities and cultural contexts, and has chaired NSF proposal review committees. Among his accomplishments, Dr. Rodriguez authored the national report, America on the Fault Line: Hispanic American Education, which informed the Hispanic Education Action Plan (HEAP) guiding Hispanic educational initiatives in federal agencies. Deborah A. Santiago Co-founder and Vice President for Policy and Research Deborah A. Santiago, is co-founder and Vice President for Policy and Research at Excelencia in Education and has spent more than 15 years leading research and policy efforts from the community to national levels to improve educational opportunities and success for all students. She has been widely cited by national media, such as the Economist, New York Times, Washington Post, and Chronicle of Higher Education on issues related to Latinos in higher education. Deborah serves on the board of the Latin American Youth Center (DC) and the National Association for College Admission Counseling. She also serves on the advisory boards of Univision s Education Campaign and the Pathways to College Network. 144

145 Nayda G. Santiago Associate Professor University of Puerto Rico Mayagüez Nayda G. Santiago received the B.S.E.E. degree from University of Puerto Rico, Mayaguez Campus, in 1989, the M.Eng.E.E. degree from Cornell University in 1990, and the Ph.D. degree in Electrical Engineering from Michigan State University in Since 2003, she has been a faculty member of the University of Puerto Rico, Mayaguez Campus at the Electrical and Computer Engineering Department where she holds a position as Associate Professor. Nayda has been recipient of the 2008 Outstanding Professor of Electrical and Computer Engineering Award, 2008 Distinguished Computer Engineer Award of the Puerto Rico Society of Professional Engineers and Land Surveyors, the 2008 HENAAC (Hispanic Engineer National Achievement Awards Conference) Education Award, and the 2009 Distinguished Alumni Award of the University of Puerto Rico, Mayaguez Campus. She is a member of the IEEE and the ACM. She is one of the founding members of CAHSI and Femprof. Maricel Quintana-Baker, Ph.D. Principal, MQB-Consulting Maricel is currently a principal at MQB-Consulting, where she specializes in research, writing, and training on higher education policy and Latino and women issues. Previously she served as Associate Director for Academic Affairs and Planning at the State Council of Higher Education for Virginia (SCHEV). She is the founder and president of the Virginia Latino Higher Education Network ( a non-profit dedicated to helping Virginia Latinos achieve a posthigh school education, and a member of the National Advisory Board for the Computing Alliance of Hispanic Serving Institutions. She currently serves on the Virginia Council on the Status of Women, served on the Virginia Latino Advisory Board for five years, helped found the Central Virginia Chapter of LULAC, and served a term of service with the Executive Board of The Women's Network, the state affiliate of ACE's Office of Women in Higher Education. She held an Oakridge Institute for Science and Education Post-Doctoral Fellowship at NSF's Division of Education and Human Resource Development. She earned her Ph.D. from American University and a graduate of Harvard University Graduate School of Education s MDP Program. 145

146 SPONSORS CAHSI is funded by NSF Grant #CNS CONTRIBUTORS 146