What is the Common Problem that Makes most Biological Databases Hard to Work With, if not Useless to most Biologists?



Similar documents
DIPLOMADO DE JAVA - OCA

Resumen de Entrevista: Asociación de Agentes de Aduana del Puerto de Manzanillo

Schema XML_PGE.xsd. element GrupoInformes. attribute XML_PGE.xsd unqualified qualified

Sales Management Main Features

Level 2 Spanish, 2012

Database Design and Normalization

Memorial Health Care System Catholic Health Initiatives Financial Assistance Application Form

A. Before you read the text, answer the following question: What should a family do before starting to look for a new home?

COSC344 Database Theory and Applications. Lecture 9 Normalisation. COSC344 Lecture 9 1

AP SPANISH LANGUAGE 2011 PRESENTATIONAL WRITING SCORING GUIDELINES

The process of database development. Logical model: relational DBMS. Relation

Manejo Basico del Servidor de Aplicaciones WebSphere Application Server 6.0

LINIO COLOMBIA. Starting-Up & Leading E-Commerce. Luca Ranaldi, CEO. Pedro Freire, VP Marketing and Business Development

APS ELEMENTARY SCHOOL PLANNING SURVEY

LEARNING MASTERS. Explore the Northeast

Prepárate. BT Computer ABCs for Women in Transition

An Introduction to Relational Database Management System

Dictionary (catálogo)

Database Design. Marta Jakubowska-Sobczak IT/ADC based on slides prepared by Paula Figueiredo, IT/DB

Introducción a las bases de datos SQL Libro de referencia

Normalization. Functional Dependence. Normalization. Normalization. GIS Applications. Spring 2011

2. Basic Relational Data Model

C HAPTER 4 INTRODUCTION. Relational Databases FILE VS. DATABASES FILE VS. DATABASES

How To Know If An Ipod Is Compatible With An Ipo Or Ipo (Sanyo)

New Server Installation. Revisión: 13/10/2014

Instituto Cervantes - London

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

BALANCE DUE 10/25/2007 $ STATEMENT DATE BALANCE DUE $ PLEASE DETACH AND RETURN TOP PORTION WITH YOUR PAYMENT

How To Write A Report On A Drug Company

AV-002: Professional Web Component Development with Java

RELATIONAL DATABASE DESIGN

INFORMATIONAL NOTICE

SPANISH MOOD SELECTION: Probablemente Subjunctive, Posiblemente Indicative

How To Speak Spain

Introduction to Computing. Lectured by: Dr. Pham Tran Vu

Stages of Family Recovery

A comparative study of two models for the seismic analysis of buildings

PATIENT HEALTH QUESTIONNAIRE PHQ-9 FOR DEPRESSION

ENVIRONMENT: Collaborative Learning Environment

Curso SQL Server 2008 for Developers

Guidelines for Designing Web Maps - An Academic Experience

ISM 318: Database Systems. Objectives. Database. Dr. Hamid R. Nemati

Ejercicios propuestos C. Alexander IV.2 Parametric VaR

Phase planning today. Planificación por fases ahora. Phase planning today. Steve Knapp* 1, Roberto Charron*, Gregory Howell*

DATABASE NORMALIZATION

Verbos modales. In this class we look at modal verbs, which can be a tricky feature of English grammar.

EE 1130 Freshman Eng. Design for Electrical and Computer Eng.

Extracting the roots of septics by polynomial decomposition

INTELIGENCIA DE NEGOCIO CON SQL SERVER

Problem 1 (1.5 points)

REVIEWER(S): Clement Anson, Tim Dodge. Ortega Industrial Contractors

BDL4681XU BDL4675XU. Video Wall Installation Guide

Práctica 1: PL 1a: Entorno de programación MathWorks: Simulink

CUSTOMER ENGAGEMENT & COMMERCE PORQUE EL CAMINO & EL RESULTADO IMPORTAN

The Mighty Echar One verb, how many meanings?!

BtoB MKT Trends. El Escenario Online. Luciana Sario. Gerente de Marketing IDC Latin America 2009 IDC W W W. I D C. C O M / G M S 1

Monterey County Behavioral Health Policy and Procedure

AP SPANISH LANGUAGE 2013 PRESENTATIONAL WRITING SCORING GUIDELINES

How To Apply For A Job At American Works, Inc.

The Shoe Project. Passageway Honoring Survivors Service

Conventional Files versus the Database. Files versus Database. Pros and Cons of Conventional Files. Pros and Cons of Databases. Fields (continued)

Propiedades del esquema del Documento XML de envío:

A FIRST COURSE IN SOFTWARE ENGINEERING METHODS AND THEORY UN CURSO INICIAL SOBRE TEORÍA Y MÉTODOS DE LA INGENIERÍA DE SOFTWARE

Medicaid Prepaid Mental Health Plan

New words to remember

demonstrates competence in

Sympa, un gestor de listas de distribución para las universidades

LBWN ALQUILAR CON OPCIÓN A COMPRAR

TIM 50 - Business Information Systems

Copyright TeachMe.com 242ea 1

Lecture Notes INFORMATION RESOURCES

Sustainability Stories

Quest, Inc. Title VI Complaint Procedures and Forms

NEW TOOLS FOR THE SELECTION OF TECHNOLOGIES; APPLICATION TO SHEET METAL FORMING

90 HOURS PROGRAMME LEVEL A1

Plumbers and Irrigators INSTRUCTIONS -- PLEASE READ CAREFULLY

Your summer goal: To practice what you have been learning in Spanish and learn more about the Spanish language and Spanish-speaking cultures.

Normalization in OODB Design

Preparando a futuros profesores para integrar tecnología

ANTI-FRAUD Advocacy Toolkit for Naturalization Collaboratives

Electrician Application -- INSTRUCTIONS -- PLEASE READ CAREFULLY

Tu Inglés Podcast Sesion Diez Transcripcion

Topics. Database Essential Concepts. What s s a Good Database System? Using Database Software. Using Database Software. Types of Database Programs

Cambridge IGCSE.

Bucle for_in. Sintaxis: Bucles for_in con listas. def assessment(grade_list): """ Computes the average of a list of grades

Received by NSD/FARA Registration Unit 06/25/2012 3:30:37 PM

Adaptación de MoProSoft para la producción de software en instituciones académicas

VaughanTown. Newsletter 5:...Last Words. Last Words and Recommendations Last Reminder Meeting point map.

Telling and asking for the time.

FAMILY INDEPENDENCE ADMINISTRATION Seth W. Diamond, Executive Deputy Commissioner

Encuesta de Expectativas Económicas al Panel de Analistas Privados

DATABASE INTRODUCTION

LOS ANGELES UNIFIED SCHOOL DISTRICT REFERENCE GUIDE

ICT education and motivating elderly people

LOS ANGELES UNIFIED SCHOOL DISTRICT REFERENCE GUIDE

El modelo estratégico de los canales de. contacto con el cliente dentro de los objetivos. de calidad y rentabilidad

z Introduction to Relational Databases for Clinical Research Michael A. Kohn, MD, MPP copyright 2007Michael A.

ECCAIRS 5 Instalación

Cursos Generales Complementarios. Universidad del TURABO. Escuela: Ingeniería Grado: Bachillerato

Transcription:

What is the Common Problem that Makes most Biological Databases Hard to Work With, if not Useless to most Biologists? RUNI VILHELM MRAG Americas, Inc. 110 South Hoover Blvd., Suite 212 Tampa, Florida 33609-2458 USA ABSTRACT The manner by which a biologist collects, stores, and understands the relations between data is often completely different from the way a computer analyst views data. Even though databases are supposed to be relationally designed (i.e. built on relations between the data entities), this usually never happens. In real life, the biologist starts collecting data for a scientific purpose and then contacts a computer analyst (CA) in the IT department to help develop a database for the data collection schema after it has commenced. From a brief discussion with the biologist, the CA designs and implements a table structure (database). The product then gets turned over to the biologist, and it is now up to the biologist to understand the relations that the CA had in mind when developing the database. The database will not be the helpful tool it was intended to be; instead, it is often a troublesome data storage system that the biologist is stuck with in the future. To make matters worse, the database is rarely documented and if/when the CA leaves/retires from the organization no one will completely understand the database structure. Thus, what is a relational database design? Is it (1) a design that promotes the relations and business rules of the collected data, or (2) a normalized relational design that the Computer Analyst creates from theories he/she has read about in a computer book? Most databases are constructed based on (2). However, this makes most databases hard to work with if not useless to most biologists. So how can this be changed? To construct a database that will be successfully used, concepts from both (1) and (2) must be included when designing the database structure. This presentation will focus on how this can be achieved. KEY WORDS: Databases, design Cual es el Problema Común que Ocasiona que la Mayoría de los Bancos de data Biológica sean Difíciles de Utilizar o sean Inefectivos para la Mayoría de los Biologos? A pesar de que los bancos de data están supuestos a tener un diseño de relación (e.g., elaborados en relaciones de data entre las entidades), usualmente no es así. En la realidad el biólogo comienza a recopilar la data con un propósito científico, y luego se pone en contacto con un analista de data (CA,

Page 440 56th Gulf and Caribbean Fisheries Institute por sus siglas en inglés) que lo ayude a desarrollar un banco de data para el esquema de la data recogida después de haber comenzado ésta. Despues de tener una breve discusión con el biólogo, el CA diseña e implementa una tabla de estructura (banco de data). El producto es entonces devuelto al biólogo, y le corresponde a éste entender las relaciones que el CA tiene en mente al desarrollar el banco de data. El banco de data no será entonces el instrumento de ayuda productivo que se tenía pensado, mas bien se convierte en un problemático sistema para guardar la data que no es productivo. Peor aún, el banco de data es muy pocas veces documentado y si en algún momento el CA se retira o abandona la entidad, no habrá nadie que entienda completamente la estructura del banco de data. Por lo tanto, que es un diseño de banco de relación de data? (1) Es un diseno que promueve las relaciones y reglas de trabajo de la data recogida? O (2) un diseño de relación normalizado que el analista crea basado en teorías que él o ella han leído en un libro de computadora? La mayoría de los bancos de data están construidos basados en la 2. Sin embargo esto hace que la mayoría de los bancos de data sean difíciles o imposibles de utilizar por la mayoría de los biólogos. Como se puede cambiar esto? Para construir un banco de data que se pueda utilizar efectivamente se deben incluir ambos conceptos, el 1 y 2, al diseñar la estructura del banco de data. El enfoque de esta presentación va dirigido a lograr este propósito. PALABRAS CLAVES: Bancos de data, diseñar INTRODUCTION Computers 20 years ago were simply not powerful enough to handle the processing of data, especially the relational database model (which we will explore shortly). As computers began to advance, databases began the snowball effect we see now. We are just beginning to perceive the utter power of in-depth database systems and the power of which these systems can implement. Before you can begin to design a database, you must understand the underlying concepts and theories of why databases are used and how they are created. I will give you an explanation of what a database is, the relational database model and structured query language. Databases are the primary form of storage in both today's online and offline worlds. Databases are used to store millions of different types/ combinations of information including product details, employees, personal address books, news, etc. Before you can begin using a database however, you must understand the underlying concepts and theories of why databases are used and how they are created.

Vilhelm, R. GCFI:57 (2006) Page 441 OVERVIEW OF THE RELATIONAL MODEL The relational model was formally introduced by Dr. E. F. Codd in 1970 and has evolved since then through a series of writings. The model provides a simple, yet rigorously defined, concept of how users perceive data. The relational model represents data in the form of two-dimension tables. Each table represents some real-world person, place, thing, or event about which information is collected. A relational database is a collection of twodimensional tables. In the relational model, a database is a collection of relational tables. A relational table is a flat file composed of a set of named columns and an arbitrary number of unnamed rows. The columns of the tables contain information about the table. The rows of the table represent occurrences of the "thing" represented by the table. A data value is stored in the intersection of a row and column. PROPERTIES OF THE RELATIONAL TABLES Values are Atomic This property implies that columns in a relational table are not repeating group or arrays. Such tables are referred to as being in the "first normal form" (1NF). The atomic value property of relational tables is important because it is one of the cornerstones of the relational model. Column Values are of the Same Kind In relational terms this means that all values in a column come from the same domain. A domain is a set of values which a column may have. For example, a Monthly Salary column contains only specific monthly salaries. It never contains other information such as comments, status flags, or even weekly salary. Each Row is Unique This property ensures that no two rows in a relational table are identical; there is at least one column, or set of columns, the values of which uniquely identify each row in the table. Such columns are called primary keys and will be discussed in more detail in a moment. This property guarantees that every row in a relational table is meaningful and that a specific row can be identified by specifying the primary key value. The Sequence of Columns is Insignificant This property states that the ordering of the columns in the relational table has no meaning. Columns can be retrieved in any order and in various sequences. The benefit of this property is that it enables many users to share the same table without concern of how the table is organized. It also permits the physical structure of the database to change without affecting the relational tables.

Page 442 56th Gulf and Caribbean Fisheries Institute The Sequence of Rows is Insignificant This property is analogous to the one above but applies to rows instead of columns. The main benefit is that the rows of a relational table can be retrieved in different order and sequences. Adding information to a relational table is simplified and does not affect existing queries. Each Column has a Unique Name Because the sequence of columns is insignificant, columns must be referenced by name and not by position. In general, a column name need not be unique within an entire database but only within the table to which it belongs. Relationships and Keys A relationship is an association between two or more tables. Relationships are expressed in the data values of the primary and foreign keys. A primary key is a column or columns in a table whose values uniquely identify each row in a table. A foreign key is a column or columns whose values are the same as the primary key of another table. You can think of a foreign key as a copy of primary key from another relational table. The relationship is made between two relational tables by matching the values of the foreign key in one table with the values of the primary key in another. Keys are fundamental to the concept of relational databases because they enable tables in the database to be related with each other. Navigation around a relational database depends on the ability of the primary key to unambiguously identify specific rows of a table. Navigating between tables requires that the foreign key is able to correctly and consistently reference the values of the primary keys of a related table. Data Integrity Data integrity means, in part, that you can correctly and consistently navigate and manipulate the tables in the database. There are two basic rules to ensure data integrity; entity integrity and referential integrity. The entity integrity rule states that the value of the primary key can never be a null value (a null value is one that has no value and is not the same as a blank). Because a primary key is used to identify a unique row in a relational table, its value must always be specified and should never be unknown. The integrity rule requires that insert, update, and delete operations maintain the uniqueness and existence of all primary keys.

Vilhelm, R. GCFI:57 (2006) Page 443 The referential integrity rule states that if a relational table has a foreign key, then every value of the foreign key must either be null or match the values in the relational table in which that foreign key is a primary key. Normalization Normalization is a design technique that is widely used as a guide in designing relational databases. Normalization is essentially a two step process that puts data into tabular form by removing repeating groups and then removes duplicated data from the relational tables. Normalization theory is based on the concepts of normal forms. A relational table is said to be a particular normal form if it satisfied a certain set of constraints. There are currently five normal forms that have been defined. In this section, we will cover the first three normal forms that were defined by E. F. Codd. The goal of normalization is to create a set of relational tables that are free of redundant data and that can be consistently and correctly modified. This means that all tables in a relational database should be in the third normal form (3NF). A relational table is in 3NF if and only if all non-key columns are (a) mutually independent and (b) fully dependent upon the primary key. Mutual independence means that no non-key column is dependent upon any combination of the other columns. The first two normal forms are intermediate steps to achieve the goal of having all tables in 3NF. In order to better understand the 2NF and higher forms, it is necessary to understand the concepts of functional dependencies and lossless decomposition. Simply stated, normalization is the process of removing redundant data from relational tables by decomposing (splitting) a relational table into smaller tables by projection. The goal is to have only primary keys on the left hand side of a functional dependency. In order to be correct, decomposition must be lossless. That is, the new tables can be recombined by a natural join to recreate the original table without creating any spurious or redundant data. SQL (STRUCTURED QUERY LANGUAGE) A discussion of databases would not be truly complete without touching on SQL, the database language of choice for most relational database systems. Usually pronounced as see-kwel, SQL was first conceived at IBM's laboratories in the early 1970s, where it was named sequel, not SQL. Only in the 1980s was the language renamed to SQL, an acronym for its complete name, Structured Query Language. Databases use queries. Queries interact with the database to extract, update, insert and delete records, or otherwise work with the database's data.

Page 444 56th Gulf and Caribbean Fisheries Institute COMMON DATABASE DESIGN FLAWS Sampling Process In real life the biologist starts collecting data for a scientific purpose and then contacts a computer analyst (CA) in the IT department to help develop a database for this data collection schema after it has commenced. From a brief discussion with the biologist, the CA designs and implements a table structure (database). The product then gets handed over to the biologist and it is now up to the biologist to understand the relations that the CA had in mind when developing the database. The database will not be the helping tool it was intended to be, it will rather be a troublesome data storages that the biologist will be stuck with in the future. To make things even worse, the database is rarely documented, and if/ when the CA leaves/retires from the organization, no one will completely understand the database structure.