Database Development. Richard Bruskiewich, Ph.D. Bioversity International. Principal Scientist Agricultural Biodiversity Informatics Theme

Global Timber Tracking Network Database Development Richard Bruskiewich, Ph.D. Principal Scientist Agricultural Biodiversity Informatics Theme Conservation & Availability Bioversity International 26 March 2012

Database objectives Design requirements Overview Current implementation strategy: Initial design decisions Domain model semantics (first iteration) Progress to date: Status of implementation & deployment Brief tour of early prototype Next steps

Database Objectives Will aggregate reference DNA and isotope datasets Will provide scientifically (statistically) sound analytical tools to support or refute claims of timber provenance Target users/actors: DNA and isotope testing laboratories DNA and isotope reference standard d providers National legal enforcement agencies (e.g. customs, courts) International multilateral agencies (e.g. FAO?) Other interested parties: Forestry companies? Conservation NGO s? Any others?

Design Requirements Credible forensic resource: Quality assurance workflows for reference data Embodies and accurately expresses objective e science Provides an audit trail for forensic purposes Globally accessible system: Internet (web) based application online 24 x 7 Support for international languages by locale Secure system: Reference data treated as confidential Web access secure as a bank Scalable: Expand with available reference data Extendable to embody new science/algorithms Highly sustainable informatics implementation

Current Implementation Strategy Initial design decisions Domain model semantics (first iteration)

Initial Design Decision 1a After reviewing programming language options for web development, decided to use Python since it: Is a mature dynamically-typed language with a coherent clean syntax and powerful programming constructs. Is seen as more robust, less confusing and more versatile than PHP: i.e. Python supports back end systems s computing, not just web interfaces (PHP is very web centric). Is generally less tedious than Java(*) for prototyping, but still has native object oriented support. (*) Actually, a Java implementation of Python (Jython) is available, if we change our minds

Initial Design Decision 1b Python has broad, excellent software library support for scientific computing and data visualization, e.g. NumPy/Scipy for scientific computations RPy2 library for interfacing with R statistical software; PyMC for Monte Carlo analysis; PyMix for mixed models; PyBel interface to OpenBabel cheminformatics Python for ArcGIS, MayaVi for Scientific Data Visualization An many others

Initial Design Decision 2 Given Python, it was decided to embrace Django, one of fthe most well recognized dindustry-standard t d d frameworks for web site development, which: Has mature, solid design well supported by a global community Has very loosely-coupled, module architectural design, which encourages very rapid, iterative e development. elopment Very flexible HTML template-driven presentation layer. Web page formatting of the HTML will be done with Cascading Style Sheets in best practices way Javascript will be used for dynamic content.

Initial Design Decisions 3 Using Postgresql in the back end (but Django supports other back ends could be changed). We are targeting CentOS Linux for server deployment (although the prototype does run well under Windows ) Using NGINX + uwsgi as the web application container (faster but less onerous than Apache + FASTCGI). The staging web server is deployed as an HTTPS/SSL secured web site.

Domain model semantics (first iteration)

Progress to Date Project meeting at beginning of November 2012 Status of implementation & deployment HTTPS/SSL protected staging web server commissioned on a RackSpace hosted server Baseline project development environment established: e.g. source code repository, project wiki, etc. Core informatics implementation technologies were reviewed, are partially selected and are being applied. A preliminary data model was specified and implemented Iterative software development work started on prototype.

Brief Demonstration of Prototype

Next Steps Towards a leaner and more agile development strategy? A rapid, iterative lean/agile design/implementation cycle was promised at the November 2012 meeting but not yet put into place. Perhaps this process can start now, given initial progress on putting the prototype in place. Resourcing and collaboration? It s a 15 billion dollar a year problem. If the GTTN informatics system is to be properly implemented soon, would additional targeted resources (and dedicated personnel) help? Except for this presenter, the Bioversity team lacks technical expertise and certainly, human resourcing, to meet the challenge. What kind of community involvement is desired?

Thank you.