Desig ad Implemetatio of a Publicatio Database for the Viea Uiversity of Techology Karl Riedlig Istitute of Idustrial Electroics ad Material Sciece, TU Wie, A-040 Viea karl.riedlig@tuwie.ac.at Abstract: Iitially for the iteral use of the EE faculty of the Viea Uiversity of Techology ad as a tool for the evaluatio of the scietific output of the faculty s istitutes, the author has created a database for publicatios. The first prototype based o Microsoft Access was i due course replaced by a Web-based solutio with a LAMP server cocept. Because the Web software met the expectatios of the uiversity authorities, it was implemeted uiversity-wide i mid-2002, ad sice the provides all publicatio-related evaluatio data of the uiversity. This presetatio will describe the desig of the Publicatio Database ad will give some iformatio o techical ad orgaisatioal problems ecoutered durig its implemetatio at the uiversity, particularly aspects of how to improve user acceptace. Itroductio The scietific commuity geerally measures the quality of scietific work by the resultig published output. However, a simple cout of the publicatios of researchers or istitutes is hardly a accurate represetatio of this output due to the wide rage of what differet people cosider a scietific publicatio, ad because a simple publicatio cout supplied by the researchers ofte caot be fully verified. There are official databases of recogized publicatios for some areas but ot for all, otably Electrical Egieerig. With the aim to provide reliable publicatio data for the EE faculty of the Viea Uiversity of Techology that are suitable for resolvig all coceivable pertiet queries, the author has devised the Publicatio Database. Sice a quick solutio was required, the author chose Microsoft Access for the prototype versio of this database. The Access prototype cosisted of two modules, a GUI frot-ed with a umber of VB script modules as a user iterface that could easily be upgraded, ad a back-ed that held the publicatio data. This database became operatioal after oly a few moths of developmet; after a few more moths of test operatio at the author s istitute it was itroduced faculty-wide i late 999, first with separate copies of the Database for each of the about fiftee istitutes of the faculty, later with oe commo server-based istallatio. The severe drawbacks of the Access applicatio soo showed: Bad acceptace by istitutes that foud themselves forced to set up a computer uder Widows ad istall Access; Compatibility problems, sice it was ot possible to simultaeously ru the Database o differet versios of Access or upgrade it automatically to a ewer versio;
Frequetly damaged data sets due to computers crashig with the Access Publicatio Database ope; Cosequetly, a excessive eed for database maiteace; No reasoable possibility for real-time queries by Web servers; Ad over all the lack of ay serious security because eve lookig up data o a Access database requires write permissio to the database file. These cosideratios ad the prospect of a much more powerful system led to the developmet of a Web-based database solutio with a LAMP (Liux Apache MySQL PHP) approach. Based o the cocept of ad the experiece gathered with the Access prototype, ad uder the author s supervisio, a group of four studets developed the code of the Web Database, which took more tha oe year due to the complexity of the task ivolved. Hece, the Web versio became available oly i mid-200, almost two years after the Access prototype had bee ready for use. After ot less tha 3 versio releases of the Access Database, the Web Database took over the data ad the tasks of the prototype Publicatio Database. Sice mid-200, the author is the (practically) sole perso i charge of the Web Publicatio Database; he has added a wealth of additioal fuctios ad improvemets meawhile. The Database has grow from 25 tables i its Access versio to 30 i the early Web releases, ad comprises 47 tables i its curret 9 th release. Accordig to a recogised tool for software assessmet, David A. Wheeler's SLOCCout, the Web database cosists of 8,260 lies of PHP program code today, which correspods to more tha 4.5 perso-years of developmet effort (ad developmet costs of 570,000 US-$). (SLOCCout caot cout the lies of HTML code i the Web database, whose umber lies i a similar order of magitude.) Although the Access prototype versio already provided much more fuctioality tha required for evaluatio purposes, we could sigificatly icrease the added value of the Database i the Web versio. Because the software met the expectatios of the uiversity authorities, it has fially bee implemeted uiversity-wide i mid-2002, ad sice the provides all publicatio-related evaluatio data of the uiversity. The Cocept of the Publicatio Database The desig of a system like the Publicatio Database has to take ito accout two possibly coflictig requiremets. Iformatio i the Database has to be as complete ad detailed as possible to allow for all coceivable queries which may ot oly result i a simple cout of publicatios but should also take ito accout the quality of the publicatios (or, easier to maitai, of the media where the publicatios appeared). O the other had, etries ito the Database are ofte made by persos ot familiar with the detailed aspects of bibliography (e.g., by a istitute secretary). This precludes a full-blow bibliographic system ad demads a flexible approach where oly the fields essetial for idetifyig ad verifyig a publicatio eed fillig i, while optioal fields are available for additioal iformatio such as abstracts, idetificatio umbers i iteratioally recogised publicatio collectios for the particular field, or a lik to a electroic versio of the publicatio. http://www.dwheeler.com/sloccout/
I geeral, istrumets that oly serve the purpose to collect statistical data are ot well accepted. I order to improve the acceptace of the Publicatio Database, it has to provide sufficiet added value to its users. This etails that, e.g., everybody must be able to extract their ow publicatio lists or have them created dyamically i ay perceivable format for use o a website, ad that exteral users ca freely search for iformatio i the Database. I fact, the Publicatio Database must serve as a kowledge base as well. Both operatios the collectio of evaluatio data ad the operatio as a kowledge base with the possibility to search for iformatio imply a geuie database structure, where each item of a publicatio etry is located i a separate field of a database table. Obviously, there are relatios betwee the data fields; hece, a relatioal database is required. Several authors affiliated to differet istitutes of the faculty may joitly have writte a publicatio, which is supposed to appear i the publicatio lists of each of its authors, ad of each of the groups ad istitutes to which its authors belog. This implies that the ames of persos must reside i a separate table, liked to the table of publicatios, with refereces to the groups ad istitutes they belog to; ad that the ames of the authors must be selected from a list whe the publicatio etry is beig created. (For reasos of uiformity, the same applies to the editors of books or coferece proceedigs, ad to the reviewers or supervisors of doctor s or diploma theses.) Obviously, it must be possible to add ew ames to the ame table i the course of eterig a publicatio. Media Class e.g., "Jourals" Media Type e.g., "SCI Jourals" Publicatio Medium e.g., "Applied Physics Letters" Publicatio Fig. : Hierarchic orgaisatio of publicatios i the Publicatio Database Weighig publicatios should also be as easy as possible: It simply would ot do to have iformatio such as the SCI status of a publicatio or the impact factor of the joural i which it appeared etered separately for each publicatio. These are properties of the publicatio medium (e.g., the joural), which properly belog ito a publicatio medium record (see Fig. ). Similar to the ames of authors, publicatio media have to be selected from a list, ad oly added to this list if they are ot yet i the database. It should also be possible to tie together publicatio media with a comparable quality ad regard them as belogig to oe media type (which, i tur, determies their weight i a evaluatio). For example, jourals listed i the SCI with a impact factor of greater tha may costitute a particular media type. Sice jourals ad, e.g., cofereces obviously caot share media types, they costitute differet media classes. The media classes recogised i the Publicatio Da-
tabase are jourals, publishig houses (for books ad cotributios to books or proceedigs volumes), evets (for oral or poster presetatios at cofereces or other scietific meetigs), ad patets. The publicatio media cocept is ot used for simple publicatios like diploma or doctor s theses or reports, which also should be represeted i the Database. The publicatio media cocept greatly facilitates a saity check of the data etered: Istead of lookig at the classificatio i hudreds of publicatio etries, oly the classificatio of the publicatio media eeds checkig. Particularly i the case of jourals, the umber of publicatio media grows oly slowly after a iitial phase, ad it is easy to look up these ewly added jourals i the proper databases. (Of course, this process could be automated altogether.) Publicatio refereces should be stadardised but require a differet structure for differet types of publicatios. (I fact, a stadardised referece format is oe of the greatest beefits i creatig publicatio lists from a database. Who ever had to create a decet lookig publicatio list for a departmet report from lists supplied by various colleagues will probably agree.) It makes therefore sese to defie publicatio types : A publicatio type determies ot oly the format of the referece output; it also determies the media class to which the publicatio media offered for selectio must belog. Access Rights Authors m Publicatio Media Publicatios Editors, etc, m Names Media Names Types Media Classes Ower Publicatio Types Groups Names Istitutes Fig. 2: Simplified ER diagram of the Publicatio Database This structure results i the ER diagram show i Fig. 2, which is a greatly simplified represetatio of the actual table structure of the Publicatio Database. Figure 2 does ot show the tables that hold auxiliary iformatio such as the formattig of the referece output, the groupig of publicatio types i publicatio lists, or the evaluatio queries ad results, ad it also shows oly oe relatio that determies the ower of a publicatio etry (i.e., the perso that made the etry). All tables regular users ca modify have similar fields that permit to determie who the last perso to chage the etry was.
The cocept already itroduced i the Access prototype to keep as much cofiguratio iformatio as possible i database tables proved to be exceedigly successful: No chages of the program code proper are ecessary to itroduce ew publicatio types; this requires oly addig records to the publicatio type ad the formattig tables. I fact, the core table structure as show i Fig. 2 has remaied uchaged through the life of the Database; however, may ew fields have bee added to these tables. The Implemetatio of the Publicatio Database While the already metioed shortcomigs of Access dictated a differet solutio i ay case, other boudary coditios favoured a Web solutio over ay other clietserver cocept: We were lookig for a sustaiable solutio that should exceed the lifetime of commo cliet software applicatios. I a uiversity eviromet, there is a wide rage of hardware ad operatig system platforms. This precludes covetioal LAN-based cliets. Usig the Database as a kowledge base ad providig exteral access to the publicatio iformatio requires a Web iterface ayway. This also applies to the added value of the Database as a o-demad geerator of publicatio lists for the Web servers of the faculty s istitutes. I geeral, usig covetioal Web browsers as cliets ad the HTTP (or secure HTTP) protocol for trasport makes the Database platform-idepedet ad worldwide accessible. Sice uiversity members ted to use a variety of browsers, icludig some exotic species, browser-idepedet programmig is madatory. For various (primarily fiacial, but also techical) reasos, we chose a LAMP structure for the database server, with cliet-based JavaScript for local pre-processig. There are three major access poits to the Publicatio Database: A autheticated access for data etry ad maiteace (the admiistratio module ); A iteractive public iterface that allows searchig for publicatios ad/or creatig tailored publicatio lists of persos, groups, or istitutes; ad A fuctio that dyamically creates pages with publicatio lists i a custom desig for iclusio o other websites. I the two iteractive iterfaces, a variety of query fuctios permits restrictig a search to etries meetig certai coditios. Both iteractive iterfaces provide a full-text search, which may comprise the etire etry icludig abstracts etc., or oly certai fields of the etry. The admiistratio module is i Germa oly but permits to create publicatio lists i Eglish; the other two iterfaces are available i Germa ad Eglish. Although the admiistratio module allows aoymous access, the recommeded access for the outside world is the (much simpler) iteractive iterface. A five-level access privilege scheme has bee implemeted i the admiistratio module of the Database: Aoymous users have strictly read-oly access. At the lowest autheticated level, users may create publicatio etries ad edit their etries (where their meas those etries they made themselves, plus all etries i which they appear i the list of authors). The third level exteds the editig rights to all pub-
licatio etries created by members of the group the user belogs to, or with authors belogig to this group. The fourth level aalogously exteds these rights to the user s istitute. The highest level is the admiistrator who ca edit ay publicatio etry i the Database. A separate privilege attribute allows a user to chage evaluatiospecific parameters; this attribute is ot techically but i fact tied to the admiistrator rights level. Sice access to a publicatio also depeds o the relatio of the user to at least oe of the authors, the table i which the access rights are stored is closely liked to the table that holds the ames of authors ad other persos show i publicatio etries (see Fig 2). Two differet statistics fuctios provide evaluatio data, (a) by the official algorithm, which is more or less a simple cout of publicatios, however, with a very peculiar groupig, ad (b) with a proprietary fuctio that also accouts for the quality of the publicatio media ad the sizes of the publicatios. Sice the official algorithms are ot oly rather covoluted but also likely to chage uexpectedly, we chose a very flexible approach that permits a admiistrator to costruct queries iteractively based o publicatio ad media types plus additioal iformatio from the publicatio etry. These queries are stored i a set of database tables for later use. Every user of the admiistratio module may ispect the queries (which ca be altered by a admiistrator oly) ad iteractively ru them. For the official algorithm, a publicatio couts for a istitute if at least oe of its authors belogs to the istitute. The proprietary fuctio uses freely defiable weights for each media type, which allows distiguishig betwee publicatio media with, e.g., differet impact factors. I additio to the weight per publicatio, a weight per page ca be specified (with a freely defiable limit), ad other facts eter as well via weight factors, e.g., whether a publicatio was ivited or ot. Depedig o the cofiguratio chose, the poits thus eared by a publicatio may be give to each group or istitute to which at least oe author belogs, they may be give i full to each author of the publicatio, or be split ito equal parts betwee all authors. Sice the latter approach teds to discrimiate authors who work i groups, a simple algorithm depedig o the umber of authors ca ehace the umber of poits before they are divided betwee the authors. While oly admiistrators with the special privilege level may defie the cofiguratio, all users of the admiistratio program ca ispect it ad ru queries. Additioal fuctios of the Publicatio Database comprise a tool to create URLs for iclusio o other websites that request a certai selectio of publicatio data from the Database, various database maiteace ad itegrity testig fuctios, ad fuctios for extractig export files for the official evaluatio procedure. While the URL geerator is available to all users of the admiistratio module, oly admiistrators may access the latter fuctios. The program structure chose keeps most of the processig i the server-based PHP code. This facilitates software maagemet ad provides a secure ad reliable processig eviromet. Most of the JavaScript code i the Publicatio Database is there to ehace the usability of the user iterface. Oe example is presettig certai form elemets if other elemets were edited (e.g., to set a radio butto to limit the database search to etries i a certai time rage if the start or ed times of the time rage were chaged). Other importat features are a quick search through log lists of perso or media ames, or checkig the completeess of a iput form. Due to problems ecoutered with some browsers that had difficulties hadlig large amouts of data as JavaScript variables, some parts of the iitially rather extesive
cliet-side JavaScript data pre-processig code meawhile either have bee rewritte or moved ito the server-side PHP code. Likewise, all potetially securityrelated JavaScript fuctioality has bee coverted to server-side PHP. This applies, i particular, to the pre-processig of data that is to be stored i the MySQL database; havig this doe o the server side reliably protects the system from data itetioally desiged to disturb or damage the database. Although we have bee careful to avoid all but the most established JavaScript fuctios, there are still occasioal problems o ewly itroduced browsers. (For example, Opera 7 chokes o JavaScript code about which o other browsers, icludig its ow predecessors, complaied.) This makes it advisable to covert JavaScript fuctios ito PHP code wherever possible. With oe exceptio, the Publicatio Database itetioally does ot distiguish betwee browsers: There is o approach to represet Greek characters that all browsers hoour. Netscape 4 requires ad Iteret Explorer accepts that the correspodig ASCII character is output with the fot set to Symbol, while the browsers with the Gecko egie (Netscape 6+ ad Mozilla) exclusively require (ad Iteret Explorer accepts) the HTML etity represetatio. (Greek characters were ecessary for the users i chemistry ad mathematics.) Ufortuately, the Web versio of the Publicatio Database has bee developed uder PHP3, which precluded object-orieted programmig ad itroduced some security hazards that could be resolved oly recetly. A cosequece of the procedural programmig techique used is a rather covoluted program code that is difficult to maitai. The Publicatio Database was origially desiged for use by oe faculty oly. Whe the uiversity authorities chose to itroduce it uiversity-wide, we decided to implemet oe separate copy of the Database for each of the faculties ad the large sub-groups of the Faculty of Sciece ad Iformatics. The resultig te databases reside o the same physical server; they are idividually accessed via the virtual Web server cocept of Apache. Although the maiteace of te separate databases requires more effort, compared to oe database for the etire uiversity, several reasos favoured the curret solutio: It does ot make a differece for people eterig publicatio data if they log ito a uiversity or a faculty Publicatio Database. Evaluatio data are primarily gathered faculty-wise. Splittig the Database i the way chose does ot costitute a problem there. Faculties may wat to use idividual cofiguratios of the Database. This is much easier to implemet i separate copies of the Database (although we tried to prevet this so far to avoid icompatibilities betwee the databases). The lists of already registered authors ad of the publicatio media with a suitable media class must be set to the cliet browser each time the publicatio editig form is opeed. These lists grow rapidly; i the EE Database, which holds the faculty s publicatios from 996 o, there are curretly about 3,800 ame ad 2,00 media etries (for 6,700 publicatios). This amouts to page sizes of several hudred kilobytes; extedig this cocept to the etire uiversity would icrease this figure by about a order of magitude, which is obviously uacceptable. Furthermore, fidig suitable ame or media etries i lists of that size is impractical. Curretly, exteral visitors have to search i several databases where the publicatios they are lookig for might appear. This drawback ca easily be resolved, though, by itroducig a portal that automatically searches all databases i tur.
Apart from a few cofiguratio files (which defie, e.g., the MySQL database to be used ad the ame of the faculty to be show i the output pages) all copies of the Database use the same set of files. Ufortuately, the Liux eviromet does ot permit to istall the commo tree of PHP, HTML, ad image files oly oce ad access it via symbolic liks from each faculty s base directory where the cofiguratio files reside; therefore, ew software releases must be copied to te faculty directories (which has bee automated, though). Experiece Gathered with the Publicatio Database As could be expected, people at the istitutes iitially met the Publicatio Database with (at least) suspicio, as a istrumet desiged to further icrease their workload. We could alleviate their objectios by poitig out the added value of o-lie publicatio lists ad queries, ad by the promise that all publicatio-related evaluatio data would come from the Database, without botherig them with these surveys i the future. While the structure of the Database itself was quite well accepted by the EE faculty (with whose requiremets i mid it had bee desiged), there were some objectios by other faculties, where differet approaches exist. I a rather odd coicidece, two differet researchers claimed o the same day that the Database requested too may data fields, ad that there were too few fields for bibliographic iformatio, respectively. Due to the flexibility of the Database desig (ad of people workig at a uiversity), it was possible to provide solutios everybody could live with. Some importat extesios to the Publicatio Database were madated by its uiversity-wide itroductio: Particularly users i chemistry ad mathematics wated Greek characters, superscripts, ad subscripts i title (ad abstract) texts. Greek characters are a problem because both PHP ad MySQL oly support eight-bit characters; the oly feasible solutio was to use a proprietary ecodig scheme. (We are aware that proprietary solutios should be avoided if possible. However, both the HTML etity otatio ad the TeX otatios for Greek characters were ot suitable: the first, because browsers uderstadig Greek HTML etities covert them ito their Uicode represetatio if a etry is opeed agai for editig, ad the latter, because the backslash character used i the TeX otatio simply does ot make it through a form submissio ad the MySQL database.) The already existig electroic collectios of publicatios at other istitutes costituted a particular problem: Aroud the time whe we itroduced the prototype Database at the EE faculty (999), several istitutes ad eve faculties had started similar activities. Obviously, these istitutes wated to cotiue usig their data collectios, or at least to import their data ito the Publicatio Database. These data collectios raged from simple Word documets over BibTeX files to more or less wellstructured databases. Some of the requests were surrealistic ideed: the Iformatics people (who should have kow better) demaded that all existig publicatio collectios regardless of format ad structure must be imported fully automatically. The uiversity hired a programmer to create a import tool for publicatio collectios; i the ed, it tured out that except for well-structured database or BibTeX sources the effort for post-processig the import data to make them suitable for a fully structured database came close to ewly typig i these etries. I particular, most problems we ecoutered origiated from o-uiform publicatio etry formats i may of the collectios; i some cases, each referece etry differed from its eighbours i puctuatio, order of first ad last ames of the authors, ad available data items.
Apart from the proper fault-free operatio of the Publicatio Database, the usability of its user iterface has bee the most importat desig issue, begiig with the Access prototype. Data etry forms should be as clear ad self-explaatory as possible, ad the sequece of operatio steps trasparet. The desig of the Access prototype ad the experiece gathered with it, i tur, iflueced the desig of the Web versio. Some ehacemets became ecessary oly after a prologed operatio of the Database, particularly due to the growig sizes of ame ad publicatio media lists. The problems with these lists were addressed i two ways: either by reducig the amout of data i the list where possible (e.g., by showig oly the ames of people belogig to the curretly selected istitute or group), or by providig a quick search for etries i these lists. Some seemigly isigificat features greatly facilitate work for the users, e.g., the possibility to sort etries by age (with the latest o top of the selectio list), or to limit searches to etries that still require some actio. Some problems origiated from the wide variety of hard- ad software platforms ad browser cofiguratios used uiversity-wide. Although we had spet much effort i makig the Publicatio Database as platform ad browser-idepedet as possible, we could ot test it with every coceivable browser cofiguratio ad etwork structure. A relatively high degree of browser-idepedece is possible by usig very basic HTML ad JavaScript programmig, avoidig sophisticated style attributes ad JavaScript fuctios. Although the resultig pages may ot look as uiformly o those browsers that support high-level style attributes, they at least work o all reasoable browsers. While the admiistratio module, which requires cliet-side JavaScript, eeds at least Netscape 4 or Iteret Explorer 4, the public iterface ca also operate without JavaScript (although it has a smoother user iterface o JavaScripteabled browsers); i fact, you ca eve use lyx to work with the public iterface! The combiatio of peculiarities of a particular browser with its usage by particular persos created some odd effects that were very difficult to trace ad reproduce. It tured out that it were geerally the same users that caused fatal errors i the Database, while other users with the same browsers remaied icospicuous. For example, Netscape 4 allows submittig a form that is ot yet completely loaded (which rus havoc if essetial o submit JavaScript code is ot yet there), while Iteret Explorer permits multiple submits of the same form. This is usually ot a issue, but made itself felt with the huge pages for publicatio etry editig, particularly if there was a slow etwork lik, ad with very impatiet users. (Fortuately, we could idetify ad resolve this problem before the Publicatio Database wet ito uiversity-wide operatio.) Aother issue that may severely reduce the usability of a Web applicatio is a differet iterpretatio of pressig the Eter key, especially if there are several Submit buttos i a form (which is a commo situatio i the Publicatio Database if differet fuctios may be ivoked with the same data). While some browsers (e.g., Netscape 4) igore the Eter key altogether i this case, others (e.g., Mozilla) iterpret it as a click o the first Submit butto i the form that has the focus. Sice it is likely that the first submit butto was ot the butto the user wated to operate (at least at the time she iadvertetly pressed the Eter key at the ed of a text field), we have added dummy Submit buttos o all frequetly used forms that are likely to get the focus because they cotai text fields. These dummy Submit buttos simply re-draw the form with its curret cotets, which is probably the most beig reactio i this cotext. We foud that a thorough checkig of data etered by the users was essetial: Regardless of academic degree ad scietific orietatio, users ted to make all coceivable mistakes. Several ehacemets of the data checkig mechaisms be-
came ecessary after the Database had bee itroduced uiversity-wide; eve if it comes to makig mistakes, the creativity potetial of academics should ot be uderestimated. The Future of the Publicatio Database Although the Publicatio Database by ow has reached a very high level of fuctioality ad completeess, there are several reasos that make a complete redesig desirable: The Database has bee desiged as a stad-aloe applicatio. Major modificatios would be ecessary to embed it ito a itegrated uiversity iformatio system, which should happe i the ear future. With its procedural PHP3-based code ad with the wealth of fuctioality meawhile packed ito it, the Publicatio Database is extremely complex ad difficult to maitai. The uiversity IT admiistratio who should coceptually be i charge of the Database to date was ot able to take over the Database (ad eve flatly refused to do so). Therefore, the author is still faced with its maiteace, which is ot a desirable situatio at all. A re-write of the Database must, of course, duplicate its curret fuctioality ad add some more, amog others: Itegratio with other uiversity-iteral ad exteral databases; Smart reductio of the amout of data to be set to the Web cliets; ad A few features i the hadlig of publicatios ad the creatio of publicatio lists that appear desirable but caot be implemeted i the curret structure. Coclusios The experiece made with the Publicatio Database idicates that a few simple but importat desig rules greatly improve the usability ad adaptability of a Web applicatio, eve if it has to operate i a eviromet it was ot origially desiged for: Use geeric code o the server, ad put all cofiguratio iformatio ito exteral data, either ito a database table, or ito cofiguratio files; Sed miimalist HTML ad JavaScript code to the cliet browsers; Support a reasoable rage of stadard browsers; ad Do ot assume that all of your users read pop-up messages, let aloe help files ad software documetatio. Provide sytactic ad, if possible, heuristic checkig o all data etered by the users!