Database Technologies Bachelor and Master Projects XML Databases Database & Information Systems Group Christian Grün
Introduction XML just small files why databases? library of U (800 MB) genetic data (Swissprot, 3 GB) Wikipedia (8 GB) Medline (38 GB) Challenges support new standards find relevant query optimizations visualizing results tree-structured data structure <XML/> vs <root><entry id="100k_rat" class="standard" mtype="prt" seqlen="889"> <AC>Q62671</AC> <Mod date="01-nov-1997" Rel="35" type="created"></mod> <Mod date="01-nov-1997" Rel="35" type="last sequence update"></mod> <Mod date="15-jul-1999" Rel="38" type="last annotation update"></mod> <Descr>100 KDA PROTEIN (EC 6.3.2.-)</Descr> <Species>Rattus norvegicus (Rat)</Species> <Org>Eukaryota</Org> <Org>Metazoa</Org> <Org>Chordata</Org> <Org> Craniata</Org> <Org>Vertebrata</Org> <Org>Euteleostomi</Org> <Org>Mammalia</Org> <Org> Eutheria</Org> <Org>Rodentia</Org> <Org>Sciurognathi</Org> <Org>Muridae</Org> <Org> Murinae</Org> <Org>Rattus</Org> <Ref num="1" pos="sequence FROM N.A"> <Comment> STRAIN=WISTAR</Comment> <Comment>TISSUE=TESTIS</Comment> <DB>MEDLINE</DB> <MedlineID> 92253337</MedlineID> <Author>Mueller D</Author> <Author>Rehbein M</Author> <Author> Baumeister H</Author> <Author>Richter D</Author> <Cite>Nucleic Acids Res. 20:1471-1475(1992)</Cite> </Ref> <Ref num="2" pos="erratum"> <Author>Mueller D</Author> <Author>Rehbein M</Author> <Author>Baumeister H</Author> <Author>Richter D</Author> <Cite>Nucleic Acids Res. 20:2624-2624(1992)</Cite> </Ref> <EMBL prim_id="x64411" sec_id= "CAA45756"></EMBL> <INTERPRO prim_id="ipr000569" sec_id="-"></interpro> <INTERPRO prim_id="ipr002004" sec_id="-"></interpro> <PFAM prim_id="pf00632" sec_id="hect" status= "1"></PFAM> <PFAM prim_id="pf00658" sec_id="pabp" status="1"></pfam> <Keyword>Ubiquitin conjugation</keyword> <Keyword>Ligase</Keyword> <Features> <DOMAIN from="77" to="88"> <Descr>ASP/GLU-RICH (ACIDIC)</Descr> </DOMAIN> <DOMAIN from="127" to="150"> <Descr>PRO-RICH</Descr> </DOMAIN> <DOMAIN from="420" to="439"> <Descr>ARG/GLU-RICH (MIXED CHARGE)</Descr> </DOMAIN> <DOMAIN from="448" to="457"> <Descr>ARG/ASP-RICH (MIXED CHARGE)</Descr> </DOMAIN> <DOMAIN from="485" to="514"> <Descr>PABP- LIKE</Descr> </DOMAIN> <DOMAIN from="579" to="590"> <Descr>ASP/GLU-RICH (ACIDIC) </Descr> </DOMAIN> <DOMAIN from="786" to="889"> <Descr>HECT DOMAIN</Descr> </DOMAIN> <DOMAIN from="827" to="847"> <Descr>PRO-RICH</Descr> </DOMAIN> <BINDING from="858" to="858"> <Descr>UBIQUITIN (BY SIMILARITY)</Descr> </BINDING> </Features></Entry> <Entry id="104k_thepa" class="standard" mtype="prt" seqlen="924"> <AC>P15711</AC> <Mod date="01-apr-1990" Rel="14" type="created"></mod> <Mod date="01-apr-1990" Rel="14" type="last sequence update"></mod> <Mod date="01-aug-1992" Rel="23" type="last annotation update"></mod> <Descr>104 KDA MICRONEME-RHOPTRY ANTIGEN</Descr> <Species>Theileria parva </Species> <Org>Eukaryota</Org> <Org>Alveolata</Org> <Org>Apicomplexa</Org> <Org> Piroplasmida</Org> <Org>Theileriidae</Org> <Org>Theileria</Org> <Ref num="1" pos= "SEQUENCE FROM N.A"> <Comment> STRAIN=MUGUGA</Comment> <DB>MEDLINE</DB> <MedlineID> 90158697</MedlineID> <Author>Iams K.P</Author> <Author>Young J.R</Author> <Author>Nene V</Author> <Author>Desai J</Author> <Author>Webster P</Author> <Author>Ole-Moiyoi O.K</Author> <Author>Musoke A.J</Author> <Cite>Mol. Biochem. Parasitol. 39:47-60(1990)</Cite> </Ref> <EMBL prim_id="m29954" sec_id="aaa18217"></embl> <PIR prim_id= "A44945" sec_id="a44945"></pir> <Keyword>Antigen</Keyword> <Keyword>Sporozoite</Keyword> Seite 2
BaseX XML database, developed in DBIS workgroup open source: www.basex.org query languages: W3C standards XPath & XQuery extensions: XQuery Update, Full-Text indexes: attributes, texts full-text special focus: tight coupling between frontend and backend Seite 3
Topics Backend Namespace Support what are namespaces? <Address> <FirstName>John</FirstName> <FamilyName>McHilton</FamilyName> <Street>12 Donovan Road</Name> <Town>Chicago, 31072</Town> </Address> XPath: //FirstName //Familyname <Address xmlns:name="names"> <name:first>john</name:first> <name:family>mchilton</name:family> <Street>12 Donovan Road</Name> <Town>Chicago, 31072</Town> </Address> XPath: //name:* design of an elegant solution for namespace access extension of the internal BaseX storage unterstanding of the specification Seite 4
Topics Backend DTD Parsing what is a DTD? defines the document structure and entities allows document validation <mondial> <country id= f0_136"> <name>germany</name> <city>münchen</city> </mondial> <!ELEMENT mondial (country*) > <!ELEMENT country (name, city*) > <!ELEMENT name (#PCDATA) > <!ELEMENT city (#PCDATA) > <!ATTLIST country id ID #REQUIRED > extension of the XML parser integration of validate commands unterstanding of the specification <!ENTITY uuml ü > Seite 5
Topics Backend XQuery Optimizations sample (returns all media with the title Casablanca ): possible query plans: for $i in doc("library.xml")//medium where $i/title = "Casablanca" return $i parse all Medium and Title tags (sequential scan) very slow access the index and check results much faster! implementation of existing XPath optimizations for XQuery learning much about XQuery and tree-structured optimizations! Seite 6
Topics Backend Index Management current state: one index for all texts & attribute value desirable: special-purpose indexes: indexes for single tags/attributes indexes on numeric values range queries index for approximate text search extension of the existing indexes adaptation of the query optimizations thoughts on new index structures <Medium> <Title>Matrix</Title> <Year>1999</Year> <Type>DVD</Type> </Medium> <Medium> <Title>Matrix Reloaded</Title> <Year>2003</Year> <Type>DVD</Type> </Medium> XPath: //Medium[Year > 2000]? Seite 7
Topics Frontend View Schemas XML structure and contents can be very diverse: attribute-based storage <item id="0" firstname="hans" lastname="gruber" title="b.sc." /> <item id="1" firstname="thomas" lastname="schmid" title="prof." /> text-based storage <item><id>0</id><first>hans</first><last>gruber</last><title>... flat vs. hierarchic data desirable: view definitions to optimize visualization output analysis of existing XML documents design of a view schema implementation of schema parsing and interpretation Seite 8
Topics Frontend TreeMap space-filling visualization for hierarchic data diversity of layout algorithms available numerous attributes unexploited: color, intensity, popular example: size-based file system visualization visualization of tree-structured data implementation of efficient Java visualizations Seite 9
Topics Frontend Visualization numerous visualizations exist for tree-structured data: conventional tree view hyperbolic view interring, visualization of tree-structured data implementation of efficient Java visualizations Seite 10
Organization First take some time for your decision feel free to suggest own topics Events project is accompanied by a weekly project seminar seminar includes regular updates between all members and one talk on your project Room: E217 88-4449 @ christian.gruen@uni-konstanz.de Seite 11