Four FILE STRUCTURES Introduction This chpter is minly concerned with the wy in which file structures re used in document retrievl. Most surveys of file structures ddress themselves to pplictions in dt mngement which is reflected in the terminology used to describe the bsic concepts. I shll (on the whole) follow Hsio nd Hrry 1 whose terminology is perhps slightly nonstndrd but emphsises the logicl nture of file structures. A further dvntge is tht it enbles me to bridge the gp between dt mngement nd document retrievl esily. A few other good references on file structures re Roberts 2, Bertziss 3, Dodd 4, nd Climenson 5. Logicl or physicl orgnistion nd dt independence There is one importnt distinction tht must be mde t the outset when discussing file structures. And tht is the difference between the logicl nd physicl orgnistion of the dt. On the whole file structure will specify the logicl structure of the dt, tht is the reltionships tht will exist between dt items independently of the wy in which these reltionships my ctully be relised within ny computer. It is this logicl spect tht we will concentrte on. The physicl orgnistion is much more concerned with optimising the use of the storge medium when prticulr logicl structure is stored on, or in it. Typiclly for every unit of physicl store there will be number of units of the logicl structure (probbly records) to be stored in it. For exmple, if we were to store tree structure on mgnetic disk, the physicl orgnistion would be concerned with the best wy of pcking the nodes of the tree on the disk given the ccess chrcteristics of the disk. The work on dt bses hs been very much concerned with concept clled dt independence. The im of this work is to enble progrms to be written independently of the logicl structure of the dt they would interct with. The independence tkes the following form, should the file structure overnight be chnged from n inverted to seril file the progrm should remin unffected. This independence is chieved by interposing dt model between the user nd the dt bse. The user sees the dt model rther thn the dt bse, nd ll his progrms communicte with the model. The user therefore hs no interest in the structure of the file. There is school of thought tht sys tht sys tht pplictions in librry utomtion nd informtion retrievl should follow this pth s well 6,7. And so it should. Unfortuntely, there is still much debte bout wht good dt model should look like. Furthermore, opertionl implementtions of some of the more dvnced theoreticl systems do not exist yet. So ny suggestion tht n IR system might be implemented through dt bse pckge should still seem premture. Also, the scle of the problems in IR is such tht efficient implementtion of the ppliction still demnds close scrutiny of the file structure to be used. Nevertheless, it is worth tking seriously the trend wy from user knowledge of file structures, trend tht hs been stimulted considerbly by ttempts to construct theory of dt 8,9. There re number of proposls for deling with dt t n bstrct level. The best known of these by now is the one put forwrd by Codd 8, which hs become known s the reltionl model. In it dt re described by n-tuples of ttribute vlues. More formlly if the dt is described by reltions, reltion on set of domins D 1,..., D n cn be represented by set of ordered n-tuples ech of the form (d 1,..., d n ) where d i D i. As
50 File structures it is rther difficult to cope with generl reltions, vrious levels (three in fct) of normlistion hve been introduced restricting the kind of reltions llowed. A second pproch is the hierrchicl pproch. It is used in mny existing dt bse systems. This pproch works s one might expect: dt is represented in the form of hierrchies. Although it is more restrictive thn the reltionl pproch it often seems to be the nturl wy to proceed. It cn be rgued tht in mny pplictions hierrchic structure is good pproximtion to the nturl structure in the dt, nd tht the resulting loss in precision of representtion is worth the gin in efficiency nd simplicity of representtion. The third pproch is the network pproch ssocited with the proposls by the Dt Bse Tsk Group of CODASYL. Here dt items re linked into network in which ny given link between two items exists becuse it stisfies some condition on the ttributes of those items, for exmple, they shre n ttribute. It is more generl thn the hierrchic pproch in the sense tht node cn hve ny number of immedite superiors. It is lso equivlent to the reltionl pproch in descriptive power. The whole field of dt bse structures is still very much in stte of flux. The dvntges nd disdvntges of ech pproch re discussed very thoroughly in Dte 10, who lso gives excellent nnotted cittions to the current literture. There is lso recent Computing Survey ll which reviews the current stte of the rt. There hve been some very erly proponents of the reltionl pproch in IR, s erly s 1967 Mron 12 nd Levien 13 discussed the design nd implementtion of n IR system vi reltions, be it binry ones. Also Prywes nd Smith in their review chpter in the Annul Review of Informtion Science nd Technology more recently recommended the DBTG proposls s wys of implementing IR systems 7. Lurking in the bckground of ny discussion of file structures nowdys is lwys the question whether dt bse technology will overtke ll. Thus it my be tht ny ppliction in the field of librry utomtion nd informtion retrievl will be implemented through the use of some pproprite dt bse pckge. This is certinly possibility but not likely to hppen in the ner future. There re severl resons. One is tht dt bse systems re generl purpose systems wheres utomted librry nd retrievl systems re specil purpose. Normlly one pys price for generlity nd in this cse it is still too gret. Secondly, there now is considerble investment in providing specil purpose systems (for exmple, MARC) 14 nd this is not written off very esily. Nevertheless trend towrds incresing use of dt-bse technology exists nd is well illustrted by the incresed prominence given to it in the Annul Review of Informtion Science nd Technology. A lnguge for describing file structures Like ll subjects in computer science the terminology of file structures hs evolved higgledypiggledy without much concern for consistency, mbiguity, or whether it ws possible to mke the kind of distinctions tht were importnt. It ws only much lter tht the need for well-defined, unmbiguous lnguge to describe file structures becme pprent. In prticulr, there rose need to communicte ides bout file structures without getting bogged down by hrdwre considertions. This section will present forml description of file structures. The frmework described is importnt for the understnding of ny file structure. The terminology is bsed on tht introduced by Hsio nd Hrry (but lso see Hsio 15 nd Mnol nd Hsio 16 ). Their terminology hs been modified nd extended by Severnce 17, summry of this cn be found in vn Rijsbergen 18. Jonkers 19 hs formlised different frmework which provides n interesting contrst to the one described here. Bsic terminology Given set of 'ttributes' A nd set of 'vlues' V, then record R is subset of the crtesin product A x V in which ech ttribute hs one nd only one vlue. Thus R is set of
Introduction 51 ordered pirs of the form (n ttribute, its vlue). For exmple, the record for document which hs been processed by n utomtic content nlysis lgorithm would be R = {(K 1, x 1 ), (K 2, x 2 ),... (K m, x m )} The K i 's re keywords functioning s ttributes nd the vlue x i cn be thought of s numericl weight. Frequently documents re simply chrcterised by the bsence or presence of keywords, in which cse we write R = {K t1, K t2,..., K ti } where K ti is present if x ti = 1 nd is bsent otherwise. Records re collected into logicl units clled files. They enble one to refer to set of records by nme, the file nme. The records within file re often orgnised ccording to reltionships between the records. This logicl orgnistion hs become known s file structure (or dt structure). It is difficult in describing file structures to keep the logicl fetures seprte from the physicl ones. The ltter re chrcteristics forced upon us by the recording medi (e.g. tpe, disk). Some fetures cn be defined bstrctly (with little gin) but re more esily understood when illustrted concretely. One such feture is field. In ny implementtion of record, the ttribute vlues re usully positionl, tht is the identity of n ttribute is given by the position of its ttribute vlue within the record. Therefore the dt within record is registered sequentilly nd hs definite beginning nd end. The record is sid to be divided into fields nd the nth field crries the nth ttribute vlue. Pictorilly we hve n exmple of record with ssocited fields in Figure 4.1. R = K 1 K 2 K 3 K 4 P 1 P 2 P 3 P 4 Figure 4.1. An exmple of record with ssocited fields The fields re not necessrily constnt in length. To find the vlue of the ttribute K 4, we first find the ddress of the record R (which is ctully the ddress of the strt of the record) nd red the dt in the 4th field. In the sme picture I hve lso shown some fields lbelled P i. They re ddresses of other records, nd re commonly clled pointers. Now we hve extended the definition of record to set of ttribute-vlue pirs nd pointers. Ech pointer is usully ssocited with prticulr ttribute-vlue pir. For exmple, (see Figure 4.2) pointers could be used to link ll records for which the vlue x 1 (of ttribute K 1 ) is, similrly for x 2 equl to b, etc.
52 File structures R 1 R b R R b 2 R 3 4 5 c c c Figure 4.2. A demonstrtion of the use of pointers to link records To indicte tht record is the lst record pointed to in list of records we use the null pointer. The pointer ssocited with ttribute K in record R will be clled K-pointer. An ttribute (keyword) tht is used in this wy to orgnise file is clled key. The unify the discussion of file structures we need some further concepts. Following Hsio nd Hrry gin, we define list L of records with respect to keyword K, or more briefly K-list s set of records contining K such tht: it is clled of the list. (1) the K-pointers re distinct; (2) ech non-null K-pointer in L gives the ddress of record within L; (3) there is unique record in L not pointed to by ny record contining K; the beginning of the list; nd (4) there is unique record in L contining the null K-pointer; it is the end (Hsio nd Hrry stte condition (2) slightly differently so tht no two K-lists hve record in common; this only ppers to complicte things.) From our previous exmple: K 1 -list : R 1, R 2, R 5 K 2 -list : R 2, R 4 K 4 -list : R 1, R 2, R 3 Finlly, we need the definition of directory of file. Let F be file whose records contin just m different keywords K 1, K 2,..., K m. Let n i be the number of records contining the keyword K i, nd h i be the number of K i -lists in F. Furthermore, we denote by ij the beginning ddress of the jth K i -list. Then the directory is the set of sequences (K i, n i, h i, i1, i2,... ihi ) i = 1, 2,... m We re now in position to give unified tretment of sequentil files, inverted files, index-sequentil files nd multi-list files. Sequentil files A sequentil file is the most primitive of ll file structures. It hs no directory nd no linking pointers. The records re generlly orgnised in lexicogrphic order on the vlue of some key. In other words, prticulr ttribute is chosen whose vlue will determine the order of the records. Sometimes when the ttribute vlue is constnt for lrge number of records second key is chosen to give n order when the first key fils to discriminte. The implementtion of this file structure requires the use of sorting routine.
Introduction 53 Its min dvntges re: (1) it is esy to implement; (2) it provides fst ccess to the next record using lexicogrphic order. Its disdvntges: (1) it is difficult to updte - inserting new record my require moving lrge proportion of the file; (2) rndom ccess is extremely slow. Sometimes file is considered to be sequentilly orgnised despite the fct tht it is not ordered ccording to ny key. Perhps the dte of cquisition is considered to be the key vlue, the newest entries re dded to the end of the file nd therefore pose no difficulty to updting. Inverted files The importnce of this file structure will become more pprent when Boolen Serches re discussed in the next chpter. For the moment we limit ourselves to describing its structure. An inverted file is file structure in which every list contins only one record. Remember tht list is defined with respect to keyword K, so every K-list contins only one record. This implies tht the directory will be such tht n i = h i for ll i, tht is, the number of records contining K i will equl the number of K i -lists. So the directory will hve n ddress for ech record contining K i. For document retrievl this mens tht given keyword we cn immeditely locte the ddresses of ll the documents contining tht keyword. For the previous exmple let us ssume tht non-blck entry in the field corresponding to n ttribute indictes the presence of keyword nd blck entry its bsence. Then the directory will point to the file in the wy shown in Figure 4.3. The definition of n inverted files does not require tht the ddresses in the directory re in ny order. However, to fcilitte opertions such s conjunction ('nd') nd disjunction ('or') on ny two inverted lists, the ddresses re normlly kept in record number order. This mens tht 'nd' nd 'or' opertions cn be performed with one pss through both lists. The penlty we py is of course tht the inverted file becomes slower to updte.
54 File structures File K, 3, 3,,, 1 K, 2, 2,, 2 Directory 11 21 12 22 13 11 = = 41 21 42 b K 3 c c K 4, 3,3, 41 42 43 = 12 43 22 b c 13 Figure 4.3. An inverted file Index-sequentil files An index-sequentil file is n inverted file in which for every keyword K i, we hve n i = h i = 1 nd 11 < 21... < m1. This sitution cn only rise if ech record hs just one unique keyword, or one unique ttribute-vlue. In prctice therefore, this set of records my be order sequentilly by key. Ech key vlue ppers in the directory with the ssocited ddress of its record. An obvious interprettion of key of this kind would be the record number. In our exmple none of the ttributes would do the job except the record number. Digrmmticlly the index-sequentil file would therefore pper s shown in Figure 4.4. I hve delibertely written R i insted of K i to emphsise the nture of the key.
Introduction 55 Directory R 1, 1, 1, 11 R, 1, 1, 2 21 File c R 3, 1, 1, 31 b R 4, 1, 1, 41 R 5, 1, 1, 51 c c b Figure 4.4. An index-sequentil file In the literture n index-sequentil file is usully thought of s sequentil file with hierrchy of indices. This does not contrdict the previous definition, it merely describes the wy in which the directory is implemented. It is not surprising therefore tht the indexes ('index' = 'directory' here) re often oriented to the chrcteristics of the storge medium. For exmple (see Figure 4.5) there might be three levels of indexing: trck, cylinder nd mster. Ech entry in the trck index will contin enough informtion to locte the strt of the trck, nd the key of the lst record in the trck which is lso normlly the highest vlue on tht trck. There is trck index for ech cylinder. Ech entry in the cylinder index gives the lst record on ech cylinder nd the ddress of the trck index for tht cylinder. If the cylinder index itself is stored on trcks, then the mster index will give the highest key referenced for ech trck of the cylinder index nd the strting ddress of tht trck.
56 File structures Mster index [Core store] Cylinder nd trck contining index Highest key referenced for corresponding index 1/010 0150 2/020 0250 3/015 0350 Cylinder index [Cylinder 1, trck 010] Highest key referenced for corresponding index Trck ddress of record index 0050 015 0150 030 Trck index [Cylinder 1, trck 030] Highest key in trck Strt of trck 0080 090 00100 091 Dt re Trck Key Dt 090 091 0060 0065 0070 0080 0085 0090 0095 0100 Figure 4.5. An exmple of n implementtion of n index-sequentil file (Adpted from D. R. Judd, The Use of Files, Mcdonld nd Elsevier, London nd New York 1973, pge 46)
Introduction 57 No mention hs been mde of the possibility of overflow during n updting process. Normlly provision is mde in the directory to dminister n overflow re. This of course increses the number of book-keeping entries in ech entry of the index. Multi-lists A multi-list is relly only slightly modified inverted file. There is one list per keyword, i.e. h i = 1. The records contining prticulr keyword K i re chined together to form the K i - list nd the strt of the K i -list is given in the directory, s illustrted in Figure 4.6. Since there is no K 3 -list, the field reserved for its pointer could well hve been omitted. So could ny blnk pointer field, so long s no mbiguity rises s to which pointer belongs to which keyword. One wy of ensuring this, prticulrly if the dt vlues (ttribute-vlues) re fixed formt, is to hve the pointer not pointing to the beginning of the record but pointing to the loction of the next pointer in the chin. The multi-list is designed to overcome the difficulties of updting n inverted file. The ddresses in the directory of n inverted file re normlly kept in record-number order. But, when the time comes to dd new record to the file, this sequence must be mintined, nd inserting the new ddress cn be expensive. No such problem rises with the multi-list, we updte the pproprite K-lists by simply chining in the new record. The penlty we py for this is of course the increse in serch time. This is in fct typicl of mny of the file structures. Inherent in their design is trde-off between serch time nd updte time. Cellulr multi-lists A further modifiction of the multi-list is inspired by the fct tht mny storge medi re divided into pges, which cn be retrieved one t time. A K-list my cross severl pge boundries which mens tht severl pges my hve to be ccessed to retrieve one record. A modified multi-list structure which voids this is clled cellulr multi-list. The K-lists re limited so tht they will not cross the pge (cell) boundries. At this point the full power of the nottion introduced before comes into ply. directory for cellulr multi-list will be the set of sequences (K i, n i, h i, i1,... ihi ) i = 1, 2,..., m where the h i hve been picked to ensure tht K i -list does not cross pge boundry. In n implementtion, just s in the implementtion of n index-sequentil file, further informtion will be stored with ech ddress to enble the right pge to be locted for ech key vlue. Ring structures A ring is simply liner list tht closes upon itself. In terms of the definition of K-list, the beginning nd end of the list re the sme record. This dt-structure is prticulrly useful to show clssifiction of dt. Let us suppose tht set of documents {D l, D 2, D 3, D 4, D 5, D 6, D 7, D 8 } hs been clssified into four groups, tht is {(D l, D 2 ), (D 3, D 4 ), (D 5, D 6 ), (D 7, D 8 )} Furthermore these hve themselves been clssified into two groups, {((D l, D 2 ), (D 3, D 4 )), ((D 5, D 6 ), (D 7, D 8 ))} The dendrogrm for this structure would be tht given in Figure 4.7. To represent this in storge by mens of ring structures is now simple mtter (see Figure 4.8). The
58 File structures Directory R, 2, 1, 2 11 R, 2, 1, 2 21 11 = 41 File c R 3 21 b R 4, 3, 1, 41 c c Λ b Λ Λ Figure 4.6. A multi-list D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 Figure 4.7. A dendrogrm
Introduction 59 C 7 C 5 6 C C C C C 1 2 3 4 D D D D 1 2 3 4 D D 5 6 D 7 D 8 Figure 4.8. An implementtion of dendrogrm vi ring structures The D i indictes description (representtion) of document. Notice how the rings t lower level re contined in those t higher level. The field mrked C i normlly contins some identifying informtion with respect to the ring it subsumes. For exmple, C 1 in some wy identifies the clss of documents {D 1, D 2 }. Were we to group documents ccording to the keywords they shred, then for ech keyword we would hve group of documents, nmely, those which hd tht keyword in common. C i would then be the field contining the keyword uniting tht prticulr group. The rings would of course overlp (Figure 4.9), s in this exmple: D 1 = {K 1, K 2 } D 2 = {K 2, K 3 } D 3 = {K 1, K 4 } The usefulness of this kind of structure will become more pprent when we discuss serching of clssifictions. If ech ring hs ssocited with it record which contins identifying informtion for its members, then, serch strtegy serching structure such s this will first look t C i (or K i in the second exmple) to determine whether to proceed or bndon the serch. Threded lists
60 File structures In this section n elementry knowledge of list processing will be ssumed. Reders who re unfmilir with this topic should consult the little book by Foster 20. K K 1 2 D D D 3 1 2 Figure 4.9. Two overlpping rings A simple list representtion of the clssifiction ((D 1, D 2 ), (D 3, D 4 )), ((D 5, D 6 ), (D 7, D 8 )) is given in Figure 4.10. Ech sublist in this structure hs ssocited with it record contining only two pointers. (We cn ssume tht D i is relly pointer to document D i.) The function of the pointers should be cler from the digrm. The min thing to note, however, is tht the record ssocited with list does not contin ny identifying informtion. A modifiction of the implementtion of list structure like this which mkes it resemble set of ring structures is to mke the right hnd pointer of the lst element of sublist point bck to the hed of the sublist. Ech sublist hs become effectively ring structure. We now hve wht is commonly clled threded list (see Figure 4.11). The representtion I hve given is slight oversimplifiction in tht we need to flg which elements re dt elements (giving ccess to the documents D i ) nd which elements re just pointer elements. The mjor dvntge ssocited with threded list is tht it cn be trversed without the id of stck. Normlly when trversing conventionl list structure the return ddresses re stcked, wheres in the threded list they hve been incorported in the dt structure.
Introduction 61 D 1 D D 2 3 D D 4 5 D 6 D 7 D 8 Figure 4.10. A list structure implementtion of hierrchic clssifiction One disdvntge ssocited with the use of list nd ring structures for representing clssifictions is tht they cn only be entered t the 'top'. An dditionl index giving entry to the structure t ech of the dt elements increses the updte speed considerbly. D 1 2 D D 3 D D 4 5 D 6 D 7 D 8 Figure 4.11. A threded list implementtion of hierrchic clssifiction Another modifiction of the simple list representtion hs been studied extensively by Stnfel 21,22 nd Ptt 23. The individul elements (or cells) of the list structure re modified to incorporte one extr field, so tht insted of ech element looking like this
62 File structures it now looks like this P P S P P 1 2 1 2 where the P i s re pointers nd S is symbol. Otherwise no essentil chnge hs been mde to the simple representtion. This structure hs become known s the Doubly Chined Tree. Its properties hve minly been investigted for storing vrible length keys, where ech key is mde up by selecting symbols from finite (usully smll) lphbet. For exmple, let {A,B,C} be the set of key symbols nd let R 1, R 2, R 3, R 4, R 5 be five records to be stored. Let us ssign keys mde of the 3 symbols, to the record s follows: AAA R 1 AB R 2 AC R 3 BB R 4 BC R 5 An exmple of doubly chined tree contining the keys nd giving ccess to the records is given in Figure 4.12. The topmost element contins no symbol, it merely functions s the strt of the structure. Given n rbitrry key its presence or bsence is detected by mtching it ginst keys in the structure. Mtching proceeds level by level, once mtching symbol hs been found t one level, the P 1 pointer is followed to the set of lterntive symbols t the next level down. The mtching will terminte either: mtch; or (1) when the key is exhusted, tht is, no more key symbols re left to (2) when no mtching symbol is found t the current level. Level 0 A B Level 1 B R C R A B R C R 2 3 4 5 Level 2 A R 1 Level 3 Figure 4.12. An exmple of doubly chined tree For cse (1) we hve:
Introduction 63 () (b) the key is present if the P1 pointer in the sme cell s the lst mtching symbol now points to record; P1 points to further symbol, tht is, the key 'flls short' nd is therefore not in the structure. For cse (2), we lso hve tht the key is not in the structure, but now there is mismtch. Stnfel nd Ptt hve concentrted on generting serch trees with minimum expected serch time, nd preserving this property despite updting. For the detiled mthemtics demonstrting tht this is possible the reder is referred to their cited work. Trees Although computer scientists hve dopted trees s file structures, their properties were originlly investigted by mthemticins. In fct, substntil prt of the Theory of Grphs is devoted to the study of trees. Excellent books on the mthemticl spects of trees (nd grphs) hve been written by Berge 24, Hrry et l., 25 nd Ore 26. Hrry's book lso contins useful glossry of concepts in grph theory. In ddition Bertziss 3 nd Knuth 27 discuss topics in grph theory with pplictions in informtion processing. There re numerous definitions of trees. I hve chosen prticulrly simple one from Berge. If we think of grph s set of nodes (or points or vertices) nd set of lines (or edges) such tht ech line connects exctly two nodes, then tree is defined to be finite connected grph with no cycles, nd possessing t lest two nodes. To define cycle we first define chin. We represent the line u k joining two nodes x nd y by u k = [x,y]. A chin is sequence of lines, in which ech line u k hs one node in common with the preceding line u k-1, nd the other vertex in common with the succeeding line u k+1. An exmple of chin is [,x 1 ], [x 1,x 2 ], [x 2,x 3 ], [x 3,b]. A cycle is finite chin which begins t node nd termintes t the sme node (i.e. in the exmple = b). Berge gives the following theorem showing mny equivlent chrcteristions of trees. Theorem. Let H be grph with t lest n nodes, where n > 1; ny one of the following equivlent properties chrcterises tree. (1) H is connected nd does not possess ny cycles. (2) H contins no cycles nd hs n - 1 lines. (3) H is connected nd hs n - l lines. (4) H is connected but loses this property if ny line is deleted. (5) Every pir of nodes is connected by one nd only one chin. One thing to be noticed in the discussion so fr is tht no mention hs been mde of direction ssocited with line. In most pplictions in computer science (nd IR) one node is singled out s specil. This node is normlly clled the root of the tree, nd every other node in the tree cn only be reched by strting t the root nd proceeding long chin of lines until the node sought is reched. Implicitly therefore, direction is ssocited with ech line. In fct, when one comes to represent tree inside computer by list structure, often the ddresses re stored in wy which llows movement in only one direction. It is convenient to think of tree s directed grph with reserved node s the root of the tree. Of course, if one hs root then ech pth (directed chin) strting t the root will eventully terminte t prticulr node from which no further brnches will emerge. These nodes re clled the terminl nodes of the tree. By now it is perhps pprent tht when we were tlking bout ring structures nd threded lists in some of our exmples we were relly demonstrting how to implement tree structure. The dendrogrm in Figure 4.7 cn esily be represented s tree (Figure
64 File structures 4.13). The documents re stored t the terminl nodes nd ech node represents clss (cluster) of documents. A serch for prticulr set of documents would be initited t the root nd would proceed long the rrows until the required clss ws found. Another exmple of tree structure is the directory ssocited with n index-sequentil file. It ws described s hierrchy of indexes, but could eqully well hve been described s tree structure. The use of tree structures in computer science dtes bck to the erly 1950s when it ws relised tht the so-clled binry serch could redily be represented by binry tree. A binry tree is one in which ech node (except the terminl nodes) hs exctly two brnches leving it. A binry serch is n efficient method for detecting the presence or bsence of key vlue mong set of keys. It presupposes tht the keys hve been sorted. It proceeds by successive division of the set, t ech division discrding hlf the current set s not contining the sought key. When the set contins N sorted keys the serch time is of order log 2 N. Furthermore, fter some thought one cn see how this process cn be simply represented by binry tree. Unfortuntely, in mny pplictions one wnts the bility to insert key which hs been found to be bsent. If the keys re stored sequentilly then the time tken by the insertion opertion my be of order N. If one, however, stores the keys in binry tree this lengthy insert time my be overcome, both serch nd insert time will be of order log 2 N. The keys re stored t the nodes, t ech node left brnch will led to 'smller' keys, right brnch will led to 'greter' keys. A serch terminting on terminl node will indicte tht the key is not present nd will need to be inserted. The structure of the tree s it grows is lrgely dependent on the order in which new keys re presented. Serch time my become unnecessrily long becuse of the lop-sidedness of the tree. Fortuntely, it cn be shown (Knuth 28 ) tht rndom insertions do not chnge the expected log 2 N time dependence of the tree serch. Nevertheless, methods re vilble to prevent the possibility of degenerte trees. These re trees in which the keys re stored in such wy tht the expected serch time is fr from optiml. For exmple, if the keys were to rrive for insertion lredy ordered then the tree to be built would simply be s shown in Figure 4.14.
Introduction 65 Root Top of tree D D D D D D D D 1 2 3 4 5 6 7 8 Terminl nodes Bottom of tree Figure 4.13. A tree representtion of dendrogrm It would tke us too fr field for me to explin the techniques for voiding degenerte trees. Essentilly, the binry tree is mintined in such wy tht t ny node the subtree on the left brnch hs pproximtely s mny levels s the subtree on the right brnch. Hence the nme blnced tree for such tree. The serch pths in blnced tree will never be more thn 45 per cent longer thn the optimum. The expected serch nd insert times re still of order log N. For further detils the reder is recommended to consult Knuth28. K 1 K 2 K 3 Figure 4.14. An exmple of degenerte tree So fr we hve ssumed tht ech key ws eqully likely s serch rgument. If one hs dt giving the probbility tht the serch rgument is K i ( key lredy in the tree), nd the probbility tht the serch rgument lies between K i nd K i+1, then gin techniques re known for reordering the tree to optimise the expected serch time. Essentilly one mkes sure tht the more frequently ccessed keys hve the shortest serch pths from the root.
66 File structures One well-known technique used when only the second set of probbilities is known, nd the others ssigned the vlue zero, is the Hu-Tucker lgorithm. Agin the interested reder my consult Knuth. At this point it is probbly good ide to point out tht these efficiency considertions re lrgely irrelevnt when it comes to representing document clssifiction by tree structure. The sitution in document retrievl is different in the following spects: (1) we do not hve useful liner ordering on the documents; (2) serch request normlly does not seek the bsence or presence of document. In fct, wht we do hve is tht documents re more or less similr to ech other, nd request seeks documents which in some wy best mtch the request. A tree structure representing document clssifiction is therefore chosen so tht similr documents my be close together. Therefore to rerrnge tree structure to stisfy some 'blncedness' criterion is out of the question. The serch efficiency is chieved by bringing together documents which re likely to be required together. This is not to sy tht the bove efficiency considertions re unimportnt in the generl context of IR. Mny opertions, such s the serching of dictionry, nd using suffix stripping lgorithm cn be mde very efficient by ppropritely structuring the binry tree. The discussion so fr hs been limited to binry trees. In mny pplictions this twowy split is inpproprite. The nturl wy to represent document clssifictions is by generl tree structure, where there is no restriction on the number of brnches leving node. Another exmple is the directory of n index sequentil file which is normlly represented by n m-wy tree, where m is the number of brnches leving node. Finlly, more comments re in order bout the mnipultion of tree structures in mss storge devices. Up to now we hve ssumed tht to follow set of pointers poses no prticulr problems with regrd to retrievl speed. Unfortuntely, present rndom ccess devices re sufficiently slow for it to be impossible to llow n ccess for, sy, ech node in tree. There re wys of prtitioning trees in such wy tht the number of disk ccesses during tree serch cn be reduced. Essentilly, it involves storing number of nodes together in one 'pge' of disk storge. During disk ccess this pge is brought into fst memory, is then serched, nd the next pge to be ccessed is determined. Sctter storge or hsh ddressing One file structure which does not relte very well to the ones mentioned before is known s Sctter Storge. The technique by which the file structure is implemented is often clled Hsh Addressing. Its underlying principle is ppelingly simple. Given tht we my ccess the dt through number of keys K i, then the ddress of the dt in store is locted through key trnsformtion function f which when pplied to K i evlutes to give the ddress of the ssocited dt. We re ssuming here tht with ech key is ssocited only one dt item. Also for convenience we will ssume tht ech record (dt nd key) fits into one loction, whose ddress is in the imge spce of f. The ddresses given by the ppliction of f to the keys K i re clled the hsh ddresses nd f is clled hshing function. Idelly f should be such tht it spreds the hsh ddresses uniformly over the vilble storge. Of course this would be chieved if the function were one-to-one. Unfortuntely this cnnot be so becuse the rnge of possible key vlues is usully considerbly lrger thn the rnge of the vilble storge ddresses. Therefore, given ny hshing function we hve to contend with the fct tht two distinct keys K i nd K j re likely to mp to the sme ddress f(k i ) (=f(k j )). Before I explin some of the wys of deling with this I shll give few exmples of hshing functions. Let us ssume tht the vilble storge is of size 2 m then three simple trnsformtions re s follows:
Introduction 67 (1) if K i is the key, then tke the squre of its binry representtion nd select m bits from the middle of the result; (2) cut the binry representtion of K i into pieces ech of m bits nd dd these together. Now select the m lest significnt bits of the sum s the hsh ddress; (3) divide the integer corresponding to K i by the length of the vilble store 2 m nd use the reminder s the hsh ddress. Ech of these methods hs disdvntges. For exmple, the lst one my given the sme ddress rther frequently if there re ptterns in the keys. Before using prticulr method, the reder is dvised to consult the now extensive literture on the subject, e.g. Morris 29, or Lum et l. 30. As mentioned before there is the problem of collisions, tht is, when two distinct keys hsh to the sme ddress. The first point to be mde bout this problem is tht it destroys some of the simplicity of hshing. Initilly it my hve been thought tht the key need not be stored with the dt t the hsh ddress. Unfortuntely this is not so. No mtter wht method we use to resolve collisions we still need to store the key with the dt so tht t serch time when key is hshed we cn distinguish its dt from the dt ssocited with keys which hve hshed to the sme ddress. There re number of strtegies for deling with collisions. Essentilly they fll into two clsses, those which use pointers to link together collided keys nd those which do not. Let us first look t the ones which do not use pointers. These hve mechnism for serching the store, strting t the ddress where the collision occurred, for n empty storge loction if record needs to be inserted, or, for mtching key vlue t retrievl time. The simplest of these dvnces from the hsh ddress ech time moving long fixed number of loctions, sy s, until n empty loction or the mtching key vlue is found. The collision strtegy thus trces out well defined sequence of loctions. This method of deling with collisions is clled the liner method. The tendency with this method is to store collided records s closely to the initil hsh ddress s possible. This leds to n undesirble effect clled primry clustering. In this context ll this mens is tht the records tend to concentrte in groups or bunch-up. It destroys the uniform nture of the hshing function. To be more precise, it is desirble tht hsh ddresses re eqully likely, however, the first empty loction t the end of collision sequence increses in likelihood in proportion to the number of records in the collision sequence. To see this one needs only to relise tht key hshed to ny loction in the sequence will hve its record stored t the end of the sequence. Therefore big groups of records tend to grow even bigger. This phenomenon is ggrvted by smll step size s when seeking n empty loction. Sometimes s = 1 is used in which cse the collision strtegy is known s the open ddressing technique. Primry clustering is lso worse when the hsh tble (vilble storge) is reltively full. Vritions in the liner method which void primry clustering involve mking the step size vrible. One wy is to set s equl to i + bi 2 on the ith step. Another is to invoke rndom number genertor which clcultes the step size fresh ech time. These lst two collision hndling methods re clled the qudrtic nd rndom method respectively. Although they void primry clustering they re nevertheless subject to secondry clustering, which is cused by keys hshing to the sme ddress nd following the sme sequence in serch of n empty loction. Even this cn be voided, see for exmple Bell nd Kmn31. The second clss of collision hndling methods involves extr storge spce which is used to chin together collided records. When collision occurs t hsh ddress it my be becuse it is the hed of chin of records which hve ll hshed to tht ddress, or it my be tht record is stored there which belongs to chin strting t some other ddress. In both cses free loction is needed which in the first cse is simply linked in nd stores the new record, in the second cse the intermedite chin element is moved to the free loction nd the new record is stored t its own hsh ddress thus strting new chin ( oneelement chin so fr). A vrition on this method is to use two-level store. At the first level we hve hsh tble, t the second level we hve bump tble which contins ll the
68 File structures collided records. At hsh ddress in the hsh tble we will find either, record if no collisions hve tken plce t tht ddress, or, pointer to chin of records which collided t tht ddress. This ltter chining method hs the dvntge tht records need never be moved once they hve been entered in the bump tble. The storge overhed is lrger since records re put in the bump tble before the hsh tble is full. For both clsses of collision strtegies one needs to be creful bout deletions. For the liner, qudrtic etc. collision hndling strtegies we must ensure tht when we delete record t n ddress we do not mke records which collided t tht ddress unrechble. Similrly with the chining method we must ensure tht deleted record does not leve gp in the chin, tht is, fter deletion the chin must be reconnected. The dvntges of hshing re severl. Firstly it is simple. Secondly its insertion nd serch strtegies re identicl. Insertion is merely filed serch. If K i is the hshed key, then if serch of the collision sequence fils to turn up mtch in K i, its record is simply inserted t the end of the sequence t the next free loction. Thirdly, the serch time is independent of the number of keys to be inserted. The ppliction of hshing in IR hs tended to be in the re of tble construction nd look-up procedures. An obvious ppliction is when constructing the set of confltion clsses during text processing. In Chpter 2, I gve n exmple of document representtive s simply list of clss nmes, ech nme stnding for set of equivlent words. During retrievl opertion, query will first be converted into list of clss nmes. To do this ech significnt word needs to be looked up in dictionry which gives the nme of the clss to which it belongs. Clerly there is cse for hshing. We simply pply the hshing function to the word nd find the nme of the confltion clss to which it belongs t the hsh ddress. A similr exmple is given in gret detil by Murry 32. Finlly, let me recommend two very redble discussions on hshing, one is in Pge nd Wilson 33, the other is in Knuth's third volume 28. Clustered files It is now common prctice to refer to file processed by clustering lgorithm s clustered file, nd to refer to the resulting structure s file structure. For exmple Slton 34 (p. 288) lists clustered file s n lterntive orgnistion to inverted, seril, chined files, etc. Although it my be convenient terminologiclly, it does disguise the rel sttus of cluster methods. Cluster methods (or utomtic clssifiction methods) re more profitbly discussed t the level of bstrction t which reltions re discussed in connection with dt bses, tht is, in thoroughly dt independent wy. In other words, selecting n pproprite cluster method nd implementing it re two seprte problems. Unfortuntely not ll users of clustering techniques see it this wy, nd so the current scene is rther confused. One fctor contributing to the confusion is tht clustering techniques hve been used t very low level of implementtion of system softwre, for exmple, to reduce the number of pge exceptions in virtul memory. Therefore, those who use clustering merely to increse retrievl efficiency (in terms of storge nd speed) will tend to see clssifiction structure s file structure, wheres those who see clustering s mens of discovering (or summrising) some inherent structure in the dt will look upon the sme structure s description of the dt. Of course, this description my be used to chieve more efficient retrievl (nd in IR more effective retrievl in terms of sy precision nd recll). Furthermore, if one looks crefully t some of the implementtions of cluster methods one discovers tht the clssifictory system is represented inside the computer by one of the more conventionl file structures. Bibliogrphic remrks There is now vst literture on file structures lthough there re very few survey rticles. Where possible I shll point to some of the more detiled discussions which emphsise n ppliction in IR. Of course the chpter on file orgnistion in the Annul Review is good
Introduction 69 source of references s well. Chpter 7 of Slton's ltest book contins useful introduction to file orgnistion techniques 34. A generl rticle on dt structures of more philosophicl nture well worth reding is Meley 32. A description of the use of sequentil file in n on-line environment my be found in Negus nd Hll 36. The effectiveness nd efficiency of n inverted file hs been extensively compred with file structure bsed on clustering by Murry 37. Ein-Dor 38 hs done comprehensive comprison between n inverted file nd tree structured file. It is hrd to find discussion of n index-sequentil file which mkes specil reference to the needs of document retrievl. Index-sequentil orgnistion is now considered to be bsic softwre which cn be used to implement vriety of other file orgnistions. Nevertheless it is worth studying some of the spects of its implementtion. For this I recommend the pper by McDonell nd Montgomery 39 who give detiled description of n implementtion for mini-computer. Multi-lists nd cellulr multi-lists re firly well covered by Lefkovitz 40. Ring structures hve been very populr in CAD nd hve been written up by Gry 4l. Extensive use ws mde of modified threded list by vn Rijsbergen 42 in his cluster-bsed retrievl experiments. The doubly chined tree hs been dequtely delt with by Stnfel 21,22 nd Ptt 23. Work on tree structures in IR goes bck long wy s illustrted by the erly ppers by Slton 43 nd Sussenguth 44. Trees hve lwys ttrcted much ttention in computer science, minly for the bility to reduce expected serch times in dt retrievl. One of the erliest ppers on this topic is by Windley 45 but the most extensive discussion is still to be found in Knuth 28 where not only methods of construction re discussed but lso techniques of reorgnistion. More recently specil kind of tree, clled trie, hs ttrcted ttention. This is tree structure which hs records stored t its terminl nodes, nd discrimintors t the internl nodes. A discrimintor t node is mde up from the ttributes of the records dominted by tht node. Or s Knuth puts it: 'A trie is essentilly M-ry tree whose nodes re M- plce vectors with components corresponding to digits or chrcters. Ech node on level l represents the set of ll keys tht begin with certin sequence of l chrcters; the node specifies n M-wy brnch depending on the (l + 1)st chrcter.' Tries were invented by Fredkins 46, further considered by Sussenguth 44, nd more recently studied by Burkhrd 47, Rivest 48, nd Bentley 49. The use of tries in dt retrievl where one is interested in either mtch or mismtch is very similr to the construction of hierrchic document clssifiction, where ech node of the tree representing the hierrchy is lso ssocited with 'discrimintor' used to direct the serch for relevnt documents (see for exmple Figure 5.3 in Chpter 5). The use of hshing in document retrievl is delt with in Higgins nd Smith 50, nd Chous 51. It hs become fshionble to refer to document collections which hve been clustered s clustered files. I hve gone to some pins to void the use of this terminology becuse of the conceptul difference tht exists between structure which is inherent in the dt nd cn be discovered by clustering, nd n orgnistion of the dt to fcilitte its mnipultion inside computer. Unfortuntely this distinction becomes somewht blurred when clustering techniques re used to generte physicl orgnistion of dt. For exmple, the work by Bell et l. 52 is of this nture. Furthermore, it hs recently become populr to cluster records simply to improve the efficiency of retrievl. Buckhrd nd Keller 53 bse the design of file structure on mximl complete subgrphs (or cliques). Htfield nd Gerld 54 hve designed pging lgorithm for virtul memory store bsed on clustering. Simon nd Guiho 55 look t methods for preserving 'clusters' in the dt when it is mpped onto physicl storge device. Some of the work tht hs been lrgely ignored in this chpter, but which is nevertheless of importnce when considering the implementtion of file structure, is concerned directly
70 File structures with the physicl orgnistion of storge device in terms of block sizes, etc. Unfortuntely, generl sttements bout this re rther hrd to mke becuse the orgnistion tends to depend on the hrdwre chrcteristics of the device nd computer. Representtive of work in this re is the pper by Lum et l. 56. References 1. HSIAO, D. nd HARARY, F., 'A forml system for informtion retrievl from files', Communictions of the ACM, 13, 67-7 3 ( 1970) 2. ROBERTS, D. C., 'File orgniztion techniques', Advnces in Computers, 12, 115-174 (1972) 3. BERTZISS, A. T., Dt Structures: Theory nd Prctice, Acdemic Press, London nd New York (1971) 4. DODD, G. G., 'Elements of dt mngement systems', Computing Surveys, 1, 117-133 (1969) 5. CLIMENSON, W. D., 'File orgniztion nd serch techniques', Annul Review of Informtion Science nd Technology, 1, 107-135 ( 1966) 6. WARHEIT, 1. A., 'File orgniztion of librry records', Journl of Librry Automtion, 2, 20-30 (1969) 7. PRYWES, N. S. nd SMITH, D. P., 'Orgniztion of informtion', Annul Review of Informtion Science nd Technology, 7, 103-158 (1972) Technology, 7,103 158 (1972) 8. CODD, E. E., 'A reltionl model of dt for lrge shred dt bnks', Communictions of the A CM, 13, 377-387 ( 1970) 9. SENKO, M. E., 'Informtion systems: records, reltions, sets, entities, nd things', Informtion Systems, 1, 3-13 (1975) 10. DATE, C. J., An Introduction to Dt Bse Systems, Addison-Wesley, Reding, Mss. (1975) 11. SIBLEY, E. H., 'Specil Issue: Dt bse mngement systems', Computing Surveys, 8, No. I (1976) 12. MARON, M. E., 'Reltionl dt file I: Design philosophy', In: Informtion Retrievl (Edited by Schecter 6), 211-223 (1967) 13. LEVIEN, R., 'Reltionl dt file Il: Implementtion', In: Informtion Retrievl (Edited by Schecter 6), 225-241(1967) 14. HAYES, R. M. nd BECKER, J., Hndbook of Dt Processing for Librries, Melville Publishing Co., Los Angeles, Cliforni (1974) 15. HSIAO, D., 'A generlized record orgniztion', IEEE Trnsctions on Computers, C-20, 1490-1495 (1971) 16. MANOLA, F. nd HSIAO, D. K., 'A model for keyword bsed file structures nd ccess', NRL Memorndum Report 2544, Nvl Reserch Lbortory, Wshington D.C. (1973) 17. SEVERANCE, D. G., 'A prmetric model of lterntive file structures', Informtion Systems, 1, 51-55 (1975) 18. vn RIJSBERGEN, C. J., 'File orgniztion in librry utomtion nd informtion retrievl', Journl of Documenttion, 32, 294-317 ( 1976) 19. JONKERS, H. L., 'A strightforwrd nd flexible design method for complex dt bse mngement systems', Informtion Storge nd Retrievl, 9, 401-415 (1973) 20. FOSTER, J. M., List Processing, Mcdonld, London; nd Americn Elsevier Inc., New York ( 1967 )
Introduction 71 21. STANFEL, L. E., 'Prcticl spect of doubly chined trees for retrievl', Journl of the ACM, 19, 425~36 (1972) 22. STANFEL, L. E., 'Optiml trees for clss of informtion retrievl problems', Informtion Storge nd Retrievl, 9, 43-59 (1973) 23. PATT, Y. N., Minimum serch tree structure for dt prtitioned into pges', IEEE Trnstions on Computers, C-21, 961-967 (1972) 24. BERGE, C., The Theory of Grphs nd its Applictions, Methuen, London (1966) 25. HARARY, F., NORMAN, R. Z. nd CARTWRIGHT, D., Structurl Models: An Introduction to the Theory of Directed Grphs, Wiley, New York (1966) 26. ORE, 0., Grphs nd their Uses, Rndom House, New York (1963) 27. KNUTH, D. E., The Art of Computer Progrmming, Vol. 1, Fundmentl Algorithms, Addison-Wesley, Reding, Msschusetts (1968) 28. KNUTH, D. E., The Art of Computer Progrmming, Vol 3, Sorting nd Serching, Addison- Wesley, Reding, Msschusetts (1973) 29. MORRIS, R., 'Sctter storge techniques', Communictions of the ACM, 11, 35-38 (1968) 30. LUM, V. Y., YUEN, P. S. T. nd DODD, M., 'Key-to-ddress trnsform techniques: fundmentl performnce study on lrge existing formtted files', Communictions of the ACM, 14, 228-239 (1971) 31. BELL, J. K. nd KAMAN, C. H., 'The liner quotient hsh code', Communictions of the ACM, 13, 675-677 ( 1970) 32. MURRAY, D. M., 'A sctter storge scheme for dictionry lookups' In: Report ISR-16 to the Ntionl Science Foundtion, Section 11, Cornell University, Deprtment of Computer Science (1969) 33. PAGE, E. S. nd WILSON, L. B., Informtion Representtion nd Mnipultion in Computer, Cmbridge University Press, Cmbridge (1973) 34. SALTON, G., Dynmic Informtion nd Librry Processing, Prentice-Hll, Englewood Cliffs, N.J. (1975) 35. MEALEY, G.H., 'Another look t dt', Proceedings AFIP Fll Joint Computer Conference, 525-534 (1967) 36. NEGUS, A.E. nd HALL, J.L., 'Towrds n effective on-line reference retrievl system', Librry Memo CLM-LM2/71, U.K. Atomic Energy Authority, Reserch Group (1971) 37. MURRAY, D.M., 'Document retrievl bsed on clustered files', Ph.D. Thesis, Cornell University Report ISR-20 to Ntionl Science Foundtion nd to the Ntionl Librry of Medicine (1972) 38. EIN-DOR, P., 'The comprtive efficiency of two dictionry structures for document retrievl', Infor Journl, 12, 87-108 (1974) 39. McDONELL, K.J. nd MONTGOMERY, A.Y., 'The design of indexed sequentil files', The Austrlin Computer Journl, 5, 115-126 (1973) 40. LEFKOVITZ, D., File Structures for On-Line Systems, Sprtn Books, New York (1969) 41. GRAY, J.C., 'Compound dt structure for computer ided design: survey', Proceedings ACM Ntionl Meeting, 355-365 (1967) 42. vn RIJSBERGEN, C.J., 'An lgorithm for informtion structuring nd retrievl', Computer Journl, 14, 407-412 (1971)
72 File structures 43. SALTON, G., Mnipultion of trees in informtion retrievl', Communictions of the ACM, 5, 103-114 (1962) 44. SUSSENGUTH, E.H., 'Use of tree structures for processing files', Communictions of the ACM, 6, 272-279 (1963) 45. WINDLEY, P.F., 'Trees, forests nd rerrnging', Computer Journl, 3, 84-88 (1960) 46. FREDKIN, E., 'Trie memory', Communictions of the ACM, 3, 490-499 (1960) 47. BURKHARD, W.A., 'Prtil mtch queries nd file designs', Proceedings of the Interntionl Conference on Very Lrge Dt Bses, 523-525 (1975) 48. RIVEST, R., Anlysis of Associtive Retrievl Algorithms, Ph.D. Thesis, Stnford University, Computer Science (1974) 49. BENTLEY, J.L., 'Multidimensionl binry serch trees used for ssocitive serching', Communictions of the ACM, 13, 675-677 (1975) 50. HIGGINS, L.D. nd SMITH, F.J., 'Disc ccess lgorithms', Computer Journl, 14, 249-253 (1971) 51. CHOU, C.K., 'Algorithms for hsh coding nd document clssifiction', Ph.D. Thesis, University of Illinois (1972) 52. BELL, C.J., ALDRED, B.K. nd ROGERS, T.W., 'Adptbility to chnge in lrge dt bse informtion retrievl systems', Report No. UKSC-0027, UK Scientific Centre, IBM United Kingdom Limited, Neville Rod, Peterlee, County Durhm, U.K. (1972) 53. BURKHARD, W.A. nd KELLER, R.M., 'Some pproches to best-mtch file serching', Communictions of the ACM, 16, 230-236 (1973) 54. HATFIELD, D.J. nd GERALD, J., 'Progrm restructuring for virtul memory', IBM Systems Journl, 10, 168-192 (1971) 55. SIMON, J.C. nd GUIHO, G., 'On lgorithms preserving neighbourhood to file nd retrieve informtion in memory', Interntionl Journl Computer Informtion Sciences, 1, 3-15 (CR 23923) (1972) 56. LUM, V.Y., LING, H. nd SENKO, M.E., 'Anlysis of complex dt mngement ccess method by simultion modelling', Proceedings AFIP Fll Joint Computer Conference, 211-222 (1970)