On te Use of Bayesan Networks to Analyze Survey Data P. Sebastan 1 (1 and. Ramon ( (1 Department of atematcs and Statstcs, Unversty of assacusetts. ( Cldren's Hosptal Informatcs Program, Harvard Unversty edcal Scool. Keywords: Automated modelng, Bayesan networks, grapcal models, Bayesan model selecton. Abstract. Ts paper uses Bayesan modelng tecnques to analyze a data set extracted from te Brts General Houseold survey. Te models used are Bayesan networks, wc provde a compact and easy to nterpret knowledge representaton formalsm. An ssue consdered s te need for automated Bayesan modelng. 1. Introducton Te General Houseold Survey s a yearly survey, based on a sample of te general populaton resdent n prvate ouseolds n Great Brtan. Te General Houseold Survey began n 1971 and data s avalable from 1973 onwards. It s wdely regarded as a gold standard because of survey desgn and data collecton and as been coped by several countres. Te goal of ts survey s to provde contnuous nformaton about te major socal felds of populaton, ousng, educaton, employment, ealt and ncome. Snce te survey covers all tese topcs, t provdes users wt te opportunty to examne not only eac topc separately, but also ter mutual nterplay. Summary of te statstcal fndngs are publsed by te Brts Offce of Natonal Statstcs, and are typcally presented va contngency tables relatng two or tree varables at a tme, (see Tomas et al., 1998. We beleve tat ts communcaton style fals one of te prmary objectves of te survey, wc s to offer, to a non-tecncal audence, an up-to-date pcture of lvng n Great Brtan. To avod te fragmentaton of te overall nformaton, one sould try to buld a model tat assocates a large number of varables. To be a communcaton tool, owever, suc a model needs to be easly understandable, and easy to use. Understandablty and usablty beng te requrements, we focus on Bayesan networks, wc are known for provdng a compact and easyto-use representaton of probablstc nformaton, (see Laurtzen, 1996, and Cowell et al., 1999. A Bayesan network as two components: a drected acyclc grap and a probablty dstrbuton. Nodes n te drected acyclc grap represent stocastc varables and arcs represent drected stocastc dependences among tese varables. Tus, te grap provdes a smple summary of te dependency structure relatng te varables. Te probablty dstrbuton for te network varables decomposes accordng to te condtonal ndependences represented by te drected acyclc grap, 1 Address for correspondence: Department of atematcs and Statstcs, Lederle Graduate Researc Tower, Unversty of assacusetts, Amerst A 01003. Emal: sebas@mat.umass.edu, Telepone: 413 545 06. 1
and eac component - a condtonal probablty table - quantfes te remanng drected dependences. Te grap s an effectve way to descrbe te overall dependency structure of a large number of varables, tus removng te lmtaton of examnng te par-wse assocatons of varables. Furtermore, one can easly nvestgate undrected relatonsps between te varables, as well as makng predcton and explanaton, by queryng te network. Ts last task conssts of computng te condtonal probablty dstrbuton of one varable, gven tat values of some varables n te network are observed. Nowadays tere are several effcent algortms for probablstc reasonng, wc take advantage of te network decomposablty (Castllo, 1997, and commercal programs suc as Bayesware Dscoverer (avalable at ttp://www.bayesware.com or Hugn (avalable at ttp://www.ugn.com mplement tese algortms. Te problem to be addressed, and we beleve one of te reasons for te slow gan n popularty of tese models n te statstcal communty, s ow to practcally buld a Bayesan network from a large data set usng Bayesan metods. Ts s consdered n te next secton. In Secton 3 we analyze a data set extracted from te 1996 General Houseold Survey. Te model selected s a network tat dsplays a global pcture of lvng n Brtan and dscovers nterestng assocatons among varables descrbng te ouseold wealt, te soco-economc status and te etnc group of te ead of te ouseold.. Overvew of automated learnng A Bayesan network s a drected acyclc grap and a probablty dstrbuton. Nodes n te drected acyclc grap represent stocastc varables = ( 1, v, and drected arcs from parent nodes to a cld node represent condtonal dependences. Any condtonal dependence s quantfed by te set of condtonal dstrbutons of te cld varable gven te confguratons of te parent varables. argnal and condtonal ndependences encoded by te drected acyclc grap (Laurtzen, 1996, provdes te followng factorzaton of te jont probablty dstrbuton v 1 k, xk, L, x vk = xk j = 1 x π Here, x, x, L, x s a combnaton of values of te varables n. For eac, te varable ( 1k k vk Π denotes te parents of Partcularly, = x, x x. ( 1k k vk wle xk and π j denote te events π j s te combnaton of values of te parent varable = xk, and = π j Π. Π n te event Te problem we consder next s learnng a Bayesan network from data. We can descrbe ts as a ypoteses testng problem. Suppose we ave a set = { 1, g } of Bayesan networks, for te dscrete random varables = ( 1, v. Eac Bayesan network represents a ypotess on te dependency structure relatng te varables. We ws to coose one Bayesan network after observng a sample of data D = x, x x }, for k = 1, L, n. Wt p { 1k k vk (
denotng te pror probablty of, for eac = 1, L, g, te typcal Bayesan soluton to te model selecton problem conssts of coosng te network wt maxmum posteror probablty p ( D =. p Te quantty p ( s te margnal lkelood, and t s computed as follows. Gven te Bayesan network, let θ denote te vector parameterzng te jont dstrbuton of te varables = ( 1, v. We denote by θ te pror densty of θ. Te lkelood functon s D θ and te margnal lkelood s computed by averagng out θ from te lkelood functon D θ. Hence = D θ θ dθ Te computaton of te margnal lkelood requres te specfcaton of a parameterzaton of eac model, and te elctaton of a pror densty for θ. In ts paper we suppose tat te varables = ( 1, v are all dscrete, so tat te parameter vector θ conssts of te condtonal probabltes θ jk = p ( = xk Π = π j, θ. In ts framework, t s easy to sow tat, under te assumpton of multnomal samplng wt complete data, te lkelood functon becomes p ( D θ ( θ jk jk njk were n jk s te sample frequency of pars ( xk, π j n te database D. Te Hyper-Drclet dstrbuton, wc s defned as a set of ndependent Drclet dstrbutons D( α j1, L, α jc, one for eac set of parameters {θ jk } k assocated wt te condtonal dstrbuton π j, s a numercally convenent coce. It s well known (see Cowell et al., 1999, tat ts coce for te pror dstrbuton provdes te followng formula for te margnal lkelood of te data. Γ( α j Γ( α jk + njk =. Γ( α + n Γ( α jk j j jk Here, n = n s te margnal frequency of π j n te database, and α = α. j k jk j k jk For consstent model comparsons, we adopt symmetrc Hyper-Drclet dstrbutons, wc depend on one yper-parameter α, called global precson. Eac yper-parameter α s computed jk 3
from α as α jk = α /( qc, were c s te number of categores of te varable number of categores of te parent varable, and q s te Π. Te ratonale bend ts coce s to dstrbute te overall pror precson α n a unform way among te parameters assocated wt dfferent condtonal probablty tables. In ts way, te pror probabltes quantfyng eac network are unform, and all te pror margnal dstrbutons of te network varables are unform and ave te same pror precson. In prncple, gven a set of Bayesan networks, wt pror probabltes, and a complete data set, one can compute ter posteror probablty dstrbuton and select te network wt maxmum posteror probablty. However, as te number of varables n te data set ncreases, te sze of te searc space makes te task nfeasble. Tus some eurstc s requred to reduce te dmenson of te searc space. Fortunately, under some partcular model pror probabltes, te posteror probablty of eac model factorzes, tus allowng local computatons. Ts property can be fully exploted by mposng an order over te varables, wc transforms model selecton nto a sequence of locally exaustve searces. We wll also descrbe a greedy searc algortm to reduce te complexty of eac locally exaustve searc wen te model space s stll too large. Te margnal lkelood p ( above as a multplcatve form. Ts fact, togeter wt te assumpton tat te network pror probabltes are decomposable (Heckerman et al., 1995, provdes a factorzaton of eac model posteror probablty. A pror probablty for a network s termed decomposable f t admts te factorzaton v = = 1, were p ( s te pror probablty of te local network structure tat specfes te parent set Π for te varable. Tus, decomposable prors are elcted by explotng te modularty of a Bayesan network, and are based on te assumpton tat te pror probablty of a local structure j of a Bayesan network s ndependent of te oter parts. Ts factorzaton of eac model pror probablty, togeter wt te factorzaton of te margnal lkelood, ensures tat te posteror probablty of te Bayesan network can be wrtten as v D = D. = 1 v = 1 Tus, te network posteror probabltes are decomposable and, n te comparson of models tat dffer only n te parent sets of a varable, only te quantty D 4
matters. Tus, for fxed, te comparson of two local network structures and ~ specfyng dfferent parent sets for can be done by smply evaluatng te product of te local Bayes factor BF = jl ~ and te pror odds ~ to compute te posteror odds of versus assocatons among te oter 1 varables. ~. Ts comparson s ndependent of any oter Now, te problem s ow to explot ts posteror probablty decomposablty. One approac, proposed by Cooper and Herskovtz (see Cooper and Herskovtz, 199, s to restrct te model searc to a subset of all possble networks, wc are consstent wt an order relaton f on te varables = ( 1, v. Te order relaton f s defned by j f, f cannot be a parent of j n any network n. In oter words, rater tan explorng networks wt arcs avng all possble drectons, ts order lmts te searc to a subset of networks n wc tere are nterestng drected assocatons. At frst glance, te requrement for an order among te varables appears to be a serous restrcton on te applcablty of ts searc strategy, but we ave successfully mplemented t n oter applcatons. (see Sebastan 000 From a modelng pont of vew, specfyng ts order s equvalent to specfyng te ypoteses to be tested and some careful screenng of te varables n te data set may avod te surprse of selectng a not very sensble model or explore unnterestng assocatons. In te next secton, we wll consder te problem of selectng an order among te varables n a real applcaton. Ts order mposed on te varables, nduces a set of k possble parents for eac varable, say P = { 1, k }. One way to proceed, wc produces te sequence of locally exaustve searces, s to mplement an ndependent model selecton for eac varable as follows. For eac varable, we defne to be te set of networks gven by te possble combnatons of parents for. Te set of networks can be dsplayed on a lattce wt k levels, eac level avng models n wc te assocated drected acyclc grap specfes k parents for. Te frst level of te lattce contans te model 0 n wc does not ave parents. Te second level contans te k models j n wc j alone s parent of and so on. For eac varable, te exaustve searc conssts of evaluatng te posteror probablty of eac model n te lattce so tat te model 5
wt maxmum posteror probablty can be dentfed. Te global model s ten found by lnkng togeter te local models for eac varable. Altoug te order among te varables greatly reduces te dmenson of te searc space, ts k locally exaustve searc sould explore a lattce of models for eac varable and, for large k, ts may be nfeasble. A furter reducton s obtaned va a greedy searc strategy, also known as te K algortm, (see Cooper and Herskovtz, 199. Te K algortm s a bottom-up strategy, so tat smpler models are evaluated frst. For eac varable, rater tan computng te posteror probablty of all networks n te set, te searc moves up n te lattce as long as n te level just explored tere s at least one network wt posteror probablty ger tan te posteror probabltes of te networks n te precedent level. Te searc starts by evaluatng te margnal lkelood 0 of te local network structure 0 encodng ndependence of and te varables n te set P. Te next step s te computaton of te margnal lkelood p ( j of te k Bayesan networks j, eac of wc descrbes te dependence of on te varable. If te maxmal margnal lkelood p ( D, for some J s greater tan j 0, te parent J s accepted and te searc proceeds n te same manner by tryng to add one of te parents from te set P \ J to te Bayesan network selected. If none of te k Bayesan networks as a margnal lkelood greater tan 0, te model 0 s accepted and te searc moves to some oter varable. Clearly, ts eurstc searc can end up n a local maxmum, and one sould be aware of ts rsk, wen nterpretng te model eventually selected. Oter searc strateges ave been proposed to address ts problem (see Cowell et al., 1999, and references teren. j 3. Analyss In ts secton we analyze a data set extracted from te Brts General Houseold Survey, wc was conducted between Aprl 1996 and arc 1997 by te Socal Survey Dvson of te Offce of Natonal Statstcs n te Unted Kngdom. Ts annual, multpurpose survey s based on a sample of around 10,000 prvate ouseolds n Great Brtan. Intervews are conducted wt everyone aged over 16 n te ouseold (around 18,000 adults. Te data set we consder comprses 9033 Brts ouseolds, wc followng te defnton ntroduced snce 1981, consst of as a sngle person or of a group of people wo ave te address as ter only or man resdence and wo sare eter one meal a day or te lvng accommodaton. In order to sow te potental usefulness of our metodology, we selected trteen varables descrbng te Brts ouseolds n terms of composton (varables Ad_fems, Ad_males, Cldren, Ho_age, Ho_gend, regons of te Unted Kngdom (varable Regon, one etncty ndcator (varable Ho_orgn, one moblty ndcator (varable Ho_reslen and economc ndcators of te ouseold (varables Accom, Bedrms, Ncars, Ho_status, Tenure. A complete descrpton of tese Crown Copyrgt 1996. Used by permsson of te Brts Offce for Natonal Statstcs. 6
varables and ter states are summarzed n Table 1. Ts group of varables was fully observed n te data set extracted from te survey. Te modelng of te data was carred out wt te program Bayesware Dscoverer, wc mplements te model searc approac descrbed n te prevous secton. Varable Descrpton State descrpton Regon Regon of brt of te England, Scotland and Wales Ho Ad_fems Number of adult females 0, 1, Ad_males Number of adult males 0, 1, Cldren Number of cldren 0, 1,, 3, 4 Ho_age Age of te Ho 17-36; 36-50; 50-66; 66-98 (years Ho_gend Gender of te Ho ale, Female Ho_orgn Etnc group of te Ho Cauc., Black, Cn., Indan, Oter Ho_reslen Lengt of resdence 0-3; 3-9; 9-19; 19 (monts Ho_status Status of Ho Actve, Inactve, Retred Accom Type of accommodaton Room, Flat, House, Oter Bedrms Number of bedrooms 1,, 3, 4 Tenure House status Rent, Owned, Socal-Sector Ncars Number of cars 1,, 3, 4 Table 1. Descrpton of te varables extracted from te 1996 General Houseold Survey. Ho denotes te Head of te Houseold. Numbers of adult males, females and cldren refer to te ouseold. Te approac to model selecton descrbed n te prevous secton requres te varables to be dscrete. Terefore, te frst step of te analyss was to dscretze te contnuous varables nto four bns of approxmately equal proportons. Before ts step, varables avng a skewed dstrbuton were transformed n a logartmc scale. any nteger-valued varables --- as tose ndcatng te number adult males or females n te ouseold --- were approprately recoded and states observed wt low frequency were grouped nto a unque state. We ten coose te followng order among te varables to lmt te space of models to be explored. Regon f Ho_orgn f Ho_gend f Ad_fems f Ad_mal f Ho_age f Ho_status fcldren f Tenure f Ho_reslen f Accomod f Bedrms f Ncars. Te coce was based on te followng consderatons. Geograpc varables precede ouseold varables and tus we are nterested n condtonng on tem frst (e.g., see Tomas, et al., 1988. Te orderng of some of te ouseold demograpc varables (e.g., Ho_orgn, Ho_gend, Ad_fems, Ad_males, Ho_age and we cose te partcular orderng for convenence. Tese varables are commonly tougt of as explanng ouse wealt wc s descrbed by te varables Ho_status, Cldren, Tenure, Ho_reslen, Accomod, Bedrms, Ncars, wle dependences n wc te age of te ouseold ead are drected nfluenced by any of tese varables do not seem to be 7
nterestng. Te remanng order was cosen n a smlar way, on te bass of possble cause-effect relatonsps between te remanng varables. We used ts order to buld 4 models, usng te K algortm, unform pror probabltes on te possble networks, and symmetrc Hyper-Drclet pror dstrbutons for te model parameters. We cose four values for te global precson α =1, 5, 10, 0 to evaluate te effect of cangng te global pror precson on te model selected. Te evaluaton was carred out by comparng te networks topologes, and ter dfferent predctve capabltes. Ts last aspect was evaluated by computng te classfcaton accuracy of te four networks. Full detals of te analyss are n Sebastan and Ramon, (see Sebastan and Ramon, 001 and led to select te network learned wt α =5. Ts network s depcted n Fgure 1 and s descrbed n te next secton. Fgure 1 Te Bayesan network selected from te data wen te global pror precson α s 5. 4. Results and dscusson Te network n Fgure 1 sows mportant, drected dependences and condtonal ndependences. Te dependency of te etnc group of eads of te ouseolds on te varable Regon reveals a more cosmopoltan socety n England tan Wales and Scotland, wt a larger proporton of Blacks and Indans as ead of ouseolds. Te varables descrbng te etnc group of te ead of te 8
ouseold, of te gender of te ead of te ouseold, and te number of adult females n te ouseold, separate Regon from most of varables descrbng ouseold wealt. Te workng status of te ead of te ouseold (varable Ho_status s ndependent of te etnc group gven te gender and age of te ead of te ouseold. Te estmated condtonal probablty table sows tat wen a young female s ead of a ouseold, se s muc more lkely to be nactve tan a young male (40% compared to 6% wen te age group s 17--36. Ts dfference attenuates as te age of te ead of te ouseold ncreases. Te condtonal dstrbuton quantfyng te dependency of te gender of te ead of te ouseold on te etnc group reveals tat Blacks ave te smallest probablty of avng a male ead of te ouseold (64% wle Indans ave te largest probablty (89%. Te age of te ead of te ouseold depends drectly on te number of adult males and females, and sows tat ouseolds wt no females and two or more males are more lkely to be eaded by a young male wle, on te oter and, ouseolds wt no males and two or more females are eaded by a md age female. Tere appear to be more sngle ouseolds eaded by an elder female tan an elder male. Furtermore, te composton of te ouseold canges n te etnc groups: te most nterestng fact s tat Indans ave te smallest probablty of lvng n a ouseold wt no adult males (10%, wle Blacks ave te largest probablty (3%. Te tenure status of te accommodaton depends drectly on te age, gender and status of te ouseold ead. On te average, te largest proporton of Brts ouseolds s establsed n owned accommodatons (75%, wen te age of te ead of te ouseold s between 36 and 66 years. Younger eads of ouseold ave a larger cance of lvng n rented accommodatons (0%, wle senor eads of ouseold ave a larger cance of lvng n accommodatons provded by te socal servce (3%. Tese fgures owever cange dramatcally wen te gender of te ead of te ouseold s taken nto account. Wen te ead of te ouseold s a young female, te probablty tat te ouseold s n an owned accommodaton s 7%, aganst 65% wen te ouseold ead s a young male. Ts probablty rses up to 5% wen te ouseold ead s an elder female compared to 69% for elder males. Houseolds are more lkely to be n an accommodaton provded by te socal servce wen te ead s an nactve female rater tan an nactve male. Te number of bedrooms s drectly affected by te number of cldren n te ouseold, te type of accommodaton and ts tenure status. Houseolds wt two or more cldren are more lkely to be n tree bedroom flats or ouses, but te accommodatons provded by te socal servce are slgtly smaller tan tose rented or owned by te ead of te ouseold. Houses are more lkely to ave a larger number of bedrooms tan flats: te most lkely number of bedrooms of an owned ouse s tree, compared to one n a flat. Interestngly, flats provded by te socal sector are more lkely to be one-bed flats, wle rented and owned flats are most lkely to be two-beds flats. Te lengt of resdence s drectly dependent on te age of te ead of te ouseold and te tenure status of te accommodaton and sows tat te lengt of resdence n rented accommodatons or tose provded by te socal servce s sorter tan tat n owned accommodatons. 9
Fgure An example of query wt te Bayesan network nduced from te data. By queryng te network, one may nvestgate oter undrected assocatons and dscover tat, for example, te typcal Caucasan md famly wt two cldren as 77% cances of beng eaded by a male wo, wt probablty.57, s aged between 36 and 50 years. Te probablty tat te ead of te ouseold s actve s.84, and te probablty tat te ouseold s n an owned ouse s.66. Results of tese queres are dsplayed n Fgure. Tese fgures are slgtly dfferent f te ead of te ouseold s, for example, Black. In ts case, te probablty tat te ead of te ouseold s male (gven tat tere are two cldren n te ouseold s only.6 and te probablty tat e s actve s.79. If te ead of te ouseold s Indan, ten te probablty tat e s male s.90 and te probablty tat e s actve s.88. On te average, te etnc group canges slgtly te probablty of te ouseold beng n an accommodaton provded by te socal servce (6% for Blacks, 3% for Cnese, 0% Indans and 4% Caucasans. Smlarly, black eads of ouseold are more lkely to be nactve tan eads of ouseold from dfferent etnc groups (16% Blacks, 10% Indans, 14% Caucasans and Cnese, and to be lvng n a less wealty ouseold, as sown by te larger probablty of lvng n accommodatons wt a smaller number of bedrooms and of avng a smaller number of cars. Houseolds eaded by Blacks are less affluent tan oters, f te gender of te ead of te ouseold s not taken nto account. However, te dependency structure sows tat te gender of te ead of te ouseold and te number of adult females make all te oter varables ndependent of te etnc group. Tus, te model extracted suggests tat dfferences n te ouseold wealt are 10
more lkely caused by te dfferent ouseold composton, and n partcular by te gender of te ead of te ouseold, rater tan racal ssues. Te robustness of many of tese nterpretatons can be examned by careful alteraton of te orderng of te varables and te structurng of te greedy searc algortm. 5. Conclusons In ts analyss, we focused on networks learned by usng unform model prors and sets of ndependent, symmetrc Drclet dstrbutons as pror dstrbuton for eac model parameters. Te advantage of usng tese pror dstrbutons s tat tey can be elcted smply by assgnng te global pror precson and ts coce produces consstent model comparsons. However, symmetrc Drclet dstrbutons are known to be too nvarant, (see Forster and Smt, 1998, so tat tey model dfferent dependency structures n te same way. One may ws to use a class of model parameter prors wc encodes dfferent pror nformaton. An nterestng callenge s to devse a class of pror dstrbutons wc mantans te consstency of model comparsons, feasblty of computatons, and provdes te user wt more modelng freedom. Te analyss ere was carred out by dscretzng all contnuous varables, tus rasng te ssue of te effect of te dscretzaton. We are currently workng on te mplementaton of a more general learnng algortm, wc selects networks from data sets wt bot contnuous and dscrete varables. One furter ssue s related to te publcatons of te results found wt te metod descrbed ere. A Bayesan network s not just te drected acyclc grap dsplayng te dependency structure selected, condtonal on te data. It s also a probablty dstrbuton, and as suc, te best way to publs te results s to gve te entre network, and to let users make ter own queres. Gven te ncreasng mportance tat te World Wde Web s assumng nowadays as a communcaton system, publcaton of te network over te WWW offers a smple way to dsplay results wtout gvng drect access to te orgnal data, tus preservng data confdentalty. 7. Acknowledgements Ts researc was supported by Eurostat, under contract EP9105. ateral from te General Houseold Survey 1996 s Crown Copyrgt and as been made avalable by te Offce for Natonal Statstcs troug Te Data Arcve and as been used by permsson. Neter te ONS nor Te Data Arcve bear any responsblty for te analyss or nterpretaton of te data reported ere. 11
6. References Castllo, E., Guterrez, J.., and Had, A. S. (1997, Expert Systems and Probablstc Network odels, Sprnger, New York, NY. Cooper, G. F., and Herskovtz, E. (199, A Bayesan metod for te nducton of probablstc networks from data, acne Learnng, Vol. 9, pp. 309-347. Cowell, R. G., Dawd, A. P., Laurtzen, S. L., and Spegelalter, D. J. (1999, Probablstc Networks and Expert Systems, Sprnger, New York, NY. Forster, J. J., and Smt, P. W. F. (1998, odel-based nference for categorcal survey data subject to non-gnorable non-response (wt dscusson, Journal of te Royal Statstcal Socety, B, Vol. 60, pp. 57-70. Heckerman, D., Geger, D., and Cckerng, D.. (1995, Learnng Bayesan networks: te combnatons of knowledge and statstcal data, acne Learnng, Vol. 0, pp. 97-43. Laurtzen, S. L. (1996, Grapcal odels, Oxford Unversty Press, Oxford, UK. Sebastan, P., and Ramon,. (001, Data analyss wt Bayesan networks, Under revson. Sebastan, P., Ramon,., and Crea, A. (000, Proflng customers from n-ouse data, AC SIGKDD Exploratons, Vol. 1, pp. 91-96. Tomas,., Walker, A., Wlmot, A., and Bennet, N. (1998, Lvng n Brtan: Results from te 1996 General Houseold Survey, Te Statonary Offce, London, UK. 1