A Graph-based Proactive Fault Identification Approach in Computer Networks



Similar documents
The example is taken from Sect. 1.2 of Vol. 1 of the CPN book.

Architecture of the proposed standard

QUANTITATIVE METHODS CLASSES WEEK SEVEN

Use a high-level conceptual data model (ER Model). Identify objects of interest (entities) and relationships between these objects

C H A P T E R 1 Writing Reports with SAS

5 2 index. e e. Prime numbers. Prime factors and factor trees. Powers. worked example 10. base. power

Entity-Relationship Model

Adverse Selection and Moral Hazard in a Model With 2 States of the World

A Project Management framework for Software Implementation Planning and Management

Lecture 20: Emitter Follower and Differential Amplifiers

FACULTY SALARIES FALL NKU CUPA Data Compared To Published National Data

Continuity Cloud Virtual Firewall Guide

CPS 220 Theory of Computation REGULAR LANGUAGES. Regular expressions

Hardware Modules of the RSA Algorithm

Enforcing Fine-grained Authorization Policies for Java Mobile Agents

Econ 371: Answer Key for Problem Set 1 (Chapter 12-13)

Question 3: How do you find the relative extrema of a function?

Keywords Cloud Computing, Service level agreement, cloud provider, business level policies, performance objectives.

Review and Analysis of Cloud Computing Quality of Experience

Key Management System Framework for Cloud Storage Singa Suparman, Eng Pin Kwang Temasek Polytechnic

Planning and Managing Copper Cable Maintenance through Cost- Benefit Modeling

A Secure Web Services for Location Based Services in Wireless Networks*

by John Donald, Lecturer, School of Accounting, Economics and Finance, Deakin University, Australia

Upper Bounding the Price of Anarchy in Atomic Splittable Selfish Routing

Traffic Flow Analysis (2)

EFFECT OF GEOMETRICAL PARAMETERS ON HEAT TRANSFER PERFORMACE OF RECTANGULAR CIRCUMFERENTIAL FINS

Development of Financial Management Reporting in MPLS

User-Perceived Quality of Service in Hybrid Broadcast and Telecommunication Networks

ITIL & Service Predictability/Modeling Plexent

Business rules FATCA V. 02/11/2015

Free ACA SOLUTION (IRS 1094&1095 Reporting)

Sci.Int.(Lahore),26(1), ,2014 ISSN ; CODEN: SINTE 8 131

Incomplete 2-Port Vector Network Analyzer Calibration Methods

Teaching Computer Networking with the Help of Personal Computer Networks

Performance Evaluation

A Theoretical Model of Public Response to the Homeland Security Advisory System

STATEMENT OF INSOLVENCY PRACTICE 3.2

A Note on Approximating. the Normal Distribution Function

Gold versus stock investment: An econometric analysis

An IAC Approach for Detecting Profile Cloning in Online Social Networks

SOFTWARE ENGINEERING AND APPLIED CRYPTOGRAPHY IN CLOUD COMPUTING AND BIG DATA

Analyzing Failures of a Semi-Structured Supercomputer Log File Efficiently by Using PIG on Hadoop

Intermediate Macroeconomic Theory / Macroeconomic Analysis (ECON 3560/5040) Final Exam (Answers)

IHE IT Infrastructure (ITI) Technical Framework Supplement. Cross-Enterprise Document Workflow (XDW) Trial Implementation

Combinatorial Analysis of Network Security

An Broad outline of Redundant Array of Inexpensive Disks Shaifali Shrivastava 1 Department of Computer Science and Engineering AITR, Indore

Data Encryption and Decryption Using RSA Algorithm in a Network Environment

Parallel and Distributed Programming. Performance Metrics

CPU. Rasterization. Per Vertex Operations & Primitive Assembly. Polynomial Evaluator. Frame Buffer. Per Fragment. Display List.

Remember you can apply online. It s quick and easy. Go to Title. Forename(s) Surname. Sex. Male Date of birth D

union scholars program APPLICATION DEADLINE: FEBRUARY 28 YOU CAN CHANGE THE WORLD... AND EARN MONEY FOR COLLEGE AT THE SAME TIME!

June Enprise Rent. Enprise Author: Document Version: Product: Product Version: SAP Version:

Saving Through Trailer Tracking

CARE QUALITY COMMISSION ESSENTIAL STANDARDS OF QUALITY AND SAFETY. Outcome 10 Regulation 11 Safety and Suitability of Premises

Abstract. Introduction. Statistical Approach for Analyzing Cell Phone Handoff Behavior. Volume 3, Issue 1, 2009

(Analytic Formula for the European Normal Black Scholes Formula)

Data warehouse on Manpower Employment for Decision Support System

Lecture 3: Diffusion: Fick s first law

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Asset set Liability Management for

WORKERS' COMPENSATION ANALYST, 1774 SENIOR WORKERS' COMPENSATION ANALYST, 1769

AP Calculus AB 2008 Scoring Guidelines

Rural and Remote Broadband Access: Issues and Solutions in Australia

Constraint-Based Analysis of Gene Deletion in a Metabolic Network

Whole Systems Approach to CO 2 Capture, Transport and Storage

ME 612 Metal Forming and Theory of Plasticity. 6. Strain

Cisco Data Virtualization

Precise Memory Leak Detection for Java Software Using Container Profiling

Meerkats: A Power-Aware, Self-Managing Wireless Camera Network for Wide Area Monitoring

Scalable Transactions for Web Applications in the Cloud using Customized CloudTPS

Category 7: Employee Commuting

FEASIBILITY STUDY OF JUST IN TIME INVENTORY MANAGEMENT ON CONSTRUCTION PROJECT

On Resilience of Multicommodity Dynamical Flow Networks

Real-Time Evaluation of Campaign Performance

A Loadable Task Execution Recorder for Hierarchical Scheduling in Linux

Important Information Call Through... 8 Internet Telephony... 6 two PBX systems Internet Calls... 3 Internet Telephony... 2

Mathematics. Mathematics 3. hsn.uk.net. Higher HSN23000

Lecture notes: 160B revised 9/28/06 Lecture 1: Exchange Rates and the Foreign Exchange Market FT chapter 13

REPORT' Meeting Date: April 19,201 2 Audit Committee

Electronic Commerce. and. Competitive First-Degree Price Discrimination

1754 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 6, NO. 5, MAY 2007

The international Internet site of the geoviticulture MCC system Le site Internet international du système CCM géoviticole

Global Sourcing: lessons from lean companies to improve supply chain performances

Moving Securely Around Space: The Case of ESA

In the previous two chapters, we clarified what it means for a problem to be decidable or undecidable.

Developing Software Bug Prediction Models Using Various Software Metrics as the Bug Indicators

Dehumidifiers: A Major Consumer of Residential Electricity

Practical Embedded Systems Engineering Syllabus for Graduate Students with Multidisciplinary Backgrounds

Foreign Exchange Markets and Exchange Rates

Long run: Law of one price Purchasing Power Parity. Short run: Market for foreign exchange Factors affecting the market for foreign exchange

Probabilistic maintenance and asset management on moveable storm surge barriers

Policies for Simultaneous Estimation and Optimization

A Multi-Heuristic GA for Schedule Repair in Precast Plant Production

Theoretical aspects of investment demand for gold

Essays on Adverse Selection and Moral Hazard in Insurance Market

Case Study: the Use of Agile on Mortgage Application: Evidence from Thailand

DENTAL CAD MADE IN GERMANY MODULAR ARCHITECTURE BACKWARD PLANNING CUTBACK FUNCTION BIOARTICULATOR INTUITIVE USAGE OPEN INTERFACE.

OPINION NO December 28, 1990

Transcription:

A Graph-basd Proacti Fault Idntification Approach in Computr Ntworks Yijiao Yu, Qin Liu and Lianshng Tan * Dpartmnt of Computr Scinc, Cntral China Normal Unirsity, Wuhan 4379 PR China E-mail: yjyu, liuqin, l.tan@mail.ccnu.du.cn Abstract In larg-scal computr ntworks, th isolation of th primary failur sourc is a challnging task. This papr prsnts a proacti ntwork fault diagnosis approach basd on graph thory. Compard with othr approachs, th managr of ntwork managmnt systm chcks th status of th managd dics actily rathr than rci mssags from thos objcts passily. Th salint fatur of this approach is that th possibl failur sourcs, including th ral on, can b computd prcisly and quickly without any alarm historical information or strict assumptions. This approach dos not introduc much procssing complxity by taking full us of matrix and Boolan oprations. To tst and aluat our proposd algorithm, it is implmntd in Jaa and tstd in a ral larg ntwork nironmnt. Th xprimnt rsults show that this approach is not only fficint but also scalabl on fault idntification in larg-scal computr ntworks. Kywords connctiity;connctiity structur; graph thory, ntwork managmnt; fault idntification; Simpl Ntwork Managmnt Protocol; Intrnt Control Mssags Protocol. I. INTRODUCTION With th mrging of arious ntwork applications, usrs dmand bttr quality of sric (QoS). On of th most crucial issus is to maintain th ntwork aailability and rliability. Unfortunatly, th currnt ntwork is not good nough to satisfy th rquirmnts for som xcptional nts. For xampl, powr shuts down in ntwork cntrs or fibrs ar brokn down as a rsult of mchanical failur. It is not asy to do with ntwork faults only by ntwork oprators in short tim du to th complxity of ntwork. Furthrmor, th Intrnt xplods so quickly, and th difficulty of ntwork fault managmnt incrass sharply. To shortn th fault rspons tim and librat ntwork oprators, automatd fault managmnt is a dsirabl goal of Ntwork Managmnt Systm (NMS) implmntation for larg-scal ntworks. Th procss of ntwork fault managmnt is diidd into thr stags, which ar alarms corrlation, fault idntification and rpairing. In th first stp, th managr * Corrsponding author. E-mail addrss: l.tan@mail.ccnu.du.cn. Post addrss: Dpartmnt of Computr Scinc, Cntral China Normal Unirsity, Wuhan, 4379, PR China. should dtct and know th status of all th managd objcts. Whn xcptions happn in ntwork, th managr should rport and fix thm as soon as possibl. Thr ar two ky issus of fault idntification: whr is th first failur sourc? How many kinds of possibl faults can lad to th fault rsults? Th goal of automatd fault idntification algorithm is to sol ths issus without th intrning of man. In rcnt rsarchs, a lot of fault idntification algorithms ha bn proposd, such as coding-basd schms [1]-[2], proacti ntwork [3], Ptri-nt [4]-[5], alarm corrlation [6]-[7], xprt systm and artificial nural ntworks []-[9], acti ntwork & mobil agnts [1]-[12] and ntwork dpndncis [13]. Thy inhrit good ffcts in som spcial ntwork nironmnts, but it will b bttr if th gnralization of thos algorithms is improd. Exprt systm or artificial nural ntwork nds a grat dal of historical alarm rcords and th rasoning procss is tim-consuming. According to th xprincs from artificial intllignc, it is hard to guarant th rasoning rsults. Acti ntwork and mobil agnts ar th futur ntwork tchnologis, but th scurity is th most obstacl of thir application in currnt ntworks. In addition, most routrs and switchs do not support mobil cods at prsnt. Alarm corrlation mthod and othr mthods ar basd on th prmis that ntwork managr is abl to gt trap mssags from fault agnts. Actually, if som links ar brokn down or dics ar shut down, trap cannot b snt or transmittd to th managr. Thrfor, thos approachs will not work wll whn th basic prmis is not satisfid. What s mor, som mthods ar snsiti to th topology with ntwork, but th computr ntwork nironmnts chang frquntly. It s ncssary to rlax strict limitations on th assumptions, impro th accuracy of fault idntification and spd up th dcision procss. W proposd a proacti fault idntification approach basd on graph thory [14]. In this papr, w focus on th implmntation of th approach. Consquntly, our major concrns ar summarizd into four qustions as 1

listd: 1. How dos th managr acquir th fault information as soon as possibl? 2. Whr do th possibl failur sourcs locat and how to idntify thm? 3. How many faults can caus th fault indications and which on is th most likly? 4. How to gt th fault indications gin th spcial fault lmnts status and dcid whthr it is consistnt with th obsrd? Th contnts of this papr ar organizd as follows. Ntwork modling is in sction II. Sction III shows ntwork fault idntification approach in dtail. Sction IV lists th simulation of four classical ntwork faults. Automatd fault indications analysis algorithm is illustratd in sction V. Th approach has bn implmntd in Jaa and tstd in a ral larg ntwork nironmnt. Th prformancs ar shown in sction VI. Finally, som conclusions and olutions about th approach ar drawn. II. NETWORK MODELING Computr ntwork is such a supr and complicatd systm, in which hardwar and softwar ar includd, that NMS is always in dlopmnt to satisfy th rquirmnts from usrs. In rcnt yars, slfmanagmnt and organization of larg-scal ntwork bcoms mor and mor ncssary [16]-[1]. From th ntwork srics proidrs point of iw, th maintnanc of hardwar is a ital task. Common NMS focuss on switching lmnts, channls and parts of ky srrs, such as WWW, FTP srr. Eithr switching dics or srrs ha at last on IP addrss or domain nam addrss, and thy support a lot of ntwork managmnt protocols. Th managr runs in NMS can communicat with ths managd objcts in diffrnt mannrs, such as snding an Intrnt Control Mssag Protocol (ICMP) rqust to a spcifid dstination, or a Simpl Ntwork Managmnt Protocol (SNMP) rqust. Each managd dic has at last a uniqu IP addrss and th managr cars for whthr thy ar acti. Th ntwork fault managmnt can b tratd as a connctiity problm in our iw. Thrfor, ths dics can corrspond to rtics in a graph. Amplifir and rpatrs ar linkd with channls to kp th signal from attnuation. Thos ntwork lmnts will not b considrd bcaus thy cannot support ntwork managmnt protocols and cannot b discord by th managr. Physical channls in ntwork ar luxuriant. Optical fibr, cabl and wirlss ar usually mployd in ntwork. Sinc thy ar spcial mdium joining pair of managd quipmnts, thy can b markd as dgs in a graph. Almost all ntworks ar full simplx in a sns of connctiity, it is natural that dgs ar bi-dirctd dgs and ntwork graph is an undirctd graph. SNMP and ICMP ar two lmntary and popular ntwork managmnt protocols. With th Ping command, managr can tst th connctiity and acti status of ry rtx in th ntwork. Howr, no path information, from th managr to th nod, is proidd with th ping command. En with th tracrout command, it only gts on path btwn th gin pair of rtics. Suppos that softwar fault occurs whn dstination rtx can b communicatd wll with Ping command but abnormal with Gt primiti in SNMP. Assum fatal fault occurs whn th Ping command nds for tim out. Managr has two ways to dtct th status of rtx. Th first on is a proacti mannr that it automatically snds spcial packts to th managd rtx and waits for th acknowldgmnts. Th scond way is only to wait for th traps from agnts in th managd dic. Howr, th traps will not b forwardd to managr whn srious faults happn. Furthrmor, agnt snds trap only on sn spcial nts occurring, and most trap mssag just dscribs isolation information of a nod. Hnc, managr should adopt a proacti mannr to chck th status of managd rtics. Whn links ar brokn down, switching dics connctd with thm still xist, but rtics pair no longr communicat to ach othr ia it. Whn switchs ar abnormal, channls incidnt with it will b unaailabl, but th dic still xists in ntwork. Bcaus most of switching dics ha multi-intrfac, switch fault can b subdiidd into global fault and local fault. Global fault mans all intrfacs ar abnormal, and th local fault mans only parts of thm fail. In this papr, global fault maps rmoing rtx and local fault mans rmoing dg. Obiously, th rmoal of a rtx and th rmoal of an dg ar th lmntary oprations for matrics, which rprsnts th computr ntwork. Th dfinitions of rmoal oprations ar as follows, and th ncssary dtaild oprations for matrics will b prsntd in sction III. Df. 1. Th collction of all managd rtics in graph is managd rtics st (MVS). Df. 2. Th rmoal of a rtx is to ras all dgs incidnt with th rtx, but th rtx still xists. Df. 3. Th rmoal of an dg is simply to ras th dg. Thr ar a grat many indicators of ntwork faults, such as no acknowldgmnt from dstination rtx, high loss ratio of packts and th larg transmission dlay. In th iw of graph thory, th obious and dirct consqunc of srious ntwork faults will lad to 2

th whol ntwork from on connctd componnt to multipl connctd componnts. Connctiity and rachability matrix ar th basic notations and tools in graph rsarch, which will b importd in ntwork fault managmnt. With th connctiity analysis in graph thory, th rtx, which th managr of NMS xists, is in on and th only on connctd componnt of th graph. That is to say rtics in othr connctd componnts will not b accssibl for th managr. As th abo analysis, th connctiity structur of th managd ntwork is th ncssary information of our approach basd on graph thory. In fact, th dynamic and ral-tim discory of th ntwork topology is also a complicatd topic in ntwork rsarch. In this papr, w do not focus on how to obtain thm in dtail. Thr is a basic and important assumption for our approach, which is th fault idntification algorithm can gt th physical connctiity structur from othr modls of th ntwork managmnt systm (NMS), such as th configuration modl. In a businss ntwork managmnt nironmnt, th ntwork oprator knows clarly how many quipmnts and links ha bn instd and whr thy ar. Whn th NMS starts, th static connctiity information will b importd to NMS through th configuration modl. Furthrmor, most of th configuration modls ha th auto-topology discory function, which can proid th dynamic connctiity information. III. THE FAULT IDENTIFICATION APPROACH Th approach basd on graph mainly works on th connctiity analysis btwn pair of rtics, and a constant rtx in th pair is th nod running NMS. Bfor introducing our approach, som ncssary dfinitions ar dfind first. Df. 4. A rtx is rachabl only whn th managr communicats with it succssfully in dfind tims; othrwis it s unrachabl. Df. 5. Th st including all rachabl rtics in ntwork is namd as rachabl rtics sts (RVS). Corrspondingly, th st of unrachabl rtics is unrachabl rtics st (UVS). Df. 6. Suppos that G = <V, E> is an undirctd and simpl graph whr V = n, and 1, 2,, n V. Th adjacncy matrix A of G, with rspct to this listing of th rtics, is th n n zro-on matrix with 1 as its (i, j) th ntry whn i and j ar adjacnt, and as its (i, j) th ntry whn thy ar not adjacnt. In othr words, if its adjacncy matrix is A = [a ij ], thn 1 if i j is an dg of G aij = othrwis Df. 7. Suppos that G = <V, E> is an undirctd and simpl graph whr V = n, E = m, 1, 2,, n V and 1, 2,, m E. Th incidnc matrix M of G, with rspct to this listing of th rtics, is th n m zro-on matrix with 1 as its (i, j) th ntry whn i and j ar incidnt, and as its (i, j) th ntry whn thy ar not incidnt. In othr words, if its incidnc matrix is M = [m ij ], thn 1 whn dg j is incidnt with i mij =. othrwis 1 1 3 2 2 3 4 11 1 5 6 13 7 9 12 9 7 6 Figur 1. Th topology of a ntwork. Figur 1 is an undirctd graph of a ntwork. Th adjacncy matrix A for this graph is 1 2 3 4 5 6 7 9 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1 A = 4 1 1 1. 5 1 6 1 1 1 7 1 1 1 1 1 1 9 1 1 Th incidnc matrix I for this graph is 1 2 3 4 5 6 7 9 1 11 12 13 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1. I = 4 1 1 1 5 1 6 1 1 1 7 1 1 1 1 1 1 9 1 1 Th adjacncy matrix and incidnc matrix will b usd frquntly in sctions IV. 5 4 3

Df.. Suppos that G = <V, E> is an undirctd and simpl graph whr V = n, and 1, 2,, n V. Th rachability matrix R of G, with rspct to this listing of th rtics, is th n n zro-on matrix with 1 as its (i, j) th ntry whn i can rach j, and as its (i, j) th othrwis. In othr words, if its rachability matrix is R = [r ij ], thn 1 if i can rach j rij = othrwis All th lmnts in th rachability matrix for Figur 1 ar on whn thr is no fault in computr ntwork. Df. 9. Th dg is a possibl fault dg (PFE) if an nd rtx of this dg is in RVS and anothr nd rtx is in UVS. Df. 1. Th collction of all possibl fault dgs is possibl fault dgs st (PFES). 1 1 3 2 2 3 4 11 1 4 5 6 13 7 9 12 9 7 6 Figur 2. A fault cas of th computr ntwork. Figur 2 is a cas of ntwork failur, in which cross mans fault. Suppos that th managr procss always runs in in this papr. Th currnt status of all rtics can b rprsntd with a Status Vctor, dnotd by SV in this papr. Th alu of all lmnts of SV is ithr on or zro. Whn SV [i] ( i n 1) is qual to 1, i can b accssd by. With th zro-on on-dimnsion array SV, both RVS and UVS ar rprsntd in computr algorithm. Du to th dgs fault, it is clar that thr ar two connctd componnts in Figur 2. Th SV of Figur 2 is 1 2 3 4 5 6 7 9 SV = [1 1 1 1 1 1 1 1] According to dfinition 9, 3 and 6 ar mmbrs of PFES. As th data structur of SV, PFES is also ralizd by a on-zro ctor. Hnc, PFES of Figur 2 is 1 2 3 4 56 7 9111 1213 PFES = [111111111111] Df. 11. A rtx, mmbr of UVS, is a possibl fault rtx (PFV), whn on of incidnt dgs is an lmnt of PFES. In Figur 2, unrachabl rtx 4 is a PFV bcaus 3 and 6 ar mmbrs of PFES. Df. 12. Th collction of all possibl fault rtics is possibl fault rtics st (PFVS). 5 Th PFVS of Figur 2 is 1 2 3 4 5 6 7 9 PFVS = [1 1 1 1 1 1 1 1 1] Comparing with th PFVS and SV, it is asy to s that th numbr of PFV is no mor than that of unrachabl rtics. According to th dfinitions as dfind, th nol fault idntification approach can b dscribd informally. In gnral, th approach is subdiidd into nin stps as dscribd in Figur 3. Sral thorms about th corrctnss of this approach ar prod in rfrnc [15]. In this papr w will focus on th dtaild algorithms about how to raliz ry stp of it. Stp1: RVS is st to null and UVS is st to MVS; Stp2: Both PFES and PFVS ar st to null; Stp3: Th managr tsts statuss proactily of all managd objcts and all rtics ar diidd into RVS and UVS according to th tst rsults; Stp4: Scan th incidnc matrix and comput th PFES of ntwork; Stp5: Scan PFES and UVS; comput PFVS; Stp6: Idntify possibl fault locations; Stp7: Rason possibl faults; Stp: Rpair fault and tst th ffcts; Stp9: Dcid if th algorithm should stop or go back to Stp3. Figur 3. Th approach basd on graph thory. Th fault idntification procss is xcutd undr thr occasions. First, ntwork managr xcuts it at ry intral priodically. Scond, whn ntwork status bcoms wors, it will b calld. Third, ntwork oprator starts it through th human-machin intrfac of NMS. Whn th NMS starts, th fault idntification modl gts th physical connctiity structur of th managd ntwork from th configuration modl. According to th connctiity structur, fault idntification modl will crat th adjacncy matrix and incidnc matrix of th managd ntwork. Usually, th NMS runs for a ry long tim. If thr ar som changs of connctiity structur, th configuration modl will snd th latst connctiity structur to fault idntification modl to rfrsh th two matrics. A basic qustion of our approach is how to masur th currnt status of ry rtx by th managr. Du to th popularity of SNMP and ICMP, thy ar slctd to masur th status of rtx in our xprimnts. Gt and Ping ar th frquntly usd primitis or command in thos protocols rspctily. Th command "ping" is xcutd only to tst whthr a nod is acti or not. Th acti status of all managd nods is ital to our approach. If th tsting communication btwn th managr nod and th managd nod can gt xpctd 4

rply in tim, th currnt status of this nod will b rgardd as "rachabl". Considrd that th ntwork is unstabl and it ought to kp th orload of ntwork managmnt as low as possibl, th managr is suggstd to chck th status of a managd rtx at sral tims. If th prious chcks ar faild, managr will try again until th tims of chcks is largr than th gin tims or th managr rcis a aluabl acknowldgmnt. If th managr gts th acknowldgmnts in tim, th nod is markd as acti and th dscription of this rtx in SV will b 1. Othrwis th masurd rtx is markd as unrachabl and th dscription of this rtx in SV will b. Guaranting th ral tim fatur of ntwork lmnt status information is an nginring problm. For ry masurmnt, it is unaoidabl for th managr to spnd som tim waiting for th acknowldgmnt. If th masurmnt actions ar xcutd in succssion, th tim of waiting is th sum of ry tsting. In larg-scal ntwork, this cannot b ignord bcaus th numbr of managd nods is so larg. Furthrmor, th transmission dlay of ry long distanc connction aggraats th situation. Thn a nw qustion mrgs, which is that th status of th first on has changd aftr masuring th last on. To aoid th long tim of masurmnt and assur th ral-tim fatur of status tabl, paralll masurmnt is suggstd in nginring. Eithr with C++ or Jaa programming languag, multi-thrad tchnology is supportd and mployd frquntly. Ery thrad is a kind of computr rsourc, and no mattr crating or killing a thrad will cost a littl tim. To rduc th systm consumption of thrads opration, a pool of thrads is proposd and applid in our mulation systm. Ery thrad is assignd a masurmnt task and it can b xcutd concurrntly. A masurmnt task applis for an instanc of thrad from th thrad pool bfor it starts, and rturns th thrad instanc aftr masurmnt. Exprimnts rsult show that th thrad pool is abl to sa th tim of crating a thrad during th algorithm xcution. Th nxt stp is to gt th possibl fault dgs. This opration can b accumulatd prcisly with th algorithm illustratd in Figur 4. Most of th lmntary oprations in this algorithm ar qurying th status ctor, dnotd by SV in psudocod, and th incidnc matrix. Obiously, th final computing rsults for a gin input ar dtrminat. Furthrmor, th computing spd is accptabl in nginring. oid idntifypfes() int[] = nw int[2]; int k; for (int i = ; i < numofedgs; i++) k = ; for (int j = ; j < numofvrtics; j++) if (incidncymatrix[j][i] == 1) [k] = j; k++; if (SV[[]] SV[[1]]) PFES [i] = ; ls PFES [i] = 1; Figur 4. Algorithm of computing PFES. Why do w dfin PFES and what is th us of it? An important goal is to idntify th fault sourcs as soon as possibl. For instanc, th ntwork stat is lik Figur 2, 4 and 5 ar inaccssibl and thr ar thr dgs incidnt with 4 and 5, and why do only 3 and 6 blong to PFES? Th connctiity btwn 4 and 5 is not clar from th managr point of iw. If th fault is from dgs, th managr affirms that 4 and 5 ar unrachabl if and only if 3 and 6 ar both brak down. On th othr hand, without adquat idncs, it is impossibl to judg whthr 7 is normal or not. In Stp5, PFVS will b computd according to th rsults of PFES. Th dtaild algorithm is listd in Figur 5. Similar to th numbr of possibl fault dgs, th numbr of possibl fault rtics also dcrass compard with th numbr of inaccssibl rtics. It is intrprtd that if 4 shuts down, both 4 and 5 ar unrachabl; on th othr hand, whn 5 fails, 4 can also run wll. For xampl, that th hub of an offic is rror do not man all th computrs in th offic shut down. All th possibl fault dgs and rtics ar calld as possibl fault lmnts in this papr. oid idntifypfvs() for (int i = ; i < numofedgs; i++) if (PFES [i] == 1) // Find th rtics incidnt with PFE. for (int j = ; j < numofvrtics; j++) if ((incidncymatrix[j][i] == 1) && ( SV[j] == )) PFVS[j] = ; Figur 5. Algorithm of computing PFVS. With th xampl of Figur 2, th dfinition of PFES and PFVS is hlpful to isolat th possibl fault lmnts and cut down th fault sourc rasoning tim. This fatur will b discussd in rmaind sctions. Bcaus Stp 6 is so complicatd, it will b discussd in sction V indpndntly. 5

IV. FAULT CASES STUDYING Although th fault cass ar distinguishd in diffrnt ntwork managmnt nironmnts, in our point of iw, ths fault cass can b classifid into four typs. Th diffrnt fault cass ar closly with two notations, th numbr of connctd componnts and dgr of rtx, in graph thory. Diffrnt fault cass ha diffrnt fault idntification ways, and thy will b illustratd as follows. A. A Vrtx is Inaccssibl Whos Dgr is Largr Than 1 In figur 6, th rtx 7, whos dgr is 2, is inaccssibl. Routrs and switchs ar th connctors of diffrnt sub-ntwork; of cours, th fibrs or cabls connctd with thm ar mor than on. So this fault cas can rprsnt singl switch or routr is inaccssibl in computr ntwork. 1 1 3 2 2 3 4 11 1 4 5 6 13 7 9 12 9 7 6 Figur 6. Th first fault cas. Stp3: SV = [1 1 1 1 1 1 1 1 1]. Stp4: Thr is only on lmnt in UVS. Scan th 7 th row in incidnc matrix, and w can s that and 9 ar possibl fault dgs. PFES = [1 1 1 1 1 1 1 1 1 1 1 1] Stp5: Scan th incidnc matrix and conclud that PFVS = [1 1 1 1 1 1 1 1 1]; Stp6: Thr ar two lmnts in PFES and on lmnt in PFVS. So th numbr of possibl fault lmnts is thr and th total possibl fault lmnts combination ar ight. TABLE I. FAULT ELEMENTS COMBINATIONS AND EFFECTS 7 9 consistncy T 1 T 1 T 1 1 T 1 T 1 1 F 1 1 F 1 1 1 F Tabl I show th thr possibl fault lmnts and combinations and ffcts. In th first thr columns, 1 rprsnts this ntwork lmnt is normal and is abnormal. Whn th combination of possibl fault lmnts is gin as th first thr columns, th alu of "consistncy" is T (Tru) if th nod is isolatd according to connctiity analysis in graph thory. On 5 th contrary, it will b F (Fals) if th nod is rachabl to th managr nod in connctiity analysis. For xampl, th scond row in Tabl 1 rprsnts a spcial combination of possibl fault lmnts, which mans that 7 shuts down and both 7 and ar brokn. Obiously, 7 is unrachabl in th iw of connctiity analysis in graph thory. What's mor, th rachabl tsting rsult with NMS shows that 7 is also unrachabl. Th connctiity analysis in graph thory is consistnt with th nginring tsting on. Thn th alu of th consistncy will b st T. Anothr xampl is about F. Whn only 7 is brokn, 7 is rachabl in connctiity analysis. Howr, 7 is not rachabl in tst. Thn th two rsults ar not consistnt; thrfor th alu is F. Only th combinations of fault lmnts, which lad to th consistncy column quals to T, ar th possibl fault rasons and nd to b considrd furthr. If th alu of consistncy is T, thn th alus of th first thr columns will b intrprtd as a possibl fault sourcs. Bcaus th computation algorithm about consistncy is so complicatd that th dtaild procdur will b introducd in sction V and sction VI rspctily. In this sction, w will us th rsults dirctly. Stp7: From th fourth column of Tabl I, it is clar that thr ar fi possibl fault rasons rsult in th unrachabl 7. Th currnt task is to output all th possibl rasons and judg which on is th most possibl. In our xprimnt, th Baysian dcision is usd to sort th possibl fault rasons. With th Baysian dcision, som fault probability of rtics and dgs should b obtaind at first. Actually, th failur probability of switch rtics is not asy to dfin, bcaus thr is not a unifid failur probability for all switchs and links in diffrnt ara and diffrnt nironmnt. En in a shard nironmnt, th probabilitis of diffrnt quipmnts ar not qual. Th accurat probability can b dfind basd on th historical statistic fault rcords. Du to ths rasons, w do not focus on th accurat way to dfin th probabilitis. W ar mor intrstd in th computing and sorting mthod for fault probabilitis. Th dtaild probability of quipmnt dos not influnc our fault idntification approach. Assumptions about fault occurring probabilitis ar listd: a) Probability of fault nt in switch is P s ; b) Probability of fault nt in link is P c ; c) Pc Ps ; d) Both P s and P c ar lss than.5. 6

Th simpl four ruls, usd in this papr, ar basd on our past ntwork managmnt xprinc in China. It is ral in our ntwork managmnt systm, but w cannot assur that thy ar ry fitful to othr ntworks. Comput probability of th fi possibl fault rasons. Rason1: P = Ps Pc Pc Rason2: P1 = Ps Pc (1 Pc ) Rason3: P 2 = Ps (1 Pc ) Pc Rason4: P3 = Ps (1 Pc ) (1 Pc ) Rason5: P 4 = (1 Ps ) Pc Pc Bcaus P 3 is th largst, failur in 7 is th most possibl fault sourc and will b output firstly. Stp: Snd rpair commands to ntwork oprator to chck th powr or othr conditions in 7. Stp9: If th rpair fdback information shows th judgmnt is not corrct, slct Rason5, which has th scond largst probability, and so on. Th algorithm runs rpatdly until all rtics ar rachabl. B. A Vrtx of Dgr 1 is Unrachabl Figur 7 shows th scond fault cas whr 5 is inaccssibl and whos dgr is just on. In computr ntworks, most srrs ar locatd in on ntwork, i.., only a cabl or optical fibr is connctd with it. Thn this cas can rprsnt that a WWW, FTP and E-MAIL srr is inaccssibl. 1 1 3 2 2 3 4 11 1 4 5 6 13 7 9 12 9 7 6 Figur 7. Th scond fault cas. Stp3: SV = [1 1 1 1 1 1 1 1 1]; Stp4: PFES = [1 1 1 1 1 1 1 1 1 1 1 1 1]; Stp5: PFVS = [1 1 1 1 1 1 1 1 1]; Stp6: All th possibl fault lmnts in this fault cas and possibl fault rasons ar listd in Tabl II. TABLE II. FAULT ELEMENTS COMBINATIONS AND FAULT EFFECTS. 5 7 consistncy T 1 T 1 T 1 1 F Stp7: Comput probability of th thr possibl fault rasons. Rason1: P = P s Pc Rason2: P = P s (1 P ) Rason3: 1 c P = (1 P s ) P 2 c 5 Basd on th assumptions as abo, P 1 and P 2 ar obiously largr than P. Lt s compar P 1 and P 2. P P = P (1 P ) (1 P ) P 1 2 s s = P P P P + P P s c = P s Pc For P 1 is th largst, th rtx 5 is th first output of th fault idntification algorithm. Stp : Snd th rpair command to fix th ntwork. C. Multipl Vrtics in a Connctd Componnt ar Inaccssibl Figur shows th third cas, in which a lot of managd dics ar inaccssibl instantly and all th rtics ar in th sam connctd componnt, and this cas is frqunt in ral ntwork nironmnt. If th boundary routrs fail or fibrs connctd with boundary routrs ar brokn down, all computrs in a sub-ntwork ar inaccssibl. 1 1 3 2 2 3 4 11 1 4 5 6 13 7 9 12 9 c 7 c s s c 6 Figur. Th third fault cas. Stp3: SV = [1 1 1 1 1 1 1 1]; Stp4: PFES = [1 1 1 1 1 1 1 1 1 1 1 1]; Stp5: PFVS = [1 1 1 1 1 1 1 1 1]; Stp6: Thr possibl fault lmnts, namly 3, 6 and 4, ar in Figur ; Stp7: Similar to fault sourc sorting in th first fault cas, th rtx 4 is fault will b first listd; Stp: Th opration is similar to th first two cass; Stp9: Tst rachabl status of rtics in UVS. Whn all rtics ar accssibl, th algorithm stops. In this cas, w s that th possibl fault sourcs do not incras quickly although th fault rtics ar mor than on. If 3 and 6 ar normal, th ntwork will b rcord compltly aftr rpairing th fault in 4. Othrwis th fault fix procss can b subdiidd into two stps. Th adantags of th way ar that dcrasing th complxity of dcision of ach tim, and improing th fixing rat and th corrctnss. D. Multipl Vrtics Ar Inaccssibl Which ar in Multpl Connctd Componnts Figur 9 shows that 4, 5 and 7 ar abnormal which is an xampl of th fourth fault cas. In computr ntworks, this cas can b intrprtd that som isolatd sub ntworks ar inaccssibl in th sam tim. c 5 7

1 1 3 2 2 4 3 13 9 11 12 1 9 7 5 6 7 6 Figur 9. Th fourth fault cas. Stp3: SV = [1 1 1 1 1 1 1]; Stp4: PFES = [1 1 1 1 1 1 1 1 1 1]; Stp5: PFVS = [1 1 1 1 1 1 1 1]. Th nxt stps ar discussd in th first and scond cass. V. AUTOMATED FAULT EFFECT ANALYSIS ALGORITHM In sction III, th possibl fault dgs and rtics of a fault ffct ha bn idntifid. If ithr all th possibl dgs or all th possibl fault rtics ar rmod, thy will chang th connctiity of th graph. Diffrnt combinations of possibl fault lmnts ar abl to lad to th sam fault indication. Such as in Figur 2, if 3 and 4 ar faild at th sam tim, th fault indication is also that 4 and 5 ar unrachabl. It suggsts that idntifying th possibl fault factors is only th prmis stps of fault rasons analysis. Th furthr work is to list th possibl fault rasons as Tabl I and Tabl II in sction IV. Som algorithms compar th probability, and som find th similarity btwn th currnt fault cass with th historical fault rcords. Howr, w car for th connctiity of graph. In this mthod, all th possibl fault rasons will b listd prcisly. Th discussion of th algorithm is diidd into thr parts. Firstly, algorithms of rachability matrix computing ar analyzd and compard. Th combinations of fault factors and thir indications ar computd. Scondly, how to judg whthr th indications of possibl fault lmnts combination is th sam as tsting rsults ar discussd. Finally, how to gt all th combinations in softwar dsign ar introducd. A. Possibl Fault Elmnts Combination and its Effct In graph thory, th computing algorithm of th rachability matrix is R = A + A 2 + A 3 + + A n, whr A is th adjacncy matrix of th graph. Th tim complxity of this algorithm is O(n 4 ) [14]. Warshall algorithm is an fficint mthod for computing th transiti closur of a rlation [15]. Warshall algorithm has a worst cas complxity of O(n 3 ), whr n is th numbr of rtics of th graph. For th rachability matrix is similar to th transiti closur of 5 4 a rlation, w propos that using th Warshall algorithm to do rachability matrix computing [14]. Obiously, Warshall algorithm is fastr than th first on. Bcaus th computing of rachability matrix is a frqunt opration in our fault idntification approach, slcting th Warshall algorithm will spd up th whol fault rasoning procss fficintly. Th rmoal of a rtx and th rmoal of an dg ha bn dfind in sction II. Bcaus all th information about ntwork is stord in th adjacncy matrix and incidnc matrix during softwar implmntation, th dtaild oprations of rmoal ar dscribd in Figur 1. Qurying PFES and PFVS, and rmoing rtics and dgs as Figur 1, a nw adjacncy matrix, dnotd by A, will b producd. Basd on matrix of A, computing th rachability matrix of currnt ntwork with Warshall algorithm is possibl. oid rmoalofavrtx(int rtxid) for (int i = ; i < numofvrtics; i++) adjacncymatrix[i][ rtxid] = ; adjacncymatrix[rtxid][i] = ; oid rmoalofanedg(int dgid) int j = ; int[] = nw int[2]; for (int i = ; i < numofvrtics; i++) if (incidncmatrix[i][dgid] == 1) [j] = i; j++; adjacncymatrix[[]][[1]] = ; adjacncymatrix[[1]][[]] = ; Figur 1. Th rmoal Algorithms of a rtx and an dg. To show th computing procss, thr is an xampl of this procss about Figur 11. 1 1 2 2 2 2 13 9 11 12 1 (a) (b) Figur 11. Th Vrtx and dg ha bn rmod. Figur 11 (a) is th complt computr ntwork, which is a sub-graph of Figur 1. Figur 11 (b) is th currnt ntwork stat in which and 2 ar inaccssibl by th managr locatd in. Suppos that this fault is du to th global fault in, and 2 ar brokn down at th sam tim. Suppos that th rtics of Figur 12 (b) ar listd arbitrary as, 1, 2, and 9. Th rachability matrix computing procss is as blow. 13 9

1 1 1 1 1 1 1 dlt 1 1 A = A= 1 1 A = 1 1 1 1 1 1 1 1 1 dlt 2 1 A = 1 Computd th rachability matrix of th Figur 11 (b) with Warshall algorithm. Th dtaild procss is 1 1 1 1 (1) 1 1 1 (2) 1 1 1 A' A' = A' = 1 1 1 1 1 1 1 (3) 1 1 1 (4) A' = R= A' 1 1 1 (1) Add th 1 st row to th 2 nd row; (2) Add th 1 st row to th 5 th row; (3) Add th 2 nd row to th 1 st row; (4) R is th rachability matrix. Howr, only with th rachability matrix R, it is not asy to gt conclusions about fault ffct dirctly. B. Judging Whthr Fault Effcts ar th Consistnt Obsr th rachability matrix, w can s whthr thr is at last a path btwn two gin rtics with ys. Th most spcial ida of our fault idntification approach is th connctiity analysis of th graph, and this rachability matrix implis that information. W try to find a way to translat th implicit information into xplicit on, which can b judgd by som ruls. Matrix transformation is th most frqunt opration in mathmatical rsarch, and it is usd in our algorithm. Aftr matrix transformation, som spcial sub-matrix will appar. Fortunatly, w ha found som spcial transformation ruls for rachability matrix. To do automatd fault ffct analysis, two tmporary matrics, RTM and CTM, ar importd in Figur 12. RTM is th abbriation of Row Transformation Matrix and CTM is th abbriation of Column Transformation Matrix. From th graph thory, a pair of diffrnt rtics is rachabl if and only if thy ar in th sam connctd componnt. Th automatd fault ffct judgmnt algorithm, illustratd in Figur 12, is basd on this principl. Crat matrix RTM and CTM. for (i =, j=, k= numofvrtics -1; i < numofvrtics; i++) if (i is accssibl) copy th i th row of R to th j th row of RTM; j++; ls copy th i th row of R to th k th row of RTM; k--; for (i =, j=, k = numofvrtics - 1; i < numofvrtics; i++) if (i is accssibl) copy th i th column of RTM to th j th column of CTM; j++; ls copy th i th column of RTM to th k th column of CTM; k--; Figur 12. Th matrix transform algorithm. Transform rachability matrix R with algorithm in Figur 12, and w will gt th CTM. Th rtics of graph will b listd as rachabl rtics st and unrachabl rtics st in CTM. To show this transformation clarly, w will show th computing procdur of Figur (b) as follows. 1 1 1 1 1 1 1 1 1 (1) 1 1 1 R = RTM = 1 1 1 1 1 1 1 1 1 (2) 1 1 1 CTM = 1 1 1 (1) Copy th 1 st, 2 nd, 5 th row to th 1 st, 2 nd, 3 rd row of RTM; (2) Copy th 1 st, 2 nd, 5 th column to th 1 st, 2 nd, 3 rd row of CTM. Finally, th squnc of rtics of Figur 11 in CTM is, 1, 9 and, 2,. Suppos th lmnts numbr of RVS is V 1 ( V1 n ). CTM can b diidd into 4 sub-matrics and ry submatrix has its spcial physical dnotation, which is hlpful to th consistncy judgmnt of fault ffct. If th fault ffct from connctiity analysis is consistnt with th masurmnt rsults, th CTM must b satisfid with all rgulations blow. 9

1 1 1 1 1 1 CTM = 1 1 1 1 1 Th lft-top sub-matrix ( V 1 V 1 ) rprsnts rachabl faturs of any pair rtics among th accssibl rtics; so all th lmnts alus must b 1. Th lft-low sub-matrix and th right-top sub-matrix ha th sam maning, which rprsnt th rachabl status btwn RVS and UVS. Of cours, lmnts in ths sub matrics should b zro. Othrwis, thr must b a rtx, which is an lmnt in both UVS and RVS, but it s not prmittd in graph thory. Th abo cas is a possibl fault rason whn 2 and ar inaccssibl by th mangr, which and 2 ar th fault sourcs at th sam tim. On th othr hand, w will tst anothr combination of possibl fault lmnts. For xampl, whn only is rror, th computing procss is listd blow. 1 1 1 1 1 1 1 (1) 1 1 A= 1 1 A' = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (2) 1 1 1 1(3) 1 1 1 1 R = 1 1 1 1 RTM = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (4) 1 1 1 1 CTM = 1 1 1 1 1 1 1 1 (1) Rmo ; (2) Comput th rachability matrix with Warshall algorithm; (3) Transform R to gt RTM; (4) Transform RTM to gt CTM. Obiously, th combination fault ffct is not consistnt with th masurmnt rsult, bcaus 2 is connctd with from th rachability matrix. Prhaps thr r som connctions among inaccssibl rtics, which can t b aailabl by th managr. Th bst way is to ignor information in th right-blow sub matrix. If thr r connctions among inaccssibl rtics, th xprssion is that rpairing on rtx and som rtics bcom rachabl. C. Algorithm for Computing Fault Elmnts Combination Th numbr of possibl fault lmnts is a ariabl, which changs in diffrnt tim n in th sam ntwork. Ery possibl fault lmnts has two statuss, on is normal and anothr is abnormal. It is known that thr ar 2 x kinds of combinations of possibl fault lmnts whr x is th numbr of possibl fault lmnts of th ntwork, but th numbr of possibl fault rasons dos not qual to it. A flxibl algorithm, which can produc th 2 x on-zro squnc quickly, is th ky issu of fault rasons computing. To sol it, a similar problm is compard at first, which is th way to gt th binary format squnc from an intgr. DIV and MOD ar two usful oprations in data procssing. Lt s obsr an xampl, which is to transform a dcimal numbr into its binary format stp by stp. Th input data is (11) d. Th stps with DIV and MOD computing ar shown in Figur 13. numbr = 11, numbr DIV = 1, MOD = 3; numbr = 3, numbr DIV 4 =, MOD 4 = 3; numbr = 3, numbr DIV 2 = 1, MOD 2 = 1; numbr = 1, numbr DIV 1 = 1, MOD 1 =. Figur 13. Th procss of chang (11) d to (111) b. Obsr Figur 13, it is huristic for our task. Bcaus th numbr of possibl fault lmnts is known and th total combinations ar computd. Each combination can b assignd a numbr to idntify it. How to cod th numbr with a good way is th currnt task, which can rprsnt physical maning and hlpful for furthr procss. Howr, ach combinations can b codd as a binary squnc whos lngth is x, and ry bit of it mans whthr th possibl fault lmnt is normal. With this codd mthod, all th combination can b rprsntd. Th algorithm to cod th fault lmnts combination is illustratd in Figur 14, which is similar to th intgr format transformation algorithm. oid listfaultrasonstabl(int lmnts[x]) int max = 2 x ; int numbr, i, j, k; int list = nw int[max][x + 1]; for (numbr = ; numbr < max; numbr ++) i = numbr; j = x - 1; k = ; whil (j >= ) list[numbr][k] = i DIV 2 j ; i = i MOD 2 j ; j--; k++; Figur 14. Th algorithm to gt possibl fault lists. With ths algorithms in sction V, th two tabls can b computd automatically. 1

VI. EMULATIONS IN A LARGE-SCALE NETWORK Thr ar a fw complicatd algorithms proposd in this papr, which should b codd and applid in a ral ntwork managmnt systm. Furthrmor, w ar agr to aluat this fault idntification approach from th actual xprimnt rsults. In our point of iw, th prformanc for fault idntification at last includs two factors. On is th rasoning spd and anothr is whthr it can gi th actual fault rason for ntwork. W do not considr th total rcory tim of ntwork bcaus diffrnt fault sourc nds diffrnt fixd tim. W pursu th shortr rasoning tim. W carry out an xprimnt of this fault idntification approach in a ral larg-scal ntwork managmnt nironmnt. Th xprimnt st up is illustratd stp by stp as follows. Stp 1: Cod ths algorithms and mbd th nol fault idntification modl into a ntwork managmnt systm; Stp 2: Aftr th NMS runs, th fault idntification modl gts th compltd and static topology information from th configuration modls of NMS. Stp 3: Th currnt or dynamic status tabl of all managd nods is cratd by th status tsting procss; Stp 4: With th thr ncssary structurs (Adjacncy matrix, Incidnc matrix and th Status Vctor), th fault idntification modl will b triggrd and th all-possibl fault rasons will b listd and sortd. Stp 5: Fix th ntwork according to th possibl fault rason list. rasoning ar not connintly applid. On th contrary, th graph-basd approach is not snsiti with th topology and can b mployd asir. Jaa is on of th most portabl programming languag, whos cods can run in Windows, Linux and Unix and othr oprating systm, and is popular in ntwork managmnt systm dsign and implmntation in rcnt yars [1]-[19]. Although Windows is th most popular oprating systm, Unix and Linux ar mor frquntly usd in tlcommunication srrs. To impro th softwar rusability, Jaa Dlopmnt Kit (JDK1.4.1) is adoptd to b th dlopmnt languags. With th RunTim class in Jaa, th Ping command can b connintly intgratd in th NMS and gt th stat masurmnt rsults. In mulation tsting, th classical typs of fault ffcts xcpt th scond cas listd in sction IV ar tstd. Figur 15. Map of an actual larg-scal ntwork. Th connctiity structur of all th managd ntwork quipmnts is illustratd in Figur 15, which is th backbon of a tlcommunication company in Hubi proinc in China. Thr ar totally 53 managd dics, dnotd by triangls in th map, which ar routrs or switchs. Th black lin dscribs th links and most of thm ar optical fibrs. Th ntwork topology is so complicatd that it is not asy to classify thm into a rgular topology cas. Sinc ry rtx locats at last in a ring and th dgr of it is largr than two, th scond fault cas cannot b tstd in our ulation xprimnts. Du to th irrgular topology structur of this ntwork, th nts corrlation tchnology and Figur 16. Th tim consumption of rasoning. Two issus of th algorithms ar considrd carfully, namly th tim of fault isolation and th corrctnss of fault sourc. Figur 16 shows th rasoning tim (stp 3 to stp 6) consumption (from stp 4 to stp 7) changs with th numbr of inaccsibl rtics which ar in a connctd componnt. Th managr runs in a prsonal computr with Pntium II 5 and 12M mmory. It dnots that th tim consumption is narly a const, which is about 4 ms, n th fault numbr incrass linarly. That is to say th tim costs is stabl whn it is mployd in a largr scal ntwork. This fatur supports th ral tim rspons rquirmnt for NMS and th salability of all kinds of ntwork is also accptabl. Compard with othr fault isolation tchnologis, th approach basd on graph thory is not only fast but also asy to b ralizd. What s mor, th rasoning rsults ar mor rliabl and hlpful to fix. To gt an objcti aluation of th nol approach in papr, som comparisons should b don with othr fault idntification mthods. Bcaus w ha not th sam tst cass from othr papr, w do not compar our work with othr rsarch rsults dirctly in this papr to 11

aoid drawing a subjcti conclusion. Th ral rasoning spd in Figur 16 is accptabl ithr in rsarch or in nginring fild. On th othr hand, all th possibl fault rasons ar listd and sortd by th probability aftr rasoning. That is to say, w ha found th actual fault rason with this fault idntification approach. Thrfor, w think th prformanc for fault idntification is illustratd by th simulation rsults. VII. CONCLUSIONS Th fault idntification and isolation approach basd on graph thory appals som xcllncs cmpard with othr mthods. Firstly, it is abl to work in any topology ntwork. Whil othr rasoning mthods dpnd on nts corrlation, thy ha to obsr th ntwork topologis to conclud rlationships among th rtics and traps. Scondly, th rasoning mthods about fault rasons bcom simplr and can b xcutd connintly in computr. Thr ar som cas-basd, ruls-basd or artificial ntwork basd ways to find out th fault rasons. Each of thm tris to find or compar th similarity btwn th nw fault cas with th historical fault rcords. It is a dilmma in thos mthods btwn th corrctnss and rasoning spd, but our nol mthod sols it succssfully. Thirdly, th gratst impromnt is that th algorithm always finds th most likly fault rason in short tim. Finally, most of th oprations in this approach ar matrix and Boolan, which can b xcutd quickly in computr. From th aultion xampl in this papr, it is clar that th tim complxity of th approach is accptabl. Whil this papr has alludd that th probability of fault nts ar qual, w do not bli this to b ntirly dsirabl. Hnc, th rsarch should b rplacd with th actual facts in futur, bcaus th probability will affct th fault rasons sorting squnc. It is indicatd that Baysian dcision is not abl to rspond quickly for small probability nts; futur rsarch should focus on intgrating bays-dcision and cas-basd in th fault sourc rasoning systm. In addtion, if thr is abundant historical fault rcords, thy should b mor studid and try to min som hlpful ruls. Furthrmor, th basic assumption, which th connctiity structur of ntwork can b obtaind from th topology modl, should b rlasd in th furtur, bcaus th ntwork topology is changing dramatically [2]. Som currnt connctiity discory should b intrgratd in th fault idntification work. Automatd fault idntification and isolation ar still a challng, bcaus th collctd information in managr is too littl but th possibl fault rasons ar so many. How to find th actual fault rason is still an opn problm in ntwork managmnt. In this papr, th most possibl rasons st including th actual on ar found. It is an impromnt, but not th compltd solution. W bli that th fault isolation basd on graph thory will lad to simpl, scalabl ntwork managmnt solutions. ACKNOWLEDGEMENTS This rsarch has bn partially supportd by National Natural Scinc Foundation of China undr Grant No. 617443, by th Ky Projct of Natural Scinc Foundation of Hubi Proinc in China undr Grant No. 22AB25, and by th Natural Scinc Foundation of Cntral China Normal Unirsity undr Grant No. 552. REFERENCES [1] Chi-Chun Lo, Shing-Hong Chn, and Bon-Yh Lin, Coding-basd Schms for Fault Idntification in Communication Ntworks, Intrnational Journal of Ntwork Managmnt, pp. 157-164, May-Jun 2. [2] Masum Hasan, Binay Sugla, and Ramsh Viswanathan, A Concptual Framwork for Ntwork Managmnt Ent Corrlation and Filtring Systms, Procdings of th Sixth IFIP/IEEE Intrnational Symposium on Intgratd Ntwork Managmnt (IM), 1999. [3] Cynthia Hood and Chuanyi Ji, Proacti Ntwork Fault Dtction, IEEE Transactions on Rliability, ol. 46, no. 3, pp. 333-341, Sptmbr, 1997. [4] A. Aghasaryan, E. Fabr, A. Bnnist, R. Boubour, C. Jard, A Ptri nt approach to fault dtction and diagnosis in distributd systms, 36 th IEEE Confrnc on Dcision and Control (CDC), San Digo, IEEE Control Systms Socity, pp. 726-731, Dcmbr 1997. [5] Boubour, Rné, Jard, and Claud. Fault Dtction in Tlcommunication Ntworks Basd Ptric Nt Rprsntation of Alarm Propagation, Lctur Nots in Computr Scinc, ol. 124: 1 th Intrnational Confrnc on Application and Thory of Ptri Nts, Toulous, Franc, pp. 367-36, Jun 1997. [6] Bouloutas A, Calo S, and Finkl A, Alarm Corrlation and Fault Idntification in Communication Ntworks, IEEE Transactions on Communications, ol. 42, pp. 523-533, 1994. [7] C. S. Chao, D. L. Yang, and A. C. Liu, "An Automatd Fault Diagnosis Systm Using Hirarchical Rasoning and Alarm Corrlation," Journal of Ntwork and Systms Managmnt, ol. 9, no. 2, pp. 13-22, Jun 21. [] E.A. Mohamd and N.D. Rao, Artificial nural ntwork basd fault diagnostic systm for lctric powr distribution fdrs, Elctric Powr Systm Rsarch, ol. 35, pp. 1-1, 1995. [9] Hongjun Li, John S. Baras, and Gorg Mykoniatis, An Automatd, Distributd, Intllignt Fault Managmnt Systm for Communication Ntworks, Tchnical Rport, TR 99-57, Unirsity of Maryland, 1999. [1] Brly Schwartz, Aldn W. Jackson, W. Timothy Strayr, Wnyi Zhou, R. Dnnis Rockwll, and Craig Partridg, Smart Packts: Applying Acti Ntworks to Ntwork Managmnt, ACM Transactions on Computr Systms. Vol.1, NO.1, Fbruary 2. pp. 67-. [11] T. Whit, A. Biszczad, and B. Pagurk, Distributd Fault Location in Ntworks Using Mobil Agnts, In Procdings of th 3 rd Intrnational Workshop on Agnts in Tlcommunication Applications IATA'9, Paris, Franc, July 199. [12] Andrzj Biszczad, Brnard Pagurk, and Tony Whit, Mobil agnts for ntwork managmnt, IEEE Communications Surys, Sptmbr 199. [13] Irn Katzla and Mischa Schwartz. Schms for Fault Idntification in Communication Ntworks, IEEE/ACM Transactions on Ntworking, ol. 3, no. 6, pp. 753-764, Dcmbr 1995. [14] Yijiao Yu, Qin Liu, Lianshng Tan and Dbao Xiao, A Nol Automatd Fault Idntification Approach in Computr Ntworks Basd 12

on Graph Thory, In Procdings of 23 Intrnational Confrnc on Communication Tchnology, Bijing, pp. 167-173, April 9-11, 23. [15] Stphn Warshall, A Thorm on Boolan Matrics, Journal of th ACM, 9(1), pp.11-12, January 1962. [16] C. S. Chao, D. L. Yang, and A. C. Liu, A LAN Fault Diagnosis Systm, Computr Communications, ol. 24, no. 14, pp. 1439-1451, Sptmbr 21. [17] Anoop Rddy, Dborah Estrin, and Ramsh Goindan, Larg-Scal Fault Isolation, IEEE Journal on Slctd Aras in Communications, ol. 1, no. 5, pp. 733-743, May 2. [1] Y. Ymini, A.V. Konstantinou, and D. Florisssi, NESTOR: An Architctur for Ntwork Slf-Managmnt and Organization, IEEE Journal on Slctd Aras in Communications, ol. 1, no. 5, pp. 75-76, May 2. [19] L. Andry, O. Fstor, E. Nataf and R. Stat, JTMN: A Jaa-Basd TMN Dlopmnt and Exprimntation Enironmnt, IEEE Journal on Slctd Aras in Communications, ol. 1, no. 5, pp. 66-67, May 2. [2] V.Paxson, End-to-End Routing Bhaior in th Intrnt, IEEE/ACM Transactions on Ntworking, Vol.5, No.5, pp.61-615, Octobr 1997. Yijiao Yu is now a Taching Assistant in Cntral China Normal Unirsity, PR China. Mr. Yu rcid his MSc and Bachlor dgr of computr scinc from th dpartmnt of computr scinc, Cntral China Normal Unirsity in 22 and 1999 rspctily. His rsarch intrsts focus on computr ntworks and artificial intllignc. Qin Liu gts MSc. and Bachlor dgr of Computr Scinc from Dpartmnt of Computr Scinc, Cntral China Normal Unirsity, PR China. Sh is intrstd in computr ntworks congstion control and traffic modling. Lianshng Tan is now a Full Profssor and Had of Dpartmnt in Dpartmnt of Computr Scinc, Cntral China Normal Unirsity, PR China. Profssor Tan rcid his Ph.D. dgr from Loughborough Unirsity in th UK in 1999. H was doing rsarch in computr communication ntwork in School of Information Tchnology and Enginring at Unirsity of Ottawa, Ontario, Canada as a postdoctoral rsarch fllow and a isiting rsarch scintist in 21. H has publishd or fifty rfrrd paprs. His rsarch intrsts ar in modling, analysis and prformanc aluation of computr communication ntworks, thir protocols, srics and intrconnction architcturs. Ths includ multimdia ntworks, local ara ntworks, mtropolitan ara ntworks, broadband ntworks and switching architcturs for congstion control. Profssor Tan is also intrstd in quuing thory, simulations, computational algorithms and thir applications in high-spd computr communication ntworks. 13