RequIn, a tool for fast web traffc nference Olver aul, Jean Etenne Kba GET/INT, LOR Department 9 rue Charles Fourer 90 Evry, France Olver.aul@nt-evry.fr, Jean-Etenne.Kba@nt-evry.fr Abstract As networked attacks grow n complexty and more and more Internet users get broadband Internet access, applcaton level traffc analyss n operator networks becomes more dffcult. In ths paper, we descrbe a tool allowng web communcatons to be analyzed n such envronment. Instead of relyng on the extracton of applcaton level parameters and pattern matchng algorthms that are usually consdered bottlenecks for such actvty, we look at smple network and transport level parameters to nfer what happens at the applcaton level. Our approach provdes the ablty to perform a trade-off between analyss speed and precson that n our opnon could be useful for some traffc analyss applcatons lke denal of servce attacks detecton. Keywords-component; Montorng, HTT, performance, DDoS. I. INTRODUCTION Over the last ten years, a part of the securty functons that were prevously mplemented wthn companes has been delegated or outsourced to external organzatons. The appearance of new threats (e.g. worms, DDoS attacks) has led network operators to provde ntruson detecton servces to ther customers. In ths paper we consder one of the challenges mpled by ths new actvty; the ablty to montor user communcatons wthn operator networks. Ths task can be consdered as challengng for several reasons: Operators network nternal devces usually have basc traffc analyss abltes. Most devces are currently lmted to operatons appled to packet headers through capture, aggregaton, flterng, samplng and countng operatons. Operators network nternal lnks usually carry large amounts of traffc. As a result the tme that a montor can devote to each user request s usually very short (a few hundreds of nanoseconds). As a result complex operatons such as the algorthms employed n endhosts montorng systems are usually unusable. For example the snort ntruson detecton tool uses pattern matchng algorthm for whch the best known soluton n term of temporal complexty [2] s n O(n+m) where n s the sze of the strng to be searched n and m the sze of the pattern. Such algorthms would clearly be unable to handle strngs longer than a few words n very hgh speed envronments. Operator networks are usually constraned n term of ntroducng new mechansms or tools by two parameters one beng the relablty of ther network, the other one beng the management cost. As a result new technques should as far as possble take advantage of exstng montorng mechansms n order to lmt the modfcaton to exstng elements. In ths paper we focus on HTT communcatons. Accordng to ISs [3], HTT traffc consttutes between 35 and 50% of the Internet traffc. The goal of ths paper s to present a tool that allows such communcatons to be analyzed n the mddle of a network whle complyng wth the aforementoned lmtatons. We frst ntroduce the measurement nformaton our analyss s based on. Secton IV shows how such nformaton can later be used to deduce users requests. Secton V presents RequIn, an mplementaton of our technques as an extenson to Iflter on FreeBSD. We then test our analyss technque and mplementaton by consderng models and traffcs orgnatng from our web ste. II. MEASUREMENT INFORMATION Our goal s to permt the montorng of web communcatons when applcaton level nformaton cannot be used. In order to do so our plan s to use network and transport level nformaton n order to nfer applcaton level behavors. In ths secton we frst ntroduce network and transport level measurement capabltes. A. The HTT rotocol HTT exchanges can be vewed at several levels. At the lowest level, the HTT protocol s based on a request-response protocol where each request attempts to perform an HTT operaton on an obect at the server. We later call ths level mcro-sesson level. Informaton n HTT. messages [4] s organzed nto nformaton elements called headers. Although HTT. defnes more than 40 dfferent headers, requests and responses usually only use a few them. Requests usually nclude some of the followng headers: The verson of the protocol, a method ndcatng the acton to be performed, a URI ndcatng the obect the acton s to be performed on, a destnaton dentfyng the targeted web server, the date at whch the request was performed... Ths work s funded through European Commsson IST F6 DIADEM FIREWALL and GET DDOS proects.
Smlarly, a response usually ncludes smlar headers (verson, encodng, date, server) and some specfc headers lke a status ndcatng the result of the request, the content length or nformaton targetng caches (expraton date, cache drectves). B. Measurement Informaton Selecton As mentoned earler, measurement operatons am at understandng applcaton level operatons through the analyss of network and transport level nformaton. Although the applcaton level protocol has an mpact on ths nformaton, ths mpact also depends on ntermedate protocols. Addtonally some applcaton level parameters mght be more dffcult to nfer than others. As a result a frst step s to try and map applcaton level parameters to transport and network level parameters. The strength of ths mappng s later examned n the followng sectons. arameter Method URI Source Destnaton Tme/Date Compress. Cachng Status TABLE I. ARAMETERS MAING. Network/Transport level parameter Request data sze, Response data sze Response data sze Source I address, ort Destnaton I address, ort External: Tme/Date External: Server confguraton Response data sze. Response data sze. As ndcated n table I, szes are expected to be a sgnfcant source nformaton n order to nfer several applcaton level parameters. More specfcally our assumpton s that obect sze and obect dentfers are closely connected and that obect szes and transport/network level measured szes are also connected. Whle ths last relaton s obvously true wth HTT.0 where a connecton s used for each obect, HTT. uses several mprovements that can render ths relaton weaker. C. HTT/TC Relatonshp HTT. [4] provdes the ablty for web clents and servers to multplex several HTT request-responses exchanges over a sngle TC connecton. Among persstent connectons we can also dstngush connectons usng ppelned requests from regular connectons. pelned connectons are used by the clent to perform several requests wthout watng for an answer from the server. Ths ablty s however usually lmted by the structure of html obects where mported obects can only be requested after the html document lnkng to them s receved by the clent. Therefore n connectons, request-response sessons can be dstngushed at the network level by ether lookng at: Connectons set-up and endng n the case of non persstent connectons (whether ppelned or not). Request-Response sesson patterns [6] n the case of non-ppelned persstent connectons. These patterns can be found at the network level by consderng TC sequence numbers evolutons. As the sequence number from the clent only ncreases when a new requests s sent to the server, we can set the begnnng of each new sesson when a clent sequence number ncrease occurs. Snce several requests cannot be served smultaneously over the same connecton ths also represents the end of the prevous sesson. Request-Response sesson patterns n the case of persstent ppelned connectons. In the case of ppelned requests, only the frst request-response sesson can be dstngushed from other exchanges. The next sesson may nclude one or several requestresponse sessons. As a result ppelned, persstent connectons can make the relaton between Network/Transport level nformaton and applcaton level nformaton so weak that t can hardly be used. As a result an nterestng queston s whether ppelned connectons are supported n the real lfe. Reference [5] shows that most browsers (MS IE) are not able to use ppelned connectons. Moreover browsers (Frefox, Netscape) that do support ppelnng are usually confgured to avod usng t. Beng able to dstngush mcro-sessons allows us to measure the amount of data transported by TC for a request or a response by lookng at TC sequence number evoluton durng a mcro-sesson. III. METHOD AND OBJECTS SIZE INFERENCE A mentoned earler, our assumpton s that obects szes can be nferred from network or transport level measurements. As a result beng able to perform that operaton as correctly as possble s crtcal to our scheme. Several factors lke HTT headers can play a role n makng ths process more dffcult. Our assumpton s that the sze of HTT headers can take a lmted number of values. For a gven server these values depend on the server confguraton. A. Response type and method nference Fg. provdes the relaton between header szes, types of response and total szes n the case of our web server. These values where obtaned by capturng responses packets from the server over 24 hours. Sx types of responses (dentfed by code numbers) were captured. Fg. shows that 200 ( Ok ) responses can be dstngushed from other responses by lookng at the total sze (total sze > 570 bytes). As show n fg., some 200 responses have a sze that colldes wth other types of responses. However obects carred by these requests consttute less than % of exstng obects. 304 ("Not modfed") responses can also be dstngushed from other responses by lookng at the total sze (total sze <250). Other responses cannot be dstngushed as they carry obects whose sze can vary wdely. Addtonally, our tests showed that two types of HTT headers (and thus two headers szes) were found n transactons wth our server headers for non persstent connectons as well as headers for persstent connectons whch ncluded addtonal headers wth a fxed sze.
Sze (bytes) 650 550 450 350 250 50 Header Sze Response Type Total Sze 200 30 304 400 403 404 Fgure. Response Type/Sze Relatonshp. As a result knowng whether a connecton s persstent s suffcent to deduce the nfluence of the persstence on the HTT header sze. Ths knowledge can be obtaned usng the sesson delmtaton scheme descrbed n secton II. Non persstent connectons are dstngushed by lookng for multple connectons establshment-teardown over short perods of tme. table II provdes the relaton between response sze and response codes for persstent connectons. TABLE II. RESONSE CLASSIFICATION USING RESONSE SIZE (RS). Result Response Sze 200 RS >570 250>RS>460 304 240<RS <250 30, 400, 403, 404 460<RS<570 Usng a smlar methodology, we defne a set of classfcaton crteron n order to nfer the method used n HTT requests. However, we found that determnng the methods type usng solely the response sze could not be performed effcently. In order to do so, we use the combnaton of request and response szes. B. Obect Sze Inference Fg. shows that 200 responses can carry HTT headers whose szes are not fxed. As a result usng an average HTT header sze value to estmate obects sze n the case of GET requests can lead us to some errors. By lookng more closely at headers felds we can classfy them accordng to ther behavor: Some headers never change (e.g. response code, server dentfer; accept range, ). Some header values change but have a fxed sze (e.g. last modfed, date and Etag). Some header values change dependng on the assocated obect (e.g. content type and length). As a result for a gven obect, the response sze should reman constant. Ths means that by keepng the relaton between response szes and obect szes, we can get an exact estmate of obects szes. opular HTT servers support obects compresson pror to sendng them to the clent. Ths can cause a dfference between the number of bytes measured n the network and the sze of the obect. The compresson opton s used when an approprate confguraton s performed on the server sde and when the clent supports compresson. However as most clents support compresson, knowng f compresson s used s only a matter of knowng f the server s confgured to use t. In ths case HTT servers provde the ablty to log both compressed and orgnal szes for each requested obect. The nference process n the case of compressed obect therefore remans the same. IV. URI INFERENCE Our assumpton for URI nference s that network and transport level measurement parameters can be used to nfer obects dentfers for GET requests: Each gven obect has a sngle sze. As a result knowng an URI can help us explanng obects szes and recprocally. Users orgnatng from dfferent locatons have dfferent nterests. For example local students are usually more nterested n schedules and courses related nformaton whle users connectng from remote research nsttutons are more nterested n research related obects. For the same reasons people resdng n dfferent tmezones use dfferent parts of the server. In order to understand the relatons between measurement parameters, we use access logs avalable on web servers. These logs are usually made of a set of entres, each of them descrbng an acton performed on the server. Because each entry lnks I addresses, tme and date nformaton, obect szes and obect dentfers, we can use log entres n order to buld a model that wll later be used to nfer obects dentfers when provded other parameter values. A. Inference Model The model we selected to perform nference operatons s a Bayesan network. Bayesan networks are graphcal models that can be used to represent causal relatonshps between varables. A Bayesan network s usually defned as: An acyclc drected graph G, G ( V, E) =, where V s a set of nodes and E a set of vertexes. A fnte probablty set ( Ω Ζ, Ρ),. A set of varables defned on ( Ω Ζ, Ρ), such as: n ( V, V 2,, V n ) = ( V C ( V )) = Where C ( V ), s the set of causes for V n the graph. The nference n a causal network conssts n propagatng one or more unquestonable nformaton wthn the network, n order to deduce how belefs concernng the other nodes are modfed.
wrte: If node If node s located downstream from node = s a drect descendant of s over. In the other case we can break up reach a drect descendant of., we can, the computaton untl we If node s located upstream from node t s necessary to propagate the nformaton startng from the begnnng of the chan, to know the uncondtonal probablty for each node ( k ), ( k ). In order to do so, we can use the property of nverson of the condtonal probablty: + = + ( ) ( ) As wth the downward propagaton, f s a drect ascendant for the computaton stops here. In the other case + we can perform the same operaton on ascendants. + B. Varables Selecton In order to obtan an effcent model, we frst performed some aggregaton on varables. I addresses were aggregated nto country codes. Seconds, mnutes and hours nformaton was aggregated nto a sngle hour varable. Day of the week, month, year nformaton was aggregated nto a sngle day of the week varable. As the cost of nference n a Bayesan network ncreases exponentally wth the number of varables n the network, t s essental to lmt that number. In order to do so, we evaluate the ablty for each parameter (sze, country, tme and date) to explan obect dentfers. For each couple of varables (URI,), we do so by computng (URI ) and comparng t wth (URI).() by computng: V. IMLEMENTATION AND TESTS A traffc analyzer was mplemented as an extenson to IFlter [7]. The HTT sesson handlng functon s mplemented as a part of the TC state mantenance functon. Ths functon extends the TC connectons data structures by allowng multple HTT sessons to coexst wthn a TC connecton. Sessons are delmted as specfed n secton II and specfed usng the IFlter flterng polcy. When a sesson ends, the correspondng nformaton (Source I address and port, Destnaton I address and port, tmestamps, number of TC bytes transported n both drectons, Number of packets and bytes transported, type of connecton) s handed to the kernel syslog part. Ths nformaton s later exported to the user space and retreved by RequIn. RequIn s frst used to transform tmestamps nto tme and date values as well as I addresses nto country codes. To do so we use a statc I address database for performance reasons. When started, RequIn frst uses logs from the server to montor n order to buld the correspondng Bayesan network and method-response codes classes. When such models are bult, classfcaton and nference models can be used to nfer users' actons. A. Valdaton Tests Our valdaton tests were performed usng our departmental web server. Ths server runs wth Apache.3 and ncludes roughly 5k obects, most of them beng statc pages and receves 7k requests a day. In order to perform consstent tests over a long perod of tme, a copy of ths server was made on a smlar computer. Ths copy was later used for the tests. In order to check that our server dd not have a structure that would have tanted our tests, we performed a comparson between requests szes to our server and the ones usually found on the nternet [6]. Fg. 3 shows both cumulatve dstrbuton functons. Szes smaller than 500 bytes have been gnored snce header szes dstrbuton s unknown n [6]. Overall there s lttle dfference between the two dstrbutons except n the [0 5 ;0 6 ] range where the dfference should not have a large mpact on our scheme. I ( URI, ) N ([ ( URI ) ( URI ) ( )]) = = The rankng between I(URI,) values lead us to the smple Bayesan network presented n fg. 2. N Identfer Obect Sze Country Code Fgure 2. Resultng Bayesan Network. Fgure 3. Responses szes cumulatve dstrbuton. Usng the model defned n secton IV, we bult a Bayesan network for ths web server usng a 309k entres log fle gathered over 43 days from the orgnal server. In order to test
the ablty of the model to predct future requests, we frst nvestgated the nfluence of tme on the estmaton accuracy. Fg. 4 provdes the evoluton of the correct estmate rate over three weeks when usng a three weeks log to buld the model. As shown n fg. 4, the percentage of correct estmates remans around 75% durng roughly 0 days (records to 70k). It then slowly falls to 7% over the next 2 days as new obects are stored n the server. redcton accuracy 0,76 0,75 0,74 0,73 0,72 0,7 0,7 0,69 9 7 25 33 4 49 57 65 73 8 89 97 05 3 2 29 37 45 53 6 Entry# (n thousands) Fgure 4. % of correct estmates over tme. The valdaton of the method and response code nference methods were performed usng a smlar process. Estmaton results are provded n table III. TABLE III. Estmated parameter Method 95 Operaton result 96 RESULT AND METHOD INFERENCE. % correct estmaton Ths frst estmaton does not take nto account the perturbaton that mght be ntroduced by the measurement part of RequIn. In order to valdate the whole software we generated sequental requests for each obect dentfer found n the full log fle. Requests were analyzed by RequIn whch produced the nferred user actons. These actons were later compared to orgnal requests. Results are provded n table IV. Ths test s however based by two parameters: The nference part s unable to take advantage of the country code nformaton. Ths should decrease the accuracy of the nference. The nference process s not affected by agng as the server confguraton s statc. Ths should ncrease the accuracy of the nference. Gven the varaton of accuracy over tme (fg. 4) we however beleve that ths last parameter should have a small effect over the frst ten days. Consequently we expect to get slghtly better results wth real lfe traffc. TABLE IV. Scenaro URI 74 Method 90 Operaton result 94 VALIDATION OF WHOLE SOFTWARE. % correct estmaton B. erformance Tests RequIn was tested on FreeBSD 5.2 on a 2.4Ghz entum eon processor wth a 52KBytes cache. Durng our tests we benchmarked several aspects of the nference process ncludng the tme requred to buld the models, the sze of the models and the tme requred to nfer a request once models are bult. For the test we used an access log fle ncludng 77k entres to buld the nference model. We then used (I address, obect sze) couples from a 232k entres log fle to perform the performance test. We performed 50 seres of tests and averaged the results. TABLE V. arameter Tme to buld the models Sze of the model Tme per request ERFORMANCE RESULTS. 4s.5 Mbytes 0.9us Value These results (table V) show that our nference process, when used ndependently from the request-response measurement mechansm should be able to analyze roughly.m requests per second. Assumng an average Internet HTT traffc ths would allow us to treat a 20Gb/s full duplex lnk. VI. CONCLUSION In ths paper, we ntroduce a new technque to analyze traffc between clents and web-servers. Unlke exstng analyss technques, ths proposal provdes the ablty to trade some accuracy (n term of what nformaton can be retreved and the precson of such nformaton) aganst an ncreased analyss speed. We thnk that such analyss speed mght be useful aganst some threats lke denal of servce attacks where speed s the maor concern. Ths would allow the usage of applcaton level resources to be controlled at the network level. Although our technque s not applcable to every web server (HTT servers that are large or contan mostly dynamc content) our feelng s that t would work for a large proporton of exstng servers makng t useful n practce. We beleve our technque could be further mproved by lookng at HTT communcatons at levels other than the mcro-sesson level. We are currently workng DDoS detecton methods based on the nformaton nferred by RequIn. REFERENCES [] H. Nelsen et al.. Network erformance Effects of HTT/. CSS, and NG. In roceedngs of SIGCOMM 997, August-September 997. [2] G. Navarro and M. Raffnot. Flexble attern Matchng n Strngs. Cambrdge Unv. ress, 2002. [3] Sprnt I Montorng proect, avalable at: pmon.sprnt.com/, 2004. [4] R. Feldng and al. HTT., RFC 266. Internet Engneerng Task Force, June 999. [5] Balachander Krshnamurthy, Martn Arltt, RO-COW: rotocol Complance on the Web, A Longtudnal Study, USITS '0, March 26 28 200. [6] F. Donelson Smth, F. Hernandez, K. Jeffay, and D. Ott, What TC/I protocol headers can Tell Us About the Web, In proceedngs of ACM SIGMETRICS 200, June 200. [7] Darren Reed, IFlter, avalable at coombs.anu.edu.au/~avalon/, 2004.