Bachelo Thesis Summe Semeste 2013 at Fachhochschule Fankfut am Main Univesity of Applied Sciences Depatment of Compute Science and Engineeing toads Bachelo of Science Compute Science submitted by Jens Kühnel Centalized and stuctued log file analysis ith Open Souce and Fee Softae tools 1. Supeviso: Pof. D. Jög Schäfe 2. Supeviso: Pof. D. Matthias Schubet topic eceived: 11. 07. 2013 thesis deliveed: 30. 08. 2013
Abstact This thesis gives an ovevie on the Open Souce and Fee Softae tools available fo a centalized and stuctued log file analysis. This includes the tools to convet unstuctued logs into stuctued log and diffeent possibilities to tanspot this log to a cental analyzing and stoage station. The diffeent stoage and analyzing tools ill be intoduced, as ell as the diffeent eb font ends to be used by the system administato. At the end diffeent tool chains ill be intoduced, that ae ell tested in this field. Revisions Rev. 269: Official Bachelo these sent to FH Rev. 273: Removal of Affidavit, fix of Pagenumbe left/ight II
Table of Contents 1 Intoduction...1 1.1 Selection citeia...1 1.2 Pogams that ae included in this thesis...2 1.3 What this thesis is not coveing...4 1.3.1 Hadoop...4 1.3.2 Pogams that ae not included in this thesis...4 1.4 Stuctue of this thesis...6 1.5 Histoy of log files...6 2 Definitions...7 2.1 Log file...7 2.2 Centalized log file...7 2.3 Definition stuctued log files...7 2.4 Definition Open Souce and Fee Softae...8 2.5 Definition Log File Analysis...9 3 Components and Functions...10 3.1 Fomats...11 3.1.1 Semi stuctued logs...11 3.1.1.1 BSD syslog (RFC3164)...11 3.1.1.2 Moden syslog (RFC 5424)...11 3.1.2 Stuctued logs...12 3.1.2.1 CEE...12 3.1.2.2 GELF...13 3.1.2.3 JSON-logstash...14 3.1.2.4 Systemd jounal...15 3.1.2.5 Windos Event Log...16 3.1.2.6 Auditlog...17 3.1.2.7 Intusion Detection Message Exchange Fomat (IDMEF)...18 3.1.3 Othe fomats...18 3.2 Collecto/Shippe...19 3.2.1 File...19 3.2.2 Sockets, named pipes and STDIN...19 3.2.3 Local Windos Eventlog...19 3.2.4 Compae collecto / shippe...19 3.3 Tanspot...20 3.3.1 Syslog...20 3.3.2 AMQP...21 3.3.3 STOMP...21 3.3.4 Ømq/ZMTP...21 3.3.5 Redis...21 3.3.6 Lumbejack...22 3.3.7 Remote Windos Eventlog...22 3.3.8 Compae Tanspots...22 3.4 Tansfomation/Nomalization...23 3.4.1 Patten-DB...24 3.4.2 Liblognom...24 3.4.3 Octopussy...24 3.4.4 Gok...25 III
3.4.5 Heka...25 3.4.6 Filte_egex...25 3.4.7 nxlog...26 3.5 Stoage...26 3.5.1 Log files...26 3.5.2 SQL...26 3.5.3 NoSQL...27 3.5.4 Compae Stoage...27 3.6 Analysis...27 3.6.1 nxlog...27 3.6.2 SEC...28 3.6.3 Sagan...28 3.6.4 Logstash and metics...29 3.6.5 Gaylog2...29 3.7 Visual output...29 4 Tools...30 4.1 Multi pupose tools...30 4.1.1 Syslog-ng...30 4.1.2 Rsyslog...30 4.1.3 Gaylog2...31 4.1.4 Logstash...32 4.1.5 Node-Logstash...33 4.1.6 ELSA...34 4.1.7 octopussy...35 4.1.8 nxlog...36 4.1.9 Heka...37 4.2 Output...37 4.2.1 Webpage...37 4.2.1.1 LogAnalyze...37 4.2.1.2 Kibana 2...38 4.2.1.3 Kibana 3...38 4.2.2 Gaphs...39 4.2.2.1 StatsD...39 4.2.2.2 Gaphite...39 4.2.2.3 Fnodmetic...40 4.2.2.3.1 Fnodmetic Classic...40 4.2.2.3.2 Fnodmetic Entepise...40 4.2.2.3.3 Fnodmetic UI...40 4.3 Stoage...40 4.3.1.1 mysql...40 4.3.1.2 MongoDB...40 4.3.1.3 ElasticSeach...41 4.4 Tanspots...42 4.4.1 edis...42 4.4.2 abbitmq...42 4.4.3 ActiveMQ...42 4.4.4 Ømq...43 4.5 Collecto/Shippe...43 4.5.1 Fluentd...43 4.5.2 flume...43 IV
4.5.3 aesant...44 4.5.4 beave...44 4.5.5 lumbejack...44 4.5.6 eventlog-to-syslog...44 4.5.7 oodchuck...44 4.5.8 ncode/logix...44 4.5.9 syslog-shippe...44 4.5.10 emote_syslog...45 4.5.11 systemd/jounal2gelf...45 4.6 Analysis...45 5 Toolchains...46 5.1 Possible toolchains...46 5.2 Toolchain Featues...47 5.2.1 Accepting stuctued log files...47 5.2.2 Reliable tanspot...47 5.2.3 High availability...48 5.2.4 Use sepaation and LDAP...49 5.2.5 Size of ule base...49 5.2.6 Log Analysis...49 5.2.7 Install...50 5.2.8 Speed...50 5.3 Summay...51 6 Conclusion...53 6.1 Shot summay about evey majo tool...53 6.2 Futue...53 6.3 Optimal toolchain...54 V
Bibliogaphy ActiveMQ-Cluste: The Apache Softae Foundation., Featues > Clusteing, 2011, https://activemq.apache.og/clusteing.html, etieved: 20.08.2013 ActiveMQ-Featues: The Apache Softae Foundation., Connectivity > Coss Language Clients, 2011, https://activemq.apache.og/coss-language-clients.html, etieved: 20.08.2013 ActiveMQ-SSL: The Apache Softae Foundation., Ho do I use SSL, 2011, https://activemq.apache.og/ho-do-i-use-ssl.html, etieved: 20.08.2013 AMQP: OASIS, AMQP A Geneal-Pupose Middleae Standad, 2011, http://.amqp.og/specification/0-10/amqp-og-donload, etieved: 09.08.2013 CBE: IBM, Undestanding Common Base Events Specification V1.0.1, 2004, etieved: 08.08.2013 CEEFields: MITRE Coopeation, CEE Coe Field Dictionay, 2012, etieved: 13.08.2013 Chuilin2013: Atyom Chuilin, CHOOSING AN OPEN-SOURCE LOG MANAGEMENT SYSTEM FOR SMALL BUSINESS, 2013, http://lab.cs.ttu.ee/dl135, etieved: 20.08.2013 Chuvakin2008: D. Anton A. Chuvakin, CEE Logging Standad, 2008, http://de.slideshae.net/anton_chuvakin/cee-logging-standad-today-and-tomoo, etieved: 09.08.2013 Chuvakin2013: D. Anton A. Chuvakin, Logging and Log Management, 2013, ISBN: 978-159749-635-3 Czanik2013: Pete Czanik, PattenDB git moved and updated, 2013, https://czanik.blogs.balabit.com/2013/05/pattendb-git-moved-and-updated/, etieved: 08.08.2013 ELSA-UseGuide:, Use Guide fo ELSA, 2013, http://code.google.com/p/entepise-log-seachand-achive/iki/documentation, etieved: 24.08.2013 ELSAQuickstat: Matin Holste, ELSA Quickstat, 2011, http://code.google.com/p/entepise-logseach-and-achive/iki/quickstat, etieved: FLUENTD-FAQ: unkonn, FAQ, 2013, http://docs.fluentd.og/aticles/faq, etieved: 21.08.2013 FlumeUseGuide: The Apache Softae Foundation, Flume 1.4.0 Use Guide, unknon, https://flume.apache.og/flumeuseguide.html, etieved: 21.08.2013 FeeSoftae: Fee Softae Foundation, The Fee Softae Definition, 2013, https://.gnu.og/philosophy/fee-s.html, etieved: 10.08.2013 GELF: Lennat Koopmann, Gaylog Extended Log Fomat, 2011, https://github.com/gaylog2/gaylog2-docs/iki/gelf, etieved: 09.08.2013 Gehads2007: Raine Gehads, hy does the old need anothe syslogd? (aka syslog vs. syslog-ng), 2007, http://blog.gehads.net/2007/08/hy-does-old-need-anothe-syslogd.html, etieved: 08.08.2013 Gehads2008: Raine Gehads, hy you can't build a eliable TCP potocol ithout app-level acks..., 2008, http://blog.gehads.net/2008/05/hy-you-cant-build-eliable-tcp.html, etieved: 08.08.2013 Gehads2011: Raine Gehads, Log Nomalization Systems and CEEPofiles, 2011, etieved: 07.08.2013 Gehads2011-2: Raine Gehads, Using syslog mmnomalize module effectively ith Adiscon LogAnalyze, 2011, http://.syslog.com/using-syslog-mmnomalize-module-effectivelyith-adiscon-loganalyze/, etieved: 08.08.2013 Gehads2013: Raine Gehads, Ho to sign log messages though signatue povide Guadtime, 2013, http://.syslog.com/ho-to-sign-log-messages-though-signatue-povideguadtime/, etieved: 08.08.2013 Gehads2013-2: Raine Gehads, syslog's fist signatue povide: hy Guadtime?, 2013, http://blog.gehads.net/2013/05/syslogs-fist-signatue-povide-hy.html, etieved: 08.08.2013 VI
Gheoghe2012: Radu Gheoghe, Using Elasticseach fo logs, 2012, http://.elasticseach.og/tutoials/using-elasticseach-fo-logs/, etieved: 08.08.2013 Gilche2012: Floian Gilche, ElasticSeach pe-flight checklist, 2012, http://asquea.de/opensouce/2012/11/25/elasticseach-pe-flight-checklist/, etieved: 20.08.2013 Guzdial1993: Mak Guzdial, Deiving Softae Usage Pattens fom Log Files, 1993 HekaInto: Mozilla Foundation, Intoducing Heka, 2013, https://blog.mozilla.og/sevices/2013/04/30/intoducing-heka/, etieved: 22.08.2013 Hintjens2013: Piete Hintjens, Code Connected Volume 1 - Leaning ZeoMQ, 2013, ISBN: 1481262653 Holste2011: Matin Holste, Fighting APT ith Open-souce Softae, Pat 1: Logging, 2011, http://ossectools.blogspot.de/2011/03/fighting-apt-ith-open-souce-softae.html, etieved: 19.06.2013 Hang2011: Eic Hang, Sam Rash, Data Feeay: Scaling out to Realtime, 2011, http://.slideshae.net/slash/2011-0630hadoopsummit-v5-8469751#btnnext, etieved: 18.08.2013 JOURNALFIELDS: Lennat Poetteing, systemd.jounal-fields Special jounal fields, 2012, http://.feedesktop.og/softae/systemd/man/systemd.jounal-fields.html, etieved: 13.08.2013 JOURNALJSON: Joe Rayhak, Nis Matensen, Jounal JSON Fomat, 2013, http:/http://.feedesktop.og/iki/softae/systemd/json/, etieved: 07.08.2013 Köbschall: D. Ged Köbschall, pesonal intevie at 22.08.2013, 2013 Malpass2011: Ian Malpass, Measue Anything, Measue Eveything, 2011, http://codeascaft.com/2011/02/15/measue-anything-measue-eveything/, etieved: 09.08.2013 MSEventLog: Micosoft, MSDN Event Logging, 2013, http://msdn.micosoft.com/enus/libay/aa363652.aspx, etieved: 10.08.2013 MSEVENTSCHEMA: unknon / Micosoft, Windos Event Schema, 2013, http://msdn.micosoft.com/en-us/libay/indos/desktop/aa385201%28v=vs.85%29.aspx, etieved: 14.08.2013 nxlog: Botond Botyanszki, NXLOG Community Edition Refeence Manual fo v2.5.1089, 2009, http://nxlog-ce.soucefoge.net/nxlog-docs/en/nxlog-efeence-manual.html, etieved: 20.08.2013 nxlog-va-aning: Botond Botyanszki, NXLOG Community Edition Refeence Manual fo v2.5.1089, 2009, http://nxlog-ce.soucefoge.net/nxlog-docs/en/nxlog-efeencemanual.html#lang_vaiable_example_co_note, etieved: 20.08.2013 OctopussyInstallation: unknon, Octopussy Installation,, http://8pussy.og/documentation/guides/administato_guide/01_installation, etieved: 20.08.2013 Ømq: Piete Hintjens, ØMQ - The Guide, 2013, http://zguide.zeomq.og/page:all, etieved: 16.08.2013 OpenSouce: Open Souce Iniative, The Open Souce Definition, unknon, http://opensouce.og/osd, etieved: 05.08.2013 OSAch2012: Amy Bon, Geg Wilson, The Achitectue Of Open Souce Applications, 2012, ISBN: 978-1257638017 Poetteing2012: Lennat Poetteing, Foad Secue Sealing (FSS) is finally coming to +systemd's jounal., 2012, https://plus.google.com/115547683951727699051/posts/g1e6axvktyc, etieved: 08.08.2013 RabbitMQ: GoPivotal, Inc., What can RabbitMQ do fo you?, unknon, http://.abbitmq.com/featues.html, etieved: 20.08.2013 RabbitMQ-SSL: GoPivotal, Inc, SSL Suppot, unknon, http://.abbitmq.com/ssl.html, etieved: 20.08.2013 Redis: unknon, Intoduction to Redis, unknon, http://edis.io/topics/intoduction, etieved: 20.8.2013 VII
edis-secuity: unkon, Redis Secuity, unknon, http://edis.io/topics/secuity, etieved: 20.08.2013 RELP: Raine Gehads, RELP - The Reliable Event Logging Potocol, 2008, http://.syslog.com/doc/elp.html, etieved: 08.08.2013 RFC3164: C. Lonvick, RFC3164: The BSD syslog Potocol, 2001, etieved: 08.08.2013 RFC3339: G. Klyne, C. Neman, Date and Time on the Intenet: Timestamps, 2002, etieved: 08.08.2013 RFC4627: D. Cockfod, The application/json Media Type fo JavaScipt Object Notation (JSON), 2006, etieved: 07.08.2013 RFC4765: H. Deba, D. Cuy, B. Feinstein, The Intusion Detection Message Exchange Fomat (IDMEF), 2007, etieved: 07.08.2013 RFC5424: R. Gehads, RFC5424: The Syslog Potocol, 2009, etieved: 08.08.2013 RFC5426: A. Okmianski, Tansmission of Syslog Messages ove UDP, 2009, etieved: 09.08.2013 SEC: D. Risto Vaaandi, SEC - simple event coelato, 2013, http://simpleevco.soucefoge.net/, etieved: 08.08.2013 Seguin2013: Kal Seguin, The Little Redis Book, 2013, http://openmymind.net/2012/1/23/thelittle-redis-book/, etieved: 14.08.2013 Shao2011: Zheng Shao, Real-time Analytics at Facebook, 2011, http://conf.slac.stanfod.edu/xldb2011/talks/xldb2011_tue_0940_facebookrealtimeanalytics.pdf, etieved: 18.08.2013 Sissel2012: Jodan Sissel, Poposal: ne logstash event schema, 2012, https://logstash.jia.com/bose/logstash-675, etieved: 16.08.2013 Sissel2013: Jodan Sissel, lumbejack,, https://github.com/jodansissel/lumbejack, etieved: 10.08.2013 Sissel2013-2: Jodan Sissel, MITRE's CEE is a failue fo pofit., 2013, http://.semicomplete.com/blog/geekey/cee-logging-fo-pofit.html, etieved: 07.08.2013 SLES2013: SuSE Entepise Team, Release Notes fo SUSE Linux Entepise Seve 11 Sevice Pack 2, 2013, https://.suse.com/eleasenotes/x86_64/suse-sles/11sp2/#depecated.futue, etieved: 17.08.2013 STOMP: Unknon, STOMP Potocol Specification, Vesion 1.2, 2012, etieved: 08.08.2013 Tunbull2013: James Tunbull, The Logstash Book, 2013, http://.logstashbook.com/, etieved: 17.08.2013 ULM: J. Abela, T. Debeaupuis, Univesal Fomat fo Logge Messages, 1999, https://tools.ietf.og/html/daft-abela-ulm-05, etieved: 12.08.2013 Vaaandi2012: D. Risto Vaaandi,D. Michael R. Gimaila, Secuity Event Pocessingith Simple Event Coelato, 2012, http://istov.uses.soucefoge.net/publications/sec-issa2012.pdf, etieved: 14.08.2013 Valdman2001: Jan Valdman, Log File Analysis, 2001 XML Fomat: 2013, Extensible Makup Language (XML),, etieved: ZMTP: imatix, 15/ZMTP - ZeoMQ Message Tanspot Potocol, 2012, http://fc.zeomq.og/spec:15, etieved: 16.08.2013 ZMTP-CURVE: Piete Hintjens, 26/CuveZMQ Authentication and Encyption Potocol, 2013, http://fc.zeomq.og/spec:26, etieved: 17.08.2013 VIII
Illustation Index Illustation 1: Log Infastuctue...10 Illustation 2: Octopussy ule ceation...25 Illustation 3: syslog in/out plugins...31 Illustation 4: Gaylog2 eb page...32 Illustation 5: Logstash eb page...33 Illustation 6: ELSA eb page...35 Illustation 7: Octopussy home page...36 Illustation 8: LogAnalyze eb page...37 Illustation 9: Kibana 2 eb page...38 Illustation 10: Kibana 3 eb page...39 Illustation 11: Elasticseach ith HEAD plugin...41 Illustation 12: Possible toolchains (ed=stoage, yello=nomalize, hite=ebpages, blue=shippe...46 Index of Tables Table 1: Tools used in this thesis...3 Table 2: Tools not used in this thesis...6 Table 3: collecto/shippe Ovevie...20 Table 4: Tanspot Ovevie...23 Table 5: Featue: accepting stuctued log files...47 Table 6: Featue: eliable tanspot...48 Table 7: Featue: high availability...48 Table 8: Featue: Use sepaation and LDAP...49 Table 9: Featue: size of ule base...49 Table 10: Featue: log analysis...50 Table 11: Featue: easy to install...50 Table 12: Featue: ovevie...52 Table 13: Total Ovevie: Pat 1...56 Table 14: Total Ovevie: Pat 2...57 IX
1 Intoduction Log files ae a cental pat in the ok of a system administato. Wheneve something goes ong, the fist look is nomally into one log file o anothe. Fo something so fundamental fo the pope oking and management of a netok of computes, it is fascinating ho fe tools ae available and ho unstuctued log files eally ae. The only pactical useful "defacto standad" as fo a long time [RFC3164] hich has been itten to descibe the syslog potocol that as and is used in diffeent Unix systems. It offes UDP tansmission to a cental log file seve. This syslog seve only sends unstuctued o semistuctued messages. In the last couple of yeas a stong movement came up to put log files onto a moe stuctued path. It also has a stong emphasis on moden tools like NoSQL and JSON. This thesis ill sho an ovevie on the cuent state of this stuctued log file. It ill sho the diffeent fomats that ae used today, hich tools can help ith the ceation of a stuctued log infastuctue and hat is still missing o can be impoved. 1.1 Selection citeia This thesis aims to give an ovevie ove the available Open Souce and Fee Softae tools ith the folloing citeia fo selecting the ight tools. The citeia ae based on the equiements of multiple companies the autho oked in the past and pesent. This is aimed toads middle and lage companies. Combines data fom many souces and diffeent fomats In this thesis the enomous amount of log file analyzes that can only analyze one log file fomat ill be ignoed. This includes tools like astat, analog etc. that only analyze apache logs. Easy to use fo aveage system administatos (Windos and Linux) This necessitates a usable documentation Has to be stuctued to ease analysis Geneate stuctued log files out of unstuctued text log files, fo easy tansition fom unstuctued to stuctued logs. Secue and eliable tanspot (lost messages ae to be avoided, but message delivey does not need to be guaanteed) Data should be stoed and pocessed edundantly to avoid single point of failue Fast enough to get the data fom thousands of machines on a "nomal" seve 8-16 Coes, 16-64GB RAM Contains only Open Souce and Fee Softae that can un unde diect contol of the use / administato. All pats of the system must confim to this ule. Only ith Open Souce and Fee Softae it is possible to have an auditing of the used softae to check fo compliance ith existing egulations. Log files can contain pesonal infomation and should not be stoed in the cloud fo pivacy easons. 1
This thesis does not include OpenCoe Softae. OpenCoe Softae is softae that is offeed as Fee and Open Souce Softae ith special featues only available in a closed souce vesion. With OpenCoe Softae thee ae special featues that make it impossible to ok ith, because this featues ae only available in a closed souce vesion, often called "Po" o "Entepise". Especially encyption is not an optional featue. An OpenCoe Softae poduct ill not accept a ne featue like encyption, because it ill hut the sales of the closed softae poduct. A eal Open Souce poject ill gladly accept code donations to suppot a ne featue, like encyption. The tools have to analyze the data "on the fly". An analysis duing the night in a batch job is to slo in a moden IT old. Active development is necessay, fo all pats of the system. Thee ae a lot of dead Open Souce pogams available, but ithout active development no ne featues and bug fixes ill be available. Of couse anyone could take the code and continue to develop it, but the shee numbe does not allo to include them in this thesis. A poject is consideed dead in this thesis ithout a elease in to yeas, o no commit to the code management in one yea. 1.2 Pogams that ae included in this thesis Pogam Langua stable URL ge Vesion License Function logstash Java/Ru 1.1.13 by http://logstash.net/ Apache 2.0 TCNASO gaylog2 Java/Ru 0.12.0 by http://gaylog2.og GPLv3 ASO syslog C 7.4.0 http://syslog.com/ GPLv3/LGPL TCN syslog-ng C 3.4 http://.balabit.com/net ok-secuity/syslogng/opensouce-loggingsystem/ LGPLv2.1/GPLv TCN 2 nodelogstash Javasci 0.0.2 pt https://github.com/bpaquet/n Apache 2.0 ode-logstash TCNO octopussy pel 1.0.10 http://8pussy.og/ NSAO nxlog C 2.5.1089 http://nxlogce.soucefoge.net/ GPLv2/LGPLv2 CAN Heka Go 0.3.0 https://github.com/mozillasevices/heka MPL v2.0 oodchuc Ruby k 0.0.1 https://github.com/danyan/ MIT oodchuck C aesant pel 0.10 https://github.com/bloonix/a GPL esant C beave Python 30 https://github.com/josegonza MIT lez/beave/eleases C 0.2.0 https://github.com/jodansis Apache 2.0 sel/lumbejack/eleases C lumbejack Ruby/ (C go) 2 GPLv2/GPLv3 CN
Pogam Langua stable URL ge Vesion syslogshippe Ruby License Function https://github.com/jodansis BSD sel/syslog-shippe C emote_sys Ruby log 1.6.4 https://github.com/papetail BSD /emote_syslog C fluentd Ruby 1.1.15 http://fluentd.og/ Apache 2.0 C flume Java 1.4.0 https://flume.apache.og/ Apache 2.0 C 0.1.4-6 https://github.com/ncode/log Apache 2.0 ix C https://github.com/systemd/j BSD ounal2gelf C 4.4.3 http://code.google.com/p/ev BSD entlog-to-syslog/ C elasticsea Java ch 0.90.3 http://.elasticseach.og Apache 2.0 / S mongodb c++ 2.4.6 http://.mongodb.og/ AGPLv3 / Apache 2.0 S edis C 2.6.15 http://edis.io/ BSD T abbitmq Elang 3.1.5 http://.abbitmq.com/ MPL v1.1 T activemq Java 5.8.0 https://activemq.apache.og/ Apache 2.0 T 0MQ C++ 3.2.3 http://zeomq.og/ LGPLv3+ T SEC pel 2.7.4 http://simpleevco.soucefoge.net/ GPLv2 A Sagan C 0.3.0 http://sagan.quadantsec.co m/ GPLv2 A StatsD Javasci 0.6.0 pt https://github.com/etsy/stats MIT d/ O Gaphite Python http://gaphite.ikidot.com/ Apache 2.0 O Fnodmeti Java/Ru 0.5.1 c by http://fnodmetic.io/ MIT O Kibana2 Ruby http://kibana.og/ BSD O Kibana3 Javasci 3.0.0pt m3 http://thee.kibana.og/ Apache 2.0 O ncode/logi Python x systemd/jo Python unal2gelf eventlogto-syslog C++ LogAnalyz PHP e 0.9.10 0.2.0 3.6.4 http://loganalyze.adiscon.co GPLv3 m/ O Table 1: Tools used in this thesis About column "Functions included": T=Tanspot, C=Collecto, N=Nomalization, A=Analyze, S=Stoage, O=Output 3
About column "License": The autho is not a laye and the licenses that ae shon hee ae the ones that ae shon on the ebsite, README o LICENSE file. I did not check evey file and this is not a license analysis. 1.3 What this thesis is not coveing This thesis ill not cove the monitoing fo availability o pefomance. The chances that a pocess dies ithout an eo message that could be analyzed is much too high. Fo that kind of monitoing I suggest tools like the Open Monitoing Distibution (OMD), nagios o Zabbix. Also not included in this thesis is the compliance ith diffeent las and egulations, like PCI DSS, FISMA, HIPAA o best pactice fameoks such as ISE2700 and COBIT. Fo moe infomation see Chapte19 of [Chuvakin2013]. Also the compliance to data potection las ae not coveed, even hen sometimes tools fo some data anonymization ae shon. Tools that ae only designed to ok ith Intusion Detection Systems and use only log file analysis as a small pat of a lage design ae not descibed hee. Fo this eason the tools OSSIM and ossec.net ae not included. 1.3.1 Hadoop The Hadoop ecosystem ith Hadoop Distibuted File System (HDFS) is the efeence fo Open Souce big data management. The Hadoop Distibuted File System is a distibuted and scalable filesystem, based on the Hadoop Infastuctue and allos to use the MapReduce mechanism to let the ok be done in a ay to bing the computation and data stoage close togethe, often on the same machine. To analyze, quey and summaize the data the Hive Dataaehouse and the HBase non-elational Database can be used. Hadoop is an Apache poject and uses the Java platfom. Access to the data is available fom diffeent pogamming languages. The setup and pogamming is quite complicated, compaed to othe solutions that ill be coveed in this thesis, but can handle much bigge datasets ith multiple Petabyte. It is possible to use Hadoop to ceate a log analyzing infastuctue, but building a Hadoop infastuctue simply fo log file analysis is ovesized. Thee is a hadoop subpoject that ceates a log analyzing platfom on top of Hadoop named chucka, but this poject is almost dead o dying ith no mail in the Mailing list fo 6 Month and only some one and to lines bug fixes fom one develope in 2013. This does not comply ith the definition of a dead poject, but neve the less it ill not be included in this thesis, because of the necessity fo hadoop. Scibe a tool un on top of Hadoop as used by Facebook to analyze the log files, but this poject as appaently abandoned by Facebook and eplaced by a closed souce tool called Calligaphus [Hang2011] [Shao2011] Theefoe Hadoop based solutions ill not be coveed in the thesis, including Hive, HBase, Hadoop Distibuted Filesystem, Thift, Avo, OpenTSDB and pig. The Apache pojects flume is included because it cannot only ite to Hadoop, but also to othe data stoages. 1.3.2 Pogams that ae not included in this thesis The folloing pogams ae not included in this thesis. 4
Pogam splunk URL http://.splunk.com/ eason fo exclusion Not Open Souce/Fee Softae 5
Pogam URL eason fo exclusion loggly http://loggly.com cloud based, not Open Souce/Fee Softae ntsyslog http://ntsyslog.soucefoge.net/ dead poject, last elease 2007 OSSIM http://.alienvault.com/opentheat-exchange/pojects log managment only in Closed Souce vesion, open coe Sguil http://sguil.soucefoge.net/donlo dead poject, last elease 2011 ads.html logsufe http://.cypt.gen.nz/logsufe/ dead poject, last elease 2011 scibe https://github.com/facebook/scibe/ dead poject, abandoned by facebook loghound http://istov.uses.soucefoge.net/l dead poject, last elease 2004 oghound/ Snae http://.intesectalliance.com/s OpenCoe, encyption only in closed naeagents/index.html souce vesion Bevis https://github.com/bkjones/bevis Poject Lasso http://soucefoge.net/pojects/lass dead poject, last elease 2008, ceato olog/ unde ne oneship Riemann http://iemann.io/ only taking logs ceated by on log libay gelfino https://github.com/nakis/gelfino dead poject, last commit 2012 ossec.net http://.ossec.net/ log management only a vey small pat, no stuctued log files Hadoop Tools https://hadoop.apache.og/ See chapte 1.3.1 Hadoop chucka https://iki.apache.og/hadoop/ch See chapte 1.3.1 Hadoop uka scibe https://github.com/facebook/scibe See chapte 1.3.1 Hadoop Thift https://thift.apache.og/ See chapte 1.3.1 Hadoop Avo https://avo.apache.og/ See chapte 1.3.1 Hadoop OpenTSDB http://opentsdb.net/ See chapte 1.3.1 Hadoop dendite https://github.com/onemoecloud/d No usable documentation, this poject as endite/gaphs/commit-activity only active fo 3 eek, ith 8 commits all togethe. loges https://github.com/aaddon/loges No usable documentation logtail https://github.com/shtouff/logtail No license attached 6 dead poject, last commit Feb. 2012
Pogam Gaphtastic URL eason fo exclusion https://github.com/nickpadilla/ga No license, no commit in ove a yea phtastic Table 2: Tools not used in this thesis 1.4 Stuctue of this thesis This thesis is divided into 6 main chaptes. The fist chapte Intoduction is the cuent chapte. It contains the selection citeia that have been used to select the pogams and a shot histoy of log files. The chapte Definitions defines the necessay basics to undestand the est of the thesis. The chapte Components and Functions shos the diffeent pats that ae needed fo a centalized and stuctued log file analysis. The chapte Tools ill sho the diffeent tools that ae available in the Fee Softae and Open Souce old and the functionalities they offe. In the chapte Toolchains the diffeent tools ill be put togethe to ceate some examples of a centalized and stuctued log analysis system. The chapte Conclusion ill close the thesis, ith a conclusion and 'a ty to get a glimpse' into the futue of stuctued log file analysis. 1.5 Histoy of log files Long befoe computes, Paish egistes in hich all the baptisms, maiages and buials ae ecoded, could be consideed to be one of the fist log files. A lot of similaities can be found hee; One line pe enty, only append and a semi-stuctued fomat. The name log comes fom the nautical log, a device that as used to measue the speed of a boat. The measuements hee itten into a log book, to get an ovevie on the pogess of the jouney. In compute science the fist computes used small lightbulbs to sho the status of the machine and a good opeato could look at the "blinkenlights" and kne the poblem. When had copy teminals ee idespead the fist teminal called console as used to pint out the status messages of the system. When had discs ee intoduced log files came into existence. Accoding to D. Ged [Köbschall], Head of Depatment fo Cash & Deivatives IT Opeations at Deutsche Böse AG, Eschbon: "I as able to detect that the machine cashed on the changed blinking hythm on a HP 3000. When I as oking on a Contol Data Copoation 1700 (ceated 1966) it used a teletype as a console and nomally it ould ing a bell hen a cash occued. Late in 1977 hen OpenVMS as developed, it ote the status messages not only to the had copy teminal, but to the OpenVMS opeato log and still does that on the cuent OpenVMS machines that un the Xeta systems, poeing the Fankfut Stock Exchange". One of the fist standadizations in compute log files as the ceation of syslog in 1980 by Eic Allman [OSAch2012]. Initially ceated as a log mechanism fo sendmail it as late intoduced into the BSDistibution and became the de facto standad of logging in all Unix systems. But not all Unix pogams ae iting log messages ith syslog. On a moden Linux system apache, exim and samba ae thee examples that ae iting thei on files on default, pimaily fo pefomance easons. A lot of sevices ae still using syslog, including most mail seves, con, pam, inetd and ntp. The fist successful lage scale intoduction into stuctued log files as the development of the Windos Eventlog ith Windos NT 3.1. The Windos Eventlog is based on a binay fomat, but can be queied ith a.net-based inteface [MSEventLog]. In Windos Vista and Windos Seve 2008 the Windos Event logging API as eplaced by the Windos Event log API extending the possibilities of the API. 7
2 Definitions 2.1 Log file The definition of a log file fom Mak Guzdial [Guzdial1993] "discete ecodings of use actions duing softae use" is not univesal enough, because it does not comply ith most log fles on Unix o Windos based systems, hee also hadae and system messages ae stoed in log fles. Jan Valdman in [Valdman2001] uses a much ide defnition: "Cuent softae application often poduce (o can be confgued to poduce) some auxiliay text fles knon as log fles". This is bette suited to hat an aveage system administato ill undestand unde a log fle, but is vey vague in hat is eally stoed in log fles. D. Chuvakin in [Chuvakin2013] on Page 2 defnes it as: "A log messages is hat a... device... geneates in esponse to some sot of stimuli". The same autho used a diffeent defnition at the company pesentation fo LogLogic in 2008 [Chuvakin2008]: "Log = message geneated by an IT system to ecod hateve event happening". The logstash develope Jodan Sissel defnes a log message as "timestamp plus data" in [Sissel2012]. This may be a simplifed vie, but fo the development of a log fle analysis tool it is the only thing that is consistent to all log fles. My defnition is a moe develope efeenced defnition, because vey often the eason fo iting a log message is not undestandable fom the outside: "A log fle contains the infomation the develope of an application thought to be helpful and inteesting in the cuent state of the softae, togethe ith the timestamp hen this state occued." Log data o log enties o log messages ae all diffeent names fo the content of a log fle. Most log enties ae contained in one line, but that is not valid fo all log enties, like Python o Java stack taces. 2.2 Centalized log file The definition of log file contains the od "file". This induces a efeence to a nomal file on the filesystem in most compute uses. A pogam that ites to a nomal log file should only append data to a log file and neve change it afte it is itten. Secuity extensions like SELinux o the Windos file pemissions allos to enfoce the limitation to append to a log file only. A long standing tadition in Unix netoks is syslog. Syslog as itten in 1980 by Eic Allman as a log mechanism fo the famous sendmail pogam [OSAch2012]. An ealy add on as done to send log infomation to a cental syslog seve to have a centalized vie of all infomation. "Centalized" in this thesis should be defined as a ay to see and quey all log files of a defined goup of machines (nomally all machines managed by a goup of people) in one ebpage, database o filesystem. The log files do not need to and should not be stoed on a single machine, instead the data should be stoed on multiple machines fo edundancy easons, but the data should be the same on all machines. 2.3 Definition stuctued log files Eic Allman in [OSAch2012] chapte 17.8.2 also ote that he thinks syslog as vey ell designed, but specified that the only thing he should have changed as: "I ould pay moe attention to making the syntax of logged messages machine paseable essentially, I failed to pedict the existence of log monitoing." 8
The log messages ae send and stoed in diffeent log fomats. As D. Chuvakin ote in a pesentation in 2008 [Chuvakin2008]: "log fomat=layout of the log messages in the fom of fields, sepaatos, delimites, tag etc." The log file fomats can be soted into fou categoies as defined by R. Gehads in [Gehads2011]: "Semi-stuctued" "eakly stuctued" In this fom the log enties ae "like CSV-based fomats, needs extenal infomation to undestand fields". "stongly stuctued o full stuctued" In this fom the log enties ae still lagely fee fom, ith some stuctues ae aleady included, but thee is "no clea distinction beteen field values, field delimites and noise data". In a stongly stuctued log file all infomation stoed in a stuctue of the fomat and "stuctued data, field names and values ae povided". Possible fomats ae XML and JSON among othes. "Stongly stuctued based on a validating pofile" Even a stongly stuctued file fomat can be efined hen the "field names, values and semantics ae povided" and can be checked. Typical poblems ith stuctued logs ithout validation pofiles fo example ae: undefined field names: "ip" o "ip-addess" o "ipaddess" fomat: date as defined in fc3164 (May 23 21:22:23) o in fc3339/iso8601 (201305-23T21:22:23.24+02:00) datatype: host containing only IP addesses o hostname o both Thee should be a fifth categoy, that as not defined be R. Gehads: "unstuctued" log files. These unstuctued log files ae itten by the Linux and the Dain kenel as an example. Hee no stuctue hatsoeve is found. The diffeent fomats that ae in use in the field ill be descibed late in this thesis. 2.4 Definition Open Souce and Fee Softae This thesis ill only look at tools that ae both Open Souce Softae as defined by the Open Souce Initiative [OpenSouce] and Fee Softae as defined by the Fee Softae Foundation [FeeSoftae]. The eason fo that is not only pice, as it is often peceived by manages and accountants, but the possibility to extend the softae as the use sees fit. It also avoids the pitfalls of popietay softae, like the emoving of softae poducts by the manufactue, theefoe focing to eplace the softae because of missing suppot/licenses o the license estiction based on log sizes. 9
2.5 Definition Log File Analysis Log file analysis is the pocess to extact usable infomation fom the log files. This can be statistic analysis like pecentages of eo messages pe hosts, ho many mails ee sent o ho often someone tied to guess a passod via ssh. Also some dependency o coelation analysis could be done, like a use tied to login to times unsuccessfully and the thid as successful. A thid possibility is a Baian analysis, to detect unusual eo messages. 10
3 Components and Functions To ceate a "centalized and stuctued log file analysis ith Fee Softae and Open Souce tools" a lot of pogams have to inteact togethe. In this chapte the diffeent components that ae necessay ill be explained. Illustation 1: Log Infastuctue Illustation 1 shos the diffeent ays a log message can go fom the pogam that ceated the log file to the Stoage and Visual Output. The fist step is the ceation of a log message by the pogam, also knon as log souce. The log message can be ceated in a lot of diffeent fomats, in chapte Fomats the most common ones ill be intoduced. In the chapte Tanspot the diffeent mechanisms to send stuctued and unstuctued messages ae intoduced. The most common used tool fo stoing and tanspoting log messages on Unix based systems is syslog. This tool is both a log fomat and a tanspot fomat in one. Not all pogams can send the log data using syslog. Some ae only able to ite into a log file on the filesystem. The Collecto/Shippe ill ead the data fom disc and foad them to the tanspote o diectly to the cental hub. The Tansfomation/Nomalization step ill take the log data that is not in a pedefined stuctue and convet it into the equested stuctue. This can be done centalized, o on the same machine hee the log souce is unning. The Analysis phase ties to find poblems, attacks and othe abnomalities inside the log data. The Stoage ill save the log data and nomally index it, to speed up seaches. This can be done by taditional SQL systems as ell as NOSQL systems. The Visual output includes gaphs, but also ebpages to simplify the seaching and analyzing of the data. 11
3.1 Fomats The main poblem of log files, beside the huge amount of log data that ae ceated by all the diffeent pogams in a company, ae the many diffeent log fomats. The huge amount of unstuctued log files can only be handled each by itself, but the semi stuctued and stuctued log files make it possible to ceate ules and tansfomations to convet them all into a unified fomat. Most of the stuctued log files ae based on one of the existing file fomats JSON o XML. The [XML Fomat] is ceated by the Wold Wide Web Consotium, but is often consideed to be too complex and had to ead fo humans. The JSON fomat as ceated by Douglas Cockfod and is standadized in [RFC4627] and is a vey simple fomat that is easily ead. 3.1.1 Semi stuctued logs Fom the huge amount of semi stuctued logs that exist in the field only to ae eally defined and used. They ill be shon in the next section. 3.1.1.1 BSD syslog (RFC3164) The BSD (sometimes called taditional) syslog fomat as documented afte being used fo seveal yeas in [RFC3164]. It aleady shoed some vey simple semi stuctue. It contained the folloing fields in the heade: PRI contains Facilities (0-23) Seveity (0 emegency 7 debug) Timestamp in the fomat Mmm dd hh:mm:ss (Aug 12 23:12:14) hostname o ip addess The RFC3164 has some vey big limitations, the log enty is limited to 1024 chaactes and the tanspot is defined only fo UDP and is theefoe not eliable. 3.1.1.2 Moden syslog (RFC 5424) These limitations of the taditional syslog ee emoved duing the standadization of a ne syslog fomat in [RFC5424] itten in 2009 by R. Gehads develope of syslog and ae no used by all cuent syslog implementations like syslog and syslog-ng. The changes include emoval of the 1024 byte limitation and a sitch in the time fomat to the [RFC3339] standad, itself a sub vesion of ISO 8601. The RFC5424 time fomat looks like this: 2013-07-30T21:23:20.43Z o 2013-07-30T23:23:20.43+02:00 The fist is itten in UTC, the second is itten in local timezone, but both ae defining the same time. Also notice the use of yeas and sub-second pecession anothe nuisance of the old syslog fomat. RFC5424 also added suppot fo IPv6 addesses, fo UTF-8, a equied suppot fo TLS and suppot fo additional fields like the folloing: 12 Vesion=1 (RFC5425) App-Name
PROCID MSGID othe stuctued data fo example: oigin ip addess entepiseid (simila to SNMP) softae softae vesion meta sequenceid sysuptime With RFC5424 the possibility of a stuctue inside syslog as added. Since the ceation in 2009 a lot of implementations ae aleady available that suppot RC5424. But it still is idely used to tanspot unstuctued log data, because it does not equie stuctued log data. 3.1.2 Stuctued logs The Semi-stuctued log files fom RFC5424 ee only consideed to be the fist step. Thee ae some othe log file stuctues available. Most of the stuctued logs also contain the field definitions. 3.1.2.1 CEE The Common Event Expession (CEE) is a standadization effot stated in 2007 and lead by the MITRE Coopeation, a not-fo-pofit oganization that is ell knon in the IT industy fo the Common Vulneabilities and Exposues (CVE) numbes. The poject ceated not only a standadization document that defines a log fomat (called CEE Pofile) hich is encoding neutal (called CEE Log syntax), but also a CEE Log Tanspot. Vesion 1.0-beta1 specified an event fomat that could be encoded in JSON o XML files. The encoding fomat is intechangeable, because the field names togethe ith file types have been standadized in [CEEFields]. Thee fields ae equied to be pesent: host (hostname o IP addess), pname (pocess name) and time. To integate CEE into an existing syslog infastuctue a special heade "@cee:" as ceated that is used as a pefix in font of a valid JSON message. With this pefix it is possible to send CEE messages via BSD (RFC3164) and moden syslog(rfc5425) infastuctues. Poject Lumbejack is an Open Souce poject to implement CEE into Open Souce poducts. Thee is also a pogam ith the same name. To diffeentiate the names "Poject Lumbejack" and "Lumbejack" ae used. Poject Lumbejack is suppoted by both moden syslog implementations syslog and syslog-ng. Togethe ith Red Hat they have spaned the pojects "ceelog utils" and "libumbelog" to ease the implementation in Open Souce pojects. Both syslog and syslog-ng ae eady fo CEE in the cuent vesion, but beyond that thee is almost no usage of CEE in the Open Souce old. 13
{ "host":"system.example.com", "pid":123, "time":"2011-12-20t12:38:05.123456-05:00", "msgid":"abc", "msg":"my event message", "app":"application", "pname":"auth", "sev":10, "action":"login", "status":"success" } Text 1: CEE Example log enty All this effot looked eally pomising, until May 2013 hen MITRE lost funding fom the US govenment and stopped the standadization poject and the mailing list. But even befoe this, thee ee aleady some stong opposing voices. Among othes the logstash develope Jodan Sissel ote in [Sissel2013-2] that the specifications left too many options to choose fom and it is possible to ceate tools that ae CEE compliant, but cannot talk ith each othe, hen one is enfocing JSON and the othe is enfocing XML. 3.1.2.2 GELF [GELF] is the Gaylog Extended Log Fomat that as ceated in 2010 as the native fomat of the gaylog2 seve. GELF is not a eal standad, but simply the fomat that gaylog2 specified. GELF is not only a log fomat, but also a tanspot potocol. The definition fo the GELF standad is stoed in the GIT epositoy of gaylog2 and theefoe can be changed by the gaylog2 develope ithout notice. The fomat also uses JSON as a file fomat like CEE, but also specifies that the JSON must be compessed ith zlib o gzip. The folloing fields ae specified in [GELF]: 14 vesion: GELF spec vesion "1.0" (sting); MUST be set by client libay. host: the name of the host o application that sent this message (sting); MUST be set by client libay. shot_message: a shot desciptive message (sting); MUST be set by client libay. full_message: a long message that can i.e. contain a backtace and envionment vaiables (sting); optional. timestamp: UNIX micosecond timestamp (decimal); SHOULD be set by client libay. level: the level equal to the standad syslog levels (decimal); optional, default is 1 (ALERT). facility: (sting o decimal) optional, MUST be set by seve to GELF if empty. line: the line in a file that caused the eo (decimal); optional.
file: the file (ith path if you ant) that caused the eo (sting); optional. _[additional field]: evey othe field you send and pefix ith a _ (undescoe) ill be teated as an additional field. { "vesion": "1.0", "host": "1", "shot_message": "Shot message", "full_message": "Backtace hee\n\nmoe stuff", "timestamp": 1291899928.412, "level": 1, "facility": "payment-backend", "file": "/va//somefile.b", "line": 356, "_use_id": 42, "_something_else": "foo" } Text 2: GELF message The GELF fomat is quite idely used, ith suppot not only in gaylog, but also in logstash and nxlog and othes. 3.1.2.3 JSON-logstash The logstash poject ith lead pogamme Jodan Sissel ceated thei on log fomat. This fomat is not eally specified, but is also used by othe pogams as ell like fluentd and flume. In this fomat thee ae six fields that ae all equied, and specific extensions by the diffeent applications ae added into the "fields" field. 15
{ "@souce" => "pok.example.com", "@type" => "apache", "@tags" => [], "@fields" => { "client" => "127.0.0.1", "duation_usec" => 240, "status" => 404, "equest" "method" "efee" => "/favicon.ico", => "GET", => "-" }, "@timestamp" => "2012-08-22T14:53:47-0700" } Text 3: logstash JSON fomat 3.1.2.4 Systemd jounal The systemd is a ne init system that is idely used on ne Linux distibutions. Systemd is the default init system fo Fedoa, Mandiva, OpenSuSE and many othes. One pat of this system is a ne log sevice named jounal hich stated in 2011. The jounal as again an attempt to ceate a ne default stuctued log fomat fo Linux. The fields in the jounal ae sepaated into 3 diffeent kinds [JOURNALFIELDS]. Addess fields ( pefix, double undeline) Tusted field (_ pefix, single undeline) Addess fields ae only usable inside the jounal and should not be used outside. Tusted fields ae implicitly added by the jounal and cannot be set by the log client. This includes the _PID, _UID and _EXE fields. Use field (no pefix) All othe fields ae use fields and can be specified by evey log client itself. Thee ae some pedefined fields: MESSAGE, PRIORITY, ERRNO, CODE_FILE, CODE_LINE, CODE_FUNC, SYSLOG_FACILITY, SYSLOG_IDENTIFIER and SYSLOG_PID. A vey impotant field is the MESSAGE_ID, this is a UUID field that makes it possible to geneate a unique identifie fo evey kind of log message. All messages that ae geneated at the same state fom the same pogam, should have the same message id. The jounal intenally uses some binay fomat, but the jounalctl command can expot the log enties into diffeent fomats, including JSON [JOURNALJSON] and an expot fomat. The expot fomat can be used to send jounal enties acoss the netok and looks like a list of envionment vaiables. The jounal fomat allos the same field to be used multiple times inside one enty, this is mapped into a JSON aay. 16
The systemd jounal is vey deeply integated ithin systemd and is only available on Linux. To suppot othe opeating systems syslog autho Raine Gehads ceated the libay liblogging to ceate a jounal eplacement libay that is available on all opeating systems. { "_SERVICE":"systemd-logind.sevice" "MESSAGE":"Use haald logged in" "MESSAGE_ID":"422bc3d271414bc8bc9570f222f24a9" "_EXE":"/lib/systemd/systemd-logind" "_COMM":"systemd-logind" "_CMDLINE":"/lib/systemd/systemd-logind" "_PID":"4711" "_UID":"0" "_GID":"0" "_SYSTEMD_CGROUP":"/system/systemd-logind.sevice" "_CGROUPS":"cpu:/system/systemd-logind.sevice" "PRIORITY":"6" "_BOOT_ID":"422bc3d271414bc8bc95870f222f24a9" "_MACHINE_ID":"c686f3b205dd48e0b43ceb6eda479721" "_HOSTNAME":"aldi" "slogin_user":"500" } Text 4: systemd jounal log enty in JSON-petty 3.1.2.5 Windos Event Log Fom NT 3.5 until Windos XP and Windos Seve 2003 Windos used the Windos Event Log, intenally called Event Tacing fo Windos. This as eplaced by a ne vesion called Windos Eventing ith Windos Vista and Windos Seve 2008. Windos Eventing enties can be expoted and displayed as XML and displayed ith the Windos Event logs. The Event Schema XML Schema Definition is only available in the Windos SDK, but a textual desciption is available in [MSEVENTSCHEMA]. 17
- <Event xmlns="http://schemas.micosoft.com/in/2004/08/events/event"> - <System> <Povide Name="Micosoft-Windos-Secuity-Auditing" Guid="{54849625-5478-4994A5BA-3E3B0328C30D}" /> <EventID>4672</EventID> <Vesion>0</Vesion> <Level>0</Level> <Task>12548</Task> <Opcode>0</Opcode> <Keyods>0x8020000000000000</Keyods> <TimeCeated SystemTime="2013-03-26T06:51:49.973366400Z" /> <EventRecodID>2341</EventRecodID> <Coelation /> <Execution PocessID="516" TheadID="480" /> <Channel>Secuity</Channel> <Compute>WIN-M5PCUTLMBMT</Compute> <Secuity /> </System> - <EventData> <Data Name="SubjectUseSid">S-1-5-18</Data> <Data Name="SubjectUseName">SYSTEM</Data> <Data Name="SubjectDomainName">NT AUTHORITY</Data> <Data Name="SubjectLogonId">0x3e7</Data> <Data Name="PivilegeList">SeAssignPimayTokenPivilege SeTcbPivilege SeSecuityPivilege SeTakeOneshipPivilege SeLoadDivePivilege SeBackupPivilege SeRestoePivilege SeDebugPivilege SeAuditPivilege SeSystemEnvionmentPivilege SeImpesonatePivilege</Data> </EventData> </Event> Text 5: Windos Eventlog XML file 3.1.2.6 Auditlog The audit log is the log file of the Linux auditing system. With the help of the auditing system a linux system administato can monito file changes, logins, logouts, successful and unsuccessful authentications, SELinux violations and can even tace evey available syscall and the esult of this syscall. Nomally the auditing system ites its logs to /va/log/audit/audit.log and uses this fomat: 18
type=user_auth msg=audit(1375462342.487:133952): use pid=17126 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_:local_login_t:s0-s0:c0.c1023 msg='op=pam:authentication acct="oot" exe="/bin/login" hostname=? add=? teminal=tty2 es=success' Text 6: Auditlog 3.1.2.7 Intusion Detection Message Exchange Fomat (IDMEF) Intusion Detection Message Exchange Fomat (IDMEF) is an XML based file fomat that is defined in [RFC4765], but is still in the expeimental phase. This fomat can be used by snot and suicata. 3.1.3 Othe fomats Thee ae a lot of othe fomats defined, almost evey company has ceated its on log fomat. Hee is a list of othe log fomats that ae knon, but deemed too unimpotant to descibe them hee in detail. Common Event fomat Secuity Device Event Exchange (SDEE) is a log fomat ceated by IBM and based also on XML. Univesal Fomat fo Logge Messages ([ULM]) ceated by Cisco as an XML based log fomat fo thei intusion pevention system. Common Base Event ([CBE]) A definition ceated by the company acsight that uses a pipe ( ) to sepaate fields. This IETF fomat uses a space sepaated list of key=value fields and as ceated in 1999 as an intenet daft, but as depecated in the same yea. othe log fomats Apache fomat of the access log can be specified vey feely, including iting JSON-based fomat into the access log. The fomat of the eo log cannot be changed. log4j (tomcat, JBoss) log4j is an Open Souce Java libay that is used by a lot of Java pogams, incl. tomcat and JBoss. With the help of the plugin achitectue it is vey easy to add suppot fo diffeent stuctued log files. mod_secuity mod_secuity is a eb application fieall based on apache, nginx o IIS. It uses its on log fomat ith multiple foote, heades, tailes and body. Thee ae special tools to manage this, like AuditConsole fom jall.og. 19
Python Python comes ith its on build-in logging class. Because Python also ships ith JSON since 2.6, it is quite easy to ite JSON log files ith any Python pogam. Ruby Ruby also comes ith its on build-in logging class and changing the log fomat to JSON is possible ith the extension (also knon as gems) named logging by TP. 3.2 Collecto/Shippe A lot of pogams suppot syslog hich diectly allos it to send log data to a cental log seve, but not all. To suppot these kind of pogams a collecto o shippe is needed. 3.2.1 File Most shippe o collecto tools ae eading existing log files line by line and send those to a cental log seve via a pedefined tanspot. To speed up the detection of ne lines in the log file, most tools ae using a mechanism like inode notification o sized based change detection to avoid eeading evey log file enty again. 3.2.2 Sockets, named pipes and STDIN Anothe possibility to get log enties into the shippe ae sockets and named pipes. The advantage is that no disc I/O is necessay to get the log enties into the shippe, but hen the shippe is not unning the log messages cannot be handled and messages could be lost. Anothe possibility is eading the messages fom STDIN. With this mechanism the log souce stats the shippe as a subpocess and sends the messages via STDIN (file descipto 0). The log souce should check if the shippe is still unning and estat the shippe hen necessay. Whateve mechanism is used the log shippe sends the collected log messages via one o moe of seveal tanspot mechanisms to the next step of the log analyzing toolset. 3.2.3 Local Windos Eventlog Some shippes, hen un on a Windos system, can ead the Windos Event log and send them to a cental seve. Because the Event log is aleady stuctued it should be avoided to send the enties in an unstuctued log fomat. But sadly enough that is hat most tools do. diectoy stuctue multiple files ith * STDIN/STDOUT unix domain socket named pipe eventlog local Windos syslog syslog-ng 20 spool message duing dontime flat file logstash systemd jounal Pogamm 3.2.4 Compae collecto / shippe
oodchuck aesant beave lumbejack syslog-shippe emote_syslog fluentd flume spool message duing dontime Heka systemd jounal eventlog local Windos named pipe nxlog unix domain socket STDIN/STDOUT flat file diectoy stuctue multiple files ith * Pogamm node-logstash systemd/jounal2gelf eventlog-to-syslog Table 3: collecto/shippe Ovevie 3.3 Tanspot Thee ae diffeent tanspot mechanisms to bing the log messages to a cental seve. The tanspot defines the ie potocol that is used to send the messages. Because of the sensitive natue of log files they should be encypted hen send beteen machines. Not all tanspot mechanism ae suppoting that. It should also suppot eliable tanspot making sue that no message is lost on the ay. 3.3.1 Syslog The BSD syslog standad [RFC3164] also includes a tanspot potocol based on UDP and uses pot 514. The eplacement [RFC5424] does not include a tanspot potocol itself, but equies all implementation to suppot Tanspot Laye Secuity (TLS) ith TCP-Pot 6514. A UDP based tanspot using pot 514 is descibed in [RFC5426]. These syslog potocols ae the most used log tanspots in the field. With the help of TCP sessions the tanspot of the log messages is quite eliable, but thee ae cases in hich the loss of data sent via tcp cannot be avoided. The syslog autho Raine Gehads descibes this in his blogpost [Gehads2008]. This poblem led to the development of [RELP], that implements app level acknoledgment. This ceates a much moe eliable tanspot, but is missing encyption. RELP ith encyption is on the TODO list of Gehads, but is not yet stable. Until then it ould be possible to use stunnel to add TLS encyption on top. The poblem ith RELP is that it is no standad and is not used in any othe tool, like syslog-ng. 21
3.3.2 AMQP The Advanced Message Queuing Potocol [AMQP] is an open standadized message middleae application laye. AMQP as developed fo the financial industy, but is no used fo a lot of diffeent puposes. Its main advantages ae inteopeability. AMQP is an application laye potocol and it is possible to let multiple AMQP seves (aka. AMQP Boke) fom diffeent vendos talk ith each othe, simila to http o smtp. The othe main advantage is its eliability, because it can be vey tightly contolled that no message that as enteed into an AMQP system can be lost. AMQP uses special tems to descibe its components: An Exchange is hee the log messages ae send fom, hee they ae "poduced". The Queue is hee the log messages ae ead fom. The Bindings ae connecting Exchanges and Queue ith one anothe. The Boke is the AMQP Seve. AMQP suppots both usename and passod authentication as ell as SASL authoization. It also suppots TLS encyption, see Pat 5 of the [AMQP] standad. AMQP is used by the folloing message seve: Apache Qpid, Apache ActiveMQ, RabbitMQ as ell as othes. 3.3.3 STOMP [STOMP] o Simple (o Steaming) Text Oientated Messaging Potocol is a potocol simila to AMQP, but instead of being a binay fomat, it uses a text based fomat vey simila to http. It is so simple that a telnet session is enough fo some basic usages. Because of the text based fomat it is vey vebose and takes much moe bandidth than necessay. It also lacks some featues that ae available in AMQP. STOMP in the cuent vesion 1.2 suppots usename and passod authentication, but encyption is not available. It is possible to use an stunnel to put an encyption laye aound STOMP. Stomp is used by to seves that ae also speaking AMQP: Apache ActiveMQ and RabbitMQ(ith Plugin) 3.3.4 Ømq/ZMTP [Ømq] also knon as ZeoMQ o 0MQ is anothe messages queue system. Unlike STOMP and AMQP it is not built fo inteopeability, but thee ae multiple implementations available. Ømq is a libay that does not need a dedicated boke and is designed to be vey easy to use and vey fast. This simplifies setup enomously and is the main eason hy it is used quite often fo log tanspot. The tanspot potocol is called [ZMTP], but is not idely used outside the Ømq poject itself. A ne vesion of Ømq called CuveMQ as ceated in 2013 to bing encyption suppot to Ømq, the ne tanspot potocol is named [ZMTP-CURVE], but it is vey ne and no stable elease has been ceated yet. 3.3.5 Redis Redis is a key-value stoe and belongs to the so called NoSQL databases. Fo the use as a log tanspot the build-in featue called "channels" is used to ceate a publish-subscibe messaging infastuctue. The edis tanspot does not suppot encyption. Only a passod authentication scheme ithout usenames is available. Moe in the edis chapte on page 43. 22
3.3.6 Lumbejack Lumbejack is the tanspot potocol used by the shipping tool ith the same name. This is not to be confused ith the poject lumbejack, belonging to the CEE initiative. The develope is Jodan Sissel ho also ceated logstash and as ceated because he needed a tanspot potocol that suppoted "encypted, tusted, compessed, latency-esilient, and eliable tanspot of events"[sissel2013]. Moe about lumbejack on page 45. 3.3.7 Remote Windos Eventlog Micosoft Windos also has its on tanspot mechanism. This mechanism is pimay used by Micosoft itself, but access to the Client component is available via an API on a Windos Seve. It is theefoe possible to collect all Eventlogs fom all machines in a complete Windos domain and send it to the cental machine ithout having to install the shippe on all machines. Sadly no Open Souce and Fee Softae tool suppots this at the moment. It as pat of Poject Lasso, but this poject is dead no and does not suppot the ne Log Fomat used since Windos Vista and Seve 2008. Heka oodchuck aesant beave syslog-shippe lumbejack kafka vanishlog nxlog gaphtastic gaphite octopussy statsd SNMP node-logstash lumbejack gelf syslog-ng 0MQ Stomp (ActiveMQ,RabbitMQ) syslog amqp (QPID,ActiveMQ,RabbitMQ) http edis RELP gaylog2 ebsocket TLS encypted channel IETF syslog tcp tls (RFC5425) logstash IETF syslog tcp (RFC5424) BSD syslog udp (RFC3164) Pogamm IETF syslog udp (RFC5424) 3.3.8 Compae Tanspots 23
kafka vanishlog gaphtastic gaphite statsd SNMP lumbejack gelf 0MQ Stomp (ActiveMQ,RabbitMQ) systemd/jounal2gelf eventlog-to-syslog amqp (QPID,ActiveMQ,RabbitMQ) edis ncode/logix ebsocket http flume RELP TLS encypted channel fluentd IETF syslog tcp tls (RFC5425) IETF syslog tcp (RFC5424) BSD syslog udp (RFC3164) IETF syslog udp (RFC5424) Pogamm emote_syslog Table 4: Tanspot Ovevie 3.4 Tansfomation/Nomalization Most of the log messages that ae sent today ae not yet in a stuctued log fomat. To bing stuctue into these log messages a tansfomation o nomalization is necessay. This detects the diffeent kind of messages and ceates key-value pais out of the unstuctued log messages. Thee ae diffeent ays to do this. A egula expession (egex) based system could be used to achieve this, but managing and iting lage egula expessions can be cumbesome and eo pone. Most of the tools ae oking on a diffeent basis. This appoach is called samples based o patten based. Hee the pasing is done based on fixed stings and matching is done ith pedefined field type. The egula expession to pase a line like this: sshd[1738]: Accepted passod fo oot fom 172.16.242.1 pot 50447 ssh2 Would be: sshd\[[0-9]+\]: Accepted (gssapi(-ith-mic -keyex)? sa dsa passod publickey keyboad-inteactive/pam) fo [^[:space:]]+ fom [^[:space:]]+ pot [0-9]+ ssh2 It is easie to maintain a uleset like this: sshd [!PID!]: Accepted!AUTHMETHOD! fo!username! fom!ipaddress! pot!portnumber! ssh2 The Tansfomation can be done on the cental seve o on evey client itself. The cental Tansfomation has the advantage that the ules ae stoed in one place and can be changed quite easily, but the CPU usage can be a poblem in lage setups. To avoid that, the Nomalization can be 24
spead out to multiple nodes, o the ok can be done on the client side befoe the tanspot. The CPU load on evey client is quite small, but the distibution of the ule set can be poblematic, if no configuation management like puppet, chef o ansible is aleady in place. 3.4.1 Patten-DB The Patten-DB is pat of syslog-ng and is nomally compiled into the syslog-ng binay. The documentation of Patten-DB is vey complete and syslog-ng has a GIT epositoy hee it collects ules fo diffeent sevices. The pattens itself ae stoed inside an XML stuctue and include test messages and examples. The Patten-DB is vey actively maintained and also allos fo messages to be coelated. This allos fo mail seve to save sende and ecipient of a mail into one log enty o to put togethe the coelations beteen the logon and logoff times, to save the duation of a login. To pase the example fom above the ule should look like this: sshd [@NUMBER:PID:@]: Accepted @QSTRING:auth_method:@ fo @QSTRING:usename:@ fom\ @QSTRING:client_add:@ pot @NUMBER:pot:@ ssh2 3.4.2 Liblognom The Liblognom tool is developed by the syslog ceato Raine Gehads and is ceated as a libay so othe tools can use this nomalization tool as ell. Liblognom includes a small tool called "nomalize" to check the ulesets and ceates JSON messages out of nomal log files. This makes ule iting much easie. The documentation is somehat limited, but enough to ceate the ules, but thee is no adequate ule libay so all ules have to be ceated by oneself. Liblognom is not only used by syslog but also by the Sagan poject. They have ceated a ule libay, the only one available. ules=sshd [%pid:numbe%:] Accepted %auth_method:od% fo %usename:od% fom %sc-ip:ipv4% pot %sc-pot:numbe% ssh2 3.4.3 Octopussy Octopussy is a log management system that uses its on log nomalization. The ule base is quite extensive, but it can only be used by the Octopussy system, because it is an integated pat. The pattens ae stoed in an XML file and can be edited and ceated ith the help of the Octopussy ebpage. This makes it vey easy fo a system administato to ceate ne pattens. The example line fom above ould be found by this ule: <@REGEXP("ssh\S+"):daemon@>[<@PID:pid@>]: <@REGEXP("Accepted passod fo.+"):msg@> 25
Illustation 2: Octopussy ule ceation 3.4.4 Gok The gok libay is ceated by logstash develope Jodan Sissel and is available fo othe tools to be use. Gok itself is based on egex, but makes it easie to ite ules because it allos to give names to egex pattens and use these names instead. To match ou example fom above the patten could look like this: sshd [%{NUMBER:pid}:] Accepted %{WORD:auth_method} fo % {WORD:usename} fom %{IPORHOST:sc-ip} pot %{NUMBER:sc-pot} ssh2 A nice additional tool available fo gok is gokdiscovey. This tool takes a sample log message and ties to pedict the patten that could be used to nomalize this message. Of couse this is not alays diectly usable, but speeds up the ceation of ules ith gok. 3.4.5 Heka Mozillas Heka includes its on tansfomation. It is based on egex, but it is easie to ead, because it includes the vaiable name inside the egex. The folloing is an example fo pasing the Apache combined log file fomat. Some lines hee deleted hee that ould have defined the type of the fields. match_egex = '/^(?P<RemoteIP>\S+) \S+ \S+ \[(?P<Timestamp>[^\]] +)\] "(?P<Method>[A-Z]+) (?P<Ul>[^\s]+)[^"]*" (?P<StatusCode>\d+) (?P<RequestSize>\d+) "(?P<Refee>[^"]*)" "(?P<Bose>[^"]*)"/' timestamplayout = "02/Jan/2006:15:04:05-0700" 3.4.6 Filte_egex The Node-Logstash tool uses a pue egex based nomalization. The configuation is a lot moe eo pone, as it is visible in this example: { 26
"egex": "^<(\\S+)>(\\S+\\s+\\S+\\s+\\d+:\\d+:\\d+) (\\S+) ([^:\\[]+)\\[?(\\d*)\\]?:\\s+$accepted \ (gssapi(-ith-mic keyex)? sa dsa passod publickey keyboad-inteactive/pam) \ fo [^[:space:]]+ fom [^[:space:]]+ pot [0-9]+( (ssh ssh2))$", "fields":"syslog_pioity,timestamp,@souce_host,syslog_pogam,sy slog_pid,auth_method,usename,sc-ip,sc-pot", "numeical_fields": "syslog_pid","sc-pot" "date_fomat": "MMM DD HH:mm:ss Z" } 3.4.7 nxlog Nxlog also offes some limeted Tansfomation beteen fomats. It can convet fo example a Windos Eventlog to a JSON o GELF message, but can not convet unstuctued log fomat into stuctued ones. 3.5 Stoage The taditional ay to stoe log messages is a log file. This may be bad fo seaches, but thee ae some advantages to it. In most cases some kind of database system should be used. 3.5.1 Log files Taditional log files may feel antiquated, but they have the big advantage that they ae eadable in the futue. 10 o even 30 yea old log files can be ead today, if the physical medium is still eadable. Ne featues make log files even moe inteesting. Since syslog vesion 7.4 it is possible to ceate signed log messages ith the help of guadtime [Gehads2013]. This uses a Keyless Signatue Infastuctue and a hash-tee o Mekel-tee to put multiple small log messages togethe and then uses linked timestamps to make it tampepoof. The Cyptogaphic infomation is shon in [Gehads2013-2] and at.openksi.og. Rsyslog's appoach is tageted to be used hen iting to a log file. It is not possible to be used befoe it is send to the cental log seve. This is a design decision that comes fom the idea that not all log messages that ae send to a cental seve ill be saved. Systemd's jounal also has a signing featue. It uses Foad Secue Sealing (FSS) to achieve a simila objective. Instead of the Keyless Signatue Infastuctue it uses a cyptokey that is displayed as ASCII and QR-code duing ceation. This can be scanned and be used to check if the log file has been alteed. The log files ae stoed locally and can be deleted by an attacke. This poblem is acknoledged by the autho in [Poetteing2012]. 3.5.2 SQL The idea to use SQL to save syslog data is not ne. Both syslog and syslog-ng have been suppoting SQL databases fo a long time. The poblem is that you cannot eally split up the unstuctued log message, so the table stuctue of such a database is quite simple. Only the syslog stuctue log host, date, facility and pioity can be stoed and the message is a long sting field. Thee ae some possibilities to speed up seaches via full text seach extensions like Sphinx. Stoing stuctued log data in a SQL Database is not easie, because thee ae too many diffeent fields possible. The fixed schema of SQL is not flexible enough to be used fo that. 27
The tools ELSA (see page 35) and syslog's LogAnalyze (see page 38) ae using MYSQL (see page 41) to stoe the log messages in an SQL Database. Rsyslog and syslog-ng as ell as othe tools, ae suppoting othe SQL dialects as ell. Some ith the help of the DBI libay, some ith native suppot. 3.5.3 NoSQL The poblem ith the schema and stuctued log files as one of the easons to move to a NoSQL database, pimaily a document stoe. The document stoe is a NoSQL database like MongoDB and Elasticseach and is stoing documents in JSON, XML o othe data fomats. These documents can be indexed and eplicated to speed up seaches and make the system moe eliable. Thee ae pimaily to NoSQL databases used ith log management, MongoDB (see page 41) and Elasticseach (see page 42). 3.5.4 Compae Stoage Pogam mongodb logstash hadoop elasticseach (logstash fomat) gaylog2 (gaylog fomat) syslog syslog-ng (logstash fomat) N/DBI DBI node-logstash (logstash fomat) nxlog DBI Heka fluentd SQL (DBI o native) flume (logstash fomat) (logstash fomat) DBI Table 5: Stoage Ovevie 3.6 Analysis Simply stoing the nomalized log data is not enough, to get some moe usage fom the log files the data in it has to be analyzed. The main eason to analyze the log data is to detect poblems, attacks and to coelate events. Some events need only to be noticed if a lot of them occu. One logon eo is nothing to oy about, 1'000 logon eos ae not nomal and should be checked. A 404 eo on a ebpage is ok, 1'000 pe second is not ok. This kind of analysis should be done automatically, based on a itten ule set. 3.6.1 nxlog Nxlog has an analyzing functionality. It has a special module called event coelato (pm_evco), but it also suppots simple statistical countes like RATE, COUNT, AVG o the change ate of the RATE called GRAD. This makes it possible to ceate some simple analysis, but thee ae some poblems ith this as itten in the nxlog documentation [nxlog-va-aning]. 28
With the event coelato module it is possible to ceate ules to ignoe messages that aive too often, to avoid being flooded by anings. It offes the command "pais", that looks fo events that have a matching pai, the login and logout message of a use is a good example of such a pai. The command "absent" ill seach fo boken pais, ithout the second pat aiving inside a cetain timefame. The folloing example ill send a aning if the field "Message" containing "login failue" is detected 3 times in 60 seconds. <Thesholded> Condition $Message =~ /^login failue/ Theshold 3 Inteval 60 Exec $a_event = "login guessing in pogess"; </Thesholded> 3.6.2 SEC The Simple Event Coelato (SEC) is a univesal event pocessing tool, that cannot only be used fo log files but fo faud detection and othe event coelation as ell. SEC is itten in pel and uses egex to coelate the messages. As itten on the [SEC] ebpage: "SEC eads lines fom files, named pipes, o standad input, matches the lines ith pattens (like egula expessions o Pel suboutines) fo ecognizing input events, and coelates events accoding to the ules in its configuation file(s). SEC can poduce output by executing extenal pogams (e.g., snmptap o mail), by iting to files, by sending data to TCP and UDP based seves, by calling pecompiled Pel suboutines, etc." The folloing example fom [Vaaandi2012] shos a ule that checks ssh, apache and iptables/netfilte fo attacks and sends a mail hen an attack is detected: type=eventgoup3 ptype=regexp patten=sshd\[\d+\]: Failed \S+ fo (?:invalid use )?\S+ fom ([\d.]+) pot \d+ ssh2 thesh=3 ptype2=regexp patten2=^([\d.]+) \S+ \S+ \[[^]]+\] [^ ]+ HTTP\/[\d.]+ 4\d+ \d+ thesh2=1 ptype3=regexp patten3=kenel: IN=\S+ OUT= MAC=\S+ SRC=([\d.]+) thesh3=5 desc=repeated pobing fom $1 action=pipe Repeated pobing fom host $1 /bin/mail oot@localhost indo=120 29
3.6.3 Sagan Sagan is a eal-time log analysis & coelation tool and is itten in multitheaded C. Sagan ules look simila to the ules of the Snot Intusion Detection System (IDS) to simplify ule management ith oinkmaste and simila tools. The log messages have to be deliveed in a special pipe ( ) sepaated fomat via a FiFo socket. As an output it can ite diectly to a log file o uses banyad2 to ite to a SQL database. This is the same mechanisms that is used by snot. It uses liblognom fo nomalization and its on ules that ae simila to snot ules. The folloing ules ill ceate a aning if moe than 5 authentication failues can be detected inside a 300 second timefame: dop tcp $EXTERNAL_NET any -> $HOME_NET $SSH_PORT (msg:"[openssh] PAM Authentication failue - Bute foce [5/5]"; content: "Authentication failue"; classtype: unsuccessful-use; efeence: ul,iki.quadantsec.com/bin/vie/main/5000015; nomalize: openssh; pogam: sshd; afte: tack by_sc, count 5, seconds 300; theshold: type limit, tack by_sc, count 5, seconds 300; fsam: sc, 1 day; sid: 5000015; ev:5;) 3.6.4 Logstash and metics Logstash can be used fo some analysis jobs. Thee is a metic plugin that can ceate ates calcuations fo 1, 5, and 15 minutes, as ell as min, max, stddev and avg. The poblem is that thee is no ay to use it diectly, you can only foad it via JSON to anothe tool, like a gaphe as explained in the next chapte. Also missing is the possibility to check if a special use has been mistyping his passod a cetain amount of times in the last couple of minutes. 3.6.5 Gaylog2 Gaylog2 has the possibility to put messages hich ae selected by a seach quey into a message steam. When a cetain amount of messages aive in a steam, it can tigge an alam and can send mails, using jabbe o call an extenal plugin. It can also foad all messages fom a steam to an output plugin like an extenal paging sevice, but it also misses checks against things like guessing passods. Steams only ok ith ne messages that aive, not ith messages that ae aleady stoed in the elasticseach database. 3.7 Visual output All the collected, nomalized and analyzed log files can be stoed, but ithout a visual output no one ill notice. Thee ae multiple eb applications that can sho diffeent aspects of the log messages, most ae integated into a log tool, like ocotpussy, gaylog2 o ELSA (see chapte Multi pupose tools on page 31). Thee ae to kibana pojects that ae oking ith logstash, but ae developed sepaately. Kibana 3 is even usable ith othe tools like gaylog, as long as it uses timestamps and elasticseach as stoage. Moe about the diffeent eb font ends in chapte Webpage on page 33. If you only ant some gaphs to be added to an existing eb site, special gaphing tools ae available. These gaphing tools can be found in chapte Gaphs on page 40. 30
4 Tools Most of the tools used fo stuctued log file analysis offe multiple components in one pogam. In the last chapte the diffeent pats that ae necessay fo the ceation ee intoduced. This chapte shos the diffeent tools ith all pats that ae built into the tools. This chapte begins ith the multi pupose tools, then the diffeent outputs, then stoage, tanspot, shippe/collecto and finally the analysis tools. 4.1 Multi pupose tools Some tools can be used fo a ide ange of puposes, some othes ae only ceated fo one specific pupose. This section begins ith the multi pupose tools. 4.1.1 Syslog-ng Syslog-ng is a syslog seve ceated by Balabit.com a Hungay based company. Syslog-ng exists in to vesions: an Open Souce Edition (OSE) and a Pemium Edition. The latte is only available fo paying customes ith suppot and it is not Open Souce. Because of this, syslog-ng as almost emoved fom this thesis, but it is used quite extensively and the missing featues ae not big enough to aant the emoval. The featues missing in the OSE vesion include: Handle Multiline messages, encypted log files, eliable log tansfe, client-side failove and buffeing log messages pesistently to had disc in case the destination becomes uneachable. In this thesis heneve syslog-ng is itten, it is about the OSE vesion. Syslog-ng as the default syslog in SuSE Entepise Linux (SLES) and OpenSuSE, but it is being eplaced by syslog [SLES2013]. It is unknon if that as because of the OpenCoe natue of the development, o to be in sync ith othe distibutions like Debian and Red Hat. Syslog-ng cannot only be used to collect syslog messages, but also as a shippe eading the files diectly. The suppot fo eading multiple files ith ildcads is only suppoted by the closed souce Pemium Edition. The same goes fo the handling of missing syslog seves. When the taget seve is not available, the log messages ae not stoed, but lost. Thee is a huge amount of plugins available inside syslog-ng including iting to SQL Databases, MongoDB and AMQP. With the help of an AMQP Cluste it is possible to make sue that syslog-ng does not lose messages hen a seve is don. Syslog-ng does suppot encyption out of the box. Syslog-ng has its on nomalization tool called Patten-DB ith a huge amount of pedefined ules. This uleset is vey actively maintained [Czanik2013]. See chapte 3.4.1 on page25. The documentation of syslog-ng is vey good, ell stuctued and extensive. The OpenCoe natue of syslog-ng is a big poblem, but the huge and actively maintained PattenDB is something that is not available anyhee else. 4.1.2 Rsyslog Rsyslog stated as a eplacement fo the taditional syslog and as an opponent fo the existing syslog-ng. As develope Raine Gehads ote in [Gehads2007] it as developed to be a eal Fee Softae and Open Souce altenative, because syslog-ng has become a dual-licensed open coe poduct. Rsyslog is completely Open Souce and Fee Softae and suppot is available fom the Geman company of the oiginal autho named Adiscon.com. 31
Rsyslog is the default syslog seve fo Fedoa, OpenSUSE, Debian and RedHat Entepise Linux and available in evey majo linux distibution. It suppots a lage amount of input and output plugins as shon in Illustation 3. Illustation 3: syslog in/out plugins The possibility to ite to mysql makes it possible to use the Log Analyze tools fom page 38. One of the unique featues of syslog is the epl plugin, that makes it possible to make sue that syslog messages ae eally eceived by the seve. This only oks ith syslog, because it is not a standad. Togethe ith the disc based queue it is vey easy to make sue no message gets lost. This can also be done via the Ømq plugin. Both these eliable mechanisms suffe fom a lack of encyption. Only the encyption of the nomal syslog taffic is available. The development of encypted epl is unde ay but not available yet. Syslog offes a nomalization libay named "Liblognom" that is descibed on page 25. The documentation of syslog is stange, because a lot of inteesting featues ae only explained in blog posts fom the main autho. 4.1.3 Gaylog2 Gaylog2 is a complete, Fee Softae and Open Souce log management solution, ceated by Lennat Koopmann and is suppoted by toch.sh a Geman based company. It stoes the data in an elasticseach cluste and the statistics and gaph data in a MongoDB. The log messages can be send via syslog in an unstuctued ay o in the on stuctued fomat named GELF, see page 14. 32
Gaylog2 suppots the ceation of multiple gaylog2 instances, iting to the same elasticseach cluste. This allos to ceate a fail ove setup. Togethe ith the AMQP suppot in gaylog2 it is possible to have diffeent gaylog2 nodes connecting to the same AMQP boke infastuctue. In this setup gaylog2 nodes ill automatically distibute the messages to shae the load. The biggest poblem of gaylog2 is the fixed equiement to a specific elasticseach vesion. This happens because gaylog2 adds its on elasticseach node into the cluste. This cluste node can be configued to stoe data itself o to solely foad the data to othe data nodes. This can ceate some poblems, because the elasticseach development is quite fast, and you have to use an old vesion to un gaylog2. Anothe poblem ith elasticseach is that it handles deletion of old log files not based on date, but only on the size of the log files. This makes it easy to manage the disc space, but it is not knon ho many days of log files ae available. The ebpage fom gaylog2 is moe than a simple dashboad, it allos to sot messages into steams. Steams ae quey esults that can be used fo monitoing and aleting othe pogams, as itten on page 30. Steams can vey easily be ceated fom the ebpage and be put into categoies to make handling of a huge amount of steams easie. The eb inteface also suppots adding admin messages to log enties, if a special poblem is knon. It is possible to ite a egula expession and fo evey log enty that fits this expession, an automatic message is added to the eb page. Gaylog2 offes to emove sensitive infomation like passods. Steams can also tigge alams, send mails, jabbe messages o an extenal plugin. All messages of a steam can also be foaded to an output plugin. To ound it up gaylog2 allos to put machines into host goups. Illustation 4: Gaylog2 eb page 4.1.4 Logstash Logstash is the "siss amy knife" of log management. Containing eveything fom tanspots, to collecting local souces, to nomalizing log messages, to stoing data in elasticseach up to a ebpage to quey the data fom elasticseach. 33
Logstash development as stated in 2009 by Pete Fitchman and Jodan Sissel and has a huge numbe of input and output plugins as seen on page 57, as ell as filte plugins. The filte plugins includes plugins that can be used fo: anomymization, convet to the GELF, JSON, KV, XML and othe fomats, esolve ip addess into geo coodinates, gok as descibed in chapte Gok on page 26, mege multilines (like stack taces) into one message, split one message into to, tanslate numbe into text (like eo codes into eo message) o esolve IP addesses into hostnames. The documentation is vey good, and includes a vey helpful intoduction. If moe infomation is equied a logstash book itten by James Tunbull is also available [Tunbull2013]. Logstash's on ebpage is vey limited in its usage. It can be used to quey elasticseach, but it is missing a lot of othe featues offeed by the competitos. The big advantage is the simple installation. When logstash is aleady unning, a simple command line ith the paamete "eb" stats the eb seve. Seveal eliable tanspots ae available, like AMQP and syslog ith RELP. It also suppots encypted syslog, but only ithout RELP. With tools like lumbejack it suppots both eliable and encypted tanspots as ell. As the logstash ebpage is vey limited, instead the tools Kibana 2 o Kibana 3 ae often used. Illustation 5: Logstash eb page 34
4.1.5 Node-Logstash Node-logstash is a eimplementation of logstash itten in Javascipt, based on node.js and developed by Betand Paquet. It also uses elasticseach, but it is not limited to a specific vesion. Node-logstash has not yet the same amount of plugins as the oiginal. The gok plugin is missing, and a eplacement plugin called filte_egex is oking ith a egex ule base. This is shon in chapte Node-Logstash on page 26. The Filte plugins ae: add_souce_host, add_timestamp, compute_date_field, compute_field, gep, json_field, multiline, mutate_eplace, evese_dns, split, syslog_pi and egex. So a lot of the functionalities ae missing. A eliable tanspot is available ith edis, but thee is no encyption available hatsoeve. The poject stated vey ecently in July 2012, but is vey actively developed. 4.1.6 ELSA The Entepise Log and Seach Application is a combination of syslog-ng, mysql and sphinx. It is itten pimaily in pel by Matin Holste. He stated the poject in 2011. As itten in [Holste2011] the development as diven by the need to ceate a logging seve that could be queied vey fast. Accoding to the [ELSA-UseGuide], it is using the Patten-DB fom syslog-ng fo nomalization, and foads the nomalized log messages to the pel pogams. These ae sending the messages via bulk load to the mysql database. The sphinx seve is indexing the ne data evey fe hous to gain speed and to ok ith lage chunks of data. Afte a defined amount of time the data is moved fom the MyISAM table to a table fom type ARCHIVE. The quey language is based on the google quey language, hich makes it vey easy to use, but does not suppot ildcad seaches, that most othe eb font ends offe. Google also povides a lot of images, Javascipt and css files, that make it impossible to use it ithout intenet access. Fo a tool ith such secuity and pivacy elated data, it is supising that it gets most of the files fom google. The installation is a little unusual, hee most tools tell you to install a list of equiements and then you have to install the package, ELSA only offes an install scipt that does the installation. Sepaated into "node" and "eb" it installs a lot of pogams like mysql, apache, gcc, heade files fo diffeent development packages and a huge amount of cpan modules. It also donloads syslogng and sphinx and compiles it. It also donloads its on souce and cpanm fom the eb. The nice thing is at the end it uns a self-check that puts some messages into syslog and tests if these ae coectly stoed and indexed. It is possible to ceate multiple ELSA nodes, but these nodes ae not eplicated, but instead evey node uns independently ith its on messages, databases and index seve. The ebpage ill send the necessay queies to all ELSA nodes in the cluste, so it looks like all infomation is stoed in one datasouce. This has the advantage that evey node is obust and if a node is missing the infomation of this node is missing too, but eveything else it unaffected. ELSA also offes email aleting, a plugin achitectue and host checks that infoms about hosts that ae not sending messages anymoe. A nice featue is the possibility of defining log classes and defining hich use can access hich message. This makes it possible to define that a eb develope can access the eb logs, but not the ssh o audit logs. It also offes dashboads fo a bette ovevie ove diffeent kind of log messages. The documentation is quite extensive, but to ask questions it is necessay to have a google account. 35
Illustation 6: ELSA eb page 4.1.7 octopussy Octopussy o 8pussy is a quite old poject and as stated in 2005 by Sebastien Thebet. It is itten in pel and is available as a Debian achive o as souce code. It bings its on nomalization libay as shon on page 25. It uses syslog to accept the messages and sends them via a fifo into the ocotpussy dispatche. The dispatche sends the messages to the pase (fo the nomalization). Fo evey host that sends messages to octopussy a ne pase is stated. This can be a poblem ith lage setups, because evey octo_pase has a esident set size of 24 Mebibytes. With thousands of hosts this can be a poblem. On the ebpage it can be defined hich sevices ae unning on the host and automatically the uleset fo this sevice is added. A huge amount of pedefined log fomats o sevices ae available, not only Linux and Windos Sevices, but also MacOS, Netsceen, Ionpot, F5 and Cisco. It is also possible to add ones on sevices and ules. Thee is even a izad that shos all unidentified log messages and helps ith the ceation of ules fo these log messages. Thee is LDAP authentication available, as ell as alets, epoting and d-based gaphics. The big poblem is the data stoage. Thee is a mysql database equied, but the logs ae stoed based on the detected sevice in a compessed cleatext file. These files ae stoed in a date based file hieachy ith one log file fo evey minute of the day, up to 1440 files pe diectoy. This can limit the seach speed. A fast seach is possible, but only if the seach is limited to a specific sevice, because only these logs have to be uncompessed and seached. 36
Illustation 7: Octopussy home page 4.1.8 nxlog Nxlog is a vey univesal log collecto and shippe, togethe ith some analyzing and nomalization capabilities. It is itten in multitheaded C and is ceated and suppoted by the Hungay based company Nxsec. The code is only eleased on SouceFoge as a ta.gz achive and no souce code epositoy like SVN o GIT is available. Nxlog is OpenCoe softae; some featue ae only available ith the "Entepise" vesion, this includes bette Event coelation, http REST api, snmp input and a emote indos event collection. But the impotant featues ae available in the Open Souce vesion. The achitectue is based on plugins, but these ae called modules hee. The modules ae sepaated into extension modules that add suppot fo message fomats like syslog, gelf, JSON o multi line message pase to handle Java stack tace. Thee ae input and output modules as ell as pocess modules ith suppot fo memoy and disk buffes fo bidging seve gaps. It also offes event coelation and message de-duplication. Encyption is available as an input and output module. The analyzing and event coelation is descibed on page 28. The [nxlog] documentation is quite complete and includes a lot of examples. 37
4.1.9 Heka In Apil 2013 the Sevice Team of the Mozilla Foundation announced the fist public elease of Heka on thei page [HekaInto]. Heka is pimaily designed as a shippe, but has some suppot fo nomalization. It is itten in go, but can be extended in the language lua as ell. It uses RabbitMQ as the pimay tanspot, but does not suppot the TLS encyption of AMQP. It can ite diectly to elasticseach since vesion 0.3 eleased in July 2013. Heka is a vey young poduct, but ith the suppot of the Mozilla Foundation it could become vey inteesting in the coming months. Heka has some log nomalization featues as shon on page 26. The Documentation is supisingly thoough fo such a young poject. 4.2 Output Some ebpages and gaphics geneatos ae tool independent. These ae shon hee. 4.2.1 Webpage The Webpages shon hee ae not belonging to a special tool o fameok, but ae developed independently. 4.2.1.1 LogAnalyze The LogAnalyze is ceated by the same develope as syslog and is using syslog to ite syslog data into a MySQL database and this data is shon via this ebpage. Stictly speaking this tool should not be shon hee, because it does not use stuctued log data, instead it only uses the semistuctued log data fom syslog. But thee is a possibility to extend LogAnalyze to handle stuctued data as descibed on the syslog ebpage [Gehads2011-2]. 38
4.2.1.2 Kibana 2 Kibana 2 is a eb font end fo accessing the data itten to elasticseach by logstash. Kibana 2 is based on Ruby and needs a lot of Ruby gems. It is suggested that these ae installed ith the help of the Ruby gem bundle. Kibana 2 is offeing inteactive gaphs and can sho tends and distibution of fields in the log data. It even can ceate dashboads o ss feeds based on lucene queies. Kibana 1 as based on php, but is long abandoned. Illustation 9: Kibana 2 eb page 4.2.1.3 Kibana 3 Kibana 3 is a ne vesion of Kibana itten completely in HTML5 and Javascipt by the elasticseach team. The code unning in the eb bose talks diectly ith the elasticseach seve. It is not yet consideed stable, but is oking quite ell aleady. The diffeence beteen kibana 2 and 3 is not only the pogamming language, but the ay kibana 3 oks ithout a fixed fomat in elasticseach. It is possible to use it on othe data as ell, as long as thee is a time field. It is even possible to use gaylog2 fomat ith kibana 3. Because of the use of HTML5 it does not need anything special on the seve side, but it needs a moden eb bose on the client side and theefoe it does not ok ith an olde vesion of "Intenet Exploe" hich ae used at too many companies. Kibana 3 is moe like a eb based dashboad ceation tool, than a simple dashboad. It ships ith an example fo logstash, but it can be vey easily extended and eitten ithout a single line of code. 39
Illustation 10: Kibana 3 eb page 4.2.2 Gaphs The diffeent eb font ends can ceate some simple gaphs, but if the gaphs should be used in some othe ebsite like monitoing, a special gaph tool is needed. 4.2.2.1 StatsD StatsD as ceated by Etsy to follo thei eligion of the "Chuch of Gaphs. If it moves, e tack it." as stated on [Malpass2011]. StatsD is a simple Event Tacking system itten in Javascipt based on Node.js. It eceives the status changes via an UDP socket. A ne counte does not need to be ceated, simply stat adding data to a ne counte and gaphs ill automatically be ceated. Fo vey fequently hit countes it is possible to send only evey ten o evey 100 events to StatsD and it ill be coectly stoed in the counte. StatsD does not ceate gaphs, but sends the data nomally on to gaphite to geneate the gaphs. 4.2.2.2 Gaphite Gaphite is used quite often togethe ith StatsD, but can be used ithout it. Gaphite is itten in Python and uses the Tisted and Django fameok. Intenally it uses hispe as a database fo time-seies data (simila to d), cabon is the data point eceive and a eb application is available 40
to display the gaphs, also called metic. The intenal achitectue is explained in chapte 7 of [OSAch2012]. Evey gaph o metic has a path that is used to specify the gaph and can help oganize it as ell, like company.ebsites.logging.auth.use.eo. The gaphite messages ae send in a fomat like this: path_to_gaph value unixtimestamp company.ebsites.logging.auth.use.eo 1 1375946427 The gaphite seve takes this infomation and stoes it aggegated in the hispe database, based on the configuation fo this tee of gaphs and geneates the gaphs to be used by a ebpage. 4.2.2.3 Fnodmetic Fnodmetic is a collection and visualization fameok fo time seies data. Thee ae to backends to choose fom plus a eb GUI. 4.2.2.3.1 Fnodmetic Classic Fnodmetic Classic is itten in Ruby and uses a edis NoSQL database fo stoage. This is a Ruby fameok to ite Webpages ith gaphs, using Ruby as a Domain Specific Language ith pe-build idgets to ease development. It eceives data as JSON via a TCP/UDP Pot o via HTTP Post. 4.2.2.3.2 Fnodmetic Entepise Fnodmetic Entepise is using Scala that is un on a JVM. Fnodmetic can be used as a eplacement fo statsd+gaphite, but the API to eceive data is diffeent. Fnodmetic Entepise can eceive data via TCP/UDP o via a http ebsocket. The big diffeence is that the name of the metic contains the metic type like mean, sum, min/max etc. 4.2.2.3.3 Fnodmetic UI The fnodmetic UI is a HTML5 Application fameok that connects to one of the to fnodmetic backends. It is possible to ceate a ne html page o integate it into an existing one, ith only some Javascipt addons including fnodmetic itself and jquey. 4.3 Stoage 4.3.1.1 mysql Mysql is the olds most used Open Souce and Fee Softae Relational Database. Ceated by the MySQL AB Company in 1995 it as bought by Sun and late by Oacle. Since Oacle acquied MySQL, the Open Souce develope including the oiginal authos ee not happy ith the ay Oacle handled the poject. They theefoe ceated a fok named MaiaDB. Both MaiaDB and MySQL ae quite compatible. In this thesis the tem mysql efes to both databases MaiaDB and MySQL. Because mysql is so famous, it ill not be intoduced hee futhe. 4.3.1.2 MongoDB MongoDB as one of the ealiest NoSQL databases and one of the most used ones of its kind. MongoDB uses a binay epesentation of JSON called BSON. MongoDB is vey old fo a NoSQL system ith a poduction eady elease in 2010. MongoDB is developed and suppoted by the US based MongoDB, Inc. MongoDB is a document database, that can stoe data schema fee. To ceate high availability setups it is suppoting a maste-slave configuation. This nomally uns in an 41
asynchonous mode, so the databases ae not alays in sync. To split huge databases it uses shading, hee the data is distibuted to diffeent machines based on a shad key. It also suppots map-educe to distibute data and aggegation opeation. 4.3.1.3 ElasticSeach Elasticseach (ES) is also document oiented like MongoDB, but it is designed as a pue seach engine based on the Apache lucene seach libay instead. It povides ealtime data analytics and can be distibuted ove multiple machines, both fo load easons as ell as to impove availability. It has full text seach capabilities and can stoe queies inside elasticseach and execute them faste hen needed. As a quey language it uses the lucene syntax that includes seaches in fields like this: host_souce:testmachine AND eo Elasticseach is vey easy to setup because it only needs a Java untime, the Java ja file and a config file ith the cluste name in it. A cluste is a collection of multiple elasticseach instances that can talk ith each othe. To ceate a cluste ith elasticseach machines, simply make sue multicast is available and all nodes use the same cluste name. Thee is a special tutoial fo using elasticseach to stoe logs [Gheoghe2012]. Elasticseach uses JSON as the document fomat. The JSON documents that should be stoed inside ES can be put thee ith a http PUT equest. To impove pefomance multiple JSON documents can be stoed in one equest using the bulk API. To even futhe impove pefomance a so called ive plugin can be used to push documents into the ES instance. Some othe type of plugins including management eb font ends called "site plugins". The site plugin "elasticseach-head" is used vey often and allos a fast and easy ovevie on the cluste and can make changes as ell. Thee ae seveal othe site plugins available, that can give an insight into the pefomance and esouce usage of elasticseach. Illustation 11: Elasticseach ith HEAD plugin 42
Elasticseach uses shading and eplicas to speed up access and distibute the load. Shading is used to split up the data to be distibuted to multiple machines inside a cluste. The default numbe of shads ae five. In this configuation, if a cluste has moe the 5 machines, thee ae some machines that do not stoe any data. So it is helpful to ceate at least as many shads as thee ae nodes in the cluste. These shads can also be eplicated to othe machines inside the cluste to ok as a failove in case of a lost node. Changing the shads in an index can be vey complicated, so it is easie to ceate the coect numbe of shads at ceation time. Adding additional eplicas on the othe hand is vey easy. As Floian Gilche ote in [Gilche2012] it is quite easy to ceate a split bain situation. This can happen hen the elasticseach nodes ae distibuted in to data centes and the connection is seveed. In case of a split bain situation, thee is no possibility to eintegate the to sides, it is necessay to delete one side and use only the othe. This ill lead to loss of data, if it is not handled coectly up font. 4.4 Tanspots Syslog is nice to send log messages via a netok, but hen it comes to eliability and secuity the folloing dedicated tanspot systems can offe some nice altenatives. 4.4.1 edis [Redis] is an in-memoy database that suppots eplication ith a maste-slave setup, as ell as pesistence ith the help of snapshot and jounal files. Fo the use as a log tanspot the build-in featue called channels ae used to ceate a publish-subscibe messaging infastuctue. Togethe ith the eplication suppot it is possible to ceate a high available setup, but this is still in development and not suppoted yet. Redis does not suppot data encyption and only a passod authoization as descibed in [edis-secuity]. The documentation is vey good and if this is not enough thee is a fee book itten by Kal Seguin called "a little intoduction book about edis" at [Seguin2013]. 4.4.2 abbitmq [RabbitMQ] is an Open Souce message boke that is developed by RabbitMQ Inc, a London based company, no oned by VMWae. It is itten in elang and is based on the Open Telecom Platfom. Elang is a functional pogamming language and famous fo its possibility to estat the pogam in pats, ithout a complete estat o losing connectivity o function. RabbitMQ also contains access libaies fo a lot of diffeent pogamming languages. RabbitMQ includes gateays to talk ith AMQP, STOMP and MQTT. Tanspot encyption ith TLS is available built-in, using the openssl libay as itten in [RabbitMQ-SSL]. RabbitMQ can be setup to cluste multiple machines in a local netok into a single logical boke. This makes it possible to estat single machines ithout any sevice inteuption. It is also possible to mio queues ove seveal machines to ensue that in case of a hadae failue no messages ae lost. All this high availability is alays paid ith a pefomance penalty. The documentation is vey extensive and contains examples fo all use cases. 4.4.3 ActiveMQ ActiveMQ is also an Open Souce and Fee Softae message boke. It is itten in Java and eleased unde the supevision of the Apache foundation. The client is available in many languages and it suppots not only AMQP and STOMP, but also XMPP (fome Jabbe) and a RESTfull eb API as itten in [ActiveMQ-Featues]. Accoding to [ActiveMQ-Cluste] ActiveMQ also suppots 43
clusteing in diffeent flavos. Fom failove cluste hich make sue that the clients can send messages to the boke even hen a node is don, to Maste-Slave setup hee the messages of one node ae stoed on a second machine to be send, if the maste node goes don. ActiveMQ is suppoting encyption as itten in [ActiveMQ-SSL]. The ActiveMQ documentation is complete and seveal books ae available. 4.4.4 Ømq [Ømq] o 0MQ o zeomq ae thee diffeent spellings fo the same pogam. Ømq is a socket libay, that can send messages to anothe pocess. This can happen inside the same pocess via inpoc, to othe pocesses on the same machine via IPC, o to pocesses on othe machines ith the help of TCP o Mulitcast connections. The advantage is that it does not need a special boke seve unning. O as Piete Hintjens ote in [Ømq] - The Guide: "Ømq... looks like an embeddable netoking libay but acts like a concuency fameok." As itten on page 22 Ømq does not suppot encyption, but the development has stated to suppot this. Because it does not have a specialized boke it does not suppot clusteing. It is designed fo speed, not fo eliability. Ømq is ceated by imatix, a Belgium based company hich povides commecial suppot offeings. The documentation is vey good and can also be obtained as a book [Hintjens2013]. 4.5 Collecto/Shippe The possibilities of the diffeent shippe/collecto can be seen on the ovevie on page 20. 4.5.1 Fluentd Fluentd is a vey univesal shippe/collecto and is developed by Sadayuki Fuuhashi and itten in C fo the pefomance elevant pats and the est in Ruby as itten in the [FLUENTD-FAQ]. It is sponsoed by the Company Teasue Data in Califonia hich offes a cloud based log analysis platfom and uses fluentd to send the data to thei cloud. It uses JSON as the native log fomat and can be consideed a syslog of stuctued logs. It uses a plugin achitectue, ith suppot of output and input plugins. Thee ae ove 150 plugins available fo fluentd and developing a plugin is vey easy. The elasticseach output plugin ites in the same fomat as logstash. It does not suppot any encypted tanspots, but it suppots eliable tanspot and pesistent disc and memoy buffe in case a taget seve is don. It also suppots a high available setup, hee multiple fluentd instances send the messages to a log aggegato unning on to machines. The sende ill sitch ove to the backup aggegato hen the pimay is don. Fluentd can be installed as the fluentd Ruby gem o as a td-agent build as a deb o pm package. The td-agent vesion has a sloe elease cycle, has a moe tested elease, but of couse the gem vesion has the ne featues faste. 4.5.2 flume The Apache flume poject is "a distibuted, eliable, and available sevice fo efficiently collecting, aggegating, and moving lage amounts of log data" accoding to the [FlumeUseGuide]. Flume is itten in Java and is suppoting multiple Hadoop mechanisms. It is included in this thesis because is also suppots othe inputs and outputs, such as an elasticseach output that ites like logstash. To ite to elasticseach it is necessay to add the elasticseach and lucene-coe jas into the lib diectoy of the flume installation, because it uses the same mechanism to use an elasticseach node to add the data to the cluste. 44
The input mechanism is called souce in the flume documentation and the output is called sink. Souce and output ae connected ith channels. Flume does not suppot encyption outside of Hadoop, but a memoy and disk based pesistence in case of a seve dontime is available. The [FlumeUseGuide] is vey extensive ith a lot of examples. 4.5.3 aesant Aesant is a pel based shippe that suppots edis and the abbitmq client libay. With the help of the second aesant instance unning on the edis machine, it is possible to use encypted edis. 4.5.4 beave Beave is a Python based shippe that suppots edis, 0mq and uses the abbitmq client libay to access AMQP and stomp based seves. A nice featue of beave is the suppot fo a ssh tunnel to be ceated at statup. This also makes it possible to ceate an encypted connection to 0mq and edis. 4.5.5 lumbejack The logstash develope needed a shippe that as not Java based and had a vey lo memoy and CPU equiements. To fulfill his need he ceated lumbejack, because thee as no othe system suppoting that. Thee ae to implementations of lumbejack, a Ruby and go based one and a Ruby and c based system. Both ae still eceiving patches, but the go based system is much moe active developed and is stoed in the maste banch of GIT. Accoding to its ceato [Sissel2013] it is:"encypted, tusted, compessed, latency-esilient, and eliable tanspot of events". It uses OpenSSL as a base libay and uses X509 cetificates to check the seve cet. It is possible to use client cetifications as ell. Lumbejack does not suppot caching in case of a connection poblem, but it is possible to configue multiple logstash seves that ae used in case of a connection poblem. 4.5.6 eventlog-to-syslog Eventlog-to-syslog is an Open Souce and Fee Softae Windos based Eventlog shippe. It is based on the souce code fom Cutis Smith (Pude Univesity) and is no developed by Shein Faia (Rocheste Institute of Technology). This tool suppots both Windos Eventlogs fomats befoe and afte Windos 2008 and Windos Vista. It is itten in C++ and sends the Eventlogs to a Syslog seve. It suppots the taditional syslog fomat via UDP and TCP and also suppots seve failove in case of an uneachable syslog seve. 4.5.7 oodchuck Woodchuck is a vey simple Ruby based shippe that only suppots edis as output, theefoe no cypto suppot. 4.5.8 ncode/logix Logix is a vey simple Python based log shippe that is developed since 2011. It accepts udp syslog messages and sends them in gelf fomat to a AMQP seve to be ead by gaylog. 4.5.9 syslog-shippe Syslog-shippe is a vey simple shipping tool itten in Ruby. It can only ead in multiple files and send them to a syslog seve. It ill add the syslog heade if equested, uses TCP and suppot TLS encyption. 45
4.5.10 emote_syslog emote_syslog is a Ruby tool, that also eads files and sends them to a emote seve via syslog. It suppots TLS encyption ith client cetificates and can detect ne log files hen using globs. If it is configued to look fo files like /somehee/*.log and a ne log file appeas it ill be collected. It can even do some basic pasing functionalities. 4.5.11 systemd/jounal2gelf Jounal2gelf is a vey simple Python based shippe ceated in 2012 that takes systemd's jounal messages in JSON fom STDIN and sends them to a gaylog2 seve via GELF. 4.6 Analysis Most analysis tools belong to a multi pupose tool. Thee ae only to independent analysis tools. Sagan is descibed on Page 30 and SEC is descibed on Page 29. 46
5 Toolchains Thee ae multiple ays to build a stuctued log analysis solution, based on the tools descibed in this thesis. Selecting the coect solution is not an easy task. Chapte 4 ties to give an ovevie on the diffeent advantages and disadvantages of these tools. 5.1 Possible toolchains Because of the lage numbe of possibilities to combine the diffeent tools it is necessay to get an ovevie on the toolchains hich can be build. Because thee ae so many collectos/shippes available, this thesis ill stat fom the ebpage side to sho an ovevie on the possible tool stacks. The tool LogAnalyze does not eally suppot stuctued log file analysis, even though some additional fields can be added. This is not a eal stuctued solution and ill not be consideed in this pat of the thesis. Illustation 12: Possible toolchains (ed=stoage, yello=nomalize, hite=ebpages, blue=shippe As seen in Illustation 12, ELSA and Octopussy ae to special cases, because they ae both not modula like the othe tools, but they ae combining diffeent tools in a pedefined ay. This makes it hade to build things on top of the tools. The tools fluentd and flume could be used to ite the log data diectly to elasticseach, but then no nomalization ould be possible. Both ae theefoe not included in the list of toolchains. Kibana 3 can access elasticseach in gaylog2 fomat, but only by ceating the dashboad completely fom scatch, theefoe the dashed line. 47
The ebpages can access the data stoage in paallel, as they do not make changes to the data. Because of that they ill be ignoed in the collection of toolchains. This esults in the folloing toolchains: Elasticseach (gaylog fomat) - gaylog2 - logstash Elasticseach (gaylog fomat) - gaylog2 - nxlog Elasticseach (logstash fomat) - logstash (gok) Elasticseach (logstash fomat) - syslog (liblognom) ELSA Octopussy These six toolchains ill be discussed in this chapte. 5.2 Toolchain Featues To get a bette ovevie on the diffeent toolchains, some featues ill be compaed. This includes featues like high availability, size of ule base and ease of installation. 5.2.1 Accepting stuctued log files To select a solution it is necessay to kno hat kind of log files ae going to be pocessed. This analysis should not only be done fo the cuent log files, but should also include thoughts about hat kind of log files need to be pocessed in the futue. If pogams stat to ite stuctued log files in the futue, it ould be bad if these messages cannot be used by the selected toolkit. If the selected solution needs to accept aleady stuctued log files, both ELSA and Octopussy can be emoved fom the selection because they cannot accept aleady stuctued log files. The only input into the system is semi-stuctued syslog messages. Both have the possibility to nomalize diffeent kinds of syslog messages, but stuctued log messages cannot be send in. The idea of denomalization and e-nomlization ill be ignoed hee, because of possible pasing eos and pefomance easons. Pogam accepts stuctued log files Elasticseach (gaylog fomat) - gaylog2 - logstash Elasticseach (gaylog fomat) - gaylog2 - nxlog Elasticseach (logstash fomat) - logstash (gok) Elasticseach (logstash fomat) - syslog (liblognom) ELSA no Octopussy no Table 6: Featue: accepting stuctued log files 5.2.2 Reliable tanspot The log files should not be lost on the ay to the cental log seve. This can be avoided by stoing the log data locally should the log seve be unavailable. 48
Both gaylog2 and logstash suppot AMQP fo eliable message tansfe. Logstash suppots a huge numbe of othe eliable inputs. ELSA is based on syslog-ng and only accepts syslog messages. Syslog-ng does not suppot eliable syslog tansfe. Octopussy is using syslog to eceive log messages, because of the RELP tanspot of syslog it is possible to get eliable syslog tanspot ith Octopussy. Pogam Reliable tanspot Elasticseach (gaylog fomat) - gaylog2 - logstash Elasticseach (gaylog fomat) - gaylog2 - nxlog Elasticseach (logstash fomat) - logstash (gok) Elasticseach (logstash fomat) - syslog (liblognom) ELSA no Octopussy Table 7: Featue: eliable tanspot 5.2.3 High availability Log seves ae an impotant pat of the seve infastuctue. To make sue the seve is alays unning, it is advisable to ceate a high availability setup. This is nomally done in the fom of multiple seves unning in paallel and iting to the stoage in paallel. In case of a disaste it is helpful to distibute the data to diffeent seves. Cluste softae (like Red Hat, Veitas, VMWae Vmotion) is not consideed hee, as this is a geneal solution that can be used ith evey softae. With gaylog2 multiple gaylog2 nodes can be ceated, hich access the same elasticseach cluste. Only one of the gaylog2 seves needs to be configued as maste, because it has to un some cleanup jobs, but that function can be easily moved. Logstash also suppots iting to the same elasticseach cluste as explained in chapte seven of the logstash book by [Tunbull2013]. ELSA also suppots multiple nodes, but these nodes ae alays iting into thei on local database. If a node goes don, it is possible to ite to a diffeent node and the data is stoed thee. A quey ill be sent to all nodes and theefoe all log messages fom befoe and afte the poblem ae integated into one vie. Octopussy does not suppot a failove solution. Pogam high availability distibuted stoe Elasticseach (gaylog fomat) - gaylog2 - logstash Elasticseach (gaylog fomat) - gaylog2 - nxlog Elasticseach (logstash fomat) - logstash (gok) Elasticseach (logstash fomat) - syslog (liblognom) ELSA no Octopussy no no Table 8: Featue: high availability 49
5.2.4 Use sepaation and LDAP A cental log system can be vey helpful ith the coelated and centalized access to all log files, but soone o late othe people outside of the opeatos ill ant access to some of the log files. Having some kind of use sepaation is helpful hee, making it possible to give cetain uses access to only cetain kinds of log messages. Defining hich use gets access to hich log files should be an easy task. To handle use ceation and cental passod management the solution should suppot LDAP o Active Diectoy. Gaylog2 makes it possible to ceate log steams and give uses the pemission to access the steams as eades. Logstash does not have uses and theefoe cannot give a limited vie of the log messages. ELSA has the possibility to activate use logons. In the default setting eveybody that can access the ebpage can ead all log files. With dashboads it is possible to stoe queies and give uses access to the log messages that this quey etuns. In Octopussy the use management is vey detailed. It is possible to ceate ead-only uses that can ead all log files, but cannot make any changes to the configuation. Thee also ae esticted uses that can be limited to the log data of cetain devices, sevices, alets o special epots. Pogam Use sepaation LDAP Elasticseach (gaylog fomat) - gaylog2 - logstash Elasticseach (gaylog fomat) - gaylog2 - nxlog Elasticseach (logstash fomat) - logstash (gok) no no Elasticseach (logstash fomat) - syslog (liblognom) no no ELSA Octopussy Table 9: Featue: Use sepaation and LDAP 5.2.5 Size of ule base The ceation of the ules fo log nomalization can be a vey tedious job. It is nice to be able to access a huge amount of them ithout iting oneself. Pogam # of pepaed ules Elasticseach (gaylog fomat) - gaylog2 - logstash 0 Elasticseach (gaylog fomat) - gaylog2 - nxlog 4 Elasticseach (logstash fomat) - logstash (gok) 0 Elasticseach (logstash fomat) - syslog (liblognom) 0 ELSA ~130 Octopussy ~1500 Table 10: Featue: size of ule base 5.2.6 Log Analysis The centalized logs should be analyzed to detect attacks and othe anomalies. But not all tools have this equied featue available. 50
Pogam log analysis Elasticseach (gaylog fomat) - gaylog2 - logstash limited Elasticseach (gaylog fomat) - gaylog2 - nxlog Elasticseach (logstash fomat) - logstash (gok) limited Elasticseach (logstash fomat) - syslog (liblognom) limited ELSA no Octopussy Table 11: Featue: log analysis 5.2.7 Install Even hen the installation is only done once, poblems duing installation ae discouaging the use of the pogam. The installation should un on diffeent Linux distibutions, fo this thesis thee distibutions ae selected. The toolchains ae installed on Red Hat Entepise Linux (RHEL) 6.4, Debian 7 and Ubuntu 12.04 to test the installation. Elasticseach, gaylog2 and logstash ae all Java based tools. Thus the installation is vey easy, as it only needs a Java untime envionment and some basic statup scipts. The difficulty of the installation of gaylog2 and logstash is the dependency on a specific vesion of elasticseach. Rsyslog is pepackaged fo all 3 distibutions and is also available in the cuent vesion diectly fom the autho. Nxlog is not pepackaged by any distibution, but pepaed packages fo all thee distibutions ae available fom the poject ebsite. ELSA is using an installation scipt. The scipt did not ok duing the test fo this thesis ith Debian 7 and RHEL 6.4, even though they ae listed as suppoted platfoms on the ebpage [ELSAQuickstat]. Ocotpussy offes a ta.gz achive and a built Debian package. The Debian package oks ith Debian, as ell as ith Ubuntu. The ta.gz achive as installable ith RHEL6.4. All thee installations ae descibed on the [OctopussyInstallation] eb page. Pogam Debian 7 Ubuntu 12.04 RHEL 6.4 Elasticseach (gaylog fomat) - gaylog2 - logstash Elasticseach (gaylog fomat) - gaylog2 - nxlog Elasticseach (logstash fomat) - logstash (gok) Elasticseach (logstash fomat) - syslog (liblognom) ELSA no no Octopussy Table 12: Featue: easy to install 5.2.8 Speed The pefomance of the log system can have a vey lage impact on the decision fo a log system. Measuing the pefomance can be vey complicated, because of the diffeent achitectues of the log toolchains. Fo example ELSA is using a batch job based system that accepts messages and stoes them in a queue; this queue ill be impoted once a minute as itten in chapte Capabilities on the ebpage [ELSA-UseGuide]. The indexing is done "evey fe hous" fo pefomance easons. This cannot be faily tested against an elasticseach stoage, hich indexes the data duing aival. 51
Also the ay to distibute load is vey diffeent beteen the toolchains. Whee Octopussy can only un on one machine, ELSA can un on many machines, each ith its local database. The elasticseach based tools can distibute the data to many data nodes and shae the data beteen data centes. The maste thesis of [Chuilin2013] ties to handle this kind of poblem. In this thesis the pefomance as tested ith fou diffeent toolchains: Gaylog2, logstash, syslog ae iting to elasticseach and ELSA is iting to MySQL. The esult of the pefomance test is that ELSA is almost ten times faste than gaylog o logstash and still five times faste than syslog. Octopussy as not coveed in this thesis. The use of a vitual machine on top of a Windos okstation could lead to a lot of noise in the data. Because no a data as published it is not possible to check the standad deviation of the tests. The used setup suggests that the standad deviation as quite high. The vey limited amount of RAM (only 2GB) pe machine, could lead to a big disadvantage against the Java based systems ith its lage memoy equiements. A fai test of all toolchains ould also need to include the nomalization phase. To ceate a fai test it has to be made sue that the numbe of ules ae the same eveyhee, otheise the lage patten size of syslog-ng ould be constituted as a disadvantage. One possibility to handle all the poblems ould be to test evey tool sepaately, as fa as this is possible. Tools that can pefom multiple tasks, as logstash, ould be tested sepaately fo evey task. In case of logstash this ould constitute a test fo: using it as shippe, iting data to elasticseach and using it fo nomalization ith gok. With the help of the Foce Flo La it ould be possible to calculate the speed of the hole system, but the numbe of tests to be done fo such an endeavo ould be too lage fo this thesis. Because of all these poblems no pefomance data is published in this thesis. 5.3 Summay The six possible toolchains all have thei advantages and disadvantages. Table 13 is a summay of all the featues compaed in this chapte. This ovevie ill not be used to declae a inne, because hat is impotant o not alays depends on the use case of the company. All six toolchains ee unning in a test envionment fo seveal eeks, ithout any majo incidents. The functionalities diffe much beteen the tools, but all ae stable and they can be used on a daily basis. 52
RHEL 6.4 Ubuntu 12.04 Debian 7 log analysis # of pepaed ules LDAP Use sepaation distibuted stoe high availability Reliable tanspot Pogam Elasticseach (gaylog fomat) - gaylog2 - logstash 0 limit ed Elasticseach (gaylog fomat) - gaylog2 - nxlog 4 Elasticseach (logstash fomat) - logstash (gok) no no 0 limit ed Elasticseach (logstash fomat) - syslog (liblognom) no no 0 limit ed ELSA no no ~130 no no no Octopussy no no ~150 0 Table 13: Featue: ovevie 53
6 Conclusion It is aleady possible to ceate a centalized stuctued log file analysis infastuctue and thee ae a lot of diffeent tools available that can help ceate such a log infastuctue. Which tools o toolchains ae selected is up to the use to decide. This thesis ill only sho the advantages and disadvantage of the diffeent tools and the functionalities that they offe. Defining a inne fo evey use case ould be unpofessional. 6.1 Shot summay about evey majo tool As a shot ovevie hee ae all majo tools ith a vey shot summay about the advantages and disadvantage as expeienced by the autho: Syslog-ng: The huge nomalization uleset is vey nice, the missing eliable syslog tanspot because of OpenCoe is not. Rsyslog: Vey flexible syslog seve, the usability of the nomalization tool is limited by the missing ules. Gaylog2: It ants stuctue, but it ill ok ithout it. Can handle pemissions of log messages. Logstash: Siss amy knife of log management. "Eveybody sees eveything" makes it unusable fo some companies. Node-logstash: Can be a nice Java fee altenative fo logstash, but it is not thee yet. ELSA: Stuctued log files cannot be used. The fixation on google and google tools is disconceting fo some uses. Designed fo geat speed. Octopussy: Eveything ill be analyzed as a stuctued log file, but then stoed in a nomal file hich slos seaches. Nxlog: It tanspots logs and can do a lot of changes and analyzing on the ay. The OpenCoe natue is not as bad as othes. Heka: Vey young, but aleady vey usable, could become a geat tanspote. kibana 2: Nice ebpage, but ill be oveshadoed by its nee cousin. kibana 3: Geat flexible eb font end fo elasticseach, as soon as it is finished. statsd + gaphite: Often used gaphe tool chain, not eally coveed in this thesis. mysql: Eveybody knos it, MySQL ill be eplaced by MaiaDB. Elasticseach: A vey easy tool, that eases a lot of poblems. The split bain situation can be vey dangeous in lage setups, if not handled coectly. fluentd: Vey univesal tanspote itten in Ruby. flume: Vey univesal tanspot itten in Java, destined fo Hadoop. 6.2 Futue The futue fo stuctued log files is vey uncetain. The CEE standadization pocess is dead and no evival is in sight, because of uncetain funding. Poject Lumbejack could become the ne standad, if the ceatos push it had enough. Systemd's Jounal could also be the ne standad, but 54
ill only be available fo Linux. GELF and Logstash ae aleady in use, but ae both not eal standads. JSON appeas to be the only unifying base fomat eveybody is oking ith, but no decision hich fields should be used and ho they should be named has been eached yet. 6.3 Optimal toolchain If the autho should deam up a pefect log solution it ould be this: A ebpage fo the ceation of missing nomalization ules like octopussy. A pepaed nomalization uleset like in patten-db o octopussy. Logstash fomat fo elasticseach ith gaylog2 functionalities. Both kibana 3 and gaylog2 as eb font ends possible. Easy coelation and analysis based on the nomalized data. Statsd+gaphite gaphs ae pepaed and automatically filled ith data. Logstash input/output capabilities ithout the memoy ovehead of Java. A univesal stuctued log fomat that is used by developes and tools alike. 55
Appendix: Abbeviations Abbeviation AMQP Advanced Message Queuing Potocol ELSA Entepise log seach and achive FIFO Fist in - fist out GELF Gaylog Extended Log Fomat GIT Open Souce code management tool GNU Gnu is Not Unix GPL Gnu Public License GUI Gaphical Use Inteface JSON JavaScipt Object Notation KV Key Value LDAP Lighteight Diectoy Access Potocoll RFC Request fo Comment RHEL Red Hat Entepise Linux SVN Subvesion - an Open Souce code management tool STDIN Standad Input TCP Tansmission Contol Potocol TLS Tanspot Laye Secuity UDP Use Datagam Potocol URL Univesal Resouce Locato XML Extensible Makup Language 56
syslog-ng node-logstash Heka oodchuck aesant beave lumbejack syslog-shippe emote_syslog flume lumbejack gelf 0MQ nxlog fluentd Stomp (ActiveMQ,RabbitMQ) amqp (QPID,ActiveMQ,RabbitMQ) http edis RELP octopussy ebsocket TLS encypted channel IETF syslog tcp tls (RFC5425) IETF syslog tcp (RFC5424) gaylog2 syslog IETF syslog udp (RFC5424) BSD syslog udp (RFC3164) spool message duing dontime systemd jounal local Windos eventlog named pipe unix domain socket flat file logstash STDIN/STDOUT Pogamm diectoy stuctue / multiple files ith * Ovevie ncode/logix systemd/ jounal2gelf eventlog-tosyslog Table 14: Total Ovevie: Pat 1 57
(l) node-logstash othe N/DBI DBI linux accounting log (l) nxlog DBI Heka flume Table 15: Total Ovevie: Pat 2 58 log4j, ic, titte, xmpp, email, nagios, amazon (g) syslog-ng fluentd SQL (DBI o native) (l) gaylog2 syslog elasticseach hadoop mongodb gaphite kafka statsd vanishlog SNMP gaphtastic Pogamm logstash (l) (l) (l) nagios, DBI fnodmetic, couchdb, on log libs avo, JMS, HBase, sol, JDBC, ic