}w!"#$%&'()+,-./012345<ya

Transcription

1 }w!"#$%&'()+,-./012345<ya Masarykova univerzita Fakulta informatiky Application Log Analysis Master s thesis Júlia Murínová Brno, 2015

2

3 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Júlia Murínová Advisor: doc. RNDr. Vlastislav Dohnal, Ph.D. iii

4

5 Acknowledgement I would like to express my gratitude to doc. RNDr. Vlastislav Dohnal, Ph.D. for his guidance and help during work on this thesis. Furthermore I would like to thank my parents, friends and family for their continuous support. My thanks also belongs to my boyfriend for all his assistance and help. v

6

7 Abstract The goal of this thesis is to introduce the log analysis area in general, compare available systems for web log analysis, choose an appropriate solution for sample data and implement the proposed solution. Thesis contains overview of monitoring and log analysis, specifics of application log analysis and log file formats definitions. Various available systems for log analysis both proprietary and open-source are compared and categorized with overview comparison tables of supported functionality. Based on the comparison and requirements analysis appropriate solution for sample data is chosen. The ELK stack (Elasticsearch, Logstash and Kibana) and ElastAlert framework are deployed and configured for analysis of sample application log data. Logstash configuration is adjusted for collecting, parsing and processing sample data input supporting reading from file as well as online socket logs collection. Additional information for anomaly detection is computed and added to log records in Logstash processing. Elasticsearch is deployed as indexing and storage system for sample logs. Various Kibana dashboards for overall statistics, metrics and anomaly detection dashboards are created and provided. ElastAlert rules are set for real-time alerting based on sudden changes in events monitoring. System supports two types of input server logs and client logs that can be reviewed in the same UI. vii

8

9 Keywords log analysis, threat detection, application log, machine learning, knowledge discovery, anomaly detection, real-time monitoring, web analytics, log file format, Elasticsearch, Kibana, Logstash, ElastAlert, dashboarding, alerting ix

10

11 Contents 1 Introduction Monitoring & Data analysis Monitoring in IT Online service/application monitoring Data analysis Big data analysis Data science Data analysis in statistics Data mining Machine learning Business intelligence Log analysis Web log analysis Analytic tests Data anomaly detection Security domain Software application troubleshooting Log file contents Basic types of log files Common Log File contents Log4j files contents Analysis of log files contents Comparison of systems for log analysis Comparison measures Tracking method Data processing location Client-side information processing software Web server log analysis Custom application log analysis Software supporting multiple log files types analysis with advanced functionality Custom log file analysis using multiple software solutions integration 36 5 Requirements analysis Task description Requirements and their analysis System selection Proposed solution Deployment xi

12 6 Application log data Server log file Client log file Data contents issues Logstash configuration Input File input Multiline Socket based input collection Filter Filter plugins used in configuration Additional computed fields Adjusting and adding fields Other Logstash filters Output Elasticsearch output File output output Running Logstash Elasticsearch Query syntax Mapping Accessing Elasticsearch Kibana configuration General dashboard Anomaly dashboard Client dashboard Encountered issues and summary ElastAlert Types of alert rules Created alert rules Conclusion Future work Nested queries Alignment of client/server logs Appendix 1: Electronic version Appendix 2: User Guide Discover tab Settings tab Dashboard tab Visualization tab xii

13 14 Appendix 3: Installation and setup Logstash setup Elasticsearch setup Kibana setup ElastAlert setup Appendix 4: List of compared log analysis software Literature xiii

14

15 1 Introduction Millions of online accesses and transactions per day create great amounts of data that are a significant source of valuable information. Analysis of these high amounts of data needs appropriate and sophisticated methods to process them promptly, efficiently and precisely. Data logging is an important asset in web application monitoring and reporting as it contains massive amounts of data about the application behavior. Analysis of logged data can be a great help with reporting of malicious use, intruders detection, compliance assurance and the anomalies that might lead to actual damage. In my master s thesis I will be looking into the main benefits of monitoring, web application service log analysis and log records processing. I will be comparing a number of available systems for log records collecting and processing, considering both the existing commercial and open-source solutions. With regards to the sample data collected from a chosen web application, the most fitting solution will be chosen and proposed for the required data processing. This solution will then be implemented, deployed and tested on the sample application log records. Goals of this thesis are: Get familiar with the terms of monitoring, data mining and the log records analysis; Investigate possibilities and benefits of log records data collecting and analysis; Look into different types of log formats and information they contain; Compare and categorize commercial and open-source systems available for log analysis; Propose an appropriate solution for the sample log records analysis based on previous comparison and requisites; Implement the proposed solution, deploy and test on the sample data; Summarize the results of the implementation and list possible future improvements. 1

16

17 2 Monitoring & Data analysis Monitoring 1 as a verb means: To watch and check a situation carefully for a period of time in order to discover something about it. The fundamental challenge in IT monitoring process is to adapt quickly to continuous changes and make sure that the cost-effective and appropriate software tools are used. Strength of controlling process is based on both preventive and detective controls which also are the crucial parts of changes monitoring. There might be some bottlenecks in regards to different types of data that need to be monitored as not all types of monitoring systems allow records logging. Also the automated data logging processes might not be cost-effective due to slowing down the processing of data itself. Basically the strategies for automated monitoring include IT-inherent, IT-configurable, IT-dependent manual or manual guidelines and these need to be evaluated carefully considering the requisites and available resources. [1] 2.1 Monitoring in IT For Information technologies in particular, there are a few types of monitoring that are used specifically according to their purpose rather than the contents of the monitoring processes themselves as those are often overlapping. Some of the types are listed below and briefly described: System monitoring system monitor (SM) is a basic process of collecting and storing system state data; Network monitoring monitoring system set for reporting network issues (slow processing, connection discrepancies); Error monitoring focuses on error detection, catching and handling potential issues within the code; Website monitoring specific monitoring of website contents and access, reporting broken functionality or other issues related to the monitored website; APM (Application performance management)[2] Based on end user experience and other IT metrics the APM is a fundamental software application monitoring and reporting system that ensures certain level of service. It consist of 4 elements, see Figure 2.1: 1. From Cambridge dictionary: british/monitor 3

18 2. Monitoring & Data analysis Top Down Monitoring (Real-time Application Monitoring) focuses on End-user experience and can be active or passive; Bottom Up Monitoring (Infrastructure Monitoring) monitoring of operations and a central collection point for events within processes; Incident Management Process (as defined in ITIL) foundation pillar of APM, focuses on improvement of the application; Reporting (Metrics) monitor collecting raw data for analysis of application performance. Figure 2.1: Anatomy of APM [2] Online service/application monitoring using the log analysis is often compared to APM or Error monitoring and contains a lot of overlapping processes. The main difference between them is in the core purpose of monitoring. For APM the emphasis is put more on the end user perspective and on enabling the best application performance possible. Error monitoring focuses on catching the potential code errors by implementing the adequate level of error controlling mechanism in the code. 4

19 2.2 Online service/application monitoring 2. Monitoring & Data analysis Near real-time monitoring of data logging with automatic reporting is needed to obtain expected levels of security and quality that need to be maintained 24 hours a day. A certain level of uniformity in logging patterns is important for more possibilities in standardization of the log analysis process. The specified event levels and categories should simplify detecting and handling the suspicious activity or system failures. [3] The crucial part of the monitoring and reporting process is identifying the problematic data in log records and evaluating the appropriate response automatic, semi-automatic or manual. To decide on the rules to be run for these malicious patterns recognition, the speed of detection and the ability of processing need to be considered. 2.3 Data analysis What is usually understood by the term data analysis is a process of preparing, transforming, evaluating and modeling data to discover useful information helpful in subsequent conclusion finding and data-driven decisions making. The process itself includes obtaining raw data, converting to a format appropriate for analysis (cleaning dataset), applying required algorithms on collected data and visualizing the output for evaluation Big data analysis The term Big data 2 is mostly used for much larger and complicated data sets than usual. Huge amounts of records cause great challenges in their treatment and processing as the traditional approaches are often not effective enough. Advanced techniques are needed to extract and analyze the Big data and new promising approaches are being developed specifically for their treatment. Big data processing focuses on collection and management of large amounts of various data to serve large-scale web applications and sensor networks. Field called Data science focuses on discovering underlying patterns in complex data and modeling them into required output. [32] Data science A basic data science process consists of a few phases (see Figure 2.2 for visualization). Process is iterative due to possible introduction of new 2. Definition from Cambridge dictionary: dictionary/english/big-data 5

20 2. Monitoring & Data analysis characteristics during execution. The phases of data science are listed below: [4] Data requirements Clear understanding of data specifics that need to be analyzed; Data collection Collection of data from specific sources (sensors in environment, recording, online monitoring etc.); Data processing Organization and processing of obtained data into a suitable form; Data cleaning Process of detecting and correcting errors in data (missing, duplicate and incorrect values); Exploratory data analysis Summarizing main characteristics of data and its properties; Models and algorithms Data modeling using specific algorithms based on the type of problem; Data product Result of the analysis based on required output; Communication Visualization and evaluation of the data product, modifications based on feedback. Figure 2.2: The data science process [4] 6

21 2. Monitoring & Data analysis Data analysis in statistics Statistic methods are essential in data analysis as they can derive the most important characteristics from the data set and use this information directly for visualizing via basic information graphics (line chart, histogram, plots and charts). In statistics, data analysis can be divided into three different areas: [5] Descriptive statistics It is mostly used for quantitative description. It contains basic functions (sum, median, mean) as characteristics of the data set. Confirmatory data analysis (CDA) (also refers to hypothesis testing) It is based on the probability theory (significance level). It is used to confirm or reject a hypothesis. Exploratory data analysis (EDA) In comparison to Confirmatory data analysis, EDA does not have a pre-specified hypothesis. It is mostly used for summarizing main characteristics and exploring data without formal modeling or testing of content assumptions. 2.4 Data mining Data mining is in a sense a deeper step inside the analyzed data. It is a computational process of discovering patterns in the full data set records to gain knowledge about its contents. Data mining combines artificial intelligence, machine learning, statistics and database systems areas to achieve significant information extraction and transformation into a simplified format for future use. A basic task of data mining is mostly automatic analysis of large amounts of data to detect outstanding patterns, which might be consequently used for further analysis by machine learning or other analytics. There are six basic tasks in data mining: Anomaly detection (Outlier/change/deviation detection) Detection of outstanding records in a data set; Association rule learning (Dependency modeling) Detection of relationships between variables and attributes; Clustering Detection of similar properties of analyzed data and creating groups based on this information; Classification Generalization of type of structure and classification of input data based on the learnt information; 7

22 2. Monitoring & Data analysis Regression Detection of a function to model data with the least error; Summarization Detection of compact structure representing the data-set (often using visualization and reports). Data mining is also considered to be analytics part of the Knowledge Discovery in Databases (KDD) process, used for processing data stored in database systems. Data mining placement in the KDD process is also shown in Figure 2.3. The additional parts, such as data collection and preparation or results evaluation, do not belong to data mining but rather to KDD process as a whole. [7] Figure 2.3: Data mining placed in the KDD Process [7] 2.5 Machine learning Machine learning is a specific field exploring possibilities to use algorithms that are capable of learning from data. These algorithms are based on finding structural foundations to build a model from the training data and derive rules and predictions. Based on the given input the machine learning is divided into the main categories listed below: 8 Supervised learning Example input and corresponding output are presented in training data.

23 2. Monitoring & Data analysis Unsupervised learning No upfront information is given about the data, leaving the pattern recognition to the algorithm itself. Semi-Supervised Learning Incomplete input information is provided. It is a mixture of known and unknown desired output information. Reinforcement learning It is based on interaction with a dynamic environment to reach a certain goal (e.g. winning a game and developing a strategy based on the previous success). Machine learning and data mining contain similar methods and often overlap. However they can be distinguished based on the properties they are processing. While machine learning is working with known properties learnt from the training data, data mining focuses on unknown properties and pattern recognition. [6] 2.6 Business intelligence Business intelligence (BI) is a set of tools and technologies used for processing of raw data and other relevant information into business analysis. There are numerous definitions of what exactly BI consists of. In this thesis the definition where internal data analysis is considered a part of BI is used 3 : business intelligence is the process of collecting business data and turning it into information that is meaningful and actionable towards a strategic goal. BI is based on transformation of available data into a presentable form enabling easy-to-use visualization. This information might be crucial for strategic business decisions, threats and opportunities detection and better business insight. [9] Basic elements of Business intelligence are: Reporting Accessing and processing of raw data into a usable form; Analysis Identifying patterns in reported data and initial analysis; Data mining Extraction of relevant information from collected data; Data quality and interpretation Quality assurance and comparison between the obtained data and the real objects they represent; Predictive analysis Using the output information to predict probabilities and trends. 3. Definition available on World Wide Web: < resources/bi-encyclopedia/business-intelligence/> 9

24

25 3 Log analysis The term is usually used for monitoring systems where data logs are records of the events detected by the sensors. The data logs are further processed in the log analysis. Log analysis consists of the subsequent research, interpretation and processing of the generated records obtained by the data logging. The most usual reasons for the log analysis are: Security, Regulation, Troubleshooting, Research, Incident automatic response. The semantics for the specific log records are designed by the developers of the software therefore might differ for some specific areas of usage and sometimes these differences are not fully documented. A significant amount of time might be therefore needed for the log records pre-processing and their modification into a usable form for the following data analysis. In this thesis I will mainly focus on the web log analysis analysis of logs generated in web communication and interaction. Following sections include the general information about web log analysis, possible uses and common formats of these logs. 3.1 Web log analysis The web log is basically an electronic record of the interaction between the system and its user. Therefore there may be additional user actions that would trigger a record creation (not only the requests for the connection or the data transmitting but also the overall behavior on the webpage, the link/button clicking and similar). An area of that is targeting measurement, collection, analyzing and reporting of web data is called web analytics. The web analytics have been studied and improved significantly over the past years mainly because of their significance in increasing the usability of web applications and gaining more users/customers from the marketing point of view. [11] In comparison to Business Intelligence, the BI is focused more on marketing-based analysis of the internal data from multiple sources. Even though there are various approaches and software solutions available, it is still considered freer in terms of the implementation and depends highly on the organization needs, structure and tools. Web analytics on the other hand are specified for analysis of web traffic and web usage trends. As a whole it offers a solution for one area and is separated from the rest of data. However the borders are now more blurred and the web analytics can sometimes be perceived as a specific data flow from one source along others used as part of Business Intelligence. The purpose of the web log analysis lays also in the system-user communication monitoring. The actions of this communication are stored in 11

26 3. Log analysis electronic records and are subsequently analyzed for behavior patterns. These patterns are important for the research of both user and system behavior and their reaction to various actions. The users actions can include useful information about their usage of web applications and can be analyzed for system improvements, security defects detection and compliance records. The system replies and actions can detect malfunctions on the server side, the unusual behavior for the specific actions treatment and the erroneous responses. As a result, there are specific areas for the analytic tests performed on the data logs that are discussed in the following section. 3.2 Analytic tests From the statistical analysis of the data, there are two main different kinds of approaches, or branches of communication information classification. The quantitative approach focuses on the numbers of accesses, transmissions, request/actions and their distribution over time and the number of clients/ports/sessions. The qualitative approach on the other hand detects parts of the communication which are out of the ordinary. Either according to the expectation of the web application usage or the analysis of the test data, there is a certain basic behavioral pattern expected to be seen in a log records output. The records that follow the expected values are considered normal dataset and also most of the overall analyzed dataset usually belongs to this group. 3.3 Data anomaly detection There are often records that indicate different results than expected and might be significantly different from the other records in the dataset. These are considered anomalies in the data and one of the most important goals of log analysis is their detection and treatment. Anomalies, also called the outliers are ones of the primary steps in data-mining applications. In the first steps of an analysis there is the detection of the outlying observations, which may be considered as an error or noise, but also carries significant information as the observations might lead to the incorrect specification and results. Some of the definitions of outliers are more general than others, depending on the context, data structure and method of detection used. The most basic view is that an outlier is an observation in the data set which appears inconsistent to the rest of the data. There are multiple methods for the outlier detection differing according to the data set specifics, and are often based on the distance measures, clustering and spatial methods. The outlier/anomaly detection is often used for various applications, such as credit card frauds, data 12

27 3. Log analysis cleansing, network intrusion, weather prediction and other data-mining tasks. [10] The subsequent anomaly analysis is essential for the root cause investigation of the detected anomaly and it helps greatly in both the inside and the outside threat prevention. The inside kind of defects might include malfunctions in the system code or the erroneous request processing. The outside threats are often web-based attacks and intrusion attempts. Anomaly detection plays a significant role in web-based attacks detections, also called anomaly-based intrusion detection systems (IDSs). The basic intrusion detection system is monitoring the web communication against the directory of the known types of intrusion attacks and takes action once the suspicious behavior is detected. However to ensure a certain level of security against the unknown types of attacks, also potentially anomalous communication should be monitored for possible threats. In this area the monitoring of the anomalies of web traffic is essential for finding new types of attack attempts that can be detected by behavior records stored in the data logs. [13] 3.4 Security domain Frequent attack attempts are based on finding the applications with flawed functionality. Taking advantage of vulnerabilities the attacker inserts code which is executed by the web application causing transfer of malicious code into backend or reading of unauthorized data from the database. These types of attacks can be detected in the log files as the injected code is recorded when sent to server. The post- detection is important for avoiding future attacks but due to its late running the pro-active monitoring is essential. Basic regular expressions or more complicated methods can be used for rule making for known attacks detection. The communications containing harmful code injected is then rejected as a result. The application runs on the 7th layer of ISO/OSI model and for detection to be efficient it has to see relevant traffic. There are multiple parts of the communication that can be subject to attack. In Figure 3.1 there is an illustration of high-level attack detection in a network. On the lower layers (network and transport layer) there is firewall working on traffic analysis based on common protocols. It can detect anomalies in protocols. However it cannot detect the attacks on application as it does not see the additional data from higher layers. Web application firewall on the other hand is processing the higher-layer protocols and can analyze more precisely. It contains enough information for filtering and detection and as a result is a good place for defined allowance rules for specific requests and attacks detection. Web servers 13

28 3. Log analysis Figure 3.1: Illustration of communication zones for attack detection [18] such as Apache and IIS usually create log files in Common Log Format (CLF) described further in the following section. But this kind of format does not contain data sent in the HTTP header (e.g. POST parameters) since this header information can contain important data about possible attacks, it is a great deficiency for web server logs. As a part of application logic there should also be a certain degree of validation of input and output data and security information logging integrated. The application log files should contain a full information about the actions of user and therefore allow wide possibilities for misuse and threat detection mechanisms. A network intrusion detection system (NIDS) analyzes the whole traffic to and from the application. However it has some disadvantages, such as difficulties with decrypting SSL communication and real time processing in high traffic overload. Also working on ISO/OSI layer 3 and 4 is causing disability of detecting attacks targeted on higher layers information. For attacks detection, there are two possibilities log file analysis and full traffic analysis. Even though log files do not contain all data about the communication, they are easily available and collected. Due to default server-side logging to standard formats and applications usually containing basic logging process for traceability of users actions, log files provide easily set-up process for security monitoring. Attacks can be detected using two strategies by using static and dynamic rules. Differences among these are based on their creation. The recommended attack monitoring system should consist of both types of rules. [18] 14

29 3. Log analysis Rule-based detection (static rules) This strategy defines static rules based on known attacks patterns that need to be rejected in order to avoid attacks. These rules are specifically prepared beforehand and stay the same during detection. Static rules are prepared manually based on pre-known information. Static rules can be divided into two models: Negative security model Blacklist approach allows everything by default, all is considered normal and the policy defines what is not allowed (listed on a blacklist). The biggest disadvantage is in the quality of policy and its need to be updated regularly. Positive security model The positive model is the opposite of the negative one it denies all the traffic except for the allowed by policy (listed on whitelist). Whitelist contents can be learnt in the training phase by a machine learning algorithm or manually defined. Anomaly-based detection (dynamic rules) Dynamic rules are not prepared beforehand on known information. They are obtained in the learning phase on training dataset using machine learning algorithms. It is essential to make sure the dataset is without any attacks and anomalies to ensure the correct rules generated result. Afterwards the traffic considered different from the normal dataset will be flagged as anomalous. Anomalous patterns may also be helpful in other application monitoring areas like system troubleshooting. While security monitoring targets detection of suspicious behavior coming from the outside, system performance analysis and troubleshooting are focused on the inside behavior. Internal behavior patterns might reveal errors in code or even in design of the application or system setup. 3.5 Software application troubleshooting Log files can be used in multiple stages of software development, mainly debugging and functionality testing. It is possible to check the logic of a program without the need to run it in a debug mode using log files for information extraction. Another advantage is that this type of testing is not affected by the probe effect (time-based issues introduced when testing in a specific run time environment), environment and system setting generation required for currently used testing and debugging customs and offer important insight into overall functionality and performance of a system. With sufficient background implementation for automatic log file analysis in software testing, making use of language and specification capabilities, log file 15

30 3. Log analysis analysis can be considered a useful methodology for software verification, somewhere between current testing practice and formal verification methodologies. [26] From the software development, testing and monitoring perspective, there is valuable information that can be extracted from the log files. This information can be divided into several main classes: [23] Generic statistics (e.g. peak and average values, median, modus, deviations) They are mostly used in setting hardware requirements, accounting and general view into the system functionality. Program or system warnings (e.g. power failure, low memory) They are mostly used in system maintenance and performance analysis. Security related warnings They are used in security monitoring discussed in previous chapter. Validation of program runs It is used as a type of software testing, included in development cycle. Time related characteristics They are important for software profiling and benchmarking, can also reveal system performance issues. Causality and trends They contain essential information about the processed transactions and are used mostly in data mining. Behavioral patterns They are mostly used in system troubleshooting, performance and reliability monitoring. For system troubleshooting, there are various types of valuable information logged and their extraction can provide essential knowledge about the system behavior and detect performance issues that are not easily found otherwise. Some of the most basic ways to use log analysis for system performance analysis are: [24] 16 Slow response Detection of slow response times can point out directly the functionality area that should be optimized and checked for eventual code errors. Memory issues and Garbage collection Basic error massages analysis can provide indications about the malformed behavior in specific scenarios and the out of memory issues are some of the most common ones. Also these might often be caused by slow or long lasting garbage collection implementation which can also result in overall slow application behavior.

31 3. Log analysis Deadlocks and Threading issues With more users accessing the application resources simultaneously, the greater becomes the potential of them creating deadlock situations 1. Preventing as well as dealing with these occurrences is therefore an important part of application logic and their detection can significantly improve the performance optimization. High resource usage (CPU/Disk/Network) High resource usage might result in slowing down the performance or even halting the system. These irregularities can therefore help to detect the busiest times of system usage or even need of additional resources allocation due to increased user demands. Database issues Once the applications are communicating directly with the database, the queries results as well as response times and potential multithread access issues are significant to overall functionality and application responsiveness. However, not only what occurs in the system is worth detecting. The inactivity, which can be easily found by log file analysis, provides also important insight into system monitoring. If the important action that was scheduled to run had not happened, it would not generate any error message, but it would still make a significant impact. As a result, it is important to not only monitor and search the logged data for error messages and behavior patterns that happened, but also detect those actions and situations when nothing happened even though it should have. Therefore it is worth looking into the possibilities of their detection and compilation in order to maintain a certain quality of service. [25] However the contents of the log files can differ greatly from system to system. Depending on the desired information, the format of log files needs to be often adjusted to contain the specific information. Basic server logging files usually contain standard information used for server-side monitoring and troubleshooting, for specific application logic analysis additional log files may need to be generated with more descriptive information. 3.6 Log file contents The web log analysis software (sometimes also called the web log analyzer) is a tool processing a log file from a server and according to its values obtains knowledge about who, when and from where accessed the system and what actions took place during a session. There are various approaches for log files 1. Deadlock a situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever does. 17

32 3. Log analysis generation and processing. They may be parsed and analyzed in real time or may be collected and stored in databases to be examined later on. The subsequent analysis then depends on the required metrics and types of data the analysis focuses on. The basic information contained in the web log format tends to be similar across different systems. This however depends on the software application type. As a result, there is a different log file output generated by the intrusion detection systems, the antivirus software, and the operation system or web server when creating access logs. These differences need to be taken into account when storing and processing data from multiple sources. There are also various recommendations for log management security published by the National Institute of Standards and Technology that should be followed when processing log records internally within organizations. [14] There are some default types of variables and values that are generated for the web logs by the specific software web server solutions. However also for the web servers solutions like the Apache web server software 2, it is possible to alter and configure the web log format generated according to specific needs. [67] Basic types of log files There are basic log file types that are used by web server logging services. These may differ according to the type of server as well as its version and an important part of preparation for log file analysis is based on getting familiar with the contents and requirements. There are also multiple different logs generated based on their triggering event, contents and logic such as error logs, access logs, security logs and piped logs. Selected web server log formats are: [15] NCSA Log Formats The NCSA log formats are based on NCSA httpd and are mostly used as a standard for HTTP server logging contents. There are also specific types of NCSA formats such as: NCSA Common (also referred to as access log) It contains only basic HTTP access information. Its specific contents are listed in the following section. NCSA Combined Log Format It is an extension of the Common NCSA format as it contains the same information with additional fields (referrer, user agent and cookie field). NCSA Separate (three-log format) In this case the information is stored in three separate logs access log, referrer log and agent log. 2. The Apache web server software is one the most used open-source solutions used worldwide. More information available from World Wide Web: < 18

33 3. Log analysis W3C Extended Log Format This type of log format is used by Microsoft IIS (Internet Information Service) versions. It contains a set of lines that might consist of directive or entry. Entries are made of fields corresponding to HTTP transactions, separated by spaces and using a dash for fields with missing values. Directives contain information about the rules for the logging process. Apart from main server log file types, there are multiple specific ones that might be generated by FTP servers, supplemental servers, application servers 3. Application logging functionality is also important to be setup to simplify troubleshooting and maintenance as well as increase protection from outside threats. A lot of systems contain server and database logging but the application event logging is missing, disabled or poorly configured. However the application logging provides valuable insight into the application specifics and has a potential of bringing much more information than the basic server data compilation. The application logs formats might differ greatly as they are highly dependent on the application specifics, its development and needs. Nevertheless, within the application, organization or infrastructure the log files format should be consistent and as close to standards as possible. [27] There are also logging utilities created for simplified definition of consistent application logging and tracking API. Once the standardized logging file format is used, it makes its subsequent pre-processing and analysis much simpler. An example for widely used API is the open source log4j API for Java that offers a whole logging capabilities package and is often used for log generation in applications written in Java. [28] However for the basic logging functionality or simple web application the default utilities might generate sufficient records. To decide if the default log file contents are enough for needs of users, basic insight and knowledge about the common log file formats is required Common Log File contents A Common Log Format or the NCSA Common log format is based on logging information about the client accessing the server. Due to its standardization, it can be more easily used in multiple web log analysis software tools. It contains the requested resource and some additional information, but no referral, user agent or cookie information. All the log contents are stored in a single file. 3. For example server logs types of Tomcat server:< en-us/articles/ tomcat-tc-server-log-file-types > 19

34 3. Log analysis An example of the log file format is: host id username date : time r e q u e s t s t a t u s bytes host the IP address of the HTTP client that made a request; id the identifier used for a client identification; username the username or the user ID for the authentication of the client; date:time the date and time stamp of the HTTP request; request the HTTP request, containing three pieces of information resource (e.g. URL), the HTTP method (e.g. GET/POST) and the HTTP protocol version; status the numeric code indicating the success/failure of the request; bytes the number of bytes of the data transferred as part of the request without the HTTP header. The described type of the common log file format contains only the most essential information. Usually more items are added into the log obtained throughout the session, depending on the type of the data needed to be received from the web server visit logs. Often information is included about the browser type and its version, the operating system, or other actions of the user during the session Log4j files contents Log4j Java logging utility is developed under Apache Software Foundation and is platform independent. Log file contents are labeled with defined standard levels of severity of the generated message. Basic Log4j log message levels are listed below: [30] 20 OFF The OFF level has the highest possible rank and is intended to turn logging off. FATAL The FATAL level designates by very severe error events that will presumably lead the application to abort. ERROR The ERROR level designates error events that might still allow the application to continue running.

35 3. Log analysis WARN The WARN level designates potentially harmful situations. INFO The INFO level designates informational messages that highlight the progress of the application at coarse-grained level. DEBUG The DEBUG level designates fine-grained informational events that are most useful for application debugging. TRACE The TRACE level designates finer-grained informational events than the DEBUG level type. ALL The ALL level has the lowest possible rank and is intended to turn all logging on. Log4j files contents can be adjusted using properties file, XML or through Java code itself. The log4j logging utility is based on three main components which can be configured: Loggers Loggers are logical log file names which can be independently configured according to their level of logging and they are used in application code to log a message. Appenders Appenders are responsible for sending a log message to output e.g. file or remote computer. Multiple appenders can be assigned for a logger to enable sending its information to more outputs. Layouts Layouts are used by appenders for output formatting. Mostly used format with every log input on one line containing defined information is PatternLayout, which can be also specified using ConversionPattern parameter. The PatternLayout is a flexible layout type defined by a conversion pattern string (regular expression defining the requested string pattern). The goal is to format logging event information into a suitable format and return it as a string. Each conversion specifier starts with a percent sign (%) and is followed by optional format modifiers and a conversion character. The conversion character specifies the type of data, e.g. category, priority, date, thread name. Any type of literal text can be inserted into the pattern. [31] Conversion characters are listed in Table As a result, the ConversionPattern can be used to define the specific logger output format using the listed characters in its definition. 21

36 3. Log analysis Conversion Type of data character c Category of the logging event C Class name of the caller issuing the logging request d Date of the logging event F File name where the logging request was issued Location information of the caller which generated l the logging event L Line number from where the logging request was issued m Application supplied message associated with the event M Method name where the logging request was issued n Platform dependent line separator character or characters p Priority of the logging event Number of milliseconds elapsed from the construction r of the layout until the creation of the logging event t Name of the thread that generated the logging event NDC (nested diagnostic context) associated with the x thread that generated the logging event MDC (mapped diagnostic context) associated with the X thread that generated the logging event % The sequence %% outputs a single percent sign. Table 3.1: List of Conversion characters used in ConversionPattern For example the desired pattern can be defined by string sequence: \%d [\% t ] \%-5p \%c - \%m\%n Possible output might then display like: :00 [ main ] INFO l o g 4 j. SortAlgo - S t a r t s o r t Meanings of the items separated by spaces in the example are: 22 %d ( :00) date of the logging event; %t ([main]) name of the thread that generated the logging event (in brackets according to pattern definition); %-5p (INFO) priority of the logging event (the conversion specifier %-5p means the priority of the logging event should be left justified to a width of five characters);

37 3. Log analysis %c (log4j.sortalgo) category of the logging event; %m (Start sort) application supplied message associated with the logging event (dash is an added literal character between category and the message full text according to pattern definition); %n adds line separator after logging event record. Log4j logging utility provides wide possibilities for adjusting the format, contents and functionality of application logging, which can ease the subsequent analysis and log files management. There is also a variety of possibilities for filtering messages contents in generated log file records. Full textual searches and results filtering based on specific message strings that might reveal potential threats or system malfunctioning can be configured and automated. Contextual patterns that are potentially important to review can be also often easily defined by e.g. regular expressions and be searched for. 3.7 Analysis of log files contents To gain the desired knowledge from the log file contents, the subjected parts of the records need to be collected, extracted, pre-processed and analyzed as a dataset. The subsequent visual representation is for easier behavior and pattern recognition from the development or marketing point of view. Some of the basic metrics learnt from the web log analysis are: Number of users and their visits; Number of visits and their duration; Amount and the size of the accessed/transferred data; Days/hours with the highest numbers of visits; Additional information about the users (e.g. domain, country, OS). The goal of the web log analysis software therefore is to obtain among others the listed information from the generated log records. In the following chapter there is an overview of the selected available software systems designed for this task and their comparison. 23

38

39 4 Comparison of systems for log analysis When choosing the most appropriate analytics software, there is a couple of things that need to be taken into account. These include the required or expected functionality of the analysis software, web application and data storage specifics and size and amount of data for analysis. Also support and competency on premises and financial options should be evaluated when making the decision. There are various possibilities for categorization of available systems for log analysis. In this thesis, I would firstly describe multiple different approaches and categorize according to the main focus. Then choose and compare some existing systems that belong to specified categories according to the capabilities they offer. This comparison is based on the overall information offered publicly by the selected systems and is meant primary for high level overview of functionality that is available. 4.1 Comparison measures Considering web analytics as not only a tool for web traffic measurement but also a business research information source, the offerings of some web analytics software types might contain functionality closer to web page optimization and performance increase with on-page actions monitoring. These are divided into off-site web analytics which analyze the web page visibility on the Internet as a whole and on-site web analytics which track user actions while visiting the page. To ease the use of web analytics with no need for on premises demands and also for client-side monitoring, a different method apart from log file analysis came up the page tagging. As a result, software can be divided into categories according to the tracking method it uses client-side tracking (page tagging), physical log files tracking and analysis or eventually full network traffic monitoring Tracking method The two main approaches, considered mainly in the web log analysis area, are tracking the client-side and server-side information. Page tagging is a tracking method based on adding third party script to the webpage code enabling recording of user actions on the client-side using JavaScript and Cookies and sending the information to the outside server. These types of solutions are also often based on hosted software approach or Software as a service (SaaS). On the other hand, the log files are generated on the server side and contain therefore server-side information. However log files can also be transferred outside for processing and 25

40 4. Comparison of systems for log analysis there are also hosted software solutions available for log files analysis that is done on the third party premises. Some of the differences in contents between the client and server side information processing are listed below: [19] Visits Due to tracking based on JavaScript and cookies, the hosted software might not be able to output completely accurate information as a result of users with disabled JavaScript, regularly deleted Cookies or blocked access to analytics. Also it does not track robots and spiders while all the interaction information including the above mentioned is recorded in the web logs. Page views While the log file is tracking only communication going through the server, it would not include the page reload as it is usually cached in the browser. Client-side software would on the other hand record the re-visit. Visitors There is a difference in visitor recognition as the tagging script uses to identify the user Cookies (which might be deleted) while log file records the Internet address and browser. Privacy Specifically for the SaaS tagging-based systems as the third party is collecting and processing the obtained information there are some privacy concerns which are not present for local log file analysis. To sum up, there are advantages and disadvantages to both approaches and the decision should be based on the specific requirements for the software. Log file analysis does not need to make changes to webpages and contains the basic required information by default (as default setting for logging can be easily enabled and tracked for web servers). Also data are stored and processed on premises and more inside information can be extracted from the records. Page tagging on the other hand contains information from the client side that is not recorded in log files (e.g. on click, cached re-visit etc.), it is available to web page owners that do not have local web servers and support for on premises analysis. Often both approaches are combined and used for in-depth analytics. However, even though the page tagging term is mostly used for client-side information tracking, there can also be PHP server-based tags used for additional information generation. As already mentioned, there is a number of valuable data sources that are omitted when using only client-side tagging, however there is too much redundant information in physical log files. PHP tagging enables both acquiring server-side information and choosing the information that needs to be collected. There are also other tracking methods that can be used in (mostly) web-based applications/systems, such as full network traffic monitoring. Network 26

41 4. Comparison of systems for log analysis traffic monitoring might include much more information about the overall system behavior than log files or page-tags information output. But it is also more complicated to implement and the whole monitoring process needs to be setup carefully and manually while logging is generally a built-in capability that is easy to setup, adjust and process Data processing location As partly noted in the previous section, the systems for log analysis can be categorized also according to how (or where) the obtained data is collected or processed. From this point of view the basic distinction is between the Hosted (SaaS) type which processes data on centrally hosted servers (also used as on-demand software) and Self-hosted (On premise) type that runs on local user s server. Gradually increasing interest in cloud-based and outsourced services shows that it is often the easiest solution for standalone non-complex applications and small businesses without sufficient hardware and software foundation. Software as a service or hosted type of software solution is based on a delivery model where data is processed (and sometimes also collected) on premises of the software provider. The main advantages of this approach are that the user does not need to own the hardware and software equipment with desired capacity and performance, as well as does not need to cover the need for maintenance, support and additional technical services. The basic idea of hosted software is that the service is managed entirely by the software provider and the user only gets the desired results of the process. The understandable disadvantage is that data (often containing sensitive information) is transferred and processed by a third party and in this case the security and privacy are questioned. Even though the cloud and SaaS providers are legally required to commit to certain data protection, transparency and security, users might consider processing their data on premise as a safer and more convenient approach. The second type of data processing location is traditionally self-hosted or called deployment on premise. This approach includes installation and setup of the software solution on the user s server and allowing it to process data locally. In conclusion, the location of data processing requirements might differ according to type of organization, on premise hardware and software support or sensitivity of data contents. Apart from the data location and type of tracking used, there is one more important thing to consider when choosing an appropriate solution. That would be price and license of the software. 27

42 4. Comparison of systems for log analysis 4.2 Client-side information processing software Client-side information is usually obtained using page tagging, even though some of the software solutions listed in this section also include the log analysis as an additional source of input data. The common feature for these types of software is a priority based on tracking the user actions and activity as well as basic statistics containing information about the background of the user. The aim of client-side tracking software is to optimize performance of a web based application/page to be appealing for the current customers/users as well as attractive for new ones. Selected solutions of client based software solutions: Google Analytics [33] One of the most used web analytics software worldwide, contains a wide variety of features. It includes anomaly detection [21], is easy to use and for basic use is free (possibility to upgrade to paid premium version). Clicky web analytics [34] Hosted analytics software that offers real time results processing, basic customer interaction monitoring functionality and ease of use. Pricing depends on the daily page views and number of tracked web pages. KISSmetrics [35] Tool offering funnel (visitors progression through specified flows) 1, A/B test, behavior changes reports. It is offering a 14-day trial and the starter price begins at $200 per month. ClickTale [36] Software based on customer interaction monitoring, based on providing heat map analytics, session playback and conversion funnels along with basic web analytics reports. It is offering a trial demo and pricing depends on the bought solution. CardioLog [37] Software designed to work on the Windows platform specified for use in on premises SharePoint servers, Yammer and hybrid deployments including Active Directory integration. Contains basic analytic reporting with UI directly built-in SharePoint site and is easy to deploy. Available is 30-day trial, full functionality pricing depends on the chosen solution On premise/on demand/hybrid and the chosen features. WebTrends [38] Solution offering rich functionality containing mobile, web, social and SharePoint monitoring. Apart from reports there is a possibility to integrate internal data to statistics and use performance monitoring for anomalies detection. 1. More information about funnel functionality: < tools/funnels/> 28

43 4. Comparison of systems for log analysis Mint [39] On premises solution for JavaScript tagging based tool, offering basic reports for visits, page views, referrers etc. Requirements are Apache with a MySQL and PHP scripting, payment is $30 per site. Open Web Analytics [40] Open source web analytics software written in PHP working with MySQL database that is deployed on premise but also using tagging for analytics processing. There is also built-in support for content management frameworks like WordPress and MediaWiki. Piwik [41] Open analytics platform offers apart from default JavaScript tracking and PHP server-side tagging also option to import log files to the Piwik server for analysis and reporting. There are more possibilities to adjust the reporting according to needs, however as a result the solution is not as easy to use. Piwik PRO contains also on premises solutions for Enterprise and SharePoint with pricing depending on the scale. CrawlTrack 2 [42] Open source analytics tools that is based on PHP tagging, enabling a wider range of obtained information including spiders hits and other server-side information. W3Perl [43] CGI-based open source web analytics tool that works with both page tracking tags and reporting from log files. Some chosen features of client-side tracking software types are compared in the Table 4.1. First compared feature Tracking traffic sources & visitors is a fundamental functionality of client-side analysis software types as it is based on information of client-side log source and unique IDs of visitors. Tracking robot visits feature is less often supported, as it is not usually detected using client script only (however it can be detected by php tagging). Custom dashboard feature compares capability of adjusting dashboard or statistics report output contents. Real-time analysis is based on continuity of information being processed/received thanks to script present on pages, simplifying this functionality support in contrast with log files analysis. Keyword analysis can be very helpful feature mainly for SEO optimization work while it does not always belong to basic features of client-side analyzers. Mobile geo-location is a nice feature for increased tracking ability, but supported by only limited number of reviewed solutions. 2. CrawlTrack uses PHP tagging which enables also server-side information, however due to its main focus on basic client-side statistics with only spiders hit included it is listed among the client-side tracking type of software 29

44 4. Comparison of systems for log analysis Solution Tracking traffic sources & visitors Tracking robot visits Custom dashboard Real-time analysis Keyword analysis Mobile geolocation Google Analytics Clicky KISSmetrics ClickTale CardioLog WebTrends Mint Open Web Analytics Piwik CrawlTrack W3Perl Table 4.1: Comparison of selected client-based software features 4.3 Web server log analysis Types of web server log analysis use the log file in its standard format (IIS or Apache generated) and are optimized for their processing. Even though they might support analysis also for customized log file formats, the output is mostly made for basic server connectivity statistics and monitoring with no additional features that might be required for application log analysis. 30 AWStats [44] Free open source tool that works as a CGI script on the web server or launched from the command line. It evaluates the log file records and creates basic reports for visits, page views, referrers etc. It can be also used for FTP and mail logs. Analog [45] Open source web log analysis program running for all major operating systems is provided in multiple languages and processes configurable log file formats as well as the standard ones for Apache, IIS and iplanet. Webalizer [46] Portable free platform-independent solution with advantages in scalability and speed. However it does not support as wide range of reporting mechanisms as other alternatives. GoAccess [47] Open source real-time web log analyzer for Unix-like systems with interactive view running in terminal. It includes mostly general statistics in server report on the fly for system administrators.

45 4. Comparison of systems for log analysis Angelfish [48] Proprietary possibility for on premise analysis, often accompanying page tagging solutions. Contains also traffic and bandwidth analysis and also include client-side information in the reports, which was gained from web analytics tagging software. Pricing starts at $1 295 per year. Some chosen features of server log files analysis software types are compared in the Table 4.2. First is the Custom log format capability, which might not be always available but is often crucial in requirements when slightly modified log files are to be analyzed. The Unique human visitors feature is quite easy to be accomplished for client-side tracking, however from log file analysis standpoint it is not always a priority along with the Session duration property. On the other hand the log files offer easy-to-get capability of Report countries tracking based on domain and IP address. There are often supported detailed Daily statistics, but Weekly statistics might not be supported in types of analyzers with basic functionality due to high numbers of records computation. Solution Custom log format Unique human visitors Session duration Report countries Daily statistics Weekly statistics AWStats IP & Domain Analog Domain name Webalizer Domain name GoAccess IP & Domain Angelfish IP & Domain Table 4.2: Comparison of selected server log file analysis software features 4.4 Custom application log analysis Fundamental functionality expected from the application log analysis consists of: parsing custom fields in log records, view the records in a consolidated form, search for specific data using custom queries and highlighting results that might be of interest. For a simple application, the log file viewers with searching capabilities might offer sufficient functionality for basic application monitoring as they can be set up for searching in high numbers of log files records for specific issues, working with custom log files field data that differ across different platforms and application types. Searching and filtering is often based on regular expression input and configurable queries filtering contents. Some of the application log files view and analysis tools are: 31

46 4. Comparison of systems for log analysis Log Expert [49] Free open source tool for Windows, contains search, filtering, highlighting and timestamp features. Chainsaw [50] Open source project under Apache logging services focuses on GUI-based Log4J files view, monitoring and processing. It offers searching, filtering and highlighting features. BareTail [51] A free real-time log file monitoring tool with built-in filtering, searching and highlighting capabilities supporting multiple platforms and also configurable user preferences. GamutLogViewer [52] Free Windows log file, log file, viewer that works with Log4J, Log4Net, NLog, and user defined formats including ColdFusion. It supports filtering, searching, highlighting and other useful features. OtrosLogViewer [53] Open source software for logs and traces analysis. Contains searching, filtering with automatic highlighting based on filters and multiple additional options using plugins. LogMX [54] Universal log analyzer for multiple types of log files, includes built-in customable parser, filtering & searching options for large files, real time monitoring with alerts and auto response options. Pricing starts for 1 user basic license at $99. Retrospective [55] Commercial solution for managing log files data working on multiple platforms and offering wide search, monitoring, security and analytic capabilities with a friendly UI design. Pricing for personal use starts at $92. These types can differ according to supported Log files (even though custom log file format is often configurable) and also according to Platform. In the following table All listed for platform stands for Windows, OS X and Unix-like, while Win stands for Windows platform. While client-side analyzers are often based on statistics, visitors and source referrers tracking and their diagramming in dashboards, application logs analysis might not even support the statistics generation as these types of tools are mainly used for middle processing after logging and before visualization. Their capabilities are based on filtering & highlighting tools to make better sense of multiple types of data. Log files can be designed to be straightforward, in which case there are only specific types of log files data that are of interest. They can be easily retrieved with configurable searching and automatic highlighting and filtering. Regex or regular expressions 3 3. A regular expression (regex or regexp for short) is a special text string for describing a search pattern more information on page 32

47 4. Comparison of systems for log analysis functionality is priceless when searching custom data sources as they are powerful tools for retrieving valuable information in specified format. As for the Real time support, for locally gathering multiple format types this capability can be a plus (mainly for monitoring), however often it is not treated as a priority. Solution Platform Statistics Log files Filter & Regex Real Highlight search time Log Expert Win Custom Chainsaw All Log4j BareTail Win IIS/Unix/ custom GamutLog Log4j/ Win Viewer custom OtrosLog Log4j/ All Viewer Java logs LogMX All Log4j/ custom Retrospective All Server/Java/ custom Table 4.3: Comparison of selected application log file analysis software features Solutions listed up till now contain mostly basic functionality for visitors/page views/referrer stats extraction and visualization while working with either client-side tracked information (obtained by page tagging) or standard web server file format analysis. For the custom log application viewers, they contain basic searching and highlighting capabilities based on custom search rules setup. Even though some offer also additional functionality for bandwidth/anomaly detection/performance monitoring they are mostly recommended for small to midsize businesses with webpages or simple web application monitoring. Once the application log files are needed to be processed more in-depth, specific statistics for security and compliance are required, standard reporting mechanisms might not be sufficient for the web log file analysis. 4.5 Software supporting multiple log files types analysis with advanced functionality Software solutions that include additional deeper analytics capabilities as well as processing of distinct log file formats can be used for both the basic log files and client side tagging output analysis. They also often offer a fully functional platform for log file analysis that can factor in also additional data input streams. According to specific needs the contents (input and/or output) can be highly customized and prepared to fit in the user s requirements. 33

48 4. Comparison of systems for log analysis To remind the basic steps of data analysis, it consists of: data collection, pre-processing, data cleaning, analysis, results overview and communication. It is possible to get a full solution including all the required steps. On the other hand, it is possible also to compile the output from separate software tools according to the systems for data management already used in the organization. Some of the tools used also for application log files analytics include: Logentries [56] Hosted SaaS cloud-based alternative for log file collection and analysis. It collects and analyzes log data in real time using a pre-processing layer to filter, correlate and visualize. Software offers rich functionality including security alerts, anomaly detection and both log file and on-page analytics. A free trial is available, limited functionality option with sending less than 5 GB/month is free. Starter pack for up to 30 GB/month costs $29 per month. Sawmill [57] Mostly universal solution using both log file entries analysis and on-page script tagging, can be deployed locally or hosted. Covers also web, media, mail, security, network and application logs, supports most platforms and databases. Pricing depends on the chosen solution, lite pack with limited functionality starts at $99. According to the needs of the analysis, multiple possibilities are present for acquiring data from local/hosted log files. The system used for data collection as well as the overall data management is significant in choosing the appropriate tool [17]. Some of the richer functionality solutions for web log monitoring and analysis are: Splunk [58] Splunk is a solution based on working with machine data from the whole environment devices, apps, logs, traffic and cloud. Therefore it offers powerful tools for data management, analysis and results visualization. There is a possibility of cloud-based solution for data management and storage, or it can be deployed for on premise databases, offers data stream processing, mobile devices data insight and the Big data solution Splunk analytics for Hadoop 4 and NoSQL data stores. 60-day trial is available, pricing depends on the data volume per day, $675 per month is for Splunk Cloud version. There are also additional analysis tools available for Splunk solution such as anomaly detection from Prelert called Anomaly Detective Application for Splunk Enterprise [60]. Prelert offers a REST API 4. Hadoop [69] framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models 34

49 4. Comparison of systems for log analysis which can process basically any feed also offers 6-months trial for developers. The application is mostly used in its Splunk plugin form that adds the easy-to-use anomaly detection capabilities to the machine data analysis and monitoring process. Sumo Logic [59] Sumo Logic is a cloud-based solution for a native Log Analytics service and machine learning algorithms developed to efficiently analyze and visualize the information from the data processing. It includes incident management with pattern recognition, anomaly detection and other monitoring and management tools. It also has a capability called LogReduce at its disposal, which consolidates the log lines using recurring pattern detection. On top of this there is also a possibility of anomaly detection. Sumo Logic scans historical data for patterns and thanks to LogReduce works also on lines that are not identical. Also the tool allows annotating and naming anomalies so when it occurs again it can be considered as known. There is a 30-day trial option and pricing depends on the data volume per day, at 1 GB/day the cost is $90 per month. Grok [61] Numenta is a developer of data-analysis solutions and released a data-prediction and anomaly detection library modeled after the human memory. Grok for IT analytics is an anomaly detection tool for AWS 5. Basically it works with most of Amazon s web services and has an API that analyzes system metrics. Therefore this solution is processing generated metrics rather than log file lines, covers most monitoring capabilities and comes with a friendly adjustable UI. XpoLog Analytic Search [62] XpoLog Log Management and Analysis Platform offers full solution on almost any log file data analysis area including monitoring, scanning for errors, anomalies, rule-based detection. It offers collection, management, search, analytics and visualization of virtually any data format including server, application log files and built-in cloud integration possibilities for data hosted on Amazon or Google clouds. It can be used also for Big data analytics thanks to the option for integration with Hadoop and can be deployed on premise or in the cloud. Pricing depends on the daily logs volume. Skyline by Etsy [63] Skyline is an open source solution for anomaly detection based on operation metrics. It consists of several components: python-based daemon Horizon accepts data from TCP and UDP inputs, 5. Amazon Web Services [73] offers a broad set of cloud-based global compute, storage, database, analytics, application, and deployment services 35

50 4. Comparison of systems for log analysis uploads data to a redis where they are processed by an Analyzer which utilizes statistical algorithms for abnormal patterns detection. The results are then displayed in a minimalist UI. Additionally, anomaly investigation is also implemented Oculus [64] is a search engine for graphs, useful when detecting similar graphs to the anomaly detected by Skyline. Basic comparison and general information about the software solutions listed in this chapter are in Table 15 in Appendix 4 chapter. 4.6 Custom log file analysis using multiple software solutions integration Another possibility is to use multiple software solutions for specific parts of the log file management tasks. Some development platforms also offer specific tools that can be used as a standalone unit or can be integrated into an existing infrastructure. 36 Graylog [65] Graylog is a fully integrated open source log management platform used for collection, indexing and analyzing multiple types of data streams. It uses a few key open source technologies: Elasticsearch, MongoDB and Apache Kafka, which allows streamed data to be partitioned across cluster and has multiple functions suitable for big data analysis. Elastic platform [66] Platform offering both commercial and open source products aimed for search, analysis and visualization of data insights in real time. The well-known combination of three open source projects Logstash, Elasticsearch and Kibana is also referred to as the ELK stack. Logstash Collection, parsing and enrichment pipeline designed for easy integration. It is adjusted for processing of streams of logs, events and other unstructured data sources to be further processed. Elasticsearch Distributed easy to use search and analytics engine offering quick searching and analytics via a query language. Kibana Visualization platform for interaction with data analysis output including a variety of histograms, diagrams and dashboard possibilities. Apache family and integration [67] Open source software integration possibilities may also offer some efficient combinations for log files data collection, processing and output software tools.

51 4. Comparison of systems for log analysis Apache Flume [68] Distributed service built for collecting, aggregating and moving high amounts of data. It is based on joining the multiple distinct data streams and collecting. It has robust fault tolerant structure containing recovery mechanisms and optionally defined rule-based pre-processing or alerting possibilities. Hadoop HDFS and HBase [69] Flume provides a pipeline to Hadoop and the ecosystem of the distributed file systems with possible opensource additions offers various analytics capabilities and might be the most efficient possibility for large amounts of data logs analysis from multiple sources. Solr [70] Offers similar functionality as Elasticsearch based on search, analytics and monitoring capabilities. It is easily integrated with Hadoop and can also bring interesting outputs. Spark [71] Engine for large-scale data processing from Hadoop that supports also machine learning, stream processing and graphs generation. Apache Storm [72] Distributed real-time computation system for streams of data with built-in capabilities for analytics, online machine learning and others. [22] The ELK stack abbreviation is used in general, even though Logstash is the first tool used in processing data as the data collector and parser. This is explained as: Because LEK is an unpleasant acronym, people refer to this trinity as the ELK stack. [75] The ELK stack with the wide user and developer community is often used for customized input and multiple data stream processing for middle sized data as well as for higher amounts. It can be easily integrated into the open-source built solutions and is often also used as a part of Big data processing point. On the other hand Apache Flume and Hadoop are mostly used with Big data processing thanks to built-in distributed capabilities and native support for Big data storage, processing and analysis. There are various possibilities for integration and cooperation for Apache and similar open-source project solutions and they can create powerful data processing tools. Due to increasing demand for Big data processing including multiple data streams input in real-time and efficient data storage and analytics performance the capabilities of Big data processing solutions are rapidly improving. For commercial uses the combination of Splunk analytics designed to operate with Hadoop and NoSQL databases called Hunk might be also a powerful tool in data management. For high level comparison of presented software solutions divided into categories, see Table 15 in Appendix 4. 37

52

53 5 Requirements analysis Once a deeper insight in the purpose and processing of logging and possibilities in its subsequent analysis is gained, requirements for provided sample data can be analyzed and an adequate solution can be proposed. Specifics of sample data along with requirements analysis, solution proposal, implementation and evaluation are discussed in the following chapters. To propose a solution for specific sample data, requirements for desired processing input and mainly output need to be analyzed. First, the basic idea of what the processing should be working with and what the expected output should consist of is given. Next, specific requirements are listed for selecting the right log analysis system. 5.1 Task description The basic task is to collect, process and analyze logs of an application for a car tracking service to gain insight into the application functionality and behavior. Input consists of application log files in log4j format with no additional storage system setup. Expected results should contain tracking users functionality, malfunctions and suspicious behavior patterns based on anomaly and known issues detection in real time with possible alerting and automatic response functions built-in. 5.2 Requirements and their analysis Summarization of the most important requirements: Open-source solution considered as a priority; Analysis of custom application log file format using log4j logging API (online collection of records necessity); Focus on analysis in system and application troubleshooting and unknown behavior pattern detection; Tracking user operations for compliance reports; Creating rules on logs for anomaly detection; Getting alerts on errors and suspicious behavior to avoid losses; Possibility to retrieve the specific log files contents information from detected anomalies or alerts; 39

54 5. Requirements analysis Integration with logs from client devices. Besides functional requirements, there are two main technical aspects: Custom fields Log records consist of application specific information and therefore cannot be left unprocessed. So a processing system should enable custom file format parsing or configurable parsing rules. Focus on internal behavior of the application Chosen system should be able to process physical application log files and include server-side information in the output as behavior patterns of client-server communication are the essential area of interest. 5.3 System selection Considering the systems comparison from the previous chapter, there are types of software that can be considered inappropriate or insufficient for the listed requirements: 40 Client-side tagging This type of tool is not suitable for application internal behavior analysis as it provides only client-side information. Server log file parsers and analyzers These types of software do not support advanced functionality for custom fields parsing configuration and analysis. Application log file viewers Viewers are considerable option for the task, with the following pros and cons: Pros Viewers often offer decent capabilities for processing large files. The main features of the viewers consist of searching, filtering and highlighting of results, which can actually be sufficient for most of application behavior monitoring. Cons They often come as a standalone application with difficult addition of custom analytics functionality. They have limited parsing and storage adjusting capabilities, which makes them more suitable for processing of already structured or simple enough messages. Open source systems for multiple log files processing - They often consist of a compilation of multiple software tools. It is the most suitable solution for the task, considering the integration abilities of standalone open-source applications and possibilities to adjust the functionality at the code level. One of these merged solutions is the ELK stack.

55 5. Requirements analysis Some of the biggest advantages of ELK stack (Elasticsearch, Logstash and Kibana) is that it can be used for relatively big data flows as well as for basic application log files analysis with wide possibilities of add-on functionality and adjustments. Also thanks to active user community and easy to get support its setup is not complicated. Also, there is a wide spectrum of options for advanced functionality enhancements and if required even commercial possibilities can be used upon ELK deployment. One of the examples might be Prelert Anomaly detective Proposed solution According to requirements analysis outcome deployment of ELK stack for sample data processing is recommended. Logstash for online records collection can be setup easily. It is possible to configure the default parsing template for custom log file format pre-processing. Also Elasticsearch with its rich filtering and search capabilities including options for rules creation (plugins) and anomaly detection is appropriate for the task. Kibana would be a benefit for analysis output generation as a friendly GUI tool used for visualization with optional contents adjustments. Following chapters consist of the description of the provided application sample log records data as well as implementation and deployment of proposed solution. Also issues and limitations encountered while implementation are listed according to part of system they were detected in. Overall summary of implementation as well as general issues and possible future work improvements are listed in the Conclusion chapter 12. Contents of upcoming chapters: Application log data specifics of sample data contents and encountered issues; Logstash configuration input data collection, parsing and processing tool; Elasticsearch full-text search engine and data storage using JSON documents; Kibana configuration visualizations as well as description of default dashboarding; ElastAlert Elasticsearch framework for alert creation based on rules and the default rules contents. 1. Further information about the integration of Prelert and ELK can be found on page 41

56 5. Requirements analysis 5.5 Deployment Topology of ELK stack components are represented in Figure 5.1. Logstash can collect data from various sources and these can be then joined together using a Broker. Afterwards, the collected unstructured data can be processed by Logstash configuration for data parsing and enrichment. From Logstash the processed data is usually indexed directly into Elasticsearch. Once sitting in Elasticsearch, documents can be easily queried and visualized using the Kibana browser-based GUI. Figure 5.1: ELK stack topology[74] 42

57 6 Application log data The proposed solution for log analysis is supposed to process log records from a car tracking system. I was provided with a sample data of two origins on which the solution was tested. First and more important for ongoing analysis are the application server log records containing log records generated in server-side communication of the application. The second source of data are the client log records from the client device that can be packaged and sent to server on request if further information about previous communication of a specific client is needed. Client logs need to be treated separately on input due to their slightly different contents. However it is required for these to be searchable and displayed along the server records for efficient troubleshooting options. For gaining structured information about log records, two main data adjustments and additions are used: Fields By default, the log record consists of only one field overall record text. To gain structured information from the log text, it is parsed using Logstash parsing filters. The parsed parts of log records are stored as contents/values of specified fields. For example from record text CarKey 3052, only number is stored in a field called CarKey. Tags Tags specifying type of message can be added to the parsed messages as strings added to a list of values of tags field. They are useful for tracking a specific type of message so it can be easier to search on. For example all records for finished connection are tagged with Connection_finished. 6.1 Server log file Server log files are created by the system using the log4j framework. [28] The pattern of records defined by log4j PatternLayout is in Section The leading section of the log file record contains the same information for all types which is: l o g d a t e [ thread_name - thread_id ( o p t i o n a l ) s e s s i o n _ i d ( o p t i o n a l ) ]C message_type module_name - message_ text Example of the server sample data log line: : 0 0 : [ WorkerUDP-4246 a88615be -7 d07-49 f4-8 b3f -C fe6a7d594c21 ] INFO ConnectionManager.UDP. Worker - Text o f l o g record 43

58 6. Application log data As the messages for connection treatment are of most interest, these are the base records for analytics and dashboards. Three types of connections can be logged in the files: TCP, UDP from client and UDP initiated by the server (PUSH). When the connection is initiated, a record is logged containing the IP address and port from where the client connects. These messages differ according to type of connection and also do not seem to be consistent across various log records. Connection initiation messages are helpful for tracking overall connection time and provide a possibility to aggregate messages of a specific connection session together. Nevertheless there are connections where the initiation message is missing. More information about misalignments in provided sample data are listed in the following section. For any connection there is a Connection finished type of record. This message is crucial for the analysis output, as it should be present for all server-client connections and contains all the information about the connection duration and transferred records/files. Example of Connection finished type of message (message text only): Connection f i n i s h e d : WorkerUDP : c l i e n t from / : ,C TM A24, IMSI , SN null, CarKey 3052, S e s s i o n C 8502 f266 - aad3-489 e - b839-8 b0d25f26f9a, Status RUNNING, F a i l u r e OK,C Created 49 msecs ago, r e c o r d s : 0, 1, 0, connection : type=u, bytes =278,C pos =0, srv =1, msg=0 There are three types of records transferred in client-server connection in the sample data contents: Position record; Service record; Instant message record. Types of records transferred in a specific connection can be tracked according to records:0,1,0 part of message, where the first number specifies number of positions, second number of services and the third number of transferred instant message records. In Connection finished record there is also information about these numbers in connection: type=u, bytes=278, pos=0, srv=1, msg=0 section. According to provided information about sample data, the number of transferred records should be the same in both parts of the message. Nevertheless there happen to be some misalignments which are being flagged as part of anomaly detection as well. Type of connection in this message can be U, T or P corresponding to the type of connection UDP, TCP or Push. 44

59 6. Application log data Depending on the connection outcome, there can be either a Connection succeeded or a Connection failed message. Both contain information about records transferred and status of communication there is also the reason of a failure listed in case of a failed connection. Example of a basic connection that ended with SUCCESS and contains no error in the inner protocol (message text only): Connection succeeded : WorkerUDP : c l i e n t from / : ,C TM A24, IMSI , SN null, CarKey 2494, S e s s i o n a88615bec -7 d07-49 f4-8 b3f - fe6a7d594c21, Status SUCCESS, F a i l u r e OK, Created 40C msecs ago, r e c o r d s : 0, 1, 0 Example of a basic connection that ended with FAILURE and contains information about error that occurred (message text only): Connection f a i l e d : WorkerUDP : c l i e n t from / : ,C TM A16, IMSI null, SN null, CarKey 0, S e s s i o n b41803e8 - c39b -4 a05c - a765 - d43123dff8a2, Status FAILURE, F a i l u r e CLIENT_UNREGISTERED,C Created 2 msecs ago, r e c o r d s : 0, 7, 0 FailureReason : CLIENT_UNREGISTERED Apart from these, there are various types of server log messages collected. Some of the significant parts of messages parsed from server logs are: logdate timestamp from log record in format of YYYY-MM-dd HH:mm:ss.SSS; thread_name-thread_id (optional) type of thread that generated the message and its number (e.g. WorkerUDP-4246); session_id (optional) generated ID of session (e.g. 5dbe1a66-0ee0-4f0dbd3a-44afe4c852fa); message_type type of message (e.g. ERROR); module_name module that generated message (e.g. ClientRegistry); message_text text of logged message to be parsed for specific information; ip IP address of client; port communication port of client; 45

60 6. Application log data TM machine type number/null (number after dash is for internal use only and does not need to be included); IMSI client number/null; SN serial number/null; CarKey identifier of a car/null; Status status of connection success or failure; Failure if OK no failure; FailureReason reason of failure; Created number of msecs from when connection started; files file names if associated with connection; Connection information from the Connection finished message: conn_type U/P/T; conn_bytes number of bytes transferred in communication; conn_pos records of positions; conn_srv records of services; conn_msg records of messages. Records from the Connection finished and Connection failed or Connection succeeded types of messages: record_pos records of positions; record_srv records of services; record_msg records of messages. DK driver key. Furthermore, these parsed parts of records are stored in Elasticsearch and can be used for filtering and querying. Information about parsing filters used in Logstash configuration to gain this information are listed in the Logstash configuration section. 46

61 6.2 Client log file 6. Application log data Client log files are logs generated by the client communication device. They are in a different format in comparison to the server log messages. Even though the basic log analysis is run on the server log files, the system supports also adding of client log messages. Client logs can be requested by the server and parsed using a specific Logstash configuration file. As a result, log lines from both server and client can be reviewed in the same UI and checked for possible communication issues. Information about the client identification number and TM number can be parsed from the log filename. For example, an uploaded client log file can be named: sample-client-tld.domain.mod-tmlog-imsi day a _3a contents.txt Client log messages also comply with the log4j format: The leading section of the log file record contains the same information for all types which is: l o g d a t e - service_name : message_text Example of the client sample data log line: : 5 2 : PowerControl : keepwokenup c a l l e d While the client log messages are considered only an additional information source, they are still parsed to gain message text contents information. In combination with the server log files, they can provide valuable insight into the client-server communications that occurred. Some of the information parsed from client log files: logdate timestamp of log record in format YYYY-MM-dd HH:mm:ss; service_name service that generated log message (can be omitted); message_text text of logged message to be parsed for specific information; IMSI client number/null (from log filename); TM machine type number/null (from log filename); ip connecting ip address; port connecting port; SN serial number/null; 47

62 6. Application log data simstate status of SIM; DK driver keys; globalstatus global status of device; statusmessage global status message; datastatus status of data upload; networkstatus status of network; GPSStatus status of GPS; satellitestatus number of satellites; filename filename of an uploaded/downloaded file; uploadstatus status of file upload/download. Due to less precise log time information format, misalignments in time may occur when reviewing messages from server and client logs. Also there might be a transfer delay between these messages, so better alignment of client and server messages for simpler review and troubleshooting would require a consistent time format usage. The possibility to match the information from client log connection messages with the corresponding sessions in server log connection messages would be a plus. Possibilities of this alignment are drawn in the future work section (11.1). 6.3 Data contents issues Multiple issues were found in the sample data contents, causing additional efforts in their processing. Some of these are listed below: 48 Inconsistent sample data contents Additional sample data are missing previously present types of generated messages resulting in different needs for processing. Inconsistent sample data format Additional sample data are using different format of some messages resulting in parsing failures. Missing sessions for some logs, no sessions for client logs It is hard to aggregate events belonging to the same connection when the unique identifier is missing.

63 6. Application log data Messages with identical message text Messages with the same contents are sent right after one another (few milliseconds apart). Example of messages with identical message text from sample data is listed below : 2 5 : [ C l i e n t R e g i s t r y ] INFO C l i e n t R e g i s t r y - C Refreshed i n f o r m a t i o n f o r carkey 2280: IMSI = ,C serialnumber=null, compkey= : 2 5 : [ C l i e n t R e g i s t r y ] INFO C l i e n t R e g i s t r y -C Refreshed i n f o r m a t i o n f o r carkey 2280: IMSI = ,C serialnumber=null, compkey=6564 Inconsistent property names There are changes in property names such as CarKey/carKey/car_key or they need to be derived from context (e.g. IMSI/client). Different empty field value The empty field can be parsed from messages as null or 0 (e.g. CarKey field value). Different values and field names for properties in same session The field value changes as part of the same session to null/0. Example of the inconsistent messaging format, field names and changes in values is listed below (SerialNumber/SN value changes) : 5 9 : [ WorkerUDP ca1010b c - b091c - e5ed77f5b92a ] DEBUG C l i e n t R e g i s t r y - Loaded C l i e n t : CarKey=3111,C IMSI = , SerialNumber=null, phone= : 5 9 : [ WorkerUDP ca1010b c - b091c - e5ed77f5b92a ] DEBUG C l i e n t R e g i s t r y - Getting c l i e n t i n f o r m a t i o n C f o r IMSI = , SerialNumber = : 5 9 : [ WorkerUDP ca1010b c - b091c - e5ed77f5b92a ] INFO C l i e n t R e g i s t r y. C l i e n t. Abstract - C Connection f i n i s h e d : WorkerUDP : c l i e n t from / : ,C TM A22, IMSI , SN n u l l... Noise values of field names Values of properties set to unusual values such as %2$d. Parsing of these is omitted in Logstash configuration file. Example is listed below (message text only). %s i n f o r m a t i o n f o r carkey %2$d : IMSI=%1$s, serialnumber=%4$s,c compkey=%3$d, driverkey=%5$d, eco=%6$s 49

64 6. Application log data Malformed invalidated data Examples of the possibly invalidated data are: phone= (redundant full stop), IMSI=23002 (too short) Suspicious messaging for changes in values There are occurrences of messages where the new and old values listed in are the same. Example of such record: Run : carkey =2802 has t h e same SIM card, oldimsi = ,C newimsi = Unclear messaging Log record contents are often hard to understand and process (inconsistently formatted). Different messaging for the same events Example of different messaging is the new connection record which differs between connection types as well as sample data sets. Unexpected message contents There are unexpected log record contents such as unparsed packets, whole SQL queries and possibly unhandled Java exceptions. Apart from checks for these misalignments, it is highly recommended to also investigate and fix application code to avoid these issues mainly possibly unhandled Java exceptions and direct SQL code on output. Also adjusting messaging to be consistent and computable may significantly increase efficiency of log data analysis. Sample data were provided in their original format in files containing records logged per day. Nevertheless the input system supports also online log records collection. Both input formats are supported by Logstash input configuration and further described in the following chapter. 50

65 7 Logstash configuration Logstash is a log management tool for centralized logging, log enrichment and parsing. The overall purpose of Logstash is to collect the unstructured data from input data streams, parse according to a set of filter rules eventually add some computed information. Then it is used to output the processed data for additional processing or storage. All information for Logstash processing execution is set in a configuration file *.conf. So, multiple configuration files can be created for distinct data inputs. Configuration of Logstash is divided into three sections: Input setting input data streams; Filter setting parsing filters for computing structured information from often unstructured input data; Output setting output for data processed by filters. There are also multiple possibilities of log processing enrichment using Logstash plugins and using ruby code within the configuration file. Contents of Logstash configuration file used for processing sample data contents are described in the following sections. 7.1 Input The input section of a Logstash configuration file contains definition of input data streams. There are various input possibilities that can be used in Logstash and can be combined. [76] For sample server data there are two main possibilities for processing input log data streams reading updates from file and online socket listening. For the client logs the whole file from the beginning should be processed File input For File input plugin, there needs to be a path to the file defined. Log records are read from the file by tailing from the last update to file (similar to tail -0f ). But they can be also read from the beginning if it is set so in the configuration file. By default, every log line is considered one event. The input file plugin keeps track of the current position in each file by recording it in a separate file named sincedb. This makes it possible to stop and restart Logstash and have it pick up where it left off without missing the lines that were added to the file while Logstash was stopped. Path of this file can be also set in the input setting. 51

66 7. Logstash configuration #Reading whole f i l e from s p e c i f i e d l o c a t i o n f i l e { path=>"c: / t e s t d a t a / f i l e n a m e. l o g " s t a r t _ p o s i t i o n => " beginning " type=>"l o g _ s e r v e r " sincedb_path => "C: / t e s t d a t a / sincedb " } Following properties for file reading are set in this section: path location of file 1 (It can include wildcard characters and it can also be a directory name.); start_position setting reading of file from the beginning, by default it is read by tailing; type setting type to messages read by specified input; sincedb_path tracking position in watched file. The same settings are used also for client log records as they are read from a file at a specified location from the beginning Multiline To handle log messages that occupy more than one line, multiline codec needs to be defined. In this case, it needs to be added for both server and client logs. Multiline codec is defined as below: [77] #I n c l u d i n g i n f o r m a t i o n from r e c o r d s on m u l t i p l e l i n e s codec => m u l t i l i n e { pattern => "^%{TIMESTAMP_ISO8601}" negate => true what => p r e v i o u s } In this section, following properties for multiline codec are set: pattern indicator that the field is part of a multi-line event; negate can be true or false depending on first condition if true, a message not matching the pattern will constitute a match of the multiline filter and the what will be applied; 1. For file location in Windows OS slashes in path need to be changed to unix-style due to backslash (\) being treated as an escape; character see more information for this on web page 52

67 7. Logstash configuration what previous or next indicates the relation to the multi-line event. The above definition of a multiline input filter therefore means, that every line that does not start with a timestamp is considered a multi-line log record and should be added to the contents of the previous line. By default, a multiline tag is added to every record that was processed by the multiline codec. This tag can be however removed from the processed records if it is not necessary Socket based input collection A Log4j input type is for reading events over a TCP socket from a Log4j SocketAppender. This input option is currently commented in the Logstash configuration file but can be enabled for online socket listening. This input type is defined as follows: #Read events over a TCP s o c k e t from a Log4j SocketAppender. l o g 4 j { mode => s e r v e r host => " " port => [ log4j_port ] type => " l o g 4 j " } Eventually, direct collection from UDP and TCP port listening can be used for online data collection, which is also commented in the Logstash configuration file. [79] An example definition is listed below: #S e t t i n g l i s t e n e r s f o r both TCP and UDP tcp { port => 514 type => " server_tcp " } udp { port => 514 type => " server_ udp " } Settings contain a listening port, host information and an optional data type setting according to the collection method used. Additional properties can be adjusted in case the log4j socket listener is used These can be found in Logstash documentation here: guide/en/logstash/current/plugins-inputs-log4j.html 53

68 7. Logstash configuration 7.2 Filter Parsing and data enrichment rules are defined in the filter section of the Logstash configuration file. They mostly consist of regular expressions to parse the input log lines into the specific fields according to their contents. A grok filter plugin is essential for the unstructured input information parsing. However there are multiple useful filter plugins that can be used in Logstash filter section. [80] Filter plugins used in configuration Grok filter A Grok filter is the essential Logstash filter plugin used for parsing unstructured data into a structured and queryable format. It supports a lot of already defined default patterns, however custom regular expression patterns can be defined as well. The grok filter definition consists of the existing field and the provided regular expression that matches it. This regular expression then parses the information of the input string into additional fields. In the configuration created for sample data processing, the grok filter is used at first for parsing the overall message contents. Then message text contents are parsed according to type of message for acquiring specific data from the unstructured text. Custom tags are added depending on the type of message contents to allow simplified searching capabilities. [80] In the Logstash configuration two kinds of field definition types are used. The pre-defined pattern strings such as %{IP:ip_address} and custom regular expression definitions using Oniguruma syntax such as (?<client_id>[0-9]{14,15} can be used. In the first case, parsed IP address in pre-defined format is stored in the ip_address field. The custom regular expression match from the second example will be stored as client_id. The overall message contents is parsed first, while the message text contents (message_text field) is parsed later using additional grok expressions. Grok filter used for parsing all messages is defined in the Logstash configuration: #Overall regex pattern f o r logged r e c o r d s grok {match => { " message " => "%{TIMESTAMP_ISO8601 : l o g d a t e }\ s \ [C (?<thread_name >[a - za - Z ] ) (\ -(? < thread_id >[0-9] ) )?\ s C ((?< session_id >[0-9a - z \ - ] ) )?\ s \ ] \ s %{LOGLEVEL: message_type }C ((?<module_name>[a - za - Z \. ] ) )?\ - %{GREEDYDATA: message_text }"}} Essential information about log records is parsed from the default field message that contains all the log line contents. The parsed fields are: 54 logdate Date and time of log record;

69 7. Logstash configuration thread_name and thread_id Name and ID of the thread that generated message; session_id Identification of a specific client-server communication; message_type Type of logged record, or debugging level name; module_name Module name that generated message; message_text Unstructured message text. Some of these fields, such as thread_id and session_id are not present for all log records, thus are marked as optional in the regular expression. For processing of message_text, Logstash filter section is divided into sections according to the module name that generated the message. Afterwards, a specific information is parsed from the message text using grok filters and tags are added accordingly using the mutate filter. [80] An example of mutate filter is given below. It adds Listening_ip tag, if the parsed record contains ConnectionManager in the parsed field module_name and Listening in the message field. #ConnectionManager i f " ConnectionManager " in [ module_name ] {... #IP & port f o r c o n n e c t i o n s l i s t e n i n g i f " L i s t e n i n g " in [ message ] { grok { match=> {" message_text " => " [ a - za - Z\ s ] : \ s \/%{IP : ip }:%{INT : port }"} } mutate {add_tag => " Listening_ip "} }... } Most of the log record message texts are parsed in a similar fashion, adding the tag and parsing required information from the message text. Apart from these however there are some additional Logstash filter plugins used for additional computed information. As noted before, the messages with the most important information for the analysis output are those logged for the connections treatment. To enable a simpler processing of the additional metrics, computed fields are added using Logstash plugin filtering capabilities. Date filter When the log records are parsed using the Logstash dynamic mapping, a timestamp field used for querying in Kibana is added according to the time of 55

70 7. Logstash configuration log record processing. To enable using the logdate field as a timestamp, it needs to be set in the Logstash configuration using the date operation: #Use date from r e cord as timestamp in Kibana date { match => [ " l o g d a t e ", "YYYY-MM- dd HH:mm: s s. SSS " ] t a r g e t => " } Ruby filter The aim of a Ruby filter plugin is to add direct ruby code for the computation of the additional fields values. It references to the already added/parsed fields of a log record by using event[ fieldname ] in the code. The whole ruby code section is enclosed in quotes and supports full ruby code syntax including the local variables and functions. The ruby filter is used in the Logstash configuration file for the computation of the transferred records sum in the Connection_finished log events. There are actually two instances of the connection records count information in the Connection_finished event that are computed: the sum of records listed in records:0,1,0 and connection: type=u, bytes=278, pos=0, srv=1, msg=0. These fields can then be queried and checked for the high counts of transferred records in total. Additionally, as the records listed in these two parts of the Connection_finished message should be the same, their difference is considered an anomaly. The comparison result of these two computed fields is therefore stored in the additional boolean field called records_mismatch. Example of the ruby code used for the sum of records computation: ruby { code => " event [ r e c o rds_total ] = C event [ record_msg ] + event [ record_pos ] + event [ record_srv ] event [ conn_records_total ] = C event [ conn_msg ] + event [ conn_pos ] + event [ conn_srv ] "} Aggregate filter The aim of an Aggregate filter is to aggregate information of several log records that belong together. This can be used to aggregate records of the same session, tracking events for a specific client or car. The overall idea is to store some value present in the events of the task and then add the computed field to the last 56

71 7. Logstash configuration event. The use of this filter can be simply adjusted thanks to using ruby code for its computation. For example, the duplicate records detection is implemented using the aggregate filter, executing ruby procedure for all events of all clients (the aggregation of tasks using IMSI/client field). During its first run, it sets the init and records_same local variables to 0 only if they were not initiated before. Once the initialization variable init is set to 0, the current records types and counts are saved in the local variables and init is increased as initialization is complete. For every following log line processed by Logstash containing the same client number, the procedure runs with init already increased. So the first section of the code is executed (being run only if init is greater than 0). The records transferred in the current log event are compared to those saved in local variables and if all three of them are of the same value, count of duplicate events is increased (saved in records_same field). This procedure is counting only the duplicate records attempts immediately one after another, so once an event with the different records values is processed, all local variables are re-initialized. The procedure is then comparing to the updated set of saved records fields values. This procedure is marked in the Logstash configuration file by comment #Handling duplicate events - comparing records sent by client. Elapsed filter The Elapsed filter is a useful Logstash plugin that tracks the time difference between two log records. Both start and end records are chosen (identified according to their tag) and the unique field is used for the aggregation of these two events for the specific session. This filter is used for the computation of time between the Connection_new record type and the Connection_finished record type of a specific session: #Duration from s t a r t i n g connection to end e l a p s e d { start_tag => " Connection_new " end_tag => " Connection_finished " unique_id_field => " s e s s i o n _ i d " } There were some sessions in the sample data found that were missing the Connection_new tagged record. If the start tag is not found, the elapsed filter adds an elapsed_end_without_start tag and the elapsed time is not computed. For one type of the sessions missing the starting tag, the aggregate code section was added for checking if there was a starting event. 57

72 7. Logstash configuration This aggregate procedure creates the boolean value started and saves it as true for the Connection_new type of message. For specific type of record (that was found as first in sessions where often first message was missing) is then stored value of started checked. In case the value is false, no beginning message was received for the session yet and the Connection_new tag is added. In case there are more types of messages where starting event is sometimes missing, this part of code can be added for them as well. Mutate filter A Mutate filter is a basic Logstash plugin used for making any changes in the documents fields. It can be used for field addition/removal as well as for manipulation with tags added to the log event. Additionally, the field contents can be updated using this filter e.g. for replacement of a specific string in field Additional computed fields Some additional computed fields and tags were also defined in the Logstash configuration filter section. For the information computed that needed value to be stored fields were added. If no value was required to be stored, only tag was added to the records of interest. Fields and tags added by the Logstash filters are listed below: 58 files_total Similarly to the total records computation, also the sum of files transferred in a session is computed. time_difference and time_mismatch fields Using the previously described elapsed filter and the Created field in the message, the overall time of the session can be checked. However to compare these two values, they need to be adjusted first as the value of Created field is in milliseconds and the value of elapsed_time is listed in seconds. Afterwards, the difference between these two fields is computed and added to the field time_difference. If this difference is greater than 0.1 seconds (this can be adjusted accordingly), an additional boolean field time_mismatch is added. Empty_connection tag For tracking of empty connections, a tag Empty_connection is added for all the connection log records where neither records nor files were transferred. Too_many_bytes tag In case there is an unusually high amount of bytes while neither records nor files were transferred, a tag Too_many_bytes is added. Number of bytes for the empty connection is

73 7. Logstash configuration usually below 200 bytes. If more bytes are transferred in the empty connection, the Too_many_bytes tag is added to the Connection_finished event. SQL_code and Exception_code tags These tags are added if an SQL code or a Java exception is present in the message. These tags were added as a result of the content anomalies detected in the original log files and can be removed if this behavior is expected. Eventually, custom tags for any other type of the event contents that should be tracked can be added. Changed_same_value tag There are multiple log lines, where the property is being updated. As noticed, there are occurrences of these updates, where the field is being updated to the same value as before, which is also considered an anomaly and is being tracked by a tag Changed_same_value. These change messages with no actual change in their contents are checked for all from/to fields. fieldname_check tag The grok filters were set up to process all kinds of input strings, even though there should be more strict validation for the fields that contain specific format of the input. Once the misalignments in data are taken care of, the regular expressions for these can be used also in the main parsing section. Currently the misalignments in some of the fields formats are only flagged by adding a fieldname_check tag. These checks are set for IMSI, CarKey and Session fields. Example for the IMSI format check is listed below: #Most o f IMSI v a l u e s should be 14 to15 - d i g i t - - Check i f not i f [ IMSI ] { grok { match => {" IMSI " => " [ 0-9 ] { 1 4, 1 5 } " } tag_on_failure => [ " IMSI_check " ] } } Default tag added if the grok filter cannot parse the input string using provided regular expression is a _grokparsefailure. However this tag can be customized, as is shown in the example above Adjusting and adding fields The connection types are parsed from the input data in the format of first letters only (T/U/P). For better usability, these fields are updated to contain the full 59

74 7. Logstash configuration connection type format (TCP/UDP/PUSH). This field update can be specified in the Logstash configuration using a mutate filter: #Update connection type f i e l d s i f [ conn_type ] { i f [ conn_type ] == "U" { mutate { update => {" conn_type " => "UDP"} }... As noted in the previous section, there were some inconsistencies in the provided sample data logs. In the first logs provided, there were DEBUG messages added, that provided information about client and car in the beginning of each session. As these two fields are useful to be tracked for all the messages in specific connection session, they were added using the aggregate filter. This approach can be used only when the messages containing this information are present as the first log records of the session because the Logstash code can work only with the information in lines already read. The sample data contents provided later did not contain these DEBUG messages, so this trick cannot be used for all types of sample data. All records of all communication of a client cannot be retrieved by querying Elasticsearch with a specific IMSI number. The aggregate filter procedure that adds the IMSI and CarKey fields for all messages generated in a specific session is triggered by a Registry_getting_client and a Registry_getting_car tagged messages. The section of configuration that adds this information is marked with #Adding IMSI & CarKey (if Debug messages are enabled) comment and works as listed below: 60 For all Registry_getting_client tagged messages The aggregate filter saves the information about client number for specific session; For all Registry_getting_car tagged messages The aggregate filter saves the information about car number for specific session; Then for all messages that contain session_id field The aggregate filter adds the saved client and car number fields to the message; Confirmation_failed or Confirmation_success events - The messages tagged with Confirmation are usually last session messages logged. Therefore for these messages aggregate filter finishes task and removes the saved information.

75 7. Logstash configuration Other Logstash filters There are many more possibilities for the Logstash filter apart from those used in the sample data Logstash configuration. Some of the filters with interesting functionality are an elasticsearch filter and a metrics filter plugin. Elasticsearch filter The Elasticsearch filter enables sending of queries to the Elasticsearch instance and processing the query result. It provides interesting functionality: even though limited to when Elasticsearch is used as Logstash output, it is useful for getting information from other elasticsearch indexes. For some of the aggregate parts of code used in Logstash configuration, also elasticsearch filter could have been used. Metrics filter The Metrics filter is aiming at storing frequency of specified types of messages over time. It actually creates and refreshes numbers of occurrences for specified fields in 1, 5 and 15 minute sliding windows. This type of filter can actually be used as well as monitoring tool, generating alerts in case of high numbers of some record occurrences. The downside of this filter is, that it creates its own instances of events with the processing timestamp, generating a lot of new fields for all types of rates. Processing of these is therefore problematic using Kibana as a visualization tool and due to short time rates (longest is 15 minutes), this functionality was in implementation replaced by ElastAlert monitoring rules. 7.3 Output The output section of the Logstash configuration file specifies where the parsed log records are sent to. In means of using full ELK stack for processing sample data, the output is set to Elasticsearch. There is also a definition of file output for parsing failures and output for alerting Elasticsearch output Elasticsearch output defines that processed data should be indexed to the Elasticsearch instance running on a specified host. 61

76 7. Logstash configuration This definition is in the Logstash configuration: #Indexing output data to e l a s t i c s e a r c h e l a s t i c s e a r c h { a c t i o n => " index " h o s t s => [ " l o c a l h o s t " ] workers => 2 } By default, dynamic mapping for the created Logstash index is used and sends data to the Elasticsearch instance running on specified hosts. Default mapping creates one document per event for every parsed log line, adding fields and tags as defined in the Logstash filter section. Every document is saved as a JSON object with defined fields as properties under logstash-yyyy.mm.dd index File output Apart from basic output to Elasticsearch, also output to file in the original message format is added to configuration for handling parsing failures that require modifications in the Logstash filter section. This section is specified accordingly: #S e t t i n g f o l d e r o f messages f a i l e d in p a r s i n g i f " _ g r o k p a r s e f a i l u r e " in [ tags ] { f i l e { message_format => "%{[ message ] } " path => "C: / g r o k p a r s e f a i l u r e -%{+YYYY.MM. dd }. l o g " } } According to this setting, all messages that contain the _grokparsefailure tag, are written to a separate output file. These are listed in their original format, so they can be used as an input file with no changes in set filters and processed by configuration again once the problematic filter section is adjusted accordingly output An additional possibility of Logstash output setting is the triggering of alert s 3 when specified conditions are met. Logstash alerts are suitable for basic issues that can be checked right after an event is processed. There are also additional 3. Functionality of alerts was tested using testing and debugging software tool. This tool collects the s sent using the specified port of localhost without actually delivering them. It is downloadable from: 62

77 7. Logstash configuration possibilities for adding counts of events, setting complex aggregate filters or adding computed fields using ruby code for frequency checks. However if Elasticsearch instance is used for storing data parsed by Logstash, querying Elasticsearch for frequency based events using ElastAlert framework is easier for implementation. Three alerts were added to Logstash output configuration: if type of message ERROR occurs; if there are no records but more than 200 bytes transferred in connection; if client with IP from outside Europe is connecting. These are set as below: i f "ERROR" in [ message_type ] { { from => " logstash_alert@company. l o c a l " s u b j e c t => " l o g s t a s h a l e r t " t o => " @example. com" via => " smtp " port => body => "ERROR message. Here i s the event l i n e that occured : %{message }" } In this section, following properties for output are set: [78] from from whom the generated message is sent; subject subject of generated message; to specified addresses for generated message to be sent to; via for generation smtp should be used; port port for sending default is 25; address address used to connect to the mail server default is localhost; body body of generated message. 63

78 7. Logstash configuration 7.4 Running Logstash The Logstash configuration is run from the /bin folder using command logstash -f *.conf. For editing or adding new filters into Logstash configuration, the provided configuration needs to be adjusted. Editing existing filters can be done by finding the corresponding filter in configuration file by searching for the added tag, eventually according to module name that generated the message. Grok debugger 4 is a helpful tool for the verification of grok parsing patterns. 4. Can be accessed from here: 64

79 8 Elasticsearch Elasticsearch is a real-time distributed search and analytics tool. It is built on top of Apache Lucene full-text search engine and offers quick querying capabilities. Documents are stored in JSON format and all fields are natively indexed for search. When Logstash is used as input to Elasticserach, dynamic mapping is used, thus creating one JSON document per parsed log record. Elasticsearch is by default running on port 9200 of localhost and is accessible through REST API. [81] 8.1 Query syntax A rich amount of possibilities exist for querying Elasticsearch, e.g. using the Lucene query syntax. Complex queries including aggregation capabilities can be built and processed quickly due to a flat and easily searchable structure. Some of the querying possibilities using Elasticsearch Query DSL are: [81] Full text queries match or multimatch queries; Term level queries including missing/exists and range queries; Compound queries including boolean logic in queries, filtering and limiting results; Joining queries used for Nested fields and queries (mapping needs to be adjusted); Specialized queries comparing and scripted queries. Example of Elasticsearch query using REST API: GET l o c a l h o s t :9200/ l o g s t a s h - / l o g_server / _search {" query " : { " bool " : { " must " : [ { " match " : { " IMSI " : " " }}, { " match " : { " message_type " : "INFO" }} ], " f i l t e r " : [ { " term " : { " tags " : " Connection_finished " }}, { " range " : { " Created " : { " gte " : "300" }}} ] } }} 65

80 8. Elasticsearch The query is searching in specified index and type documents as used in GET instruction /index_name/type_name/_search. The bool section of the query is using the keyword must as the AND operator. As a result, all documents where both conditions are met (in this case the IMSI and message_type fields are corresponding to queried values) are returned. On top of the query results, a term filter is run causing only queried records with specified tags and Created field values are returned. 1 The index name is specified when storing data in Elasticsearch and it basically works as a package containing specific type of data. Also substitute characters can be used and in this case logstash- would search in all logstash-%date indices. Different types of data are usually being processed and saved using different indices. One index may contain multiple types of data (e.g. log_server) that are also defined on data input. For example, the bookstore contents can be stored in the specified Elasticsearch index called bookstore and contain multiple types of documents, such as book and customer. If index in the Logstash output configuration is not specified, logstash-%date index is used (date is set according to log message timestamp). Kibana as a visualization tool for Elasticsearch has the same expression capabilities as direct querying of Elasticsearch instance. For more complex aggregation and compound filters and queries, visualizations on top of searches are used. However these queries can be also shown in their raw format of how they look when sending to Elasticsearch. Filters in Kibana can also be edited by updating the query source directly. 8.2 Mapping Mapping in Elasticsearch defines how the indexed documents and their fields will be stored. It defines types of fields such as full-text search strings, numbers and geo-location strings, date formats and also other custom rules for stored contents: String fields are analyzed by default (enabling search in also sections of string), but storing also not analyzed version of field as field.raw; Numeric fields if type is set as number (integer/float) it is also stored as number in Elasticsearch and numeric operations such as SUM or AVG can be applied; IP and geolocation using geo-location plugin, fields marked as IP are processed for geo-location information detection; 1. Differences between query and filter can be found here: guide/en/elasticsearch/guide/current/_queries_and_filters.html 66

81 8. Elasticsearch timestamp timestamp is generated for all processed documents using the current time if no field is set; _source field contains the original JSON document body; Every event/log line processed as separate document in case there should be any nested properties set, these needs to be adjusted in configuration setting and also updated in Elasticsearch mapping file. Due to default requirements to process log lines as documents and no special needs for Elasticsearch mapping, this setting was not edited for sample data processing. One thing to consider for Elasticsearch usage is that with better speed, performance and scalability its structure is much different from basic relation databases. For relation databases, relations between tables and contents are crucial. For Elasticsearch on the other hand, there is much less support for data relationships modeling and it is often very complicated to get the complex dependent information. In the supposedly flat world of JSON documents, the scaling options are more horizontal instead of joining, data scheme needs to be adjusted accordingly by tweaking mapping definition. 8.3 Accessing Elasticsearch Elasticsearch can be easily accessed using REST API curl requests. Example of the curl request searching for records with specified client number is listed below: c u r l -XPOST l o c a l h o s t :9200/ l o g s t a s h - / l o g _ c l i e n t / _search -d { " query " : { " term " : { " IMSI " : " " } } } A nice easy-to-use option for accessing the Elasticsearch instance from a browser directly is using the Chrome browser Sense plugin. 2. It is a JSON aware developer console to Elasticsearch and it enables sending curl requests to the Elasticsearch instance directly through browser and review the request results. Assumption for the overall implementation of ELK stack is, that querying and working with Elasticsearch directly is minimized due to querying and visualizations being handled by Kibana. 2. Downloadable on webstore, github project for this add-on is on page com/cheics/sense 67

82

83 9 Kibana configuration Kibana is a browser-based analytics and dashboarding tool for Elasticsearch. GUI provided by Kibana is easy to use and enables searching and creating visualizations on top of data stored in Elasticsearch indexes. It is best-aligned for working with the time specific data, however it can work with all kinds of Elasticsearch mappings. The assumption for the implementation part is that Kibana would be used regularly for the visualizations and detection of the anomalous properties added in Logstash parsing. In this section, the visualizations in three dashboards created as part of this thesis are described and also a short user guide for the output processing in Kibana is provided. The time-based visualizations are displayed according to the time constraint in Kibana (defaults to last 15 minutes). The additional information about working with Kibana and adjusting display is listed in the Appendix B: User Guide. Note that due to the dynamic mapping set in Elasticsearch, the string fields are analyzed by default. The string fields containing original contents can be used in form of fieldname.raw. 9.1 General dashboard The general dashboard contains the overall statistics and visualizations of the processed log data. The visualizations are mostly time-based and their purpose is to provide general information about the system performance and processed connections specifics. Following visualizations were created and added to the general dashboard: General_all_type visualization is for the overview of all logged messages in specified time with added information about the type of generated message (e.g. INFO/DEBUG). This visualization can be helpful in the monitoring of overall count of log messages for possible spikes or drops. Additionally, it provides information about the type of messages that are logged the most often. This information may be useful in case of unusual changes of the mostly used message type that might indicate a problem with a logging module or discrepancies in data. General_CarKey_TM visualization is for the overview of unique CarKeys processed in the specified time with added information about the client machine type (TM) used. This visualization is for monitoring of unusual changes in count of specific car connections, which might 69

84 9. Kibana configuration indicate connection or server issues. Also it includes information about the used client device software for two reasons: General overview of what software is mostly used by active clients. Detection of possible relation between the software type and changes of car connections (e.g. sudden increase of messages from clients using the same device software). General_IMSI_country visualization is for the overview of unique client numbers (IMSI) count processed in the specified time with added information about the country they were connecting from (using geodetection capabilities). The purpose of this graph is in monitoring of the total count of distinct clients connecting and listing countries where most clients are connecting from. Example of this visualization is shown in Figure 9.1 below. Overall count of clients generating messages hourly during one day is visible in the screenshot. The numbers of logging clients increased noticeably around 2 and 3 pm, which might be worth investigating. Apart from that, apparently the most clients come from the Czech republic and Slovakia. Figure 9.1: General_IMSI_country visualization screenshot 70

85 9. Kibana configuration General_sessions_type visualization is for the overview of unique sessions count processed in the specified time with added information about connection type (UDP/TCP/PUSH). This graph can be used for detection of connections overload (in case of high numbers of sessions) as well as overview of mostly used connection types. General_failed_reason visualization is for the overview of all failed connections that occurred in the specified time with added information about the failure reason. The purpose of this visualization is in general overview information about the failed connections and their causes for the erroneous communications monitoring. Example of this visualization is shown in Figure 9.2 below. There are multiple peaks of the failed connections count that are presumably worth investigation. These are mostly caused by timeout and concurrent connection errors. Apart from these however, there are also multiple unregistered client connection attempts in the short timeframe, which might indicate also an attack attempt. Figure 9.2: General_failed_reason visualization screenshot General_map visualization is based on geolocation information gained from the connecting IP addresses shown on the world map. As most of the clients are connecting from Europe, the assumption is that all 71

86 9. Kibana configuration connections from other parts of the world are considered an anomaly (e.g. from Asia or Africa). General_carkey_type is a visualization of a pie chart for the top 10 CarKeys occurring in the logged messages including information about their connection types (UDP/TCP/PUSH). General_module_type is a visualization of a pie chart for the top 5 modules that generated most messages including information about the logged message types. Example of this visualization is shown in Figure 9.3 below. According to the visualization, the largest part of the messages were logged by WorkerUDP (so generated in UDP connection). This might be helpful information also for better insight into the distribution of the incoming connection types. Additionally, there is information about the type of messages generated by modules in which INFO type is taking the lead. However the blue section marking the WARN type of message might be worth looking into as the warning messages often contain valuable information about processing issues. Figure 9.3: General_module_type visualization screenshot 72

87 9. Kibana configuration General_client_TM is a visualization of the pie chart for the top 10 clients that generated most messages. It also includes information about the clients machine type (TM). General_records_sum is a visualization of trends and changes in the sum of transferred records over time. There are separate lines comparing the sum of processed service, message and position records types for better insight for what types of records are processed the most. Example of this visualization is shown in Figure 9.4. There are visible peaks in counts of transferred records according to their type. Apart from that, it is also visible that there is a much higher frequency of service type records transfer than the other two record types. Figure 9.4: General_records_sum visualization screenshot General_files_sum is a visualization of changes and trends in the sum of transferred files. There are separate lines comparing the sum of forwarded and received types of files, which might be helpful for gaining knowledge about what types of files are processed the most and how frequently. General_max_same is a visualization of changes and trends in the maximum of the same transferred records counter (records_same field). 73

88 9. Kibana configuration This visualization can be helpful in detection of unusually high counts of same records transfer attempts. From the log records only the type of transferred record can be learnt so low numbers might be referring to transfer of different records of the same type. However once the records_same value is significantly high it is worth looking into. General_bytes_max contains a visualization of the maximum numbers in transferred bytes in a specified time for better insight of peaks (if any) of bytes transferred in connections. The graphic is also including information about the connection type (TCP/UDP/PUSH) for monitoring of connections generating the maximum numbers of bytes. Example of this visualization is shown in Figure 9.5. There are some visible peaks in the maximum of transferred bytes in the screenshot. Also the additional information about the application behavior is that even though most messages are logged by the UDP connections (as shown in previous visualization General_module_type), TCP connections are responsible for the highest bytes transfers. Figure 9.5: General_bytes_max visualization screenshot 74 General_bytes_sum is a visualization of changes in the sum of transferred bytes including the information about the module that generated the message. The purpose of this graphic is to gain knowledge

89 9. Kibana configuration about what modules are responsible for peaks in the sum of transferred bytes as well as an overview of transferred bytes in a specific timeframe. This information can be helpful when monitoring sudden changes of transferring bytes or unusually high/low numbers. The General_tags_overview table is listing the counts of messages for all tags, types of messages and module names. This table is meant as an overview list of tagged log messages for quick detection of sudden and suspicious changes in the counts of specific type of message. 9.2 Anomaly dashboard The Anomaly dashboard contains visualizations of detected anomalies in processed log records. They are mostly based on the computed fields added to documents as part of the Logstash parsing and processing. Most of the computed fields are listed and described in the Logstash configuration section. Following visualizations were created as part of the Anomaly dashboard: Anomaly_empty visualization is for trending changes in empty connections count over time. Purpose of this visualization is in possible tracking of similar patterns of empty connections counts over time. Example of this visualization is shown in Figure 9.6 Figure 9.6: Anomaly_empty visualization screenshot 75

90 9. Kibana configuration Anomaly_max_time visualization is for comparing maximum values of the Created field that is parsed from the finished connection event message (containing information about the connection duration) and the elapsed_time field computed as the time between the new connection message until the finished connection event. Differences between these two fields can be also caused by the start events missing for the connection resulting in the elapsed filter not functioning properly. The overall goal should be however to align these two metrics. Anomaly_time_created is a visualization of the sum of all the Created fields in specific time for eventual increased delay in the processing monitoring. It also includes the information about clients. Example of this visualization is shown in Figure Figure 9.7: Anomaly_time_created visualization screenshot Anomaly_max_bytes visualization is for comparing trends in changing of the max values of bytes transferred in the connections and possible spikes detection. There might be some specific time of day that is usually overloaded with a high number of bytes transfers and if so, the system can be adjusted accordingly. Anomaly_duplicated_records visualization is created for monitoring of the sum of duplicated transfers (records_same field) including the information about client numbers. The records_same value might not

91 9. Kibana configuration always mark the truly duplicate records transfer, as the explicit record contents cannot be learnt from log messages. However this visualization helps with detection of clients with the highest amount of same records transfer attempts. These clients can be then filtered on. Anomaly_exception & Anomaly_SQL visualizations are tracking the messages that contain suspicious contents on output, such as Java exceptions and SQL code. Example of the SQL code anomaly detection visualization is shown in Figure 9.8. Tracking of Java exceptions is displayed using the same type of visualization. Figure 9.8: Anomaly_SQL visualization screenshot Anomaly_bytes_client is a visualization of the empty connections that are transferring more bytes than usual messages tagged with the Too_many_bytes tag. The expected number of bytes for empty connection (considering protocol management needs) is below 200 bytes. This visualization also includes information about clients so they can be easily filtered for further investigation. Anomaly_time_mismatch visualization is tracking records where the Created time value differs from the computed elapsed_time. This difference is stored in the computed field time_difference and is true only if the difference is greater than the chosen threshold (currently set to

92 9. Kibana configuration second). This threshold can be adjusted if greater differences are expected. Anomaly_empty_clients is a visualization for all the empty connections in a specified time with added information about the clients. The overall purpose of this visualization is to investigate the clients that are often connecting to the server without any records nor files being transferred. The clients occurring the most can then be filtered using the table display of the visualization (can be accessed by clicking an arrow icon at the bottom of the graph). Example of this visualization is shown in Figure 9.9. Figure 9.9: Anomaly_empty_clients visualization screenshot 78 Anomaly_records_mismatch is a visualization for the Connection_finished messages where a sum of values of record numbers differs within the message contents (comparison of the records_total computed field and the conn_records_total field). This anomaly has been discussed already in the server log data section 6.1. Anomaly_change_same is a visualization for all instances of messages containing information about the change but both values (before and after) were the same. This graph is based on the computed tag Changed_same_value that is added as part of the Logstash parsing

93 9. Kibana configuration filters by comparing previous(_from field) and new values(_to field) in all messages containing field values changes. Anomaly_field_check visualization was added for monitoring of misalignments in the field formats (e.g. for client/imsi numbers that are not 14 or 15-digit numbers as usual). These formats are checked as part of the Logstash parsing filters (possible future usage is that these fields are strictly defined in the overall parsing filters). 9.3 Client dashboard There are only few general visualizations created for the client log records as these are expected to be used only on request. These are mostly targeted at status distribution graphics and overall tags overview. The additional visualizations can be created and added to this dashboard for tracking of other parsed elements from client log files. The Client dashboard visualizations are listed below: Client_runtime_exception visualization is monitoring the messages containing a Java exception in their contents. Example of this visualization is shown in Figure 9.10 Figure 9.10: Client_runtime_exception visualization screenshot Client_all_service visualization is created for listing of all the log records with the name of service (or module) that generated them. 79

94 9. Kibana configuration Client_status is a visualization for the global statuses distribution in the log messages overview. These statuses can be of three distinct values GREEN/RED/YELLOW. Client_tags_overview is a table with all messages types counts listed based on their tags. Client_dataStatus visualization is used for data statuses overview in the client messages contents. 9.4 Encountered issues and summary There are multiple functionality downsides for Kibana 4 that were found during the implementation. Active issues of Kibana 4 can be also found on the github project page, where developers are reacting to the flagged issues and provide information on possible feature inclusion in future relases 1. Some of encountered limitations of Kibana 4 are: Time field cannot be hidden on the Discover tab, so the listing of raw messages is not very neat; Results on the Discover tab are wrapped to display on the page instead of one record per line listing (with possible scrolling out of page horizontally); Default Discover tab contents cannot be pre-configured and default filtering cannot be set; Results on the Discover tab are limited by number of results, instead of allowing listing all results on more pages; Nested aggregations not supported so unable to e.g. compute sum of number of records; Hard to locate points when using geodetection map visualization; Visualizations on the Dashboard tab are not always aligning correctly and do not display the same after re-load All dashboards, visualizations and searches discussed in this chapter are included in the electronic version. They can be easily imported to the running Kibana instance using the Import option on Settings -> Objects tab of Kibana GUI. See location of Import button on screenshot in Figure Kibana issues tracked on the github page can be reviewed here: elastic/kibana/issues 80

95 9. Kibana configuration Figure 9.11: Kibana Objects import A short User Guide for using Kibana for working with searches and visualizations is added as part of Appendices chapter. 81

96

97 10 ElastAlert As Kibana due to its browser-based nature is not suitable for real-time alerting, ElastAlert plugin is used for frequency monitoring and alerts generation. ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in Elasticsearch. Overall functionality of ElastAlert lies in default configuration of host and port of the Elasticsearch instance and set of rules created for anomaly patterns detection. Rules are situated in specified folder and contain an Elasticsearch query that is triggered according to a set interval Types of alert rules There are multiple types of rules that are supported by ElastAlert: [83] Any every hit that the query returns will generate an alert; Blacklist rule will check if specified field is on blacklist and match if it is present; Whitelist rule will check if specified field is on whitelist and match if it is not present; Change monitor a certain field value and alerts if it changes; Frequency triggered when certain number of specified events happen in a given timeframe; Spike matched when the volume of events during a given time period is spike_height times larger or smaller than the volume during the previous time period; Flatline matches when the total number of events is under a given threshold for a time period; New Term matches when a new value appears in a field that was not seen before; Cardinality matches when the total number of unique specified events/values in given timeframe is higher or lower than a preset threshold. There are various other types of alerting output settings listed in the ElastAlert documentation. For the created rules however only alerting by was used. 83

98 10. ElastAlert ElastAlert is written in python and needs the python libraries installed and set up in order to be running Created alert rules Set of default ElastAlert rules were created for the sample data monitoring. Thresholds and frequency settings are based on the sample data output but can be adjusted accordingly. Following alert rules are provided: rule_connections_spike sends an alert once there is a 3 times difference in count of Connection_finished events in comparison to the previous time window (1 hour); rule_empty_spike sends an alert once there is a 3 times difference in count of Empty_connection events in comparison to the previous time window (1 hour); rule_exception_spike sends an alert once there is a 2 times difference in count of Exception_code events in comparison to the previous window (1 hour); rule_finished_frequency sends an alert once there are more than 3000 connections in the last 15 minutes; rule_failed_frequency sends an alert once there are more than 15 failed connections in the last 15 minutes; rule_empty_frequency sends an alert once there are more than 400 empty connections in the last 15 minutes; rule_duplicate_transfer_frequency sends an alert once there are more than 700 connections attempting to transfer same records as before in the last 15 minutes (for all clients); rule_sessionid_cardinality sends an alert if there are more than 3000 unique sessions in the last 15 minutes for overload monitoring; rule_imsi_cardinality sends an alert if there are more than 1000 distinct clients connecting in the last hour; rule_carkey_cardinality sends an alert if there are more than 1000 distinct car connections in the last hour; 1. All requirements for ElastAlert to be run are listed in its documentation on page http: //elastalert.readthedocs.org/en/latest/running_elastalert.html. 84

99 10. ElastAlert rule_duplicate_imsicardinality sends an alert if there are more than distinct clients attempting to transfer same records as before in the last 15 minutes; These rules were created for monitoring of overall changes in numbers of connections and clients for the early alerting before actual damage can occur. The spikes for both connection events and empty connections are monitored for the potential system issues. The unusually high numbers of transfers within the timeframe are flagged as well for uncovering of the possible server processing issues. The cardinality rules for the unique count of clients tracking are set for revealing device or communication issues with the specific client. Similarly, both the overall and client specific duplicate transfer attempts are monitored for potential issues with the data transfer and client-server side communication. All these rules are commented and can be adjusted as well as similar additional rules can be created using ElastAlert framework [83]. 2. This number might appear too high in comparison to overall connecting clients. As there is only type of record information in logs, between transfers flagged as the same there will be also regular repeating transfers of different records of the same type. The purpose of monitoring these is to flag unusually high numbers of the same records transfers for possible system issue detection. 85

100

101 11 Conclusion Every source of information about overall behavior and patterns of a web-based application are important for gaining knowledge and improving the service. Logs, as a valuable source of information, are often underestimated. However their processing and analysis may significantly improve troubleshooting efforts and uncover issues not visible in everyday use. Anomalies and monitoring of communication flow can reveal important information about the processing flow and help catch issues before they cause actual damage. Multiple log analysis systems were compared and categorized. Categorization was based on the information available about their functionality in attempt to get an overview of possible solutions varying by requirements. Real-life data from a car tracking service were used to propose an open-source solution for log record processing and analytics. It is based on ELK. The proposed solution was implemented and sample data was processed and analyzed using it. Following results were accomplished: Sample data contents were successfully processed Logstash configuration files were created for parsing the information of interest from the original mostly unstructured data; Various Kibana visualizations were created and exported for overall statistics and monitoring and for anomalous behavior detection; Logstash alerts were set for event-dependent issues and errors alerting in time of processing; ElastAlert rules were created for real-time alerting capabilities based on sudden changes (spikes) and events frequency monitoring. Various issues were encountered during the writing and implementation of this thesis. Issues are: Non-standard categorization of log analysis Overall log analysis varies in requirements and execution. During investigation multiple distinct sources of information were found differing greatly in their understanding of concept, meaning and goal of log analysis. Sample data contents issues Multiple discrepancies and inconsistencies were present in the sample data set. Issues such as duplicate records, different messaging for same actions and inconsistent property names and values were found. Also there were differences between records logged in different days (e.g. logs for one day contained 87

102 11. Conclusion additional types of messages that were not present in logs from a different day). These issues caused that the Logstash parsing and filters section implementation, testing and maintenance was much more complicated and time-consuming. Limitations of chosen implementation One of the preferred functionality for sample data log analysis is to list all the messages contained in sessions from a specific client. However due to the flat structure of Elasticsearch, JOIN functionality is not natively supported. There is number of issues tracked for Kibana 4 and also some of functionality that was supported in Kibana 3 is not yet available in current version Future work There are multiple possibilities for future improvements of the solution provided in this thesis Nested queries One possibility to improve functionality would be to enable the usage of nested queries. These would allow the listing of all messages received from client specific sessions. This solution would require a JOIN operation, which is not supported by Elasticsearch. The required result can be accomplished in multiple ways: 88 Add information about client to following session messages this information should be ideally present in the first event of a specific session and would be then added using aggregate filter in Logstash (implemented and working if DEBUG messages are enabled); Adjust Logstash configuration and Elasticsearch mapping to use nested properties in this case overall data model would need to be adjusted so the messages are stored as a list of specific session (then a parent-child relationship could be defined and messages of a specific session could be search on using nested queries); Create a custom application which will communicate with Elasticsearch and use the saved result from the first query (get all sessions for specific client) to run in a second query (get all messages for sessions from first query result).

103 11. Conclusion Alignment of client/server logs Client and server logs contain a number of misalignments. To enable better comparison and troubleshooting capabilities, the below improvement ideas need to be considered: Standardize the time format currently for server it is YYYY-MM-dd HH:mm:ss.SSS and for client it is YYYY-MM-dd HH:mm:ss; Get information about session also in the client log file for better matching; Correct possible time delay between the server and client logs to be shown at corresponding time. 89

104

105 12 Appendix 1: Electronic version Electronic version of this thesis includes: Logstash configuration files for both server and client logs; Searches/visualizations/dashboards exported from Kibana; ElastAlert rules and configuration file; Example of sample data; Text and images of thesis document in TeX. 91

106

107 13 Appendix 2: User Guide As the Kibana GUI is assumed to be most used in logged data investigation, a short user guide is included to get started. Kibana is accessible at When Kibana is accessed for the first time, index template to be used for data querying needs to be set. For the sample data processed by Logstash, time-based events should be checked, index pattern logstash-* and Timestamp as the timefield name should be chosen. See screenshot in Figure 13.1 for reference. This step can be done only when some indexes exist in Elasticsearch, i.e. some log files have been processed by Logstash. Figure 13.1: Setting index in Kibana GUI Once the pattern is set, one can proceed with investigating data by selecting the Discover tab. This is detailed in the next section. The default time set when opening Kibana is the last 15 minutes. In case there are no records logged in this timeframe, an error page that no results were found is displayed. Date for which all records are shown (including visualizations on dashboards) can be adjusted in a time-picker widget in the top right corner of the Kibana GUI. You can see this time picker placement in Figure

108 13. Appendix 2: User Guide Figure 13.2: Changing date and time in Kibana GUI The chosen date can be also adjusted by selecting the individual parts of the time-based visualization graphs. Choosing section or columns of the visualization graphics creates a time filter for the selected area Discover tab A Discover tab of the Kibana GUI is designed for basic searching and querying data present in Elasticsearch. Default GUI sections present in the Discover tab are described below based on the screenshot with emphasized sections in Figure Figure 13.3: Kibana Discover tab GUI 94 The search window situated at the top of the page is used for entering a query (marked yellow in the screenshot) input queries are based on the Lucene search syntax; Searches can be saved, loaded and exported (marked orange in the screenshot) saved searches can be used in visualizations;

109 13. Appendix 2: User Guide Below the search toolbar on the right side there is a number of hits returned by the query (marked pink in the screenshot); Directly below the search bar there is a small visualization of overall records count logged in timeframe selected in time picker widget (marked brown in the screenshot) selecting part of this section the timeframe gets updated to show only records in timeframe chosen on visualization; Below the visualization there is a list of records returned by the query (marked red in the screenshot) records are wrapped to show few first lines by default and all contents of record can be shown by clicking the arrow on the left; On the left side of Discover tab contents there is a list of fields present in the records returned by a query (marked grey in the screenshot) this field section is updated automatically according to contents returned by query and can be hidden by clicking the arrow on the right side of the panel; The fields from the left panel can be added as columns to the results list (marked purple in the screenshot), replacing default _source field e.g. message can be added to enable reviewing only message text of listed records; The fields listed on the left panel can be added and removed from the main results table as needed, also moved to left/right and get sorted e.g. for reviewing client and server logs, suggested is to add type column distinguishing between log_server and log_client types of records. All field values listed in the log record description can be used as filters directly, see bottom part of Figure These filters can be also edited on a source level manually updated query is sent to Elasticsearch directly. See screenshot in Figure 13.4 for reference. 95

110 13. Appendix 2: User Guide Figure 13.4: Kibana Discover tab filter Some of the basic queries used in the Kibana Discover tab: 96 Searching for the records that are missing specific field is done by querying for: _missing_: fieldname. Searching for the records for which specific field is present is done by querying for: _exists_: fieldname. A field with a specific value can be searched on by using query fieldname: value in case of string, "value" needs to be closed in quotes for exact match (otherwise all parts of string will be searched and matched). For numeric fields, also basic comparisons can be done: e.g. records_same:>10.

111 13. Appendix 2: User Guide Query may contain the boolean operators AND, NOT, OR and brackets for more complex queries such as tags: Connection_finished AND NOT (TM: A29 OR TM: A23). Scripted fields can be added to records, but they cannot be searched on in the Discover tab, only used in visualizations Settings tab On Settings -> Advanced tab default configuration can be adjusted. For example number of lines shown by default is 500 but this can be increased to up to Also provided dashboards, visualizations and searches can be imported from here as instructed in the Kibana chapter Dashboard tab Dashboards can be opened on Dashboard tab using the same Save/Open/New icons as in Search toolbar of the Discover tab. Visualizations can be added to the currently opened dashboard by using (+) icon. All visualizations present on dashboards can be accessed and edited by choosing the pencil icon in the upper right corner of every visualization. Visualizations can also be rearranged in dashboard display by using drag and drop or removed by clicking on cross icon in the upper right corner. All visualizations on dashboard are aligned to date and time chosen in the time picker widget. See dashboard screenshot in Figure 13.5 Figure 13.5: Kibana Dashboard visualizations 97

112 13. Appendix 2: User Guide 13.4 Visualization tab When the visualization is clicked on (pencil icon), it is opened in the Visualize tab, where it can be further explored and adjusted. All sections in display can be filtered on by clicking on them directly. The results of visualization in table format can be accessed by clicking the arrow at the bottom of visualization. The fields and values listed in this table can be also filtered on by clicking on them. All applied filters can be pinned (using the pin option on filter) and then applied for all Kibana tabs. As a result, the value and time of interest can be set/filtered on the Visualization tab, pinned and the records themselves can be investigated on Discover tab. Looking into the visualizations results in more detail, using the table display and filtering can be very helpful in overall investigation of anomalous behavior and troubleshooting. For the filtering addition example see screenshot in Figure Figure 13.6: Kibana visualization editing There are multiple options of adjusting display of visualization: Adjusting metrics (marked yellow in the screenshot) e.g. timeframe can be adjusted to show counts hourly/by minutes; Open table list by clicking arrow on the bottom of graph (marked pink in the screenshot); Set filter on value of interest (marked orange in the screenshot); Apply filters in the top of page (marked red in the screenshot). Complete information about Kibana usage and much more is listed in the original Kibana 4 documentation pages. [82] 98