Novel Data Extraction Language for Structured Log Analysis P.W.D.C. Jayathilake 99X Technology, Sri Lanka. ABSTRACT This paper presents the implementation of a new log data extraction language. Theoretical formation of the language schema was presented in a previous work of ours (Jayathilake, 2011). In the design of new language we focus on specific problems encountered in automating log analysis. Emphasis is put on the structured nature of log files. A brief review on existing data format description mechanisms is also provided. After describing the implementation of the new language, we compare it with another popular data description language to highlight the unique capabilities of it. KEYWORDS Log data extraction, Data format description language, Log analysis, Declarative language INTRODUCTION Software log files contain information pertaining to most user and system actions within an organization. Regulations such as PCI DSS, FISMA, HIPAA and frameworks like ISO 27001 and COBIT emphasize standards on logging. If utilized properly, log data can generate a huge value in various facets of a business. Log analysis has proven its potential in intrusion detection, unintended user activity identification, system compliance testing, software troubleshooting, software monitoring, performance benchmarking and functional testing. Despite its benefits, log analysis is a process that incurs a huge cost if the entire range of its phases is performed manually. Reasons are two fold; log analysis requires expertise and a significant time is consumed for digging deep into loads of data and making inferences. Commercial tools that exist in the market deliver a range of functionalities for automating certain stages of the analysis process. These include log data collection from different sources, data indexing, searching, automatic identification of common log file constructs such as timestamps and IP addresses, customizable dashboards for data visualization, highlighting anomalies, automatic compliance checks, etc. All existing tools treat log data as unstructured information. Though high entropy of log information justifies this practice, it imposes numerous limitations in automating log analysis. Lack of contextual correctness, for example, poses many challenges in creating semantics for inferring results automatically. Jayathilake (2011) published an initial version of a framework that creates a platform for structured log analysis. Its core constituent was a new procedural language, which was designed to be used in every phase of automated log analysis. Though the language proved to be powerful in processing log data, we soon realized its inappropriateness in log data extraction. Jayathilake (2011) later published a specification for a declarative language for describing the format of any log file. The intention was to make the log file format declaration more readable and to pick information of interest from log files more easily. We proved the flexibility of the specification in expressing formats of different log file types such as line logs, highly structured logs and tabular logs. Furthermore, we verified that the specification is resilient for log file corruptions, which is a prominent problem in the domain. 1
This paper presents an implementation of that specification based on Simple Declarative Language. Simple Declarative Language (SDL) is an easy representation mechanism for data structures (Leuck, 2012). Java and.net implementations for the language already exist so that a syntax expressed in a compliant format can be parsed easily. We formulate a syntax that facilitates easy expression of all language constructs. Since the syntax is compliant with SDL, we could use an existing SDL parser for lexical analysis. Interpretation stage is implemented according to a new algorithm, which uses recursion extensively. Inconsistent log data are handled through a hierarchical fault tolerance mechanism that provides users to select the level of recovery after detecting a log file corruption. Selective data extraction is supported to enable users to cherry-pick data from huge log files for further analysis. Supportive routines are added to reduce effort in dealing with common log file constructs such as timestamps, IP addresses, port numbers and error codes. Output of the data extraction process is a tree that incarnates semantic relationships between log entries. High expressiveness, simplicity, short learning curve, readability and immunity for log file corruptions are the strengths that we identify in the language. EXISTING DATA DESCRIPTION LANGUAGES This section provides an overview on existing data description languages. EAST - East is an ISO standard data description language, which is developed by the Consultative Committee for Space Data Systems (CCSDS, 2007). It provides a rich mechanism to express data format completely and non-ambiguously. Data are regarded as a collection of data entities and the EAST description is used to interpret and gain access to those entities. Its main design goals are strong data description capabilities, human readability, and computer interpretability. One prominent problem with EAST is the lack of support for describing file structures where position of one data entity needs to be determined at run time by examining fields in other entities. DRB Data Request Broker is an open source Java application programming interface (GAEL, 2009). It is an expansion on EAST. It can be used for reading, writing and processing heterogeneous data. DRB is a software abstraction layer, which can be utilized by developers for programming applications independently from the way data are encoded within files. It is also possible to perform calculations using XQuery from within the data description allowing full description of files where the locations of data fields within a file must be calculated from other data fields. However calculations must be described in XQuery and can lead to increased complexity hence reduced human readability. PADS/ML - This is a domain-specific language designed to improve the productivity of data analysts. It is a functional computer language to formally specify the logical and physical structure in data (Mandelbaum et al, 2007). In contrast to other data description languages, PADS/ML provides a platform where the description can stand as a sound documentation on the data too. However, it does not offer a satisfactory level of support for describing semantic information. DFDL Data Format Description Language is an open standard that came up due to the need for representing text and binary data with various formats in a common XML paradigm (OGF, 2010). It also allows data to be taken from an XML model and written out to its native format. By having a data format described with a DFDL description, which is accessible to 2
multiple applications, one can provide a common interface to the data, therefore facilitating data interchange. DFDL does not inherently support semantic information but can be used in conjunction with ontologies for this purpose. One drawback is the verbose nature of DFDL because of XML metadata that affects human readability. HAWK This is a powerful, flexible language for log file analysis, which uses simple methods to analyse. Its basis in pattern-action pairs allow for flexible combination of programs. It provides support for a range of log file analysis functionality such as filtering, recoding, and counting. The language provides the processing power for analysing the log files (HAWK, 2009). BFD - The Binary Format Description language is an XML-based language for expressing binary data formats (National Collaboratories, 2003). It is an extension to extensible Scientific Interchange Language(XSIL). A BFD template can be used to extract data from a set of files and put them into an XML for further processing. REQUIREMENT FOR A NEW LOG DATA EXTRACTION LANGUAGE Above-mentioned languages are mostly generic data format description languages that consider a wide range of applications. Log analysis is one niche where those tools can be utilized. However, log analysis, as a separate domain, exhibits unique characteristics and poses specific problems. For example, corrupted data is a prominent challenge facing any attempt to automate the analysis process. Huge amount of data, inconsistent formats and frequent format changes further add to this. Even so it is vital to have a data description scheme that results in more human readable templates compared to highly verbose XML solutions. THE NEW LANGUAGE In order to address these unique needs we designed a new log data extraction language based on a simple schema. Jayathilake (2011) discussed the theoretical formulation of this language along with case studies on its applications. In summary, it is based on interpreting a log file as a hierarchy of units termed log entities. Three types of log entities are identified. 1. Type A This type is defined as a sequence of other log entries defined by the pair ([LE 1, LE 2,, LE N ], ERROR_RECOVERY) where LE i are log entries. The sequence should be built with the same order of log entries as specified inside the square brackets in the first element of the pair. ERROR_RECOVERY is a flag that indicates whether the system should try to recover from parse errors for this type of log entries. 2. Type B This is a sequence of other log entries defined by the 4-tuple ({LE 1, LE 2,, LE N }, MAX, MIN, ERROR_RECOVERY) where LE i are log entries. The sequence can be built with those log entries by putting them in any order. Each LEi can appear in the sequence zero or more times. The list containing LEis is termed the candidate list for the sequence. MAX is the maximum number of log entries permitted in the sequence. If its value is -1, there is no limit for the length of the sequence. Similarly MIN is the minimum number of log entries that should 3
be present in the sequence. -1 indicates that there is no lower bound for the length of the sequence. ERROR_RECOVERY is a flag having same semantics as in definition for Type A. 3. Type C A singleton (k) where k is a fixed sequence of bytes. The language also provides a mechanism to recover from corruptions in a log file. When a part of text that does not follow the format in the description is detected, the interpreter has the ability to fall-back to the next log entry and to continue execution without premature termination. IMPLEMENTATION We implemented the language syntax in Simple Declarative Language (SDL), which provides infrastructure for describing arbitrary data formats. Below we explain the syntax for each of the three log entry types through examples. 1. Type A Line typea Timestamp Process TID Area Category ER=true This syntax defines a log entry named Line, which is built by a sequence of other log entries Timestamp, Process, TID, Area, and Category. Error recovery (ER) is set to true. 2. Type B Gap typeb Space Tab Max=-1 Min=2 ER=false This is a definition of a log entry named Gap, which stands for an empty space created by two or more spaces and tabs. Spaces and tabs can occur in any order and quantity. Error recovery (ER) is set to false. Char typec a 3. Type C The Type C log entry Char defined here stands for the character a. Implementation of the language is shown in Fig. 1. It constitutes two main components; the lexical analysis module (parser) and an interpreter module. The parser module processes the given format specification using SDL. This is possible since the new language syntax is compliant with SDL. Log file content is lexically analysed with respect to the pre-processed format specification. After that, the interpreter extracts log file content and converts it to a proprietary binary format. This data format is ready to be processed by the log data analysis framework presented in Jayathilake (2011). A recursive algorithm is used to implement the interpreter module. 4
Figure 1: Implementation of the new language COMPARISON WITH DFDL In this section we provide a comparison of the new language with DFDL, which is another promising technology to express file formats. Similar to any other XML based schemas DFDL incurs significant metadata overhead. Log entry formats expressed in DFDL are much verbose than their expressions in our schema. Less verbose Resilient for log corruptions Optimized for log file formats Lot of metadata Unable to handle log corruptions Offers a powerful type system Line typea Timestamp Process TID Area Category ER=true Figure 2: Comparison between our schema and DFDL 5
Fig. 2 provides a comparison between the expressions of one log entry in our schema and in DFDL. The new schema results in more compact and readable format expressions. Since the new language is specifically designed for log file formats, in contrast to DFDL, which is a generic format expression mechanism, the new language offers few other benefits too. One prominent advantage of it is the ability to deal with log corruptions. On the other hand, DFDL provides a rich type system so that most common data types are natively identified. CONCLUSION The new log data extraction language has the capability to express a wide range of log file formats while offering a simple, human-readable syntax. Its hierarchical interpretation of log entries enables it to capture difficult log formats containing lot of peculiarities that many other existing data format description languages fail on. The schema is proven to work with many industrial log file types such as line logs, message logs, XML logs and tabular logs. A prominent feature in the new language is its ability to deal with inconsistencies and corruptions in log files. It strengthens the automated log analysis mechanism with the ability to use as much correct data as possible. Simple Declarative Language provided a useful platform when implementing the language syntax. The current implementation of the language supports only text log files, which constitutes a limitation. It can be enhanced to handle binary logs too. A further improvement can be adding the capability to handle log file formats where the location of one log entry should be dynamically read from another log entry. REFERENCES Jayathilake, D. (2011) A mind map based framework for automated software log file analysis, Proceedings of the International Conference on Software and Computer Applications (ICSCA 2011), pp. 1-6. Jayathilake, D. (2011) A novel mind map based approach for log data extraction, Proceedings of the 6th IEEE International Conference on Industrial and Information Systems (ICIIS 2011), pp. 130-135. Andrews, J. H. (1988) Testing using log file analysis: tools, methods and issues, Proceedings of the 13th IEEE International Conference on Automated Software Engineering, pp. 157-166. Valdman, J. (2001) Log file analysis, Department of Computer Science and Engineering (FAV UWB), Tech. Rep. DCSE/TR-2001-04. Consultative Committee for Space Data Systems, 2007. The Data Description language EAST Specification. [pdf]. Available at: <http://standards.gsfc.nasa.gov/reviews/ccsds/ccsds- 644.0-p-2.1/ccsds-644.0-p-2.1.pdf> [Accessed 05 May 2012]. GAEL Consultant, 2009. Data Request Broker. [online] Available at: <http://www.gael.fr/drb> [Accessed 05 May 2012]. Mandelbaum, Y., Fisher, K., Walker, D., Fernandez, M. and Gleyzer, A. (2007) PADS/ML: A Functional Data Description Language, Proceedings of the 34th annual ACM SIGPLAN- SIGACT symposium on Principles of programming languages (POPL 07), pp. 77-83. 6
OGF Data Format Description Language Working Group, 2010. Data Format Description Language (DFDL) v1.0 Core Specification. [pdf]. Available at: <http://www.ogf.org/public_comment_docs/documents/2010-03/draft-gwdrp-dfdl-corev1.0.pdf> [Accessed 05 May 2012]. HAWK Network Defense, 2009. The Future: Dynamic Log Analysis. [pdf]. Available at: <http://www.cleartechnologies.net/wp-content/uploads/2011/08/dynamic-log-analysis- Whitepaper4.pdf> [Accessed 05 May 2012]. National Collaboratories, 2003. Binary Format Description (BFD) Language. [online] Available at: <http://collaboratory.emsl.pnl.gov/sam/bfd> [Accessed 05 May 2012]. Daniel Leuck, 2012. Simple Declarative Language. [online] Available at: <http://107.20.201.134/display/sdl/home> [Accessed 05 May 2012]. 7