Novel Data Extraction Language for Structured Log Analysis



Similar documents
A Mind Map Based Framework for Automated Software Log File Analysis

FiskP, DLLP and XML

Rotorcraft Health Management System (RHMS)

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

2. Distributed Handwriting Recognition. Abstract. 1. Introduction

Databases in Organizations

MULTI AGENT-BASED DISTRIBUTED DATA MINING

Automatic Timeline Construction For Computer Forensics Purposes

XML DATA INTEGRATION SYSTEM

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Secure Semantic Web Service Using SAML

Introduction to Web Services

Technical. Overview. ~ a ~ irods version 4.x

A Recommendation Framework Based on the Analytic Network Process and its Application in the Semantic Technology Domain

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Search and Information Retrieval

A Framework for Ontology-Based Knowledge Management System

BUSINESS VALUE OF SEMANTIC TECHNOLOGY

Distributed Database for Environmental Data Integration

XpoLog Competitive Comparison Sheet

Some Research Challenges for Big Data Analytics of Intelligent Security

II. PREVIOUS RELATED WORK

Multiple electronic signatures on multiple documents

A Semantic Approach for Access Control in Web Services

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

The Ontological Approach for SIEM Data Repository

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

Semantically Enhanced Web Personalization Approaches and Techniques

DataDirect XQuery Technical Overview

A FRAMEWORK FOR MANAGING RUNTIME ENVIRONMENT OF JAVA APPLICATIONS

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

2695 P a g e. IV Semester M.Tech (DCN) SJCIT Chickballapur Karnataka India

A Scalability Model for Managing Distributed-organized Internet Services

FIPA agent based network distributed control system

Advantages of XML as a data model for a CRIS

Syntax Check of Embedded SQL in C++ with Proto

Efficient Information Retrieval in Network Management Using Web Services

Enterprise Data Quality

Chapter 11 Mining Databases on the Web

VIRTUAL LABORATORY: MULTI-STYLE CODE EDITOR

COCOVILA Compiler-Compiler for Visual Languages

Ontology and automatic code generation on modeling and simulation

Component visualization methods for large legacy software in C/C++

Modeling Turnpike: a Model-Driven Framework for Domain-Specific Software Development *

Lightweight Data Integration using the WebComposition Data Grid Service

Database Migration- How hard can it be?

A Mechanism for VHDL Source Protection

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

Augmented Search for IT Data Analytics. New frontier in big log data analysis and application intelligence

The International Journal of Digital Curation Volume 7, Issue

Programming Languages

LDIF - Linked Data Integration Framework

Middleware support for the Internet of Things

UNIVERSITY OF MALTA THE MATRICULATION CERTIFICATE EXAMINATION ADVANCED LEVEL COMPUTING. May 2011

INTRUSION PROTECTION AGAINST SQL INJECTION ATTACKS USING REVERSE PROXY

A Standards-Based Approach to Extracting Business Rules

From Business World to Software World: Deriving Class Diagrams from Business Process Models

TZWorks Windows Event Log Viewer (evtx_view) Users Guide

MD Link Integration MDI Solutions Limited

An Approach to Eliminate Semantic Heterogenity Using Ontologies in Enterprise Data Integeration

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

Symbol Tables. Introduction

Bisecting K-Means for Clustering Web Log data

Application of Data Mining Techniques in Intrusion Detection

How To Create A Data Transformation And Data Visualization Tool In Java (Xslt) (Programming) (Data Visualization) (Business Process) (Code) (Powerpoint) (Scripting) (Xsv) (Mapper) (

CHAPTER 1 INTRODUCTION

Test Data Management Concepts

Protecting Business Information With A SharePoint Data Governance Model. TITUS White Paper

Semantic Search in Portals using Ontologies

Improving the visualisation of statistics: The use of SDMX as input for dynamic charts on the ECB website

INTRUSION DETECTION ALARM CORRELATION: A SURVEY

Lumousoft Visual Programming Language and its IDE

Transparency and Efficiency in Grid Computing for Big Data

Scalable Extraction, Aggregation, and Response to Network Intelligence

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

On the general structure of ontologies of instructional models

Data Mining & Data Stream Mining Open Source Tools

Automated Medical Citation Records Creation for Web-Based On-Line Journals

Connections to External File Sources

Web Forensic Evidence of SQL Injection Analysis

Doctor of Philosophy in Computer Science

Hadoop Technology for Flow Analysis of the Internet Traffic

Recovering Business Rules from Legacy Source Code for System Modernization

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

BASI DI DATI II 2 modulo Parte II: XML e namespaces. Prof. Riccardo Torlone Università Roma Tre

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

ML for the Working Programmer

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

DLDB: Extending Relational Databases to Support Semantic Web Queries

Preservation Handbook

PTK Forensics. Dario Forte, Founder and Ceo DFLabs. The Sleuth Kit and Open Source Digital Forensics Conference

CHAPTER 1 INTRODUCTION

Tool Support for Model Checking of Web application designs *

Data Integration Hub for a Hybrid Paper Search

Information Visualization of Attributed Relational Data

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment

Snapshots in Hadoop Distributed File System

BPCMont: Business Process Change Management Ontology

Transcription:

Novel Data Extraction Language for Structured Log Analysis P.W.D.C. Jayathilake 99X Technology, Sri Lanka. ABSTRACT This paper presents the implementation of a new log data extraction language. Theoretical formation of the language schema was presented in a previous work of ours (Jayathilake, 2011). In the design of new language we focus on specific problems encountered in automating log analysis. Emphasis is put on the structured nature of log files. A brief review on existing data format description mechanisms is also provided. After describing the implementation of the new language, we compare it with another popular data description language to highlight the unique capabilities of it. KEYWORDS Log data extraction, Data format description language, Log analysis, Declarative language INTRODUCTION Software log files contain information pertaining to most user and system actions within an organization. Regulations such as PCI DSS, FISMA, HIPAA and frameworks like ISO 27001 and COBIT emphasize standards on logging. If utilized properly, log data can generate a huge value in various facets of a business. Log analysis has proven its potential in intrusion detection, unintended user activity identification, system compliance testing, software troubleshooting, software monitoring, performance benchmarking and functional testing. Despite its benefits, log analysis is a process that incurs a huge cost if the entire range of its phases is performed manually. Reasons are two fold; log analysis requires expertise and a significant time is consumed for digging deep into loads of data and making inferences. Commercial tools that exist in the market deliver a range of functionalities for automating certain stages of the analysis process. These include log data collection from different sources, data indexing, searching, automatic identification of common log file constructs such as timestamps and IP addresses, customizable dashboards for data visualization, highlighting anomalies, automatic compliance checks, etc. All existing tools treat log data as unstructured information. Though high entropy of log information justifies this practice, it imposes numerous limitations in automating log analysis. Lack of contextual correctness, for example, poses many challenges in creating semantics for inferring results automatically. Jayathilake (2011) published an initial version of a framework that creates a platform for structured log analysis. Its core constituent was a new procedural language, which was designed to be used in every phase of automated log analysis. Though the language proved to be powerful in processing log data, we soon realized its inappropriateness in log data extraction. Jayathilake (2011) later published a specification for a declarative language for describing the format of any log file. The intention was to make the log file format declaration more readable and to pick information of interest from log files more easily. We proved the flexibility of the specification in expressing formats of different log file types such as line logs, highly structured logs and tabular logs. Furthermore, we verified that the specification is resilient for log file corruptions, which is a prominent problem in the domain. 1

This paper presents an implementation of that specification based on Simple Declarative Language. Simple Declarative Language (SDL) is an easy representation mechanism for data structures (Leuck, 2012). Java and.net implementations for the language already exist so that a syntax expressed in a compliant format can be parsed easily. We formulate a syntax that facilitates easy expression of all language constructs. Since the syntax is compliant with SDL, we could use an existing SDL parser for lexical analysis. Interpretation stage is implemented according to a new algorithm, which uses recursion extensively. Inconsistent log data are handled through a hierarchical fault tolerance mechanism that provides users to select the level of recovery after detecting a log file corruption. Selective data extraction is supported to enable users to cherry-pick data from huge log files for further analysis. Supportive routines are added to reduce effort in dealing with common log file constructs such as timestamps, IP addresses, port numbers and error codes. Output of the data extraction process is a tree that incarnates semantic relationships between log entries. High expressiveness, simplicity, short learning curve, readability and immunity for log file corruptions are the strengths that we identify in the language. EXISTING DATA DESCRIPTION LANGUAGES This section provides an overview on existing data description languages. EAST - East is an ISO standard data description language, which is developed by the Consultative Committee for Space Data Systems (CCSDS, 2007). It provides a rich mechanism to express data format completely and non-ambiguously. Data are regarded as a collection of data entities and the EAST description is used to interpret and gain access to those entities. Its main design goals are strong data description capabilities, human readability, and computer interpretability. One prominent problem with EAST is the lack of support for describing file structures where position of one data entity needs to be determined at run time by examining fields in other entities. DRB Data Request Broker is an open source Java application programming interface (GAEL, 2009). It is an expansion on EAST. It can be used for reading, writing and processing heterogeneous data. DRB is a software abstraction layer, which can be utilized by developers for programming applications independently from the way data are encoded within files. It is also possible to perform calculations using XQuery from within the data description allowing full description of files where the locations of data fields within a file must be calculated from other data fields. However calculations must be described in XQuery and can lead to increased complexity hence reduced human readability. PADS/ML - This is a domain-specific language designed to improve the productivity of data analysts. It is a functional computer language to formally specify the logical and physical structure in data (Mandelbaum et al, 2007). In contrast to other data description languages, PADS/ML provides a platform where the description can stand as a sound documentation on the data too. However, it does not offer a satisfactory level of support for describing semantic information. DFDL Data Format Description Language is an open standard that came up due to the need for representing text and binary data with various formats in a common XML paradigm (OGF, 2010). It also allows data to be taken from an XML model and written out to its native format. By having a data format described with a DFDL description, which is accessible to 2

multiple applications, one can provide a common interface to the data, therefore facilitating data interchange. DFDL does not inherently support semantic information but can be used in conjunction with ontologies for this purpose. One drawback is the verbose nature of DFDL because of XML metadata that affects human readability. HAWK This is a powerful, flexible language for log file analysis, which uses simple methods to analyse. Its basis in pattern-action pairs allow for flexible combination of programs. It provides support for a range of log file analysis functionality such as filtering, recoding, and counting. The language provides the processing power for analysing the log files (HAWK, 2009). BFD - The Binary Format Description language is an XML-based language for expressing binary data formats (National Collaboratories, 2003). It is an extension to extensible Scientific Interchange Language(XSIL). A BFD template can be used to extract data from a set of files and put them into an XML for further processing. REQUIREMENT FOR A NEW LOG DATA EXTRACTION LANGUAGE Above-mentioned languages are mostly generic data format description languages that consider a wide range of applications. Log analysis is one niche where those tools can be utilized. However, log analysis, as a separate domain, exhibits unique characteristics and poses specific problems. For example, corrupted data is a prominent challenge facing any attempt to automate the analysis process. Huge amount of data, inconsistent formats and frequent format changes further add to this. Even so it is vital to have a data description scheme that results in more human readable templates compared to highly verbose XML solutions. THE NEW LANGUAGE In order to address these unique needs we designed a new log data extraction language based on a simple schema. Jayathilake (2011) discussed the theoretical formulation of this language along with case studies on its applications. In summary, it is based on interpreting a log file as a hierarchy of units termed log entities. Three types of log entities are identified. 1. Type A This type is defined as a sequence of other log entries defined by the pair ([LE 1, LE 2,, LE N ], ERROR_RECOVERY) where LE i are log entries. The sequence should be built with the same order of log entries as specified inside the square brackets in the first element of the pair. ERROR_RECOVERY is a flag that indicates whether the system should try to recover from parse errors for this type of log entries. 2. Type B This is a sequence of other log entries defined by the 4-tuple ({LE 1, LE 2,, LE N }, MAX, MIN, ERROR_RECOVERY) where LE i are log entries. The sequence can be built with those log entries by putting them in any order. Each LEi can appear in the sequence zero or more times. The list containing LEis is termed the candidate list for the sequence. MAX is the maximum number of log entries permitted in the sequence. If its value is -1, there is no limit for the length of the sequence. Similarly MIN is the minimum number of log entries that should 3

be present in the sequence. -1 indicates that there is no lower bound for the length of the sequence. ERROR_RECOVERY is a flag having same semantics as in definition for Type A. 3. Type C A singleton (k) where k is a fixed sequence of bytes. The language also provides a mechanism to recover from corruptions in a log file. When a part of text that does not follow the format in the description is detected, the interpreter has the ability to fall-back to the next log entry and to continue execution without premature termination. IMPLEMENTATION We implemented the language syntax in Simple Declarative Language (SDL), which provides infrastructure for describing arbitrary data formats. Below we explain the syntax for each of the three log entry types through examples. 1. Type A Line typea Timestamp Process TID Area Category ER=true This syntax defines a log entry named Line, which is built by a sequence of other log entries Timestamp, Process, TID, Area, and Category. Error recovery (ER) is set to true. 2. Type B Gap typeb Space Tab Max=-1 Min=2 ER=false This is a definition of a log entry named Gap, which stands for an empty space created by two or more spaces and tabs. Spaces and tabs can occur in any order and quantity. Error recovery (ER) is set to false. Char typec a 3. Type C The Type C log entry Char defined here stands for the character a. Implementation of the language is shown in Fig. 1. It constitutes two main components; the lexical analysis module (parser) and an interpreter module. The parser module processes the given format specification using SDL. This is possible since the new language syntax is compliant with SDL. Log file content is lexically analysed with respect to the pre-processed format specification. After that, the interpreter extracts log file content and converts it to a proprietary binary format. This data format is ready to be processed by the log data analysis framework presented in Jayathilake (2011). A recursive algorithm is used to implement the interpreter module. 4

Figure 1: Implementation of the new language COMPARISON WITH DFDL In this section we provide a comparison of the new language with DFDL, which is another promising technology to express file formats. Similar to any other XML based schemas DFDL incurs significant metadata overhead. Log entry formats expressed in DFDL are much verbose than their expressions in our schema. Less verbose Resilient for log corruptions Optimized for log file formats Lot of metadata Unable to handle log corruptions Offers a powerful type system Line typea Timestamp Process TID Area Category ER=true Figure 2: Comparison between our schema and DFDL 5

Fig. 2 provides a comparison between the expressions of one log entry in our schema and in DFDL. The new schema results in more compact and readable format expressions. Since the new language is specifically designed for log file formats, in contrast to DFDL, which is a generic format expression mechanism, the new language offers few other benefits too. One prominent advantage of it is the ability to deal with log corruptions. On the other hand, DFDL provides a rich type system so that most common data types are natively identified. CONCLUSION The new log data extraction language has the capability to express a wide range of log file formats while offering a simple, human-readable syntax. Its hierarchical interpretation of log entries enables it to capture difficult log formats containing lot of peculiarities that many other existing data format description languages fail on. The schema is proven to work with many industrial log file types such as line logs, message logs, XML logs and tabular logs. A prominent feature in the new language is its ability to deal with inconsistencies and corruptions in log files. It strengthens the automated log analysis mechanism with the ability to use as much correct data as possible. Simple Declarative Language provided a useful platform when implementing the language syntax. The current implementation of the language supports only text log files, which constitutes a limitation. It can be enhanced to handle binary logs too. A further improvement can be adding the capability to handle log file formats where the location of one log entry should be dynamically read from another log entry. REFERENCES Jayathilake, D. (2011) A mind map based framework for automated software log file analysis, Proceedings of the International Conference on Software and Computer Applications (ICSCA 2011), pp. 1-6. Jayathilake, D. (2011) A novel mind map based approach for log data extraction, Proceedings of the 6th IEEE International Conference on Industrial and Information Systems (ICIIS 2011), pp. 130-135. Andrews, J. H. (1988) Testing using log file analysis: tools, methods and issues, Proceedings of the 13th IEEE International Conference on Automated Software Engineering, pp. 157-166. Valdman, J. (2001) Log file analysis, Department of Computer Science and Engineering (FAV UWB), Tech. Rep. DCSE/TR-2001-04. Consultative Committee for Space Data Systems, 2007. The Data Description language EAST Specification. [pdf]. Available at: <http://standards.gsfc.nasa.gov/reviews/ccsds/ccsds- 644.0-p-2.1/ccsds-644.0-p-2.1.pdf> [Accessed 05 May 2012]. GAEL Consultant, 2009. Data Request Broker. [online] Available at: <http://www.gael.fr/drb> [Accessed 05 May 2012]. Mandelbaum, Y., Fisher, K., Walker, D., Fernandez, M. and Gleyzer, A. (2007) PADS/ML: A Functional Data Description Language, Proceedings of the 34th annual ACM SIGPLAN- SIGACT symposium on Principles of programming languages (POPL 07), pp. 77-83. 6

OGF Data Format Description Language Working Group, 2010. Data Format Description Language (DFDL) v1.0 Core Specification. [pdf]. Available at: <http://www.ogf.org/public_comment_docs/documents/2010-03/draft-gwdrp-dfdl-corev1.0.pdf> [Accessed 05 May 2012]. HAWK Network Defense, 2009. The Future: Dynamic Log Analysis. [pdf]. Available at: <http://www.cleartechnologies.net/wp-content/uploads/2011/08/dynamic-log-analysis- Whitepaper4.pdf> [Accessed 05 May 2012]. National Collaboratories, 2003. Binary Format Description (BFD) Language. [online] Available at: <http://collaboratory.emsl.pnl.gov/sam/bfd> [Accessed 05 May 2012]. Daniel Leuck, 2012. Simple Declarative Language. [online] Available at: <http://107.20.201.134/display/sdl/home> [Accessed 05 May 2012]. 7