Geo-Identification of Web Users through Logs using ELK Stack

Geo-Identification of Web Users through Logs using ELK Stack Tarun Prakash Ms. Misha Kakkar Kritika Patel Amity University, Uttar Pradesh Assistant Professor, Amity University, Uttar Pradesh Amity University, Uttar Pradesh Email: tarun_prakash@outlook.com Email:mkakkar@amity.edu Email:kritika.abes@gmail.com Abstract With the Internet penetration rate going higher, huge amount of log files are being generated, which contains hidden information having enormous business value. To unlock the hidden returns, log management system helps in making business decisions. Although, a lot of log management exist but they either fail to scale or are costly. Here efforts have been made to solve the shortcomings of prevailing log analyzer tools and this paper demonstrates the working of ELK ecosystem i.e. Elasticsearch, Logstash and Kibana clubbed together to efficiently analyze the log files and provide an interactive and easily understandable insights. Log management systems built on ELK stack are desired to analyze large log data sets while making the whole computation process easy to monitor through an interactive interface. Being from open source community ELK stack has many useful features for log analysis. Elasticsearch is used as Indexing, storage and retrieval engine. Logstash acts as a Log input slicer and dicer and output writer while Kibana performs Data visualization using dashboards. By implementing ELK ecosystem we have efficiently geo-identify the website users traffic using logs. Keywords Elasticsearch, logstash, kibana, Geo-identification, log analysis I. INTRODUCTION Log management system deals with large volumes of computer generated log records (also called as audit records, audit trails, event-logs, etc.). The process involves log collection, centralized aggregation, long-term retention, log analysis, log searching and reporting. A stream of messages in time sequence often contains log entity. Logs are either stored on a disk or supplied as a data stream to a log collector. Logs are being generated in order to help in debugging the operations in an application. The semantics and syntax of data in log messages are usually application or system specific. Log analysis systems must decode messages within the context of an application or system in order to make meaningful comparisons for messages within different log files. As every computing device is generating logs and if only web server logs case is considered then also it amounts to a very large data set. The analysis of these logs not only helps the companies in their decision making but also in improving their products and services. However, most of the small-scale companies cannot afford these log management system. Another problem is the usability of the log management products as most of the available products require technical expertise in-order to process and interpret the logs data. In a nutshell, the three critical problem components for log analysis are Volume, Cost and Usability. Elasticsearch is capable in indexing and searching. Elasticsearch has search and list information accessible RESTfully as JSON over HTTP and can easily scale. It's under the Apache 2.0 permit and is based on top of Apache's Lucene project. Elasticsearch is a content indexing web search tool. The best representation to show Elasticsearch is the pages of a book. As we flip to the back of a book, and look for a word and afterward discover the reference page. This implies that instead of looking at the content strings straightforwardly, Elasticsearch makes an index of the content and it looks to the indexes as opposed to the contents. Thus, it is quick. Logstash is a tool for managing events and logs. It is used to collect logs, parse them, and store them for later use (like, for searching). If we store them in Elasticsearch, we can view and analyze them with Kibana. The Kibana web interface is an adjustable dashboard that we can stretch out and change to suit our surroundings. It permits tables and diagrams creation and in addition complex representations. The Kibana web interface utilizes the Apache Lucene question linguistic structure to permit us to make inquiries. We can seek any of the fields contained in a Logstash, for instance, message, syslog_program, and so on. We can utilize Boolean rationale with AND, OR, and NOT. Fig 1: ELK Ecosystem This paper demonstrates the application of ELK stack to Geo- Identify the web users through logs. As ELK stack is an open source technology, thus it doesn t involve paying high licensing fees. 978-1-4673-8203-8/16/$31.00 c 2016 IEEE 606

II. REALTED WORK Various development works has been going on since the last decade to build log analytics systems. Artyom [1] did a comprehensive analysis on popular open source log management tools such as Graylog2, ELK, and ELSA. His study was based on performance testing, which showed ELSA was fastest and was able to handle 14000 logs per second but ELK secured impact on usability factor. Christopher [2] in his blog analyzed Yahoo's historical stock data [3] using ELK stack and build the dashboard showing the volume of stocks, if Average high and Average low. Aarvik [4] build a real time data analytics tool and tested logs of his Debian Operating system. Tanmay [5] integrated apache log data with GeoCity [6] data, which transforms IP address into exact locations with attributes like longitude, altitude, city, state and country. William [7] in his blog post used Docker [8] container and deployed ELK stack in order to monitor infrastructure. Jettro [9] In his project post scanned directory structure of jpg files and extracted meta data from images, utilizing the ELK stack he build a dashboard which shows details about aperture, focal length and iso. Bogdan [10] build a video card monitoring solution using ELK stack to have the information about the performance when gaming. In a realpython [11] blog a real time twitter sentiment analysis dashboard utilizing Elasticsearch and Kibana was built. Below Table 1 shows a comparative study of GrayLog2, ELK and ELSA. Table 1: Feature Overview of Graylog2, ELK, ELSA [1] Step 2: The Logstash server after receiving the logs and process the different queues to store the data in the local folders. Step 3: Elasticsearch performs different operations (e.g. optimize the data structures, create search indexes, group information) Step 4: Kibana reads Logstash data structures and present them to the users using custom layouts, dashboards and filters. Fig 2: ELK Overview In this study centralized ELK server was set up and Logstashforwarder is used to send the logs to the ELK server. There are two classes of Logstash hosts: Hosts running the Logstash as an event forwarder that sends us application, administration, and host logs to a central Logstash server. Central Logstash hosts running some blend of storage, indexer, archiver, search and web interface programming which get, process, and store logs. A fundamental setup record for Logstash has 3 areas: 1. Input 2. Filter 3. Output Through Inputs we are passing log information to Logstash. a. file: It reads from a document on the file system, much like the UNIX charge "tail - f" b. Syslog: It listens on the port 514 for syslog messages and parses as indicated by RFC3164 group a. Lumberjack: It contains events sent in the logger convention. It is generally called as logstash-forwarder. III. METHODS AND METHODOLOGIES ELK Ecosystem Working Overview: Step 1: The Logstash-forwarder reads the local log files by the user and then log file were sent to the Logstash server (port 5000) encrypted, using the Logstash configuration file. Filters are workhorses for handling inputs in the Logstash chain. They are regularly joined with conditionals so as to perform a certain activity on an occasion, on the off chance that it coordinates specific criteria. Some valuable channels: a. Grok: It Parses discretionary content and structures it. Grok is at present the most ideal route in Logstash to parse unstructured log information into something organized and queryable. With 120 examples 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence) 607

dispatched implicit to Logstash, it is more than likely we will discover one that addresses our issues! b. Drop: The transform channel permits us to do general transformations to fields. We can rename, uproot, supplant, and alter as required. c. Geoip: Adds data about geological area of IP addresses (and presentations astounding outlines in Kibana) Outputs are the last step of the Logstash pipeline Elasticsearch: Elasticsearch is the best approach to store data in a proficient and effortlessly queryable manner. File: composes occasion information to a document on plate. Statsd: An administration that "listens for measurements, similar to counters and clocks, sent over UDP and sends totals to one or more pluggable backend administrations". IV. EXPERIMENTAL SET UP For the purpose of this study - Elasticsearch -1.6.0, Logstash- 1.5.1, Kibana-4.1.0 is used. Log File Description: 74.125.44.145 - - [11/Mar/2015:00:07:32 +0100] "GET /feed HTTP/1.1" 304 0 "" "FeedBurner/1.0 (http://www.feedburner.com)" HSI-KBW-134-3-54-231.hsi14.kabel-badenwuerttemberg.de - - [11/Mar/2015:00:13:12 +0100] "GET /favicon.ico HTTP/1.1" 200 894 "" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0" Username: It resembles the visitor through the IP address. Visiting Path: It provides the path traversed or url being hit directly or indirectly through links by user. Path Traversed: It provide the path traversed within the website. Time Stamp: It provides the amount of time being spent on web site. It is also termed as session. Page last visited: It provides the page that was being visited by user Success rate: It provides the number of downloads being done by user. If there is any purchase being made as software or anything than it adds to the success. User Agent: It provides browser being used at client side. URL: It proves the requested resources being access by user like html page. Request Type: It provides the type of method being used like Get, Post. V. EXPERIMENTATION Access Logs were loaded into data using Logstash into Elasticsearch and then create reports in Kibana using the same index. Loading logs data into Elasticsearch using Logstash Step 1. Loading of log data into text file. Each log line looks similar to fig 3 storing information like caller IP address, the target resource, etc. Fig 3: Sample log file Logstash Configuration:To make Logstash work we need three things: Input definition from where the data comes Filters definition on how our input data will be processed Output where our processed data will be send Fig 4: Logstash Configuration Here at the input section the path specifies the location of the data file, the type is our name for the data and the start position is set to the beginning that means that to process the whole file from the start. In the filter section, first we have to process access_log type and we have two filters grok and geoip. The grok filter is responsible for parsing the access log, it takes the unstructured log from the defined input and give each log line format. The second filter is the geoip, which adds geo information to our data. For the geoip filter we are required to specify thesource property, which holds information about the field that holds the IP address we are interested in in our case it is the clientip field generated by the grok filter. In the output section, host, port and protocol are specified. Next index template is defined by using the indexproperty, which in our case will result indices like logs_2015.03.21, logs_2015.03.22 608 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence)

Step 2. Starting elasticsearch, logstash cluster from command prompt and start kibana services. To configure kibana select Use event times to create index names and Use event times to create index names, Index pattern inverval to match indices in our case daily and the Index name or pattern is set to format [logs_]yyyy.mm.dd. We also have to choose the time based field @timestamp and then hit Create.We have to select the aggregation type. As we intend to find the geo location we need to choose Geo coordinates aggregation, which will be based on the IP addresses information we configured Logstash to extract and add.in the below screen we can the Geo map which shows the Red, Yellow dots based on the number of hits from the IP address of the coloured region as follows: the Hit count of users and generates heat map which is easily understandable and offers other useful visualization schemes. Here we can see that number of hit counts of users from the various regions is being shown in different colors on a map. We can also identity the Geo Grid that has highest hit counts and lowest counts and based on that, we can interpret our user base. In this paper we performed the geo identification of users based on the access logs using open source stack ELK, we have used a small log data set just to demonstrate the usability of ELK stack but the real usefulness of ELK stack will be on Big data sets as without much of installation steps and being the product from open source community, it is cost effective and had huge active contributor base which makes it more competitive as compared to other products for log management activities. Fig 5: GeoMap with Users Fig 7: Hit Counts of Users in Geo-Map Moreover, ELK stack can be used for tracking fraudulent activities, security breaches and it looks promising for Internet of things monitoring which is the future. ELK is much effective in building real time dashboard for analyzing tweets and other analytics. VI. Fig 6: Tabular Data of Hit Counts RESULTS AND DISCUSSIONS ELK as a log management system is very useful in terms of usability as anyone can interpret the results and based on that they can have some insights. Below screen shows the Cluster health monitoring through a GUI.The figure 5 shows REFERENCES [1] Artyom Churilin, "Choosing an open-source log management system for small business," Master s Thesis, Faculty of Information Technology, Tallin University of Technology, Tallinn, Estonia. [2] Christopher (2015, April 15).Visualizing data with Elasticsearch, Logstash and Kibana[Online].Available:http://blog.webkid.io/visualizedatasets-with-elk/ [3]Yahoo Stock history data [Online].Available: http://finance.yahoo.com [4]Anders Aarvik (2014, April 04).A bit on ElasticSearch + Logstash + Kibana (The ELK stack)[online].available:http://aarvik.dk/a-bit-onelasticsearch-logstash-kibana-the-elk-stack/ [5]Tanmay Deshpande (2015,May 10).Log Analytics using Elasticsearch, Logstash and Kibana[Online].Available:http://hadooptutorials.co.in/tutorials/ elasticsearch/log-analytics-using-elasticsearch-logstashkibana.html# 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence) 609

[6]GeoLite Legacy Downloadable Database[Online].Avaialble:http://dev.maxmind.com/geoip/leg acy/geolite/ [7]William Durand (2014, December 17).Elasticsearch, Logstash & Kibana with Docker [Online].Avaialble:http://williamdurand.fr/2014/12/17/elasticse arch-logstash-kibana-with-docker/ [8]Docker [Online].Available: https://www.docker.com/ [9]Jettro Coenradie (2013, November 28).Use Kibana to analyze your images[online].avaialble:https://blog.trifork.com/2013/11/28/ use-kibana-to-analyze-your-images/ [10]Bogdan Dumitresc (2014, January 28)Using Logstash, Elasticsearch and Kibana to monitor your video card[online].avaialble:http://blog.trifork.com/2014/01/28/usin g-logstash-elasticsearch-and-kibana-to-monitor-your-videocard-a-tutorial/ [11]Real Python (2014, November 13) Twitter Sentiment - Python, Docker, Elasticsearch, Kibana [Online].Available:https://realpython.com/blog/python/twittersentiment-python-docker-elasticsearch-kibana/ [12] L.K. Joshila Grace, V. Maheswari, and Dhinaharan Nagamalai Web Log Data Analysis and Mining in Proc CCSIT-2011, Springer CCIS, Vol 133, pp 459-469, 2011. 610 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence)