Using Amazon EMR and Hunk to explore, analyze and visualize machine data

Similar documents
Using ArcGIS for Server in the Amazon Cloud

Using The Hortonworks Virtual Sandbox

Amazon Web Services Yu Xiao

Amazon Elastic Beanstalk

Talari Virtual Appliance CT800. Getting Started Guide

Amazon EC2 Container Service. Developer Guide API Version

AWS Database Migration Service. User Guide Version API Version

Deploying for Success on the Cloud: EBS on Amazon VPC. Phani Kottapalli Pavan Vallabhaneni AST Corporation August 17, 2012

Real Time Big Data Processing

Implementing Microsoft Windows Server Failover Clustering (WSFC) and SQL Server 2012 AlwaysOn Availability Groups in the AWS Cloud

Cloud Computing. AWS a practical example. Hugo Pérez UPC. Mayo 2012

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida

KeyControl Installation on Amazon Web Services

Tutorial: Using HortonWorks Sandbox 2.3 on Amazon Web Services

USER CONFERENCE 2011 SAN FRANCISCO APRIL Running MarkLogic in the Cloud DEVELOPER LOUNGE LAB

unisys Unisys Stealth(cloud) for Amazon Web Services Deployment Guide Release 1.0 January

Cloudera Manager Training: Hands-On Exercises

How To Create A Virtual Private Cloud In A Lab On Ec2 (Vpn)

VXOA AMI on Amazon Web Services

VX 9000E WiNG Express Manager INSTALLATION GUIDE

Immersion Day. Creating an Elastic Load Balancer. Rev

AWS Service Catalog. User Guide

QualysGuard Asset Management

Expand Your Infrastructure with the Elastic Cloud. Mark Ryland Chief Solutions Architect Jenn Steele Product Marketing Manager

vrealize Operations Management Pack for AWS Installation and Configuration Guide 2.0

.Trustwave.com Updated October 9, Secure Web Gateway Version 11.0 Amazon EC2 Platform Set-up Guide

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Creating an ESS instance on the Amazon Cloud

Amazon EFS (Preview) User Guide

Deploy XenApp 7.5 and 7.6 and XenDesktop 7.5 and 7.6 with Amazon VPC

Orchestrator ver

Online Backup Guide for the Amazon Cloud: How to Setup your Online Backup Service using Vembu StoreGrid Backup Virtual Appliance on the Amazon Cloud

PCI on Amazon Web Services (AWS) What You Need To Know Understanding the regulatory roadmap for PCI on AWS

Tibbr Installation Addendum for Amazon Web Services

Amazon Web Services (AWS) Setup Guidelines

How To Create A Virtual Private Cloud On Amazon.Com

Web Application Firewall

3CX IP PBX with Twilio Elastic SIP Trunking Interconnection Guide

AWS CodePipeline. User Guide API Version

Deploy Remote Desktop Gateway on the AWS Cloud

Informatica Cloud & Redshift Getting Started User Guide

Continuous Delivery on AWS. Version 1.0 DO NOT DISTRIBUTE

AWS Directory Service. Simple AD Administration Guide Version 1.0

WE RUN SEVERAL ON AWS BECAUSE WE CRITICAL APPLICATIONS CAN SCALE AND USE THE INFRASTRUCTURE EFFICIENTLY.

Security Gateway Virtual Appliance R75.40

Introduction to Cloud Computing on Amazon Web Services (AWS) with focus on EC2 and S3. Horst Lueck

Generating Load from the Cloud Handbook

Eucalyptus User Console Guide

Configuring user provisioning for Amazon Web Services (Amazon Specific)

SERVER CLOUD DISASTER RECOVERY. User Manual

Getting Started with AWS. Web Application Hosting for Linux

Microsoft Windows Server Failover Clustering (WSFC) and SQL Server AlwaysOn Availability Groups on the AWS Cloud: Quick Start Reference Deployment

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Citrix XenApp on AWS: Implementation Guide

AWS Cloud for HPC and Big Data

Using ArcGIS for Server in the Amazon Cloud

Amazon WorkSpaces. Administration Guide Version 1.0

CommandCenter Secure Gateway

Scaling in the Cloud with AWS. By: Eli White (CTO & mojolive) eliw.com - mojolive.com

Chapter 9 PUBLIC CLOUD LABORATORY. Sucha Smanchat, PhD. Faculty of Information Technology. King Mongkut s University of Technology North Bangkok

Every Silver Lining Has a Vault in the Cloud

ArcGIS for Server in the Amazon Cloud. Michele Lundeen Esri

FortyCloud Installation Guide. Installing FortyCloud Gateways Using AMIs (AWS Billing)

The steps will take about 4 hours to fully execute, with only about 60 minutes of user intervention. Each of the steps is discussed below.

Alfresco Enterprise on AWS: Reference Architecture

IBM WEBSPHERE LOAD BALANCING SUPPORT FOR EMC DOCUMENTUM WDK/WEBTOP IN A CLUSTERED ENVIRONMENT

Build Your Own Performance Test Lab in the Cloud. Leslie Segal Testware Associate, Inc.

OpenTOSCA Release v1.1. Contact: Documentation Version: March 11, 2014 Current version:

Razvoj Java aplikacija u Amazon AWS Cloud: Praktična demonstracija

SERVER CLOUD RECOVERY. User Guide

SevOne NMS Download Installation and Implementation Guide

Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data

Cloud Computing. Adam Barker

ur skills.com

MATLAB Distributed Computing Server Cloud Center User s Guide

Microsoft SharePoint Server 2013 on the AWS Cloud: Quick Start Reference Deployment

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

MATLAB on EC2 Instructions Guide

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

Overview and Deployment Guide. Sophos UTM on AWS

Single Node Hadoop Cluster Setup

ArcGIS 10.3 Server on Amazon Web Services

Service Organization Controls 3 Report

Building a Private Cloud Cloud Infrastructure Using Opensource

Getting Started Guide

Application Security Best Practices. Matt Tavis Principal Solutions Architect

Renderbot Tutorial. Intro to AWS

AWS Account Setup and Services Overview

BUNGEE Quick Start Guide for AWS EC2 based elastic clouds

Setting up Sharp MX-Color Imagers for Inbound Fax Routing to or Network Folder

Red Hat Enterprise Linux OpenStack Platform 7 OpenStack Data Processing

F-Secure Messaging Security Gateway. Deployment Guide

TimeTrade Salesforce Connector Administrator Guide

NetFlow Analytics for Splunk

DEPLOYING EMC DOCUMENTUM BUSINESS ACTIVITY MONITOR SERVER ON IBM WEBSPHERE APPLICATION SERVER CLUSTER

Using SUSE Studio to Build and Deploy Applications on Amazon EC2. Guide. Solution Guide Cloud Computing.

Zend Server Amazon AMI Quick Start Guide

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Transcription:

Using Amazon EMR and Hunk to explore, analyze and visualize machine data Machine data can take many forms and comes from a variety of sources; system logs, application logs, service and system metrics, sensors data etc. In this step-by-step document you will learn how to build a Big Data Solution for fast, interactive analysis of data stored in Hadoop or S3. This hand-on guide is highly technical and is useful for solution architects, data analysts and developers. The focus of this solution is Amazon CloudFront logs but the design can easily accommodate and scale for other sources such as ELB logs, S3 Access Logs, website logs, Hadoop logs, or just about any machine data stored in S3. You will need: Hunk 1. An Amazon EMR Cluster 2. A Hunk Instance 3. S3 bucket with your data o Data can also be in HDFS EMR... Step 1: Launch EMR Using CLI S3... To launch an EMR cluster using AWS Command Line Interface you must have version 1.4.4 or later. To use Hunk hourly, specify it in the applications attribute. For example: aws emr create- cluster - - applications Name=Hunk - - ami- version 3.3.1 - - instance- groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.xlarge InstanceGroupType=CORE,InstanceCount=10,InstanceType=c3.xlarge - - no- auto- terminate - - region us- west- 2 - - use- default- roles - - ec2- attributes KeyName=my- key- pair Notes: - EMR AMI Version should be 3.3.1 or above. - Standard Hunk hourly rates apply

Using EMR Console Login to your AWS account and open your EMR console. Click on Create Cluster Under the Cluster Configuration section enter a Cluster Name and select the appropriate Termination and Logging and Debugging options depending on your requirements. Under the Tags section enter suitable tags to describe your EMR nodes. Under the Software Configuration section: Select Amazon as the Hadoop Distribution. Select AMI version >3.2.1 (Hadoop 2.4.0). Under Additional applications select Hunk and follow directions on the pop-up window. Note: Hive and Pig are optional applications and not required by Hunk. Under the File System Configuration section select your suitable settings. Under the Hardware Configuration section: Select VPC and the appropriate EC2 availability zone. Select the number and instance type of nodes: Master, Core and Task nodes. Master: Core: 1 x c3.xlarge 10 x c3.xlarge Under the Security and Access section: Select your appropriate EC2 key pair and IAM user access settings. IAM Roles Leave at Default EC2 Security Groups: Leave at default. Click Create Cluster.

Step 2: Launch Hunk There are a few ways to provision an hourly Hunk instance that will auto-connect and auto-discover to the above EMR cluster. CloudFormation CLI aws cloudformation create- stack - - region us- west- 2 - - stack- name MyHunkInstance - - template- body https://s3.amazonaws.com/splunk- emr- public/cfn/hunk_cf_template.txt - - parameters ParameterKey=InstanceType,ParameterValue=c3.xlarge ParameterKey=KeyName,ParameterValue=myKeyName ParameterKey=StorageSize,ParameterValue=128 - - capabilities CAPABILITY_IAM CloudFormation Template Use the links provided in the table below to launch a Hunk instance in the region of your choice. You need to provide an instance type and a key pair to access the instance. EC2 Region Name EC2 Region Id CloudFormation Launch** US East, N. Virginia US West, N. California US West, Oregon Europe, Ireland Europe, Frankfurt Asia Pacific, Tokyo Asia Pacific, Singapore Asia Pacific, Sydney South America, São Paulo ** Recent as of: 2015-06-12 us-east-1 us-west-1 us-west-2 eu-west-1 eu-central-1 ap-northeast-1 ap-southeast-1 ap-southeast-2 sa-east-1 Links to the CloudFormation templates are also available on the web: http://docs.splunk.com/documentation/hunk/latest/hunk/installhunkawswithemr#provision _a_hunk_instance_using_cloudformation_template After you go through the steps feel free to look in the CloudFormation Console to watch the stack creation progress.

AWS EC2 Console 1. Select an Amazon Machine Image (AMI). Search for one of the AMI IDs listed in the table below under the Public Images or My AMIs while selecting Shared with me. 2. Alternatively, click in the table below on the AMI IDs, based on your region, and you are redirected to the appropriate page with the correct AMI-ID selected. EC2 Region Name EC2 Region Id AMI ID** US East, N. Virginia us-east-1 ami-c891d2a0 US West, N. California us-west-1 ami-9a6f74df US West, Oregon us-west-2 ami-ebc49fdb Europe, Ireland eu-west-1 ami-0dc44d7a Europe, Frankfurt eu-central-1 ami-acb586b1 Asia Pacific, Tokyo ap-northeast-1 ami-b2958fb3 Asia Pacific, Singapore ap-southeast-1 ami-2a9cb678 Asia Pacific, Sydney ap-southeast-2 ami-712c5b4b South America, São Paulo sa-east-1 ami-13b50a0e ** Recent as of: 2015-06-12 3. Select an instance type. Use "Compute optimized" instances. Instances with 8 vcpus (for example, c3.2xlarge) provide a good starting point for optimal performance. 4. Enter information under Configure Instance Details: a. Type "1" for Number of instances. b. Select the appropriate Network setting. Use that of the EMR nodes. c. Select the appropriate Availability Zone. Use that of the EMR nodes. d. Select an IAM role. This is required and needs to be the same IAM role as the EC2 Instance Profile role in your EMR cluster. e. Select the preferred settings for the rest of the fields i. Shutdown behavior, Enable termination protection, Monitoring, EBSoptimized instance and Advanced Details 5. Under Add Storage, ensure that there is sufficient storage for the instance. The instance will not be part of the cluster and will not contain any raw data. It will need storage space to host users' and apps' search results and other related artifacts. 100GB should provide enough room for normal workloads. 6. Under Tag Instance, optionally enter tags to describe your Hunk instance. 7. Under Configure Security Group, provide the information Hunk needs to communicate with the EMR cluster nodes and the users through the Splunk web port. Configure the instance with two Security Groups as follows: a. Group Name: ElasticMapReduce-master, which gets instantiated when the EMR cluster runs. b. Create a new group with port 8000 open inbound for user access, and optionally 22 for admin tasks.

8. Review your instance details, and click Launch to assign a key pair and complete the process. Links to the above AMIs are also available on the web: http://docs.splunk.com/documentation/hunk/latest/hunk/installhunkawswithemr#step_3:_ Provision_a_Hunk_Instance Login To Hunk Instance 1. Wait a minute or two until the Hunk instance gets up and running. When it is available, note its user-facing, Public DNS address on the EC2 Console. 2. Use your browser to navigate to: http://<instance address>:8000 3. Log in using the instructions on the screen. Default username is admin and password is the instance-id. Take the tour that guides you through the Hunk, EMR, and S3 experience or if you re familiar with Splunk/Hunk, feel free to skip it and move directly to Step 3 below. Step 3: Point Hunk to CloudFront Data (S3 Bucket) First, let s verify that the EMR Cluster has been auto-detected and auto-connected to by Hunk. Click on the EMR Connector App on the upper left and verify connectivity.

Next, we ll point Hunk to some CloudFront log data. Click Settings on the upper right and then Virtual Indexes. On the Virtual Indexes tab click New Virtual Index. Give it a Name and a Description. Select Provider to be your EMR Cluster. If not listed yet, please allow sufficient time until the EMR Cluster is fully up and available. Under Paths in Path to data in HDFS enter either of the S3 bucket locations for CloudFront Logs s3n://aws- bigdata- bootcamp- usstandard/data/logs- full/e123abcdef/ s3n://aws- bigdata- bootcamp/data/logs- full/e123abcdef/ Enter \.gz$ in Whitelist and click Save.

Step 4: Analyze and Visualize CloudFront Data Let s investigate our newly configured data source. Return to the Search and Reporting app and enter index=myindexname head 10 on the search bar. This will return the 10 most recent events from our dataset. Note the timeline, fields on the left, and events on the right. Let s extract a couple of fields from the data and build a quick report. On the second from top event (the first one contains commented data) click on Extract Fields under Event Actions.

A new tab will open up which invites you to Extract fields. Next, lets extract these fields from the data: x_edge_location, sc_bytes and c_ip (feel free to extract more fields). Double-click on the third field (space/tab separated), enter x_edge_location and add it as an extraction. Repeat above for sc_bytes and c_ip fields. Click Next at the top to validate.

Change permissions to App and click Finish. Next, rerun the same search again and observe the extracted fields on the left.

Now, let s run some searches and build a few reports: Top 10 Client IPs index=myindexname top c_ip Sum of Bytes by x_edge_location index=myindexname stas sum(sc_bytes) as sumbytes by x_edge_location eval MB=sumBytes/1024/1024 fields sumbytes

Geographic Distribution of Clients by x_edge_location index=myindexname iplocation c_ip geostats count by x_edge_location Step 5: Finish Up When you re done visualizing and analyzing the logs terminate the EMR cluster and Stop the Hunk instance. References: EMR Documentation Using Hunk on Amazon Web Services Using Amazon EMR and Hunk for Rapid Response Log Analysis and Review Splunk Search Command Reference