Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros



Similar documents
Chapter 9 PUBLIC CLOUD LABORATORY. Sucha Smanchat, PhD. Faculty of Information Technology. King Mongkut s University of Technology North Bangkok

Amazon Elastic Beanstalk

Thing Big: How to Scale Your Own Internet of Things.

DLT Solutions and Amazon Web Services

AIST Data Symposium. Ed Lenta. Managing Director, ANZ Amazon Web Services

ur skills.com

Every Silver Lining Has a Vault in the Cloud

Introduction to Amazon Web Services! Leo Senior Solutions Architect

Hadoop & Spark Using Amazon EMR

Cloud Computing. Adam Barker

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Razvoj Java aplikacija u Amazon AWS Cloud: Praktična demonstracija

Sisense. Product Highlights.

Alfresco Enterprise on AWS: Reference Architecture

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Introduction to AWS in Higher Ed

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Scalability in the Cloud HPC Convergence with Big Data in Design, Engineering, Manufacturing

Zend Server Amazon AMI Quick Start Guide

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Informatica Cloud & Redshift Getting Started User Guide

Introduction to Cloud Computing on Amazon Web Services (AWS) with focus on EC2 and S3. Horst Lueck

AWS Directory Service. Simple AD Administration Guide Version 1.0

EEDC. Scalability Study of web apps in AWS. Execution Environments for Distributed Computing

Managing Your Microsoft Windows Server Fleet with AWS Directory Service. May 2015

Real Time Big Data Processing

How To Use Aws.Com

TECHNOLOGY WHITE PAPER Jan 2016

Amazon WorkDocs. Administration Guide Version 1.0

AWS Lambda. Developer Guide

Innovative Geschäftsmodelle Ermöglicht durch die AWS Cloud

Opsview in the Cloud. Monitoring with Amazon Web Services. Opsview Technical Overview

Amazon Web Services. Lawrence Berkeley LabTech Conference 9/10/15. Jamie Baker Federal Scientific Account Manager AWS WWPS

Amazon EC2 Product Details Page 1 of 5

TECHNOLOGY WHITE PAPER Jun 2012

RDS Migration Tool Customer FAQ Updated 7/23/2015

Introduction to DevOps on AWS

ArcGIS 10.3 Server on Amazon Web Services

CLOUD COMPUTING FOR THE ENTERPRISE AND GLOBAL COMPANIES Steve Midgley Head of AWS EMEA

Amazon Glacier. Developer Guide API Version

Assignment # 1 (Cloud Computing Security)

AWS Toolkit for Visual Studio. User Guide Version v1.30

Last time. Today. IaaS Providers. Amazon Web Services, overview

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Last time. Today. IaaS Providers. Amazon Web Services, overview

GeoCloud Project Report GEOSS Clearinghouse

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

AWS Data Pipeline. Developer Guide API Version

Getting Started with Amazon EC2 Management in Eclipse

Using ArcGIS for Server in the Amazon Cloud

Logentries Insights: The State of Log Management & Analytics for AWS

MICROSTRATEGY ON AWS

Cloud Computing. AWS a practical example. Hugo Pérez UPC. Mayo 2012

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida

Cloud Computing with Amazon Web Services and the DevOps Methodology.

AWS Import/Export. Developer Guide API Version

ST 810, Advanced computing

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Amazon AWS in.net. Presented by: Scott Reed

AWS Import/Export. Developer Guide API Version

Using ArcGIS for Server in the Amazon Cloud

Building your Big Data Architecture on Amazon Web Services

Upgrading From PDI 4.0 to 4.1.0

Getting Started with AWS. Hosting a Static Website

Scalable Architecture on Amazon AWS Cloud

Healthstone Monitoring System

AWS Development Essentials

ColdFusion 10 in the Amazon AWS Cloud. Sven Ramuschkat tecracer GmbH

AWS IaaS Services. Methods Digital GCloud Service Definition

Amazon Web Services Annual ALGIM Conference. Tim Dacombe-Bird Regional Sales Manager Amazon Web Services New Zealand

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

MATLAB on EC2 Instructions Guide

How AWS Pricing Works

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

IBM Campaign Version-independent Integration with IBM Engage Version 1 Release 3 April 8, Integration Guide IBM

AdWhirl Open Source Server Setup Instructions

AWS Cloud for HPC and Big Data

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Scalable Application. Mikalai Alimenkou

How AWS Pricing Works May 2015

Backup and Recovery of SAP Systems on Windows / SQL Server

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Installation Guide on Cloud Platform

Hadoop Setup. 1 Cluster

Using SUSE Studio to Build and Deploy Applications on Amazon EC2. Guide. Solution Guide Cloud Computing.

WE RUN SEVERAL ON AWS BECAUSE WE CRITICAL APPLICATIONS CAN SCALE AND USE THE INFRASTRUCTURE EFFICIENTLY.

StorReduce Technical White Paper Cloud-based Data Deduplication

Implementing a Data Warehouse on AWS in a Hybrid Environment INFORMATICA CLOUD AND AMAZON REDSHIFT

Data Analytics Infrastructure

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Amazon Simple Notification Service. Developer Guide API Version

Transcription:

David Moses January 2014 Paper on Cloud Computing I Background on Tools and Technologies in Amazon Web Services (AWS) In this paper I will highlight the technologies from the AWS cloud which enable you to create a data warehouse. I will also discuss tools that can be used with AWS technologies for building a data warehouse. The technologies that I will discuss in this paper from AWS include Redshift, EC2, S3, EMR, and EBS. The tools that can be used with AWS that I will discuss in this paper include Tableau, D3.js, SQL workbench and Pentaho s Kettle software. I will also discuss the future directions of cloud computing. Background on Elastic Compute Cloud (EC2) The elastic compute cloud (EC2) holds computer servers hosted in Amazon s AWS cloud. An EC2 instance can be started by choosing an Amazon Machine Image (AMI) from Amazon s AWS website (http://aws.amazon.com). There are thousands of AMI s to choose from including servers hosted on different Linux distros (e.g. Ubuntu Linux) and on different versions of the Windows operating system. Most of the AMI s for EC2 have been configured with configurations for special purpose use. Amazon s EC2 is the core of AWS s cloud. Before you can start many of the features of EC2 it is necessary to setup security for your EC2 instance and choose an AMI for your EC2 instance.

Key pairs are one of the ways that Amazon handles security. A Key pair, which enable security of your EC2 instance, is the only way that you can access your EC2 instance. You can create a key pair for your AWS instance in the AWS console. Once you create a key pair you need to associate it with your EC2 instance. The private key needs to be downloaded and also needs to be stored safely to your local computer. After selecting the EC2 AMI and associating the AMI with your key pair you can now launch the EC2 instance. Background on Simple Storage Service (S3) S3 can be used for building a data warehouse by storing files in CSV format in an S3 bucket. An S3 bucket stores the files that you will use to load into a database (e.g. Redshift). The S3 bucket names must be unique across all S3 bucket names used in S3. S3 uses a REST service and you can send requests to Amazon S3 using its REST API. Every interaction with Amazon S3 is either authenticated or anonymous. Authentication is a process of verifying the identity of the requester trying to access the AWS product. Authenticated requests must include a signature value that authenticates the request sender. The signature value is, in part, generated from the requester's AWS access keys (access key ID and secret access key). If you are using the AWS SDK, the libraries compute the signature from the keys you provide. However, if you make direct REST API calls in your application, you must write the code to compute the signature and add it to the request. (Referenced: http://docs.aws.amazon.com/amazons3/latest/dev/welcome.html)

Background on Elastic Map Reduce (EMR) Amazon s EMR simplifies running Hadoop and related big data applications on AWS. You can use it to manage and analyze vast amounts of data. For example, a cluster, can be configured to process terabytes or even petabytes of data. In order to develop and deploy custom Hadoop applications, you once needed to have access to a lot of hardware for your Hadoop programs. EMR makes it easy to spin up a set of EC2 instances as virtual servers to run your Hadoop cluster. Amazon s EMR allows you to easily deploy your production clusters. Once you are done with the development and testing phases of your project you can easily terminate unused testing clusters. By running your cluster on Amazon s EMR, you only pay for the server resources that you use. For example, if the amount of data you process in a daily cluster peaks on Monday, you can increase the number of servers to 25 in the cluster that day, and then scale back to 5 servers in the clusters that run on other days of the week. You won't have to pay to maintain any additional servers during the rest of the week as you would with physical servers. Amazon s EMR is integrated with other Amazon Web Services such as Amazon EC2, Amazon S3, Amazon DynamoDB, Amazon RDS, CloudWatch, and AWS Data Pipeline. Which means that you can easily access data stored in AWS from your cluster and you can make use of the functionality offered by other AWS services to manage your cluster and store the output of your cluster. (Referenced:http://awsdocs.s3.amazonaws.com/ElasticMapReduce/latest/emr- dg.pdf)

Background on Code Deployment using Elastic Beanstalk (EBS) Amazon s Elastic Beanstalk allows you to deploy and scale services and applications developed with Java,.NET and other programming languages. To deploy a Java application using EBS you first create your application and package it as a web application archive file (WAR file). The next step is to upload the WAR file to Elastic Beanstalk using a web service API, the AWS management console, the AWS toolkit for Eclipse, or the command line interface. Once the WAR file has been uploaded it can then be deployed by EBS. EBS handles the deployment itself and automatically will provision server capacity, load balancing, auto- scaling, and application health monitoring. (Referenced: http://aws.amazon.com/elasticbeanstalk/) Functionality of EBS includes: (a) The ability to quickly and easily deploy new applications to a running environment. (b) Access to CloudWatch for monitoring the average CPU utilization, average latency and request count (c) Email notifications through Simple Notification Service (SNS) about changes to the health of an application or when application servers have been added or removed (d) The ability to quickly restart application servers on EC2 instances with a single command (e) Server log file access without needing to first login to the application servers. (Referenced: http://aws.amazon.com/elasticbeanstalk/). Background on ETL Tools Used in AWS To demonstrate an ETL tool that can be used with AWS, I will briefly describe the different methods by which Pentaho s Kettle ETL tool can be used to extract data

from different data sources, and then I will describe how that data can be transformed and loaded into an AWS database (e.g. Redshift). Kettle is packaged with several software tools including Spoon, Kitchen, and Pan. Spoon provides an IDE for doing transformations from one data type to another data type. For example, by using Spoon you can create a reusable process for transforming a CSV file into an XML file. Kettle s Spoon IDE provides an interface where ETL developers can create ETL jobs for their project. An ETL transformation, which is the primary ETL job, handles the manipulation of the rows or data of a dataset. (P. 25 of Pentaho Kettle s Solution). In Spoon the transformation steps are connected by transformation hops. A hop can best be described as a one- way channel that allows data to flow between the steps that are connected by the hop. Once a transformation has been created in Spoon you can then launch a Kettle job from the command line in Linux using Kettle tools such as Kitchen and Pan. Kitchen and Pan work by interpreting command- line parameters and invoking the Kettle engine to launch a job or transformation. (P. 322 of Pentaho Kettle Solutions). Kitchen can be started by using a shell script from the Kettle home directory in Linux. The scripts in Linux must be made executable first before they can be run. Scheduling a Kettle job with Kitchen can be done with the Cron utility. This can be done by simply adding entries for Cron jobs to the crontab file. (P.326 of Pentaho Kettle Solutions). Background on the Redshift Database

Redshift is a petabyte scale database cluster from AWS, which can be launched from within the Amazon Redshift management console. Before launching the Redshift database cluster you must select your parameter group, encryption option, VPC if you choose, as well as the availability zone. The Virtual private cloud (VPC) is a networking configuration that allows one to enable network isolation within the public portion of the cloud. After setting the configuration settings and completing the initialization for the cluster you can now launch the cluster. (Referenced: Getting Started with Amazon Redshift) The Redshift database uses PostgreSQL drivers and the core of the database is PostgreSQL. Although the Redshift database core is PostgreSQL, there are many features that have been removed for performance reasons. As with the design of other data warehouses it is not necessary to assign primary keys in Redshift. Assigning a primary key in Redshift will, however, be used by the optimizer to make good decisions about how to access the data. (Getting Started with Amazon Redshift) You can access Redshift from EC2 by running the connection string for Redshift as well as from the SQL Workbench IDE. To access Redshift from SQL Workbench you need the connection string that you used to connect to Redshift in EC2 and your Redshift credentials that you used when launching the Redshift cluster. You will also need to provide SQL Workbench with the PostgreSQL JDBC driver that is needed to connect to Redshift. The PostgreSQL JDBC can be downloaded from http://jdbc.postgresql.org/.

Redshift is an example of a column oriented database or C- Store. A C- Store database differs most notably from relational databases in that it stores data in columns rather than in rows. In so doing it optimizes the database for reading data instead of writing data. (Referenced http://en.wikipedia.org/wiki/c- Store). The research and development of the first column oriented database, Vertica, was done by Professor Stonebraker and his colleagues at MIT in 2005 and led to the commercialization of the Vertica database by Stonebraker. Background on Business Intelligence (BI) Tools Used in AWS (A) Tableau BI Reports Business intelligence (BI) reporting software can be used with cloud based databases such as AWS s Redshift database. Using Tableau with Redshift you can create most of the types of BI reports included with the Tableau software. Tableau reports allow one to visualize trends and spot anomalies in the data. With big data sets it is very hard to do trend analysis and spot anomalies without BI reports that can visualize big data sets. Tableau reports have commonly been used in marketing and finance industries. An example of Tableau report that has been used in sales and marketing includes creating a Tableau dashboard to view Salesforce data of a sales team. In the finance industry Tableau has commonly been used for creating wealth management dashboards and in creating dashboards which analyze publicly traded company stock prices over a given time horizon. (Referenced URL: http://www.tableausoftware.com/learn/gallery)

(B) Custom BI Reports Custom BI reports can be created with Javascript frameworks including D3.js and ExtJS. Creating custom BI reports can be done by coding the design of the report with D3.js and consuming the JSON data from a Restful API. A web server such as Tomcat needs to be installed on the EC2 instance so that the report can be viewed on the internet. III Future Directions and opportunities for AWS and Cloud Computing The future of AWS and of cloud computing is quite bright. The interest level in hosting an application in the cloud has been and continues to be high for three reasons. First, the cost of hosting an application in the cloud is substantially less than the cost of buying the hardware and hosting the application yourself. Secondly, the technologies available to you in the cloud allow you to quickly and easily deploy a website, data warehouse, or other application. Third, the cloud computing space is evolving quite fast and companies including Amazon are constantly developing new tools and technologies in the space which is good for companies that wish to host an application with a cloud provider. To prove that AWS is constantly evolving this month alone AWS released the services Appstream, Amazon Workspaces, and Amazon Cloudtrail. Appsream, a media and interactive content streaming service, is oriented toward streaming content for applications that require high- definition video and features like interactivity and user authentication.

Amazon Workspaces is Amazon s entry into the virtual desktop market. Amazon Workspaces pricing structure is similar to that of other AWS products and services and you pay as you use it. The AWS virtual desktops are Windows 7- like instances and they can be accessed from a variety of devices: Mac OS and Windows desktops, as well as ipad, Kindle Fire, and Android tablets. Amazon Cloudtrail is a tracking service which logs all of your Amazon API traffic associated with a given account. Every single command line call, SDK call, and even calls initiated from within the Amazon management console are logged.