Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

David Moses January 2014 Paper on Cloud Computing I Background on Tools and Technologies in Amazon Web Services (AWS) In this paper I will highlight the technologies from the AWS cloud which enable you to create a data warehouse. I will also discuss tools that can be used with AWS technologies for building a data warehouse. The technologies that I will discuss in this paper from AWS include Redshift, EC2, S3, EMR, and EBS. The tools that can be used with AWS that I will discuss in this paper include Tableau, D3.js, SQL workbench and Pentaho s Kettle software. I will also discuss the future directions of cloud computing. Background on Elastic Compute Cloud (EC2) The elastic compute cloud (EC2) holds computer servers hosted in Amazon s AWS cloud. An EC2 instance can be started by choosing an Amazon Machine Image (AMI) from Amazon s AWS website (http://aws.amazon.com). There are thousands of AMI s to choose from including servers hosted on different Linux distros (e.g. Ubuntu Linux) and on different versions of the Windows operating system. Most of the AMI s for EC2 have been configured with configurations for special purpose use. Amazon s EC2 is the core of AWS s cloud. Before you can start many of the features of EC2 it is necessary to setup security for your EC2 instance and choose an AMI for your EC2 instance.

Key pairs are one of the ways that Amazon handles security. A Key pair, which enable security of your EC2 instance, is the only way that you can access your EC2 instance. You can create a key pair for your AWS instance in the AWS console. Once you create a key pair you need to associate it with your EC2 instance. The private key needs to be downloaded and also needs to be stored safely to your local computer. After selecting the EC2 AMI and associating the AMI with your key pair you can now launch the EC2 instance. Background on Simple Storage Service (S3) S3 can be used for building a data warehouse by storing files in CSV format in an S3 bucket. An S3 bucket stores the files that you will use to load into a database (e.g. Redshift). The S3 bucket names must be unique across all S3 bucket names used in S3. S3 uses a REST service and you can send requests to Amazon S3 using its REST API. Every interaction with Amazon S3 is either authenticated or anonymous. Authentication is a process of verifying the identity of the requester trying to access the AWS product. Authenticated requests must include a signature value that authenticates the request sender. The signature value is, in part, generated from the requester's AWS access keys (access key ID and secret access key). If you are using the AWS SDK, the libraries compute the signature from the keys you provide. However, if you make direct REST API calls in your application, you must write the code to compute the signature and add it to the request. (Referenced: http://docs.aws.amazon.com/amazons3/latest/dev/welcome.html)

Background on Elastic Map Reduce (EMR) Amazon s EMR simplifies running Hadoop and related big data applications on AWS. You can use it to manage and analyze vast amounts of data. For example, a cluster, can be configured to process terabytes or even petabytes of data. In order to develop and deploy custom Hadoop applications, you once needed to have access to a lot of hardware for your Hadoop programs. EMR makes it easy to spin up a set of EC2 instances as virtual servers to run your Hadoop cluster. Amazon s EMR allows you to easily deploy your production clusters. Once you are done with the development and testing phases of your project you can easily terminate unused testing clusters. By running your cluster on Amazon s EMR, you only pay for the server resources that you use. For example, if the amount of data you process in a daily cluster peaks on Monday, you can increase the number of servers to 25 in the cluster that day, and then scale back to 5 servers in the clusters that run on other days of the week. You won't have to pay to maintain any additional servers during the rest of the week as you would with physical servers. Amazon s EMR is integrated with other Amazon Web Services such as Amazon EC2, Amazon S3, Amazon DynamoDB, Amazon RDS, CloudWatch, and AWS Data Pipeline. Which means that you can easily access data stored in AWS from your cluster and you can make use of the functionality offered by other AWS services to manage your cluster and store the output of your cluster. (Referenced:http://awsdocs.s3.amazonaws.com/ElasticMapReduce/latest/emr- dg.pdf)

Background on Code Deployment using Elastic Beanstalk (EBS) Amazon s Elastic Beanstalk allows you to deploy and scale services and applications developed with Java,.NET and other programming languages. To deploy a Java application using EBS you first create your application and package it as a web application archive file (WAR file). The next step is to upload the WAR file to Elastic Beanstalk using a web service API, the AWS management console, the AWS toolkit for Eclipse, or the command line interface. Once the WAR file has been uploaded it can then be deployed by EBS. EBS handles the deployment itself and automatically will provision server capacity, load balancing, auto- scaling, and application health monitoring. (Referenced: http://aws.amazon.com/elasticbeanstalk/) Functionality of EBS includes: (a) The ability to quickly and easily deploy new applications to a running environment. (b) Access to CloudWatch for monitoring the average CPU utilization, average latency and request count (c) Email notifications through Simple Notification Service (SNS) about changes to the health of an application or when application servers have been added or removed (d) The ability to quickly restart application servers on EC2 instances with a single command (e) Server log file access without needing to first login to the application servers. (Referenced: http://aws.amazon.com/elasticbeanstalk/). Background on ETL Tools Used in AWS To demonstrate an ETL tool that can be used with AWS, I will briefly describe the different methods by which Pentaho s Kettle ETL tool can be used to extract data

from different data sources, and then I will describe how that data can be transformed and loaded into an AWS database (e.g. Redshift). Kettle is packaged with several software tools including Spoon, Kitchen, and Pan. Spoon provides an IDE for doing transformations from one data type to another data type. For example, by using Spoon you can create a reusable process for transforming a CSV file into an XML file. Kettle s Spoon IDE provides an interface where ETL developers can create ETL jobs for their project. An ETL transformation, which is the primary ETL job, handles the manipulation of the rows or data of a dataset. (P. 25 of Pentaho Kettle s Solution). In Spoon the transformation steps are connected by transformation hops. A hop can best be described as a one- way channel that allows data to flow between the steps that are connected by the hop. Once a transformation has been created in Spoon you can then launch a Kettle job from the command line in Linux using Kettle tools such as Kitchen and Pan. Kitchen and Pan work by interpreting command- line parameters and invoking the Kettle engine to launch a job or transformation. (P. 322 of Pentaho Kettle Solutions). Kitchen can be started by using a shell script from the Kettle home directory in Linux. The scripts in Linux must be made executable first before they can be run. Scheduling a Kettle job with Kitchen can be done with the Cron utility. This can be done by simply adding entries for Cron jobs to the crontab file. (P.326 of Pentaho Kettle Solutions). Background on the Redshift Database

Redshift is a petabyte scale database cluster from AWS, which can be launched from within the Amazon Redshift management console. Before launching the Redshift database cluster you must select your parameter group, encryption option, VPC if you choose, as well as the availability zone. The Virtual private cloud (VPC) is a networking configuration that allows one to enable network isolation within the public portion of the cloud. After setting the configuration settings and completing the initialization for the cluster you can now launch the cluster. (Referenced: Getting Started with Amazon Redshift) The Redshift database uses PostgreSQL drivers and the core of the database is PostgreSQL. Although the Redshift database core is PostgreSQL, there are many features that have been removed for performance reasons. As with the design of other data warehouses it is not necessary to assign primary keys in Redshift. Assigning a primary key in Redshift will, however, be used by the optimizer to make good decisions about how to access the data. (Getting Started with Amazon Redshift) You can access Redshift from EC2 by running the connection string for Redshift as well as from the SQL Workbench IDE. To access Redshift from SQL Workbench you need the connection string that you used to connect to Redshift in EC2 and your Redshift credentials that you used when launching the Redshift cluster. You will also need to provide SQL Workbench with the PostgreSQL JDBC driver that is needed to connect to Redshift. The PostgreSQL JDBC can be downloaded from http://jdbc.postgresql.org/.

Redshift is an example of a column oriented database or C- Store. A C- Store database differs most notably from relational databases in that it stores data in columns rather than in rows. In so doing it optimizes the database for reading data instead of writing data. (Referenced http://en.wikipedia.org/wiki/c- Store). The research and development of the first column oriented database, Vertica, was done by Professor Stonebraker and his colleagues at MIT in 2005 and led to the commercialization of the Vertica database by Stonebraker. Background on Business Intelligence (BI) Tools Used in AWS (A) Tableau BI Reports Business intelligence (BI) reporting software can be used with cloud based databases such as AWS s Redshift database. Using Tableau with Redshift you can create most of the types of BI reports included with the Tableau software. Tableau reports allow one to visualize trends and spot anomalies in the data. With big data sets it is very hard to do trend analysis and spot anomalies without BI reports that can visualize big data sets. Tableau reports have commonly been used in marketing and finance industries. An example of Tableau report that has been used in sales and marketing includes creating a Tableau dashboard to view Salesforce data of a sales team. In the finance industry Tableau has commonly been used for creating wealth management dashboards and in creating dashboards which analyze publicly traded company stock prices over a given time horizon. (Referenced URL: http://www.tableausoftware.com/learn/gallery)

(B) Custom BI Reports Custom BI reports can be created with Javascript frameworks including D3.js and ExtJS. Creating custom BI reports can be done by coding the design of the report with D3.js and consuming the JSON data from a Restful API. A web server such as Tomcat needs to be installed on the EC2 instance so that the report can be viewed on the internet. III Future Directions and opportunities for AWS and Cloud Computing The future of AWS and of cloud computing is quite bright. The interest level in hosting an application in the cloud has been and continues to be high for three reasons. First, the cost of hosting an application in the cloud is substantially less than the cost of buying the hardware and hosting the application yourself. Secondly, the technologies available to you in the cloud allow you to quickly and easily deploy a website, data warehouse, or other application. Third, the cloud computing space is evolving quite fast and companies including Amazon are constantly developing new tools and technologies in the space which is good for companies that wish to host an application with a cloud provider. To prove that AWS is constantly evolving this month alone AWS released the services Appstream, Amazon Workspaces, and Amazon Cloudtrail. Appsream, a media and interactive content streaming service, is oriented toward streaming content for applications that require high- definition video and features like interactivity and user authentication.

Amazon Workspaces is Amazon s entry into the virtual desktop market. Amazon Workspaces pricing structure is similar to that of other AWS products and services and you pay as you use it. The AWS virtual desktops are Windows 7- like instances and they can be accessed from a variety of devices: Mac OS and Windows desktops, as well as ipad, Kindle Fire, and Android tablets. Amazon Cloudtrail is a tracking service which logs all of your Amazon API traffic associated with a given account. Every single command line call, SDK call, and even calls initiated from within the Amazon management console are logged.