AWS Data Pipeline. Developer Guide API Version

Transcription

1 AWS Data Pipeline Developer Guide

2 Amazon Web Services

3 AWS Data Pipeline: Developer Guide Amazon Web Services

4 What is AWS Data Pipeline?... 1 How Does AWS Data Pipeline Work?... 1 Pipeline Definition... 2 Lifecycle of a Pipeline... 4 Task Runners... 5 Pipeline Components, Instances, and Attempts Lifecycle of a Pipeline Task Get Set Up Access the Console Install the Command Line Interface Deploy and Configure Task Runner Install the AWS SDK Granting Permissions to Pipelines with IAM Grant Amazon RDS Permissions to Task Runner Tutorial: Copy CSV Data from Amazon S3 to Amazon S Using the AWS Data Pipeline Console Using the Command Line Interface Tutorial: Copy Data From a MySQL Table to Amazon S Using the AWS Data Pipeline Console Using the Command Line Interface Tutorial: Launch an Amazon EMR Job Flow Using the AWS Data Pipeline Console Using the Command Line Interface Tutorial: Import/Export Data in Amazon DynamoDB With Amazon EMR and Hive Part One: Import Data into Amazon DynamoDB Using the AWS Data Pipeline Console Using the Command Line Interface Part Two: Export Data from Amazon DynamoDB Using the AWS Data Pipeline Console Using the Command Line Interface Tutorial: Run a Shell Command to Process MySQL Table Using the AWS Data Pipeline Console Manage Pipelines Using AWS Data Pipeline Console Using the Command Line Interface Troubleshoot AWS Data Pipeline Pipeline Definition Files Creating Pipeline Definition Files Example Pipeline Definitions Copy SQL Data to a CSV File in Amazon S Launch an Amazon EMR Job Flow Run a Script on a Schedule Chain Multiple Activities and Roll Up Data Copy Data from Amazon S3 to MySQL Extract Apache Web Log Data from Amazon S3 using Hive Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using Hive Extract Amazon S3 Data (Custom Format) to Amazon S3 using Hive Simple Data Types Expression Evaluation Objects Schedule S3Datade MySqlDatade DynamoDBDatade ShellCommandActivity CopyActivity EmrActivity HiveActivity

5 ShellCommandPrecondition Exists S3KeyExists S3PrefixtEmpty RdsSqlPrecondition DynamoDBTableExists DynamoDBDataExists Ec2Resource EmrCluster SnsAlarm Command Line Reference Program AWS Data Pipeline Make an HTTP Request to AWS Data Pipeline Actions in AWS Data Pipeline Task Runner Reference Web Service Limits AWS Data Pipeline Resources Document History

6 How Does AWS Data Pipeline Work? What is AWS Data Pipeline? AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. For example, you can use AWS Data Pipeline to archive your web server's logs to Amazon Simple Storage Service (Amazon S3) each day and then run a weekly Amazon Elastic MapReduce (Amazon EMR) job flow over those logs to generate traffic reports. In this example, AWS Data Pipeline would schedule the daily tasks to copy data and the weekly task to launch the Amazon EMR job flow. AWS Data Pipeline would also ensure that Amazon EMR waits for the final day's data to be uploaded to Amazon S3 before it began its analysis, even if there is an unforeseen delay in uploading the logs. AWS Data Pipeline handles the ambiguities of real-world data management. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up. How Does AWS Data Pipeline Work? Three main components of AWS Data Pipeline work together to manage your data: 1

7 Pipeline Definition Pipeline definition specifies the business logic of your data management. For more information, see Pipeline Definition Files (p. 135). AWS Data Pipeline web service interprets the pipeline definition and assigns tasks to workers to move and transform data. Task Runners poll the AWS Data Pipeline web service for tasks and then perform those tasks. In the previous example, Task Runner would copy log files to Amazon S3 and launch Amazon EMR job flows. Task Runner is installed and runs automatically on resources created by your pipeline definitions. You can write a custom task runner application, or you can use the Task Runner application that is provided by AWS Data Pipeline. For more information, see Task Runner (p. 5) and Custom Task Runner (p. 8). The following illustration below shows how these components work together. If the pipeline definition supports nonserialized tasks, AWS Data Pipeline can manage tasks for multiple task runners working in parallel. Pipeline Definition A pipeline definition is how you communicate your business logic to AWS Data Pipeline. It contains the following information: s, locations, and formats of your data sources. Activities that transform the data. The schedule for those activities. Resources that run your activities and preconditions Preconditions that must be satisfied before the activities can be scheduled. Ways to alert you with status updates as pipeline execution proceeds. From your pipeline definition, AWS Data Pipeline determines the tasks that will occur, schedules them, and assigns them to task runners. If a task is not completed successfully, AWS Data Pipeline retries the task according to your instructions and, if necessary, reassigns it to another task runner. If the task fails repeatedly, you can configure the pipeline to notify you. For example, in your pipeline definition, you might specify that in 2013, log files generated by your application will be archived each month to an Amazon S3 bucket. AWS Data Pipeline would then create 12 tasks, each copying over a month's worth of data, regardless of whether the month contained 30, 31, 28, or 29 days. 2

8 Pipeline Definition You can create a pipeline definition in the following ways: Graphically, by using the AWS Data Pipeline console. Textually, by writing a JSON file in the format used by the command line interface. Programmatically, by calling the web service with either one of the AWS SDKs or the AWS Data Pipeline API. A pipeline definition can contain the following types of components: Component Data de The location of input data for a task or the location where output data is to be stored. The following data locations are currently supported: Amazon S3 bucket MySQL database Amazon DynamoDB Local data node Activity An interaction with the data. The following activities are currently supported: Copy to a new location Launch an Amazon EMR job flow Run a Bash script from the command line (requires a UNIX environment to run the script) Run a database query Run a Hive activity Precondition A conditional statement that must be true before an action can run. The following preconditions are currently supported: A command-line Bash script was successfully completed (requires a UNIX environment to run the script) Data exists A specific time or a time interval relative to another event has been reached An Amazon S3 location contains data An Amazon RDS or Amazon DynamoDB table exists Schedule Any or all of the following: The time that an action should start The time that an action should stop How often the action should run 3

9 Lifecycle of a Pipeline Component Resource A resource that can analyze or modify data. The following computational resources are currently supported: Amazon EMR job flow Amazon EC2 instance Action A behavior that is triggered when specified conditions are met, such as the failure of an activity. The following actions are currently supported: Amazon SNS notification Terminate action For more information, see Pipeline Definition Files (p. 135). Lifecycle of a Pipeline After you create a pipeline definition, you create a pipeline and then add your pipeline definition to it.your pipeline must be validated. After you have a valid pipeline definition, you can activate it. At that point, the pipeline runs and schedules tasks. When you are done with your pipeline, you can delete it. The complete lifecycle of a pipeline is shown in the following illustration. 4

10 Task Runners Task Runners A task runner is an application that polls AWS Data Pipeline for tasks and then performs those tasks. You can either use Task Runner as provided by AWS Data Pipeline, or create a custom Task Runner application. Task Runner Task Runner is a default implementation of a task runner that is provided by AWS Data Pipeline. When Task Runner is installed and configured, it polls AWS Data Pipeline for tasks associated with pipelines that you have activated. When a task is assigned to Task Runner, it performs that task and reports its status back to AWS Data Pipeline. If your workflow requires non-default behavior, you'll need to implement that functionality in a custom task runner. There are three ways you can use Task Runner to process your pipeline: AWS Data Pipeline installs Task Runner for you on resources that are launched and managed by the web service. You install Task Runner on a computational resource that you manage, such as a long-running Amazon EC2 instance, or an on-premise server. You modify the Task Runner code to create a custom Task Runner, which you then install on a computational resource that you manage. 5

11 Task Runners Task Runner on AWS Data Pipeline-Managed Resources When a resource is launched and managed by AWS Data Pipeline, the web service automatically installs Task Runner on that resource to process tasks in the pipeline. You specify a computational resource (either an Amazon EC2 instance or an Amazon EMR job flow) for the runson field of an activity object. When AWS Data Pipeline launches this resource, it installs Task Runner on that resource and configures it to process all activity objects that have their runson field set to that resource. When AWS Data Pipeline terminates the resource, the Task Runner and all its logs are published to an Amazon S3 location before it shuts down. For example, if you use the EmrActivity action in a pipeline, and specify an EmrCluster object in the runson field. When AWS Data Pipeline processes that activity, it launches an Amazon EMR job flow and uses a bootstrap step to install Task Runner onto the master node. This Task Runner then processes the tasks for activities that have their runson field set to that EmrCluster object. The following excerpt from a pipeline definiton shows this relationship between the two objects. "id" : "MyEmrActivity", "name" : "Work to perform on my data", "type" : "EmrActivity", "runson" : "ref" : "MyEmrCluster" "prestepcommand" : "scp remotefiles localfiles", "step" : "s3://mybucket/mypath/mystep.jar,firstarg,secondarg", "step" : "s3://mybucket/mypath/myotherstep.jar,anotherarg", "poststepcommand" : "scp localfiles remotefiles", "input" : "ref" : "MyS3Input" "output" : "ref" : "MyS3Output"} 6

12 Task Runners "id" : "MyEmrCluster", "name" : "EMR cluster to perform the work", "type" : "EmrCluster", "hadoopversion" : "0.20", "keypair" : "mykeypair", "masterinstancetype" : "m1.xlarge", "coreinstancetype" : "m1.small", "coreinstancecount" : "10", "instancetasktype" : "m1.small", "instancetaskcount": "10", "bootstrapaction" : "s3://elasticmapreduce/libs/ba/configure-ha doop,arg1,arg2,arg3", "bootstrapaction" : "s3://elasticmapreduce/libs/ba/configure-otherstuff,arg1,arg2" } If you have multiple AWS Data Pipeline-managed resources in a pipeline, Task Runner is installed on each of them, and they all poll AWS Data Pipeline for tasks to process. Task Runner on User-Managed Resources You can install Task Runner on computational resources that you manage, such a long-running Amazon EC2 instance or a physical server. This approach can be useful when, for example, you want to use AWS Data Pipeline to process data that is stored inside your organization s firewall. By installing Task Runner on a server in the local network, you can access the local database securely and then poll AWS Data Pipeline for the next task to run. When AWS Data Pipeline ends processing or deletes the pipeline, the Task Runner instance remains running on your computational resource until you manually shut it down. Similarly, the Task Runner logs persist after pipeline execution is complete. You download Task Runner, which is in Java Archive (JAR) format, and install it on your computational resource. For more information about downloading and installing Task Runner, see Deploy and Configure Task Runner (p. 19). To connect a Task Runner that you've installed to the pipeline activities it should process, add a workergroup field to the object, and configure Task Runner to poll for that worker group value. You do this by passing the worker group string as a parameter (for example, --workergroup=wg-12345) when you run the Task Runner JAR file. 7

13 Task Runners } "id" : "MyStoredProcedureActivity", "type" : "StoredProcedureActivity", "workergroup" : "wg-12345", "command" : "mkdir new-directory" Custom Task Runner If your data-management requires behavior other than the default behavior provided by Task Runner, you need to create a custom task runner. Because Task Runner is an open-source application, you can use it as the basis for creating your custom implementation. After you write the custom task runner, you install it on a computational resource that you own, such as a long-running EC2 instance or a physical server inside your organization's firewall. To connect your custom task runner to the pipeline activities it should process, add a workergroup field to the object, and configure your custom task runner to poll for that worker group value. 8

14 Task Runners For example, if you use the ShellCommandActivity action in a pipeline, and specify a value for the workergroup field, when AWS Data Pipeline processes that activity, it passes the task to a task runner that polls the web service for work and specifies that worker group. The following excerpt from a pipeline definition shows how to configure the workergroup field. } "id" : "CreateDirectory", "type" : "ShellCommandActivity", "workergroup" : "wg-67890", "command" : "mkdir new-directory" When you create a custom task runner, you have complete control over how your pipeline activities are processed. The only requirement is that you communicate with AWS Data Pipeline as follows: Poll for tasks Your task runner should poll AWS Data Pipeline for tasks to process by calling the PollForTask API. If tasks are ready in the work queue, GetRemoteWork returns a Response immediately. If no tasks are available in the queue, GetRemoteWork uses long-polling and holds on to a poll connection for up to 90 seconds, during which time any newly scheduled tasks are handed to the task agent. Your remote worker should not call GetRemoteWork again on the same worker group until it receives a Response, and this may take up to 90 seconds. Report progress Your task runner should report its progress to AWS Data Pipeline by calling the ReportTaskProgress API each minute. If a task runner does not report its status after 5 minutes, then every 20 minutes afterwards (configurable), AWS Data Pipeline assumes the task runner is unable to process the task and assigns it in a subsequent Response to GetRemoteWork. Signal completion of a task Your task runner should inform AWS Data Pipeline of the outcome when it completes a task by calling the SetTaskStatus API. The task runner calls this action regardless 9

15 Pipeline Components, Instances, and Attempts of whether the task was sucessful. The task runner does not need to call SetRemoteWorkStatus for tasks canceled by AWS Data Pipeline. Pipeline Components, Instances, and Attempts There are three types of items associated with a scheduled pipeline: Pipeline Components Pipeline components represent the business logic of the pipeline and are represented by the different sections of a pipeline definition. Pipeline components specify the data sources, activities, schedule, and preconditions of the workflow. They can inherit properties from parent components. Relationships among components are defined by reference. Pipeline components define the rules of data management; they are not a to-do list. Instances When AWS Data Pipeline runs a pipeline, it compiles the pipeline components to create a set of actionable instances. Each instance contains all the information needed to perform a specific task. The complete set of instances is the to-do list of the pipeline. AWS Data Pipeline hands the instances out to task runners to process. Attempts To provide robust data management, AWS Data Pipeline retries a failed operation. It continues to do so until the task reaches the maximum number of allowed retry attempts. Attempt objects track the various attempts, results, and failure reasons if applicable. Essentially, it is the instance with a counter. te Retrying failed tasks is an important part of a fault tolerance strategy, and AWS Data Pipeline pipeline definitions provide conditions and thresholds to control retries. However, too many retries can delay detection of an unrecoverable failure because AWS Data Pipeline does not report failure until it has exhausted all the retries that you specify. The extra retries may accrue additional charges if they are running on AWS resources. As a result, carefully consider when it is appropriate to exceed the AWS Data Pipeline default settings that you use to control re-tries and related settings. 10

16 Lifecycle of a Pipeline Task Lifecycle of a Pipeline Task The following diagram illustrates how AWS Data Pipeline and a task runner interact to process a scheduled task. 11

17 Access the Console Get Set Up for AWS Data Pipeline There are several ways you can interact with AWS Data Pipeline: Console a graphical interface you can use to create and manage pipelines. With it, you fill out web forms to specify the configuration details of your pipeline components. The AWS Data Pipeline console provides several templates, which are pre-configured pipelines for common scenarios. As you keep building your pipeline, graphical representation of the components appear on the design pane. The arrows between the components indicate the connection between the components. Using the console is the easiest way to get started with AWS Data Pipeline. It creates the pipeline definition for you, and no JSON or programming knowledge is required. The console is available online at For more information about accessing the console, see Access the Console (p. 12). Command Line Interface (CLI) an application you run on your local machine to connect to AWS Data Pipeline and create and manage pipelines. With it, you issue commands into a terminal window and pass in JSON files that specify the pipeline definition. Using the CLI is the best option if you prefer working from a command line. For more information, see Install the Command Line Interface (p. 15). Software Development Kit (SDK) AWS provides an SDK with functions that call AWS Data Pipeline to create and manage pipelines. With it, you can write applications that automate the process of creating and managing pipelines. Using the SDK is the best option if you want to extend or customize the functionality of AWS Data Pipeline. You can download the AWS SDK for Java from Web Service API AWS provides a low-level interface that you can use to call the web service directly using JSON. Using the API is the best option if you want to create an custom SDK that calls AWS Data Pipeline. For more information, see AWS Data Pipeline API Reference. In addition, there is the Task Runner application, which is a default implementation of a task runner. Depending on the requirements of your data management, you may need to install Task Runner on a computational resource such as a long-running Amazon EC2 instance or a physical server. For more information about when to install Task Runner, see Task Runner (p. 5). For more information about how to install Task Runner, see Deploy and Configure Task Runner (p. 19). Access the Console Topics Where Do I Go w? (p. 15) 12

18 Access the Console The AWS Data Pipeline console enables you to do the following: Create, save, and activate your pipeline View the details of all the pipelines associated with your account Modify your pipeline Delete your pipeline You must have an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. When you create an AWS account, AWS automatically signs up the account for all AWS services, including AWS Data Pipeline. With AWS Data Pipeline, you pay only for what you use. For more information about AWS Data Pipeline usage rates, see AWS Data Pipeline. If you have an AWS account already, skip to the next step. If you don't have an AWS account, use the following procedure to create one. To create an AWS account 1. Go to AWS and click Sign Up w. 2. Follow the on-screen instructions. Part of the sign-up process involves receiving a phone call and entering a PIN using the phone keypad. To access the console 1. Sign in to the AWS Management Console and open the AWS Data Pipeline console at 2. If your account doesn't already have data pipelines, the console displays the following introductory screen that prompts you to create your first pipeline. This screen also provides an overview of the process for creating a pipeline, and links to relevant documentation and resources. Click Create Pipeline to create your pipeline. 13

19 Access the Console If you already have pipelines associated with your account, the console displays the page listing all the pipelines associated with your account. Click Create New Pipeline to create your pipeline. 14

20 Where Do I Go w? Where Do I Go w? You are now ready to start creating your pipelines. For more information about creating a pipeline, see the following tutorials: Tutorial: Copy CSV Data from Amazon S3 to Amazon S3 (p. 25) Tutorial: Copy Data From a MySQL Table to Amazon S3 (p. 40) Tutorial: Launch an Amazon EMR Job Flow (p. 56) Tutorial: Run a Shell Command to Process MySQL Table (p. 107) Install the Command Line Interface The AWS Data Pipeline command line interface (CLI) is a tool you can use to create and manage pipelines from a terminal window. It is written in Ruby and makes calls to the web service on your behalf. Topics Install Ruby (p. 15) Install the RubyGems package management framework (p. 15) Install Prerequisite Ruby Gems (p. 16) Install the AWS Data Pipeline CLI (p. 17) Locate your AWS Credentials (p. 17) Create a Credentials File (p. 18) Verify the CLI (p. 18) Install Ruby The AWS Data Pipeline CLI requires Ruby Some operating systems, such as Mac OS, come with Ruby pre-installed. To verify the Ruby installation and version To check whether Ruby is installed, and which version, run the following command in a terminal window. If Ruby is installed, this command displays its version information. ruby -v If you don t have Ruby installed, use the following procedure to install it. To install Ruby on Linux/Unix/Mac OS Download Ruby from and follow the installation instructions for your version of Linux/Unix/Mac OS. Install the RubyGems package management framework The AWS Data Pipeline CLI requires a version of RubyGems that is compatible with Ruby

21 Install Prerequisite Ruby Gems To verify the RubyGems installation and version To check whether RubyGems is installed, run the following command from a terminal window. If RubyGems is installed, this command displays its version information. gem -v If you don t have RubyGems installed, or have a version not compatible with Ruby 1.8.7, you need to download and install RubyGems before you can install the AWS Data Pipeline CLI. To install RubyGems on Linux/Unix/Mac OS 1. Download RubyGems from 2. Install RubyGems using the following command. sudo ruby setup.rb Install Prerequisite Ruby Gems The AWS Data Pipeline CLI requires Ruby or greater, a compatible version of RubyGems, and the following Ruby gems: json (version 1.4 or greater) uuidtools (version 2.1 or greater) httparty (version.7 or greater) bigdecimal (version 1.0 or greater) nokogiri (version or greater) The following topics describe how to install the AWS Data Pipeline CLI and the Ruby environment it requires. Use the following procedures to ensure that each of the gems listed above is installed. To verify whether a gem is installed To check whether a gem is installed, run the following command from a terminal window. For example, if 'uuidtools' is installed, this command displays the name and version of the 'uuidtools' RubyGem. gem search 'uuidtools' If you don t have 'uuidtools' installed, then you need to install it before you can install the AWS Data Pipeline CLI. To install 'uuidtools' on Windows/Linux/Unix/Mac OS Install 'uuidtools' using the following command. 16

22 Install the AWS Data Pipeline CLI sudo gem install uuidtools Install the AWS Data Pipeline CLI After you have verified the installation of your Ruby environment, you re ready to install the AWS Data Pipeline CLI. To install the AWS Data Pipeline CLI on Windows/Linux/Unix/Mac OS 1. Download datapipeline-cli.zip from 2. Unzip the compressed file. For example, on Linux/Unix/Mac OS use the following command: unzip datapipeline-cli.zip This uncompresses the CLI and supporting code into a new directory called dp-cli. 3. If you add the new directory, dp-cli, to your PATH variable, you can use the CLI without specifying the complete path. In this guide, we assume that you've updated your PATH variable, or that you run the CLI from the directory where it is installed. Locate your AWS Credentials When you create an AWS account, AWS assigns you an access key ID and a secret access key. AWS uses these credentials to identify you when you interact with a web service. You need these keys for the next step of the CLI installation process. te Your secret access key is a shared secret between you and AWS. Keep this ID secret; we use it to bill you for the AWS services that you use. Never include the ID in your requests to AWS, and never this ID to anyone, even if a request appears to originate from AWS or Amazon.com. one who legitimately represents Amazon will ever ask you for your secret access key. The following procedure explains how to locate your access key ID and secret access key in the AWS Management Console. To view your AWS access credentials 1. Go to the Amazon Web Services website at 2. Click My Account/Console, and then click Security Credentials. 3. Under Your Account, click Security Credentials. 4. In the spaces provided, type your user name and password, and then click Sign in using our secure server. 5. Under Access Credentials, on the Access Keys tab, your access key ID is displayed. To view your secret key, under Secret Access Key, click Show. Make a note of your access key ID and your secret access key; you will use them in the next section. 17

23 Create a Credentials File Create a Credentials File When you request services from AWS Data Pipeline, you must pass your credentials with the request so that AWS can properly authenticate and eventually bill you. The command line interface obtains your credentials from a JSON document called a credentials file, which is stored in your home directory, ~/. Using a credentials file is the simplest way to make your AWS credentials available to the AWS Data Pipeline CLI. The credentials file contains the following name-value pairs. comment access-id private-key endpoint log-uri An optional comment within the credentials file. The access key ID for your AWS account. The secret access key for your AWS account The endpoint for AWS Data Pipeline in the region where you are making requests. The location of the Amazon S3 bucket where AWS Data Pipeline writes log files. In the following example credentials file, AKIAIOSFODNN7EXAMPLE represents an access key ID, and wjalrxutnfemi/k7mdeng/bpxrficyexamplekey represents the corresponding secret access key. The value of log-uri specifies the location of your Amazon S3 bucket and the path to the log files for actions performed by the AWS Data Pipeline web service on behalf of your pipeline. } "access-id": "AKIAIOSFODNN7EXAMPLE", "private-key": "wjalrxutnfemi/k7mdeng/bpxrficyexamplekey", "endpoint": "datapipeline.us-east-1.amazonaws.com", "port": "443", "use-ssl": "true", "region": "us-east-1", "log-uri": "s3://myawsbucket/logfiles" After you replace the values for the access-id, private-key, and log-uri fields with the appropriate information, save the file as credentials.json in either your home directory, ~/. Verify the CLI To verify that the command line interface (CLI) is installed, use the following command. datapipeline --help If the CLI is installed correctly, this command displays the list of commands for the CLI. 18

24 Deploy and Configure Task Runner Deploy and Configure Task Runner Task Runner is an task runner application that polls AWS Data Pipeline for scheduled tasks and processes the tasks assigned to it by the web service, reporting status as it does so. Depending on your application, you may choose to: Have AWS Data Pipeline install and manage one or more Task Runner applications for you on computational resources managed by the web service. In this case, you do not need to install or configure Task Runner. Manually install and configure Task Runner on a computational resource such as a long-running Amazon EC2 instance or a physical server. To do so, use the following procedures. Manually install and configure a custom task runner instead of Task Runner. The procedures for doing so depends on the implementation of the custom task runner. For more information about Task Runner and when and where it should be configured, see Task Runner (p. 5). Topics te You can only install Task Runner on Linux, UNIX, or Mac OS. Task Runner is not supported on the Windows operating system. Install Java (p. 19) Install Task Runner (p. 235) Start Task Runner (p. 235) Verify Task Runner (p. 236) Install Java Task Runner requires Java version 1.6 or later. To determine whether Java is installed, and the version that is running, use the following command: java -version If you do not have Java 1.6 or later installed on your computer, you can download the latest version from Install Task Runner To install Task Runner, download TaskRunner-1.0.jar from Task Runner download and copy it into a folder. Additionally, download mysql-connector-java bin.jar from and copy it into the same folder where you install Task Runner. Start Task Runner In a new command prompt window that is set to the directory where you installed Task Runner, start Task Runner with the following command. The --config option points to your credentials file. The --workergroup option specifies the name of your worker group, which must be the same value as specified in your pipeline for tasks to be processed. 19

25 Verify Task Runner java -jar TaskRunner-1.0.jar --config ~/credentials.json --workergroup=mywork ergroup When Task Runner is active, it prints the path to where log files are written in the terminal window. The following is an example. Logging to /mycomputer/.../dist/output/logs Warning If you close the terminal window, or interrupt the command with CTRL+C, Task Runner stops, which halts the pipeline runs. Verify Task Runner The easiest way to verify that Task Runner is working is to check whether it is writing log files. The log files are stored in the directory where you started Task Runner. When you check the logs, make sure you that are checking logs for the current date and time. Task Runner creates a new log file each hour, where the hour from midnight to 1am is 00. So the format of the log file name is TaskRunner.log.YYYY-MM-DD-HH, where HH runs from 00 to 23, in UDT. To save storage space, any log files older than eight hours are compressed with GZip. Install the AWS SDK The easiest way to write applications that interact with AWS Data Pipeline or to implement a custom task runner is to use one of the AWS SDKs. The AWS SDKs provide functionality that simplify calling the web service APIs from your preferred programming environment. For more information about the programming languages and development environments that have AWS SDK support, see the AWS SDK listings. If you are not writing programs that interact with AWS Data Pipeline, you do not need to install any of the AWS SDKs. You can create and run pipelines using the console or command-line interface. This guide provides examples of programming AWS Data Pipeline using Java. The following are examples of how to download and install the AWS SDK for Java. To install the AWS SDK for Java using Eclipse Install the AWS Toolkit for Eclipse. Eclipse is a popular Java development environment. The AWS Toolkit for Eclipse installs the latest version of the AWS SDK for Java. From Eclipse, you can easily modify, build, and run any of the samples included in the SDK. To install the AWS SDK for Java If you are using a Java development environment other than Eclipse, download and install the AWS SDK for Java. 20

26 Granting Permissions to Pipelines with IAM Granting Permissions to Pipelines with IAM In AWS Data Pipeline, IAM roles determine what your pipeline can access and actions it can perform. Additionally, when your pipeline creates a resource, such as when a pipeline creates an Amazon EC2 instance, IAM roles determine the EC2 instance's permitted resources and actions. When you create a pipeline, you specify one IAM role that governs your pipeline and another IAM role to govern your pipeline's resources (referred to as a "resource role"), which can be the same role for both. Carefully consider the minimum permissions necessary for your pipeline to perform work and define the IAM roles accordingly. It is important to note that even a modest pipeline might need access to resources and actions to various areas of AWS, for example: Accessing files in Amazon S3 Creating and managing Amazon EMR clusters Creating and managing Amazon EC2 instances Accessing data in Amazon RDS or Amazon DynamoDB Sending notifications using Amazon SNS When you use the AWS Data Pipeline console, you can choose a pre-defined, default IAM role and resource role or create a new one to suit your needs. However, when using the AWS Data Pipeline CLI, you must create a new IAM role and apply a policy to it yourself, for which you can use the following example policy. For more information about how to create a new IAM role and apply a policy to it, see Managing IAM Policies in the Using IAM guide. Warning Carefully review and restrict permissions in the following example policy to only the resources that your pipeline requires. "Statement": [ "Action": [ "s3:*" ], "Effect": "Allow", "Resource": [ "*" ] "Action": [ "ec2:describeinstances", "ec2:runinstances", "ec2:startinstances", "ec2:stopinstances", "ec2:terminateinstances" ], "Effect": "Allow", "Resource": [ "*" ] "Action": [ "elasticmapreduce:*" ], 21

27 Granting Permissions to Pipelines with IAM } ] "Effect": "Allow", "Resource": [ "*" ] "Action": [ "dynamodb:*" ], "Effect": "Allow", "Resource": [ "*" ] "Action": [ "rds:describedbinstances", "rds:describedbsecuritygroups" ], "Effect": "Allow", "Resource": [ "*" ] "Action": [ "sns:gettopicattributes", "sns:listtopics", "sns:publish", "sns:subscribe", "sns:unsubscribe" ], "Effect": "Allow", "Resource": [ "*" ] "Action": [ "iam:passrole" ], "Effect": "Allow", "Resource": [ "*" ] } "Action": [ "datapipeline:*" ], "Effect": "Allow", "Resource": [ "*" ] 22

28 Grant Amazon RDS Permissions to Task Runner After you define a role and apply its policy, you define a trusted entities list, which indicates the entities or services that are permitted to use your new role. You can use the following IAM trust relationship definition to allow AWS Data Pipeline and Amazon EC2 to use your new pipeline and resource roles. For more information about editing IAM trust relationships, see Modifying a Role in the Using IAM guide. } "Version": " ", "Statement": [ "Sid": "", "Effect": "Allow", "Principal": "Service": [ "ec2.amazonaws.com", "datapipeline.amazonaws.com" ] "Action": "sts:assumerole" } ] Grant Amazon RDS Permissions to Task Runner Amazon RDS allows you to control access to your DB Instances using database security groups (DB Security Groups). A DB Security Group acts like a firewall controlling network access to your DB Instance. By default, network access is turned off to your DB Instances.You must modify your DB Security Groups to let Task Runner access your Amazon RDS instances. Task Runner gains Amazon RDS access from the instance on which it runs, so the accounts and security groups that you add to your Amazon RDS instance depend on where you install Task Runner. To grant permissions to Task Runner, 1. Sign in to the AWS Management Console and open the Amazon RDS console. 2. In the Amazon RDS: My DB Security Groups pane, click your Amazon RDS instance. In the DB Security Group pane, under Connection Type, select EC2 Security Group. Configure the fields in the EC2 Security Group pane as described below: For Task Runner running on an EC2 Resource, AWS Account Id: Your AccountId EC2 Security Group: Your Security Group For a Task Runner running on an EMR Resource, AWS Account Id: Your AccountId EC2 Security Group: ElasticMapReduce-master AWS Account Id: Your AccountId EC2 Security Group: ElasticMapReduce-slave 23

29 Grant Amazon RDS Permissions to Task Runner For a Task Runner running in your local environment (on-premise), CIDR: The IP address range of your on premise machine, or firewall if your on-premise computer is behind a firewall. To allow connection from an RdsSqlPrecondition AWS Account Id: EC2 Security Group: DataPipeline 24

30 Tutorial: Copy CSV Data from Amazon S3 to Amazon S3 After you read What is AWS Data Pipeline? (p. 1) and decide you want to use AWS Data Pipeline to automate the movement and transformation of your data, it is time to get started with creating data pipelines. To help you make sense of how AWS Data Pipeline works, let s walk through a simple task. This tutorial walks you through the process of creating a data pipeline to copy data from one Amazon S3 bucket to another and then send an Amazon SNS notification after the copy activity completes successfully. You use the Amazon EC2 instance resource managed by AWS Data Pipeline for this copy activity. Important This tutorial does not employ the Amazon S3 API for high speed data transfer between Amazon S3 buckets. It is intended only for demonstration purposes to help new customers understand how to create a simple pipeline and the related concepts. For advanced information about data transfer using Amazon S3, see Working with Buckets in the Amazon S3 Developer Guide. The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each object. For more information, see Pipeline Definition (p. 2). This tutorial uses the following objects to create a pipeline definition: Activity Activity the AWS Data Pipeline performs for this pipeline. This tutorial uses the CopyActivity object to copy CSV data from one Amazon S3 bucket to another. Important There are distinct limitations regarding the CSV file format with CopyActivity and S3Datade. For more information, see CopyActivity (p. 180). Schedule The start date, time, and the recurrence for this activity. You can optionally specify the end date and time. Resource Resource AWS Data Pipeline must use to perform this activity. This tutorial uses Ec2Resource, an Amazon EC2 instance provided by AWS Data Pipeline, to copy data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes. 25

31 Before You Begin... Datades Input and output nodes for this pipeline. This tutorial uses S3Datade for both input and output nodes. Action Action AWS Data Pipeline must take when the specified conditions are met. This tutorial uses SnsAlarm action to send Amazon SNS notifications to the address you specify, after the task finishes successfully. You must subscribe to the Amazon SNS Topic Arn to receive the notifications. The following steps outline how to create a data pipeline to copy data from one Amazon S3 bucket to another Amazon S3 bucket. 1. Create your pipeline definition 2. Validate and save your pipeline definition 3. Activate your pipeline 4. Monitor the progress of your pipeline 5. [Optional] Delete your pipeline Before You Begin... Be sure you've completed the following steps. Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see Access the Console (p. 12). Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12). Create an Amazon S3 bucket as a data source. For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started Guide. Upload your data to your Amazon S3 bucket. For more information, see Add an Object to a Bucket in the Amazon Simple Storage Service Getting Started Guide. Create another Amazon S3 bucket as a data target Create an Amazon SNS topic for sending notification and make a note of the topic Amazon Resource (ARN). For more information, see Create a Topic in the Amazon Simple tification Service Getting Started Guide. [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your own IAM role policy and trust relationships, follow the instructions described in Granting Permissions to Pipelines with IAM (p. 21). te Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier. 26

32 Using the AWS Data Pipeline Console Using the AWS Data Pipeline Console Topics Create and Configure the Pipeline Definition Objects (p. 27) Validate and Save Your Pipeline (p. 30) Verify your Pipeline Definition (p. 30) Activate your Pipeline (p. 31) Monitor the Progress of Your Pipeline Runs (p. 31) [Optional] Delete your Pipeline (p. 33) The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console. To create your pipeline definition 1. Sign in to the AWS Management Console and open the AWS Data Pipeline console. 2. Click Create Pipeline. 3. On the Create a New Pipeline page: a. In the Pipeline box, enter a name (for example, CopyMyS3Data). b. In Pipeline, enter a description. c. Leave the Select Schedule Type: button set to the default type for this tutorial. te Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. d. Leave the Role boxes set to their default values for this tutorial. te If you have created your own IAM roles and would like to use them in this tutorial, you can select them now. e. Click Create a new pipeline. Create and Configure the Pipeline Definition Objects Next, you define the Activity object in your pipeline definition. When you define the Activity object, you also define the objects that AWS Data Pipeline must use to perform this activity. 1. On the Pipeline: name of your pipeline page, select Add activity. 2. In the Activities pane: a. Enter the name of the activity; for example, copy-mys3-data. b. In the Type box, select CopyActivity. c. In the Input box, select Create new: Datade. d. In the Output box, select Create new: Datade. e. In the Schedule box, select Create new: Schedule. 27

33 Create and Configure the Pipeline Definition Objects f. In the Add an optional field.. box, select RunsOn. g. In the Runs On box, select Create new: Resource. h. In the Add an optional field... box, select On Success. i. In the On Success box, select Create new: Action. j. In the left pane, separate the icons by dragging them apart. You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline uses to perform the copy activity. The Pipeline: name of your pipeline pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects. Next, configure the run date and time for your pipeline. To configure run date and time for your pipeline 1. On the Pipeline: name of your pipeline page, in the right pane, click Schedules. 2. In the Schedules pane: a. Enter a schedule name for this activity (for example, copy-mys3-data-schedule). b. In the Type box, select Schedule. c. In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity. te AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only. d. In the Period box, enter the duration for the activity (for example, 1), and then select the period category (for example, Days). e. [Optional] To specify the date and time to end the activity, in the Add an optional field box, select enddatetime, and enter the date and time. To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS Data Pipeline then starts launching the "past due" runs immediately in an attempt to address what it perceives as a backlog of work. This backfilling means you don't have to wait an hour to see AWS Data Pipeline launch its first job flow. Next, configure the input and the output data nodes for your pipeline. 28

34 Create and Configure the Pipeline Definition Objects To configure the input and output data nodes of your pipeline 1. On the Pipeline: name of your pipeline page, in the right pane, click Datades. 2. In the Datades pane: a. In the DefaultDatade1 box, enter the name for your input node (for example, MyS3Input). In this tutorial, your input node is the Amazon S3 data source bucket. b. In the Type box, select S3Datade. c. In the Schedule box, select copy-mys3-data-schedule. d. In the Add an optional field... box, select File Path. e. In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-input/name of your data file). f. In the DefaultDatade2 box, enter the name for your output node (for example, MyS3Output). In this tutorial, your output node is the Amazon S3 data target bucket. g. In the Type box, select S3Datade. h. In the Schedule box, select copy-mys3-data-schedule. i. In the Add an optional field... box, select File Path. j. In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/name of your data file). Next, configure the resource AWS Data Pipeline must use to perform the copy activity. To configure the resource, 1. On the Pipeline: name of your pipeline page, in the right pane, click Resources. 2. In the Resources pane: a. In the box, enter the name for your resource (for example, CopyDataInstance). b. In the Type box, select Ec2Resource. c. In the Schedule box, select copy-mys3-data-schedule. d. Leave the Role and Resource Role boxes set to default values for this tutorial. te If you have created your own IAM roles, you can select them now. Next, configure the SNS notification action AWS Data Pipeline must perform after the copy activity finishes successfully. To configure the SNS notification action 1. On the Pipeline: name of your pipeline page, in the right pane, click Others. 2. In the Others pane: a. In the DefaultAction1 box, enter the name for your Amazon SNS notification (for example, CopyDatatice). b. In the Type box, select SnsAlarm. 29

35 Validate and Save Your Pipeline c. In the Topic Arn box, enter the ARN of your Amazon SNS topic. d. In the Message box, enter the message content. e. In the Subject box, enter the subject line for your notification. f. Leave the Role box set to the default value for this tutorial. You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline. Validate and Save Your Pipeline You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition. If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you have to fix the errors in the pipeline definition before activating your pipeline. To validate and save your pipeline 1. On the Pipeline: name of your pipeline page, click Save Pipeline. 2. AWS Data Pipeline validates your pipeline definition and returns either success or the error message. If you get an error message, click Close and then, in the right pane, click Errors. 3. The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object names and look for an error message in red. 4. When you see the error message, click the specific object pane where you see the error and fix it. For example, if you see an error message in the Datades object, click the Datades pane to fix the error. 5. After you've fixed the errors listed in the Errors pane, click Save Pipeline. 6. Repeat the process until your pipeline is validated. Next, verify that your pipeline definition has been saved. Verify your Pipeline Definition It is important to verify that your pipeline was correctly initialized from your definitions before you activate it. To verify your pipeline definition 1. On the Pipeline: name of your pipeline page, click Back to list of pipelines. 2. On the List Pipelines page, check if your newly-created pipeline is listed. AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition. The Status column in the row listing your pipeline should show PENDING. 3. Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point. 4. In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition. 30

36 Activate your Pipeline 5. Click Close. Next, activate your pipeline. Activate your Pipeline You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition. To activate your pipeline 1. On the List Pipelines page, in the Details column of your pipeline, click View pipeline. 2. In the Pipeline: name of your pipeline page, click Activate. A confirmation dialog box opens up confirming the activation. 3. Click Close. Next, verify if your pipeline is running. Monitor the Progress of Your Pipeline Runs To monitor the progress of your pipeline 1. On the List Pipelines page, in the Details column of your pipeline, click View instance details. 31

37 Monitor the Progress of Your Pipeline Runs 2. The Instance details: name of your pipeline page lists the status of each instance. te If you do not see runs listed, depending on when your pipeline was scheduled, either click the End (in UTC) date box and change it to a later date or click the Start (in UTC) date box and change it to an earlier date. Then click Update. 3. If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.you should receive an about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification. You can also check your Amazon S3 data target bucket to verify if the data was copied. 4. If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed. a. To troubleshoot the failed or the incomplete instance runs, Click the triangle next to an instance, Instance summary panel opens to show the details of the selected instance. b. In the Instance summary pane. click View instance fields to see details of fields associated with the selected instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For = Resource not healthy terminated. c. In the Instance summary pane, in the Select attempt for this instance box, select the attempt number. d. In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt. 5. To take an action on your incomplete or failed instance, select an action (Rerun Cancel Mark Finished) from the Action column of the instance. You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline. For more information about instance status, see Interpret Pipeline Status Details (p. 129). For more information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data Pipeline Problems and Solutions (p. 131). 32

38 [Optional] Delete your Pipeline Important Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing. If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipleline. [Optional] Delete your Pipeline Deleting your pipeline deletes the pipeline definition including all the associated objects.you stop incurring charges as soon as your pipeline is deleted. To delete your pipeline 1. In the List Pipelines page, click the check box next to your pipeline. 2. Click Delete. 3. In the confirmation dialog box, click Delete to confirm the delete request. Using the Command Line Interface Topics Define a Pipeline in JSON Format (p. 33) Upload the Pipeline Definition (p. 38) Activate the Pipeline (p. 39) Verify the Pipeline Status (p. 39) The following topics explain how to use the AWS Data Pipeline CLI to create and use pipelines to copy data from one Amazon S3 bucket to another. In this example, we perform the following steps: Create a pipeline definition using the CLI in JSON format Create the necessary IAM roles and define a policy and trust relationships Upload the pipeline definition using the AWS Data Pipeline CLI tools Monitor the progress of the pipeline Define a Pipeline in JSON Format This example scenario shows how to use JSON pipeline definitions and the AWS Data Pipeline CLI to schedule copying data between two Amazon S3 buckets at a specific time interval. This is the full pipeline definition JSON file followed by an explanation for each of its sections. te We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the.json file extension. "objects": [ 33

39 Define a Pipeline in JSON Format "id": "MySchedule", "type": "Schedule", "startdatetime": " T00:00:00", "enddatetime": " T00:00:00", "period": "1 day" "id": "S3Input", "type": "S3Datade", "schedule": "ref": "MySchedule" "filepath": "s3://testbucket/file.txt" "id": "S3Output", "type": "S3Datade", "schedule": "ref": "MySchedule" "filepath": "s3://testbucket/file-copy.txt" "id": "MyEC2Resource", "type": "Ec2Resource", "schedule": "ref": "MySchedule" "actionontaskfailure": "terminate", "actiononresourcefailure": "retryall", "maximumretries": "1", "role": "test-role", "resourcerole": "test-role", "instancetype": "m1.medium", "instancecount": "1", "securitygroups": [ "test-group", "default" ], "keypair": "test-pair" "id": "MyCopyActivity", "type": "CopyActivity", "runson": "ref": "MyEC2Resource" "input": "ref": "S3Input" "output": "ref": "S3Output" "schedule": "ref": "MySchedule" } } 34

40 Define a Pipeline in JSON Format ] } Schedule The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one. The Schedule component is defined by the following fields: "id": "MySchedule", "type": "Schedule", "startdatetime": " T00:00:00", "enddatetime":" t00:00:00", "period": "1 day" te In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies. The user-defined name for the pipeline schedule, which is a label for your reference only. Type The pipeline component type, which is Schedule. startdatetime The date/time (in UTC format) that you want the task to begin. enddatetime The date/time (in UTC format) that you want the task to stop. period The time period that you want to pass between task attempts, even if the task occurs only one time. The period must evenly divide the time between startdatetime and enddatetime. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time. Amazon S3 Data des Next, the input S3Datade pipeline component defines a location for the input files; in this case, an Amazon S3 bucket location. The input S3Datade component is defined by the following fields: "id" : "S3Input", "type" : "S3Datade", "schedule" : "ref" : "MySchedule" "filepath" : "s3://testbucket/file.txt", "schedule": "ref": "MySchedule" } The user-defined name for the input location (a label for your reference only). Type The pipeline component type, which is "S3Datade" to match the location where the data resides, in an Amazon S3 bucket. 35

41 Define a Pipeline in JSON Format Schedule A reference to the schedule component that we created in the preceding lines of the JSON file labeled MySchedule. Path The path to the data associated with the data node. The syntax for a data node is determined by its type. For example, a data node for a database follows a different syntax that is appropriate for a database table. Next, the output S3Datade component defines the output destination location for the data. It follows the same format as the input S3Datade component, except the name of the component and a different path to indicate the target file. "id" : "S3Output", "type" : "S3Datade", "schedule" : "ref" : "MySchedule" "filepath" : "s3://testbucket/file-copy.txt", "schedule": "ref": "MySchedule" } Resource This is a definition of the computational resource that performs the copy operation. In this example, AWS Data Pipeline should automatically create an EC2 instance to perform the copy task and terminate the resource after the task completes. The fields defined here control the creation and function of the Amazon EC2 instance that does the work. The EC2Resource is defined by the following fields: "id": "MyEC2Resource", "type": "Ec2Resource", "schedule": "ref": "MySchedule" "actionontaskfailure": "terminate", "actiononresourcefailure": "retryall", "maximumretries": "1", "role" : "test-role", "resourcerole": "test-role", "instancetype": "m1.medium", "instancecount": "1", "securitygroups": [ "test-group", "default" ], "keypair": "test-pair" The user-defined name for the pipeline schedule, which is a label for your reference only. Type The type of computational resource to perform work; in this case, an Amazon EC2 instance. There are other resource types available, such as an EmrCluster type. Schedule The schedule on which to create this computational resource. actionontaskfailure The action to take on the Amazon EC2 instance if the task fails. In this case, the instance should terminate so that each failed attempt to copy data does not leave behind idle, abandoned Amazon EC2 instances with no work to perform. These instances require manual termination by an administrator. 36

42 Define a Pipeline in JSON Format actiononresourcefailure The action to perform if the resource is not created successfully. In this case, retry the creation of an Amazon EC2 instance until it is successful. maximumretries The number of times to retry the creation of this computational resource before marking the resource as a failure and stopping any further creation attempts. This setting works in conjunction with the actiononresourcefailure field. Role The IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket to retrieve data. resourcerole The IAM role of the account that creates resources, such as creating and configuring an Amazon EC2 instance on your behalf. Role and ResourceRole can be the same role, but separately provide greater granularity in your security configuration. instancetype The size of the Amazon EC2 instance to create. Ensure that you set the appropriate size of EC2 instance that best matches the load of the work that you want to perform with AWS Data Pipeline. In this case, we set an m1.medium sized EC2 instance. For more information about the different instance types and when to use each one, see to the Amazon EC2 Instance Types topic at instancecount The number of Amazon EC2 instances in the computational resource pool to service any pipeline components depending on this resource. securitygroups The Amazon EC2 security groups to grant access to this EC2 instance. As the example shows, you can provide more than one group at a time (test-group and default). keypair The name of the SSH public/private key pair to log in to the Amazon EC2 instance. For more information, see Amazon EC2 Key Pairs. Activity The last section in the JSON file is the definition of the activity that represents the work to perform. This example uses CopyActivity to copy data from a file in an Amazon S3 bucket to another file. The CopyActivity component is defined by the following fields: } "id" : "MyCopyActivity", "type" : "CopyActivity", "runson":"ref":"myec2resource" "input" : "ref" : "S3Input" "output" : "ref" : "S3Output" "schedule" : "ref" : "MySchedule"} The user-defined name for the activity, which is a label for your reference only. Type The type of activity to perform, such as MyCopyActivity. runson The computational resource that performs the work that this activity defines. In this example, we provide a reference to the Amazon EC2 instance defined previously. Using the runson field causes AWS Data Pipeline to create the EC2 instance for you. The runson field indicates that the resource 37

43 Upload the Pipeline Definition exists in the AWS infrastructure, while the workergroup value indicates that you want to use your own on-premises resources to perform the work. Schedule The schedule on which to run this activity. Input The location of the data to copy. Output The target location data. Upload the Pipeline Definition You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information, see Install the Command Line Interface (p. 15) To upload your pipeline definition, use the following command. On Linux/Unix/Mac OS:./datapipeline - create pipeline_name - put pipeline_file On Windows: ruby datapipeline - create pipeline_name - put pipeline_file Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the.json file extension that defines your pipeline. If your pipeline validates successfully, you receive the following message: Pipeline with name pipeline_name and id df-akiaiosfodnn7example created. Pipeline definition pipeline_file.json uploaded. te For more information about any errors returned by the create command or other commands, see Troubleshoot AWS Data Pipeline (p. 128). Ensure that your pipeline appears in the pipeline list by using the following command. On Linux/Unix/Mac OS:./datapipeline --list-pipelines On Windows: ruby datapipeline - list-pipelines The list of pipelines includes details such as, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-akiaiosfodnn7example. 38

44 Activate the Pipeline Activate the Pipeline You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command. On Linux/Unix/Mac OS:./datapipeline --activate - id df-akiaiosfodnn7example On Windows: ruby datapipeline --activate - id df-akiaiosfodnn7example Where df-akiaiosfodnn7example is the identifier for your pipeline. Verify the Pipeline Status View the status of your pipeline and its components, along with its activity attempts and retries with the following command. On Linux/Unix/Mac OS:./datapipeline --list-runs id df-akiaiosfodnn7example On Windows: ruby datapipeline --list-runs id df-akiaiosfodnn7example Where df-akiaiosfodnn7example is the identifier for your pipeline. The --list-runs command displays a list of pipelines components and details such as, Scheduled Start, Status, ID, Started, and Ended. te It is important to note the difference between the Scheduled Start date/time vs. the Started time. It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries. te AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs. Successful pipeline runs are indicated by all the activities in your pipeline reporting the FINISHED status. Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as Amazon EC2 instances, may show the SHUTTING_DOWN status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status. 39

45 Tutorial: Copy Data From a MySQL Table to Amazon S3 Topics Before You Begin... (p. 41) Using the AWS Data Pipeline Console (p. 42) Using the Command Line Interface (p. 48) This tutorial walks you through the process of creating a data pipeline to copy data (rows) from a table in MySQL database to a CSV (comma-separated values) file in Amazon S3 bucket and then send an Amazon SNS notification after the copy activity completes successfully. You will use the Amazon EC2 instance resource provided by AWS Data Pipeline for this copy activity. The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each pipeline object. For more information on pipeline definition, see Pipeline Definition (p. 2). This tutorial uses the following objects to create a pipeline definition: Activity Activity the AWS Data Pipeline must perform for this pipeline. This tutorial uses the CopyActivity to copy data from a MySQL table to an Amazon S3 bucket. Schedule The start date, time, and the duration for this activity. You can optionally specify the end date and time. Resource Resource AWS Data Pipeline must use to perform this activity. This tutorial uses Ec2Resource, an Amazon EC2 instance provided by AWS Data Pipeline, to copy data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes. Datades Input and output nodes for this pipeline. This tutorial uses MySQLDatade for source data and S3Datade for target data. 40

46 Before You Begin... Action Action AWS Data Pipeline must take when the specified conditions are met. This tutorial uses SnsAlarm action to send Amazon SNS notification to the address you specify, after the task finishes successfully. For information about the additional objects and fields supported by the copy activity, see CopyActivity (p. 180). The following steps outline how to create a data pipeline to copy data from MySQL table to Amazon S3 bucket. 1. Create your pipeline definition 2. Create and configure the pipeline definition objects 3. Validate and save your pipeline definition 4. Verify that your pipeline definition is saved 5. Activate your pipeline 6. Monitor the progress of your pipeline 7. [Optional] Delete your pipeline Before You Begin... Be sure you've completed the following steps. Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see Access the Console (p. 12). Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12). Create an Amazon S3 bucket as a data source. For more information, see Create a Bucket in Amazon Simple Storage Service Getting Started Guide. Create and launch a MySQL database instance as a data source. For more information, see Launch a DB Instance in the Amazon Relational Database Service(RDS) Getting Started Guide. te Make a note of the user name and the password you used for creating the MySQL instance. After you've launched your MySQL database instance, make a note of the instance's endpoint. You will need all this information in this tutorial. Connect to your MySQL database instance, create a table, and then add test data values to the newly created table. For more information, go to Create a Table in the MySQL documentation. Create an Amazon SNS topic for sending notification and make a note of the topic Amazon Resource (ARN). For more information, go to Create a Topic in Amazon Simple tification Service Getting Started Guide. [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your IAM role policy and trust relationships, follow the instructions described in Granting Permissions to Pipelines with IAM (p. 21). 41

47 Using the AWS Data Pipeline Console te Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier. Using the AWS Data Pipeline Console Topics Create and Configure the Pipeline Definition Objects (p. 42) Validate and Save Your Pipeline (p. 45) Verify Your Pipeline Definition (p. 45) Activate your Pipeline (p. 46) Monitor the Progress of Your Pipeline Runs (p. 47) [Optional] Delete your Pipeline (p. 48) The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console. To create your pipeline definition 1. Sign in to the AWS Management Console and open the AWS Data Pipeline Console. 2. Click Create Pipeline. 3. On the Create a New Pipeline page: a. In the Pipeline box, enter a name (for example, CopyMySQLData). b. In Pipeline, enter a description. c. Leave the Select Schedule Type: button set to the default type for this tutorial. te Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. d. Leave the Role boxes set to their default values for this tutorial. te If you have created your own IAM roles and would like to use them in this tutorial, you can select them now. e. Click Create a new pipeline. Create and Configure the Pipeline Definition Objects Next, you define the Activity object in your pipeline definition. When you define the Activity object, you also define the objects that AWS Data Pipeline must use to perform this activity. 1. On the Pipeline: name of your pipeline page, click Add activity. 2. In the Activities pane a. Enter the name of the activity; for example, copy-mysql-data 42

48 Create and Configure the Pipeline Definition Objects b. In the Type box, select CopyActivity. c. In the Input box, select Create new: Datade. d. In the Schedule box, select Create new: Schedule. e. In the Output box, select Create new: Datade. f. In the Add an optional field.. box, select RunsOn. g. In the Runs On box, select Create new: Resource. h. In the Add an optional field.. box, select On Success. i. In the On Success box, select Create new: Action. j. In the left pane, separate the icons by dragging them apart. You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline will use to perform the copy activity. The Pipeline: name of your pipeline pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects. Next step, configure run date and time for your pipeline. To configure run date and time for your pipeline, 1. On the Pipeline: name of your pipeline page, in the right pane, click Schedules. 2. In the Schedules pane: a. Enter a schedule name for this activity (for example, copy-mysql-data-schedule). b. In the Type box, select Schedule. c. In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity. te AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only d. In the Period box, enter the duration for the activity (for example, 1), and then select the period category (for example, Days). e. [Optional] To specify the date and time to end the activity, in the Add an optional field box, select enddatetime, and enter the date and time. 43

49 Create and Configure the Pipeline Definition Objects To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS Data Pipeline will then starting launching the "past due" runs immediately in an attempt to address what it percieves as a backlog of work. This backfilling means you don't have to wait an hour to see AWS Data Pipeline launch its first job flow. Next step, configure the input and the output data nodes for your pipeline. To configure the input and output data nodes of your pipeline, 1. On the Pipeline: name of your pipeline page, in the right pane, click Datades. 2. In the Datades pane: a. In the DefaultDatade1 box, enter the name for your input node (for example, MySQLInput). In this tutorial, your input node is the Amazon RDS MySQL instance you just created. b. In the Type box, select MySQLDatade. c. In the Username box, enter the user name you used when you created your MySQL database instance. d. In the Connection box, enter the end point of your MySQL database instance (for example, mydbinstance.c3frkexample.us-east-1.rds.amazonaws.com). e. In the *Password box, enter the password you used when you created your MySQL database instance. f. In the Table box, enter the name of the source MySQL database table (for example, input-table g. In the Schedule box, select copy-mysql-data-schedule. h. In the DefaultDatade2 box, enter the name for your output node (for example, MyS3Output). In this tutorial, your output node is the Amazon S3 data target bucket. i. In the Type box, select S3Datade. j. In the Schedule box, select copy-mysql-data-schedule. k. In the Add an optional field.. box, select File Path. l. In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/name of your csv file). Next step, configure the the resource AWS Data Pipeline must use to perform the copy activity. To configure the resource, 1. On the Pipeline: name of your pipeline page, in the right pane, click Resources. 2. In the Resources pane: a. In the box, enter the name for your resource (for example, CopyDataInstance). b. In the Type box, select Ec2Resource. c. In the Schedule box, select copy-mysql-data-schedule. Next step, configure the SNS notification action AWS Data Pipeline must perform after the copy activity finishes successfully. 44

50 Validate and Save Your Pipeline To configure the SNS notification action, 1. On the Pipeline: name of your pipeline page, in the right pane, click Others. 2. In the Others pane: a. In the DefaultAction1 box, enter the name for your Amazon SNS notification (for example, CopyDatatice). b. In the Type box, select SnsAlarm. c. In the Message box, enter the message content. d. Leave the entry in the Role box set to default value. e. In the Topic Arn box, enter the ARN of your Amazon SNS topic. f. In the Subject box, enter the subject line for your notification. You have now completed all the steps required for creating your pipeline definition. Next step, validate and save your pipeline. Validate and Save Your Pipeline You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition. If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline. To validate and save your pipeline, 1. On the Pipeline: name of your pipeline page, click Save Pipeline. 2. AWS Data Pipeline validates your pipeline definition and returns either success or the error message. 3. If you get an error message, click Close and then, in the right pane, click Errors. 4. The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object names and look for an error message in red. 5. When you see the error message, click the specific object pane where you see the error and fix it. For example, if you see an error message in the Datades object, click the Datades pane to fix the error. 6. After you've fixed the errors listed in the Errors pane, click Save Pipeline. 7. Repeat the process until your pipeline is validated. Next step, verify that your pipeline definition has been saved. Verify Your Pipeline Definition It is important to verify that your pipeline was correctly initialized from your definitions before you activate it. To verify your pipeline definition, 1. On the Pipeline: name of your pipeline page, click Back to list of pipelines. 2. On the List Pipelines page, check if your newly-created pipeline is listed. AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition. 45

51 Activate your Pipeline The Status column in the row listing your pipeine should show PENDING. 3. Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0's at this point. 4. In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition. 5. Click Close. Next step, activate your pipeline. Activate your Pipeline You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition. To activate your pipeline, 1. On the List Pipelines page, in the Details column of your pipeline, click View pipeline. 2. In the Pipeline: name of your pipeline page, click Activate. A confirmation dialog box opens up confirming the activation. 3. Click Close. Next step, verify if your pipeline is running. 46

52 Monitor the Progress of Your Pipeline Runs Monitor the Progress of Your Pipeline Runs To monitor the progress of your pipeline, 1. On the List Pipelines page, in the Details column of your pipeline, click View instance details. 2. The Instance details: name of your pipeline page lists the status of each instance. te If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. And then click Update. 3. If the Status column of all the objects in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.you should receive an about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification. You can also check your Amazon S3 data target bucket to verify if the data was copied. 4. If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed. a. To troubleshoot the failed or the incomplete instance runs, Click the triangle next to an instance, Instance summary panel opens to show the details of the selected instance. b. Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED the additional details box will have an entry indicating the reason for failure. For = Resource not healthy terminated. c. In the Instance summary pane, in the Select attempt for this instance box, select the attempt number. d. In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt. 5. To take an action on your incomplete or failed instance, select an action (Rerun Cancel Mark Finished) from the Action column of the instance. You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline. 47

53 [Optional] Delete your Pipeline For more information about instance status, see Interpret Pipeline Status Details (p. 129). For more information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data Pipeline Problems and Solutions (p. 131). Important Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing. If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline. [Optional] Delete your Pipeline Deleting your pipeline will delete the pipeline definition including all the associated objects. You will stop incurring charges as soon as your pipeline is deleted. To delete your pipeline, 1. In the List Pipelines page, click the check box next to your pipeline. 2. Click Delete. 3. In the confirmation dialog box, click Delete to confirm the delete request. Using the Command Line Interface Topics Define a Pipeline in JSON Format (p. 49) Schedule (p. 50) MySQL Data de (p. 51) Amazon S3 Data de (p. 51) Resource (p. 52) Activity (p. 53) Upload the Pipeline Definition (p. 54) Activate the Pipeline (p. 54) Verify the Pipeline Status (p. 55) The following topics explain how to use the AWS Data Pipeline CLI to create a pipeline to copy data from a MySQL table to a file in an Amazon S3 bucket. In this example, we perform the following steps: Create a pipeline definition using the CLI in JSON format Create the necessary IAM roles and define a policy and trust relationships Upload the pipeline definition using the AWS Data Pipeline CLI tools Monitor the progress of the pipeline To complete the steps in this example, you need a MySQL database instance with a table that contains data. To create a MySQL database using Amazon RDS, see Get Started with Amazon RDS - After you have an 48

54 Define a Pipeline in JSON Format Amazon RDS instance, see the MySQL documentation to Create a Table - Define a Pipeline in JSON Format This example scenario shows how to use JSON pipeline definitions and the AWS Data Pipeline CLI to copy data (rows) from a table in a MySQL database to a CSV (comma-separated values) file in an Amazon S3 bucket at a specified time interval. This is the full pipeline definition JSON file followed by an explanation for each of its sections. te We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the.json file extension. "objects": [ "id": "MySchedule", "type": "Schedule", "startdatetime": " T00:00:00", "enddatetime": " T00:00:00", "period": "1 day" "id": "MySQLInput", "type": "MySqlDatade", "schedule": "ref": "MySchedule" "table": "table_name", "username": "user_name", "*password": "my_password", "connection": "jdbc:mysql://mysqlinstance-rds.example.us-east- 1.rds.amazonaws.com:3306/database_name", "selectquery": "select * from #table}" "id": "S3Output", "type": "S3Datade", "filepath": "s3://testbucket/output/output_file.csv", "schedule": "ref": "MySchedule" } "id": "MyEC2Resource", "type": "Ec2Resource", "schedule": "ref": "MySchedule" "actionontaskfailure": "terminate", "actiononresourcefailure": "retryall", "maximumretries": "1", "role": "test-role", "resourcerole": "test-role", "instancetype": "m1.medium", "instancecount": "1", 49

55 Schedule } ] "securitygroups": [ "test-group", "default" ], "keypair": "test-pair" "id": "MyCopyActivity", "type": "CopyActivity", "runson": "ref": "MyEC2Resource" "input": "ref": "MySQLInput" "output": "ref": "S3Output" "schedule": "ref": "MySchedule" } } Schedule The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one. The Schedule component is defined by the following fields: "id": "MySchedule", "type": "Schedule", "startdatetime": " T00:00:00", "enddatetime":" t00:00:00", "period": "1 day" te In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies. The user-defined name for the pipeline schedule, which is a label for your reference only. Type The pipeline component type, which is Schedule. startdatetime The date/time (in UTC format) that you want the task to begin. enddatetime The date/time (in UTC format) that you want the task to stop. period The time period that you want to pass between task attempts, even if the task occurs only one time. The period must evenly divide the time between startdatetime and enddatetime. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time. 50

56 MySQL Data de MySQL Data de Next, the input MySqlDatade pipeline component defines a location for the input data; in this case, an Amazon RDS instance. The input MySqlDatade component is defined by the following fields: "id": "MySQLInput", "type": "MySqlDatade", "schedule": "ref": "MySchedule" "table": "table_name", "username": "user_name", "*password": "my_password", "connection": "jdbc:mysql://mysqlinstance-rds.example.us-east- 1.rds.amazonaws.com:3306/database_name", "selectquery" : "select * from #table}" The user-defined name for the MySQL database, which is a label for your reference only. Type The MySqlDatade type, which is an Amazon RDS instance using MySQL in this example.. Schedule A reference to the schedule component that we created in the preceding lines of the JSON file labeled MySchedule. Table The name of the database table that contains the data to copy. Replace table_name with the name of your database table. Username The user name of the database account that has sufficient permission to retrieve data from the database table. Replace user_name with the name of your user account. Password The password for the database account with the asterisk prefix to indicate that AWS Data Pipeline must encrypt the password value. Replace my_password with the correct password for your user account. connection The JDBC connection string for the CopyActivity object to connect to the database. selectquery A valid SQL SELECT query that specifies which data to copy from the database table. te that #table} is a variable that re-uses the table name provided by the "table" variable in the preceding lines of the JSON file. Amazon S3 Data de Next, the S3Output pipeline component defines a location for the output file; in this case a CSV file in an S3 bucket location. The output S3Datade component is defined by the following fields: "id": "S3Output", "type": "S3Datade", "filepath": "s3://testbucket/output/output_file.csv", "schedule":"ref":"myschedule"} 51

57 Resource The user-defined name for the input location (a label for your reference only). Type The pipeline component type, which is "S3Datade" to match the location where the data resides, in an Amazon S3 bucket. Path The path to the data associated with the data node. The syntax for a data node is determined by its type. For example, a data node for a database follows a different syntax that is appropriate for a database table. Schedule A reference to the schedule component that we created in the preceding lines of the JSON file labeled MySchedule. Resource This is a definition of the computational resource that performs the copy operation. In this example, AWS Data Pipeline should automatically create an EC2 instance to perform the copy task and terminate the resource after the task completes. The fields defined here control the creation and function of the Amazon EC2 instance that does the work. The EC2Resource is defined by the following fields: "id": "MyEC2Resource", "type": "Ec2Resource", "schedule": "ref": "MySchedule" "actionontaskfailure": "terminate", "actiononresourcefailure": "retryall", "maximumretries": "1", "role" : "test-role", "resourcerole": "test-role", "instancetype": "m1.medium", "instancecount": "1", "securitygroups": [ "test-group", "default" ], "keypair": "test-pair" The user-defined name for the pipeline schedule, which is a label for your reference only. Type The type of computational resource to perform work; in this case, an Amazon EC2 instance. There are other resource types available, such as an EmrCluster type. Schedule The schedule on which to create this computational resource. actionontaskfailure The action to take on the Amazon EC2 instance if the task fails. In this case, the instance should terminate so that each failed attempt to copy data does not leave behind idle, abandoned Amazon EC2 instances with no work to perform. These instances require manual termination by an administrator. actiononresourcefailure The action to perform if the resource is not created successfully. In this case, retry the creation of an Amazon EC2 instance until it is successful. maximumretries The number of times to retry the creation of this computational resource before marking the resource as a failure and stopping any further creation attempts. This setting works in conjunction with the actiononresourcefailure field. 52

58 Activity Role The IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket to retrieve data. resourcerole The IAM role of the account that creates resources, such as creating and configuring an Amazon EC2 instance on your behalf. Role and ResourceRole can be the same role, but separately provide greater granularity in your security configuration. instancetype The size of the Amazon EC2 instance to create. Ensure that you set the appropriate size of EC2 instance that best matches the load of the work that you want to perform with AWS Data Pipeline. In this case, we set an m1.medium sized EC2 instance. For more information about the different instance types and when to use each one, see to the Amazon EC2 Instance Types topic at instancecount The number of Amazon EC2 instances in the computational resource pool to service any pipeline components depending on this resource. securitygroups The Amazon EC2 security groups to grant access to this EC2 instance. As the example shows, you can provide more than one group at a time (test-group and default). keypair The name of the SSH public/private key pair to log in to the Amazon EC2 instance. For more information, see Amazon EC2 Key Pairs. Activity The last section in the JSON file is the definition of the activity that represents the work to perform. In this case we use a CopyActivity component to copy data from a file in an Amazon S3 bucket to another file. The CopyActivity component is defined by the following fields: } "id": "MyCopyActivity", "type": "CopyActivity", "runson":"ref":"myec2resource" "input": "ref": "MySQLInput" "output": "ref": "S3Output" "schedule":"ref":"myschedule"} The user-defined name for the activity, which is a label for your reference only. Type The type of activity to perform, such as MyCopyActivity. runson The computational resource that performs the work that this activity defines. In this example, we provide a reference to the EC2 instance defined previously. Using the runson field causes AWS Data Pipeline to create the EC2 instance for you. The runson field indicates that the resource exists in the AWS infrastructure, while the workergroup value indicates that you want to use your own on-premises resources to perform the work. Schedule The schedule on which to run this activity. Input The location of the data to copy. 53

59 Upload the Pipeline Definition Output The target location data. Upload the Pipeline Definition You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information, see Install the Command Line Interface (p. 15) To upload your pipeline definition, use the following command. On Linux/Unix/Mac OS:./datapipeline - create pipeline_name - put pipeline_file On Windows: ruby datapipeline - create pipeline_name - put pipeline_file Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the.json file extension that defines your pipeline. If your pipeline validates successfully, you receive the following message: Pipeline with name pipeline_name and id df-akiaiosfodnn7example created. Pipeline definition pipeline_file.json uploaded. te For more information about any errors returned by the create command or other commands, see Troubleshoot AWS Data Pipeline (p. 128). Ensure that your pipeline appears in the pipeline list by using the following command. On Linux/Unix/Mac OS:./datapipeline --list-pipelines On Windows: ruby datapipeline - list-pipelines The list of pipelines includes details such as, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-akiaiosfodnn7example. Activate the Pipeline You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command. On Linux/Unix/Mac OS: 54

60 Verify the Pipeline Status./datapipeline --activate - id df-akiaiosfodnn7example On Windows: ruby datapipeline --activate - id df-akiaiosfodnn7example Where df-akiaiosfodnn7example is the identifier for your pipeline. Verify the Pipeline Status View the status of your pipeline and its components, along with its activity attempts and retries with the following command. On Linux/Unix/Mac OS:./datapipeline --list-runs id df-akiaiosfodnn7example On Windows: ruby datapipeline --list-runs id df-akiaiosfodnn7example Where df-akiaiosfodnn7example is the identifier for your pipeline. The --list-runs command displays a list of pipelines components and details such as, Scheduled Start, Status, ID, Started, and Ended. te It is important to note the difference between the Scheduled Start date/time vs. the Started time. It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries. te AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs. Successful pipeline runs are indicated by all the activities in your pipeline reporting the FINISHED status. Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as Amazon EC2 instances, may show the SHUTTING_DOWN status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status. 55

61 Tutorial: Launch an Amazon EMR Job Flow If you regularly run an Amazon EMR job flow, such as to analyze web logs or perform analysis of scientific data, you can use AWS Data Pipeline to manage your Amazon EMR job flows. With AWS Data Pipeline you can specify preconditions that must be met before the job flow is launched (for example, ensuring that today's data been uploaded to Amazon S3), a schedule for repeatedly running the job flow, and the cluster configuration to use for the job flow. The following tutorial walks you through launching a simple job flow as an example. This can be used as a model for a simple Amazon EMR-based pipeline, or as part of a more involved pipeline. This tutorial walks you through the process of creating a data pipeline for a simple Amazon EMR job flow to run a pre-existing Hadoop Streaming job provided by Amazon EMR, and then send an Amazon SNS notification after the task completes successfuly.you will use the Amazon EMR cluster resource provided by AWS Data Pipeline for this task. This sample application is called WordCount, and can also be run manually from the Amazon EMR console. The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each pipeline object. For more information on pipeline definition, see Pipeline Definition (p. 2). This tutorial uses the following objects to create a pipeline definition: Activity Activity the AWS Data Pipeline must perform for this pipeline. This tutorial uses the EmrActivity to run a pre-existing Hadoop Streaming job provided by Amazon EMR. Schedule Start date, time, and the duration for this activity. You can optionally specify the end date and time. Resource Resource AWS Data Pipeline must use to perform this activity. This tutorial uses EmrCluster, a set of Amazon EC2 instances, provided AWS Data Pipeline to run the job flow.aws Data Pipeline automatically launches the Amazon EMR cluster and then terminates the cluster after the task finishes. Action Action AWS Data Pipeline must take when the specified conditions are met. 56

62 Before You Begin... This tutorial uses SnsAlarm action to send Amazon SNS notification to the address you specify, after the task finishes successfully. For more information about the additional objects and fields supported by Amazon EMR activity, see EmrCluster (p. 209). The following steps outline how to create a data pipeline to launch an Amazon EMR job flow. 1. Create your pipeline definition 2. Create and configure the pipeline definition objects 3. Validate and save your pipeline definition 4. Verify that your pipeline definition is saved 5. Activate your pipeline 6. Monitor the progress of your pipeline 7. [Optional] Delete your pipeline Before You Begin... Be sure you've completed the following steps. Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see Access the Console (p. 12). Set up the AWS Data Pipeline tools and interface you plan on using. For more information about interfaces and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12). Create an Amazon SNS topic for sending notification and make a note of the topic Amazon Resource (ARN). For more information, see Create a Topic in the Amazon Simple tification Service Getting Started Guide. te Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier. Using the AWS Data Pipeline Console Topics Create and Configure the Pipeline Definition Objects (p. 58) Validate and Save Your Pipeline (p. 60) Verify Your Pipeline Definition (p. 60) Activate your Pipeline (p. 61) Monitor the Progress of Your Pipeline Runs (p. 61) [Optional] Delete your Pipeline (p. 63) The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console. 57

63 Create and Configure the Pipeline Definition Objects To create your pipeline definition 1. Sign in to the AWS Management Console and open the AWS Data Pipeline console. 2. Click Create Pipeline. 3. On the Create a New Pipeline page: a. In the Pipeline box, enter a name (for example, MyEmrJob). b. In Pipeline, enter a description. c. Leave the Select Schedule Type: button set to the default type for this tutorial. te Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. d. Leave the Role boxes set to their default values for this tutorial. e. Click Create a new pipeline. Create and Configure the Pipeline Definition Objects Next, you define the Activity object in your pipeline definition. When you define the Activity object, you also define the objects that AWS Data Pipeline must use to perform this activity. 1. On the Pipeline: name of your pipeline page, select Add activity. 2. In the Activities pane a. Enter the name of the activity; for example, my-emr-job b. In the Type box, select EmrActivity. c. In the Step box, enter /home/hadoop/contrib/streaming/hadoop-streaming.jar,-in put,\ s3n://elasticmapreduce/samples/wordcount/input,-output,\ s3://myawsbucket/word count/output/#@scheduledstarttime\ -mapper,s3n://elasticmapreduce/samples/word count/wordsplitter.py,-reducer,aggregate. d. In the Schedule box, select Create new: Schedule. e. In the Add an optional field.. box, select Runs On. f. In the Runs On box, select Create new: EmrCluster. g. In the Add an optional field.. box, select On Success. h. In the On Success box, select Create new: Action. You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline will use to launch an Amazon EMR job flow. The Pipeline: name of your pipeline pane shows a single activity icon for this pipeline. Next, configure run date and time for your pipeline. 58

64 Create and Configure the Pipeline Definition Objects To configure run date and time for your pipeline 1. On the Pipeline: name of your pipeline page, in the right pane, click Schedules. 2. In the Schedules pane: a. Enter a schedule name for this activity (for example, my-emr-job-schedule). b. In the Type box, select Schedule. c. In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity. te AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only. d. In the Period box, enter the duration for the activity (for example, 1), and then select the period category (for example, Days). e. [Optional] To specify the date and time to end the activity, in the Add an optional field box, select enddatetime, and enter the date and time. To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS Data Pipeline will then starting launching the "past due" runs immediately in an attempt to address what it percieves as a backlog of work. This backfilling means you don't have to wait an hour to see AWS Data Pipeline launch its first job flow. Next, configure the resource AWS Data Pipeline must use to perform the Amazon EMR job. To configure the resource 1. On the Pipeline: name of your pipeline page, in the right pane, click Resources. 2. In the Resources pane: a. In the box, enter the name for your EMR cluster (for example, MyEmrCluster). b. Leave the Type box set to the default value. c. In the Schedule box, select my-emr-job-schedule. Next, configure the SNS notification action AWS Data Pipeline must perform after the Amazon EMR job finishes successfully. To configure the SNS notification action 1. On the Pipeline: name of your pipeline page, in the right pane, click Others. 2. In the Others pane: a. In the DefaultAction1 box, enter the name for your Amazon SNS notification (for example, EmrJobtice). b. In the Type box, select SnsAlarm. c. In the Message box, enter the message content. d. Leave the entry in the Role box set to default. e. In the Subject box, enter the subject line for your notification. f. In the Topic Arn box, enter the ARN of your Amazon SNS topic. 59

65 Validate and Save Your Pipeline You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline. Validate and Save Your Pipeline You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition. If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline. To validate and save your pipeline 1. On the Pipeline: name of your pipeline page, click Save Pipeline. 2. AWS Data Pipeline validates your pipeline definition and returns either success or the error message. 3. If you get an error message, click Close and then, in the right pane, click Errors. 4. The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object names and look for an error message in red. 5. When you see the error message, click the specific object pane where you see the error and fix it. For example, if you see an error message in the Schedules object, click the Schedules pane to fix the error. 6. After you've fixed the errors listed in the Errors pane, click Save Pipeline. 7. Repeat the process until your pipeline is validated. Next, verify that your pipeline definition has been saved. Verify Your Pipeline Definition It is important to verify that your pipeline was correctly initialized from your definitions before you activate it. To verify your pipeline definition 1. On the Pipeline: name of your pipeline page, click Back to list of pipelines. 2. On the List Pipelines page, check if your newly-created pipeline is listed. AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition. The Status column in the row listing your pipeine should show PENDING. 3. Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point. 4. In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition. 60

66 Activate your Pipeline 5. Click Close. Next, activate your pipeline. Activate your Pipeline You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition. To activate your pipeline 1. On the List Pipelines page, in the Details column of your pipeline, click View pipeline. 2. In the Pipeline: name of your pipeline page, click Activate. A confirmation dialog box opens up confirming the activation. 3. Click Close. Next, verify if your pipeline is running. Monitor the Progress of Your Pipeline Runs To monitor the progress of your pipeline 1. On the List Pipelines page, in the Details column of your pipeline, click View instance details. 61

67 Monitor the Progress of Your Pipeline Runs te You can also view the job flows in the Amazon EMR console. The job flows spawned by AWS Data Pipeline on your behalf are displayed in the Amazon EMR console and billed to your AWS Account in the same manner as job flows that you launch manually. You can tell which job flows were spawned by AWS Data Pipeline by looking at the name of the job flow. Those spawned by AWS Data Pipeline have a name formatted as follows: job-flow-identifier_@emr-cluster-name_launch-time. For more information, see View Job Flow Details in the Amazon Elastic MapReduce Developer Guide. 2. The Instance details: name of your pipeline page lists the status of each instance in your pipeline definition. te If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update. 3. If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.you should receive an about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification. 4. If the Status column of any of your instances indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed. a. To troubleshoot the failed or the incomplete runs Click the triangle next to an instance; the Instance summary panel opens to show the details of the selected instance. b. Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For = Resource not healthy terminated. c. In the Instance summary pane, in the Select attempt for this instance box, select the attempt number. d. In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt. 5. To take an action on your incomplete or failed instance, select an action (Rerun Cancel Mark Finished) from the Action column of the instance. You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline. 62

68 [Optional] Delete your Pipeline For more information about instance status, see Interpret Pipeline Status Details (p. 129). For more information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data Pipeline Problems and Solutions (p. 131). Important Your pipeline is running and is incurring charges. For more information, see AWS Data Pipline pricing. If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipleline. [Optional] Delete your Pipeline Deleting your pipeline deletes the pipeline definition including all the associated objects.you stop incurring charges as soon as your pipeline is deleted. To delete your pipeline 1. In the List Pipelines page, click the check box next to your pipeline. 2. Click Delete. 3. In the confirmation dialog box, click Delete to confirm the delete request. Using the Command Line Interface Topics If you regularly run an Amazon EMR job flow to analyze web logs or perform analysis of scientific data, you can use AWS Data Pipeline to manage your Amazon EMR job flows. With AWS Data Pipeline, you can specify preconditions that must be met before the job flow is launched (for example, ensuring that today's data been uploaded to Amazon S3.) The following tutorial walks you through launching the job flow that can be a model for a simple Amazon EMR-based pipeline, or as part of a more involved pipeline. The following code is the pipeline definition file for a simple Amazon EMR job flow that runs a pre-existing Hadoop streaming job provided by Amazon EMR. This sample application is called WordCount, and can also be run manually from the Amazon EMR console. In the following code, you should replace the Amazon S3 bucket location with the name of an Amazon S3 bucket that you own.you should also replace the start and end dates. To get job flows launching immediately, set startdatetime to a date one day in the past and enddatetime to one day in the future. AWS Data Pipeline then starts launching the "past due" job flows immediately in an attempt to address what it perceives as a backlog of work.this backfilling means you don't have to wait an hour to see AWS Data Pipeline launch its first job flow. "objects": [ "id": "Hourly", "type": "Schedule", "startdatetime": " T07:48:00", "enddatetime": " T07:48:00", "period": "1 hours" 63

69 Using the Command Line Interface "id": "MyCluster", "type": "EmrCluster", "masterinstancetype": "m1.small", "schedule": "ref": "Hourly" } "id": "MyEmrActivity", "type": "EmrActivity", "schedule": "ref": "Hourly" "runson": "ref": "MyCluster" "step": "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-in put,s3n://elasticmapreduce/samples/wordcount/input,-output,s3://myawsbucket/word count/output/#@scheduledstarttime-mapper,s3n://elasticmapreduce/samples/word count/wordsplitter.py,-reducer,aggregate" } ] } This pipeline has three objects: Hourly, which represents the schedule of the work.you can set a schedule as one of the fields on an transform. When you do, the transform runs according to that schedule, or in this case, hourly. MyCluster, which represents the set of Amazon EC2 instances used to run the job flow. You can specify the size and number of EC2 instances to run as the cluster. If you do not specify the number of instances, the job flow launches with two, a master node and a task node. You can add additional configurations to the cluster, such as bootstrap actions to load additional software onto the Amazon EMR-provided AMI. MyEmrActivity, which represents the computation to process with the job flow. Amazon EMR supports several types of job flows, including streaming, Cascading, and Scripted Hive. The runson field refers back to MyCluster, using that as the specification for the underpinnings of the job flow. To create a pipeline that launches an Amazon EMR job flow 1. Open a terminal window in the directory where you've installed the AWS Data Pipeline CLI. For more information about how to install the CLI, see Install the Command Line Interface (p. 15). 2. Create a new pipeline../datapipeline --credentials./credentials.json --create MyEmrPipeline When the pipeline is created, AWS Data Pipeline returns a success message and an identifier for the pipeline. Pipeline with name 'MyEmrPipeline' and id 'df y0grtud0sp0' created. 64

70 Using the Command Line Interface 3. Add the JSON definition to the pipeline. This gives AWS Data Pipeline the business logic it needs to manage your data../datapipeline --credentials./credentials.json --put MyEmrPipelineDefini tion.df --id df y0grtud0sp0 The following message is an example of a successfully uploaded pipeline. State of pipeline id 'df y0grtud0sp0' is currently 'PENDING' 4. Activate the pipeline../datapipeline --credentials./credentials.json --activate --id df Y0GRTUD0SP0 If the pipeline definition is valid, the previous --put command uploads the business logic and activates the pipeline. If the pipeline is invalid, AWS Data Pipeline returns an error code indicating what the problems are. 5. Wait until the pipeline has had time to start running, then verify the pipeline's operation../datapipeline --credentials./credentials.json --list-runs --id df Y0GRTUD0SP0 This returns information about the runs initiated by the pipeline, such as the following. State of pipeline id 'df y0grtud0sp0' is currently 'SCHEDULED' The --list-runs command is fetching the last 4 days of pipeline runs. If this takes too long, use --help for how to specify a different interval with --start-interval or --schedule-interval. Scheduled Start Status ID Started Ended MyCluster T07:48: T22:29:33 65

71 Using the Command Line Interface T22:40:46 2. MyEmrActivity T07:48: T22:29: T22:38:43 3. MyCluster T08:03: T22:34:32 4. MyEmrActivity T08:03: T22:34:31 5. MyCluster T08:18: T22:39:31 6. MyEmrActivity T08:18: T22:39:30 All times are listed in UTC and all command line input is treated as UTC. Total of 6 pipeline runs shown from pipeline named 'MyEmrPipeline' where --start-interval T22:41:32, T22:41:32 You can view job flows launched by AWS Data Pipeline in the Amazon EMR console. The job flows spawned by AWS Data Pipeline on your behalf are displayed in the Amazon EMR console and billed to your AWS Account in the same manner as job flows that you launch manually. To check the progress of job flows launched by AWS Data Pipeline 1. Look at the name of the job flow to tell which job flows were spawned by AWS Data Pipeline. Those spawned by AWS Data Pipeline have a name formatted as follows: <job-flow-identifier>_@<emr-cluster-name>_<launch-time>. This is shown in the following screen. 66

72 Using the Command Line Interface 2. Click on the Bootstrap Actions tab to display the bootstrap action that AWS Data Pipeline uses to install AWS Data Pipeline Task Agent on the Amazon EMR clusters that it launches. 67

73 Using the Command Line Interface 3. After one of the runs is complete, navigate to the Amazon S3 console and check that the time-stamped output folder exists and contains the expected results of the job flow. 68

74 Part One: Import Data into Amazon DynamoDB Tutorial: Import/Export Data in Amazon DynamoDB With Amazon EMR and Hive This is the first of a two-part tutorial that demonstrates how to bring together multiple AWS features to solve real-world problems in a scalable way through a common scenario: moving schema-less data in and out of Amazon DynamoDB using Amazon EMR and Hive. Complete part one before you move on to part two. This tutorial involves the following concepts and procedures: Using the AWS Data Pipeline console and command-line interface (CLI) to create and configure pipelines Creating and configuring Amazon DynamoDB tables Creating and allocating work to Amazon EMR clusters Querying and processing data with Hive scripts Storing and accessing data using Amazon S3 Part One: Import Data into Amazon DynamoDB Topics Before You Begin... (p. 70) Create an Amazon SNS Topic (p. 73) Create an Amazon S3 Bucket (p. 74) Using the AWS Data Pipeline Console (p. 74) Using the Command Line Interface (p. 81) The first part of this tutorial explains how to define an AWS Data Pipeline pipeline to retrieve data from a tab-delimited file in Amazon S3 to populate a Amazon DynamoDB table, use a Hive script to define the necessary data transformation steps, and automatically create an Amazon EMR cluster to perform the work. The first part of the tutorial involves the following steps: 1. Create a Amazon DynamoDB table to store the data 2. Create and configure the pipeline definition objects 69

75 Before You Begin Upload your pipeline definition 4. Verify your results Before You Begin... Be sure you've completed the following steps. Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see Access the Console (p. 12). Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12). Create an Amazon S3 bucket as a data source. For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started Guide. Create an Amazon DynamoDB table to store data as defined by the following procedure. Be aware of the following: Imports may overwrite data in your Amazon DynamoDB table. When you import data from Amazon S3, the import may overwrite items in your Amazon DynamoDB table. Make sure that you are importing the right data and into the right table. Be careful not to accidentally set up a recurring import pipeline that will import the same data multiple times. Exports may overwrite data in your Amazon S3 bucket. When you export data to Amazon S3, you may overwrite previous exports if you write to the same bucket path. The default behavior of the Export DynamoDB to S3 template will append the job s scheduled time to the Amazon S3 bucket path, which will help you avoid this problem. Import and Export jobs will consume some of your Amazon DynamoDB table s provisioned throughput capacity. This section explains how to schedule an import or export job using Amazon EMR. The Amazon EMR cluster will consume some read capacity during exports or write capacity during imports. You can control the percentage of the provisioned capacity that the import/export jobs consume by with the settings MyImportJob.myDynamoDBWriteThroughputRatio and MyExportJob.myDynamoDBReadThroughputRatio. Be aware that these settings determine how much capacity to consume at the beginning of the import/export process and will not adapt in real time if you change your table s provisioned capacity in the middle of the process. Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still pay for the underlying AWS services that are being used. The import and export pipelines will create Amazon EMR clusters to read and write data and there are per-instance charges for each node in the cluster. You can read more about the details of Amazon EMR Pricing. The default cluster configuration is one m1.small instance master node and one m1.xlarge instance task node, though you can change this configuration in the pipeline definition. There are also charges for AWS Data Pipeline. For more information, see AWS Data Pipeline Pricing and Amazon S3 Pricing. Create an Amazon DynamoDB Table This section explains how to create an Amazon DynamoDB table that is a prerequisite for this tutorial. For more information, see Working with Tables in Amazon DynamoDB in the Amazon DynamoDB Developer Guide. te If you already have a Amazon DynamoDB table, you can skip this procedure to create one. 70

76 Before You Begin... To create a Amazon DynamoDB table 1. Sign in to the AWS Management Console and open the Amazon DynamoDB console. 2. Click Create Table. 3. On the Create Table / Primary Key page, enter a name (for example, MyTable) in the Table box. te Your table name must be unique. 4. In the Primary Key section, for the Primary Key Type radio button, select Hash. 5. In the Hash Attribute field, select Number and enter Id in the text box as shown: 6. Click Continue. 7. On the Create Table / Provisioned Throughput Capacity page, in the Read Capacity Units box, enter In the Write Capacity Units box, enter 5 as shown: 71

77 Before You Begin... te In this example, we use read and write capacity unit values of five because the sample input data is small. You may need a larger value depending on the size of your actual input data set. For more information, see Provisioned Throughput in Amazon DynamoDB in the Amazon DynamoDB Developer Guide. 9. Click Continue. 10. On the Create Table / Throughput Alarms page, in the Send notification to box, enter your address as shown: 72

78 Create an Amazon SNS Topic Create an Amazon SNS Topic This section explains how to create an Amazon SNS topic and subscribe to receive notifications from AWS Data Pipeline regarding the status of your pipeline components. For more information, see Create a Topic in the Amazon SNS Getting Started Guide. te If you already have an Amazon SNS topic ARN to which you have subscribed, you can skip this procedure to create one. To create an Amazon SNS topic 1. Sign in to the AWS Management Console and open the Amazon SNS console. 2. Click Create New Topic. 3. In the Topic field, type your topic name, such as my-example-topic, and select Create Topic. 4. te the value from the Topic ARN field, which should be similar in format to this example: arn:aws:sns:us-east-1:403example:my-example-topic. To create an Amazon SNS subscription 1. Sign in to the AWS Management Console and open the Amazon SNS console. 2. In the navigation pane, select your Amazon SNS topic and click Create New Subscription. 3. In the Protocol field, choose In the Endpoint field, type your address and select Subscribe. te You must accept the subscription confirmation to begin receiving Amazon SNS notifications at the address you specify. 73

79 Create an Amazon S3 Bucket Create an Amazon S3 Bucket This section explains how to create an Amazon S3 bucket as a storage location for your input and output files related to this tutorial. For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started Guide. te If you already have an Amazon S3 bucket configured with write permissions, you can skip this procedure to create one. To create an Amazon S3 bucket 1. Sign in to the AWS Management Console and open the Amazon S3 console. 2. Click Create Bucket. 3. In the Bucket field, type your topic name, such as my-example-bucket and select Create. 4. In the Buckets pane, select your new bucket and select Permissions. 5. Ensure that all user accounts that you want to access these files appear in the Grantee list. Using the AWS Data Pipeline Console Topics Start Import from the Amazon DynamoDB Console (p. 74) Create the Pipeline Definition using the AWS Data Pipeline Console (p. 75) Create and Configure the Pipeline from a Template (p. 76) Complete the Data des (p. 76) Complete the Resources (p. 77) Complete the Activity (p. 78) Complete the tifications (p. 78) Validate and Save Your Pipeline (p. 78) Verify your Pipeline Definition (p. 79) Activate your Pipeline (p. 79) Monitor the Progress of Your Pipeline Runs (p. 80) [Optional] Delete your Pipeline (p. 81) The following topics explain how to how to define an AWS Data Pipeline pipeline to retrieve data from a tab-delimited file using the AWS Data Pipeline console. Start Import from the Amazon DynamoDB Console You can begin the Amazon DynamoDB import operation from within the Amazon DynamoDB console. To start the data import 1. Sign in to the AWS Management Console and open the Amazon DynamoDB console. 2. On the Tables screen, click your Amazon DynamoDB table and click the Import Table button. 3. On the Import Table screen, read the walkthrough and check the I have read the walkthrough box, then select Build a Pipeline.This opens the AWS Data Pipeline console so that you can choose a template to import the Amazon DynamoDB table data. 74

80 Using the AWS Data Pipeline Console Create the Pipeline Definition using the AWS Data Pipeline Console To create the new pipeline 1. Sign in to the AWS Management Console and open the AWS Data Pipeline console or arrive at the AWS Data Pipeline console through the Build a Pipeline button in the Amazon DynamoDB console. 2. Click Create new pipeline. 3. On the Create a New Pipeline page: a. In the Pipeline box, enter a name (for example, CopyMyS3Data). b. In Pipeline, enter a description. c. Leave the Select Schedule Type: button set to the default type Time Series Time Scheduling for this tutorial. Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. d. Leave the Role boxes set to their default values for this tutorial, which is DataPipelineDefaultRole. te If you have created your own IAM roles and would like to use them in this tutorial, you can select them now. e. Leave the Role boxes set to their default values for this tutorial, which are DataPipelineDefaultRole for the role and DataPipelineDefaultResourceRole for the resource role. te If you have created your own IAM roles and would like to use them in this tutorial, you can select them now. f. Click Create a new Pipeline. 75

81 Using the AWS Data Pipeline Console Create and Configure the Pipeline from a Template On the Pipeline screen, click Templates and select Export S3 to DynamoDB. The AWS Data Pipeline console pre-populates a pipeline definition template with the base objects necessary to import data from Amazon S3, as shown in the following screen. Review the template and complete the missing fields. You start by choosing the schedule and frequency by which you want your data export operation to run. To complete the schedule On the Pipeline screen, click Schedules. a. In the ImportSchedule section, set Period to 1 Hours. b. Set Start Date Time using the calendar to the current date, such as and the time to 00:00:00 UTC. c. In the Add an optional field.. box, select End Date Time. d. Set End Date Time using the calendar to the following day, such as and the time to 00:00:00 UTC. Complete the Data des Next, you complete the data node objects in your pipeline definition template. To complete the Amazon DynamoDB data node 1. On the Pipeline: name of your pipeline page, select Datades. 2. In the Datades pane: 76

82 Using the AWS Data Pipeline Console a. Enter the ; for example: DynamoDB. b. In the MyDynamoDBData section, in the Table box, type the name of the Amazon DynamoDB table where you want to store the output data; for example: MyTable. To complete the Amazon S3 data node In the Datades pane: In the MyS3Data section, in the Directory Path field, type a valid Amazon S3 directory path for the location of your source data, for example, s3://elasticmapreduce/samples/store/productcatalog. This sample file is a fictional product catalog that is pre-populated with delimited data for demonstration purposes. Complete the Resources Next, you complete the resources that will run the data import activities. Many of the fields are auto-populated by the template, as shown in the following screen. You only need to complete the empty fields. To complete the resources On the Pipeline page, select Resources. In the Emr Log Uri box, type the path where to store Amazon EMR debugging logs, using the Amazon S3 bucket that you configured in part one of this tutorial; for example: s3://my-test-bucket/emr_debug_logs. 77

83 Using the AWS Data Pipeline Console Complete the Activity Next, you complete the activity that represents the steps to perform in your data import operation. To complete the activity 1. On the Pipeline: name of your pipeline page, select Activities. 2. In the MyImportJob section, review the default options already provided. You are not required to manually configure any options in this section. Complete the tifications Next, configure the SNS notification action AWS Data Pipeline must perform depending on the outcome of the activity. To configure the SNS success, failure, and late notification action 1. On the Pipeline: name of your pipeline page, in the right pane, click Others. 2. In the Others pane: a. In the LateSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example: arn:aws:sns:us-east-1:403example:my-example-topic. b. In the FailureSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example: arn:aws:sns:us-east-1:403example:my-example-topic. c. In the SuccessSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example: arn:aws:sns:us-east-1:403example:my-example-topic. You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline. Validate and Save Your Pipeline You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition. If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline. To validate and save your pipeline 1. On the Pipeline: name of your pipeline page, click Save Pipeline. 2. AWS Data Pipeline validates your pipeline definition and returns either success or the error message. 3. If you get an error message, click Close and then, in the right pane, click Errors. 4. The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object names and look for an error message in red. 5. When you see the error message, click the specific object pane where you see the error and fix it. For example, if you see an error message in the Datades object, click the Datades pane to fix the error. 78

84 Using the AWS Data Pipeline Console 6. After you've fixed the errors listed in the Errors pane, click Save Pipeline. 7. Repeat the process until your pipeline is validated. Next, verify that your pipeline definition has been saved. Verify your Pipeline Definition It is important to verify that your pipeline was correctly initialized from your definitions before you activate it. To verify your pipeline definition 1. On the Pipeline: name of your pipeline page, click Back to list of pipelines. 2. On the List Pipelines page, check if your newly-created pipeline is listed. AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition. The Status column in the row listing your pipeine should show PENDING. 3. Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point. 4. In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition. 5. Click Close. Next, activate your pipeline. Activate your Pipeline You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition. 79

85 Using the AWS Data Pipeline Console To activate your pipeline 1. On the List Pipelines page, in the Details column of your pipeline, click View pipeline. 2. In the Pipeline: name of your pipeline page, click Activate. A confirmation dialog box opens up confirming the activation. 3. Click Close. Next, verify if your pipeline is running. Monitor the Progress of Your Pipeline Runs To monitor the progress of your pipeline 1. On the List Pipelines page, in the Details column of your pipeline, click View instance details. 2. The Instance details: name of your pipeline page lists the status of each object in your pipeline definition. te If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update. 3. If the Status column of all the objects in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.you should receive an about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification. You can also check your Amazon S3 data target bucket to verify if the data was copied. 4. If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed. a. To troubleshoot the failed or the incomplete instance runs Click the triangle next to an instance; the Instance summary panel opens to show the details of the selected instance. b. Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure; for = Resource not healthy terminated. c. In the Instance summary pane, in the Select attempt for this instance box, select the attempt number. d. In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt. 80

86 Using the Command Line Interface 5. To take an action on your incomplete or failed instance, select an action (Rerun Cancel Mark Finished) from the Action column of the instance. You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline. For more information about instance status, see Interpret Pipeline Status Details (p. 129). For more information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data Pipeline Problems and Solutions (p. 131). Important Your pipeline is running and is incurring charges. For more information, see AWS Data Pipline pricing. If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline. [Optional] Delete your Pipeline Deleting your pipeline deletes the pipeline definition including all the associated objects.you stop incurring charges as soon as your pipeline is deleted. To delete your pipeline, 1. In the List Pipelines page, click the check box next to your pipeline. 2. Click Delete. 3. In the confirmation dialog box, click Delete to confirm the delete request. Using the Command Line Interface Topics Define the Import Pipeline in JSON Format (p. 82) Schedule (p. 84) 81

87 Using the Command Line Interface Amazon S3 Data de (p. 84) Precondition (p. 85) Amazon EMR Cluster (p. 86) Amazon EMR Activity (p. 86) Upload the Pipeline Definition (p. 88) Activate the Pipeline (p. 89) Verify the Pipeline Status (p. 89) Verify Data Import (p. 90) The following topics explain how to perform the steps in this tutorial using the AWS Data Pipeline CLI. Define the Import Pipeline in JSON Format This example pipeline definition shows how to use AWS Data Pipeline to retrieve data from a file in Amazon S3 to populate a Amazon DynamoDB table, use a Hive script to define the necessary data transformation steps, and automatically create an Amazon EMR cluster to perform the work. Additionally, this pipeline sends Amazon SNS notifications if the pipeline succeeds, fails, or runs late. This is the full pipeline definition JSON file followed by an explanation for each of its sections. te We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the.json file extension. "objects": [ "id": "MySchedule", "type": "Schedule", "startdatetime": " T00:00:00", "enddatetime":" t00:00:00", "period": "1 day" "id": "MyS3Data", "type": "S3Datade", "schedule": "ref": "MySchedule" "filepath": "s3://input_bucket/productcatalog", "precondition": "ref": "InputReady" } "id": "InputReady", "type": "S3PrefixtEmpty", "role": "test-role", "s3prefix": "#node.filepath}" "id": "ImportCluster", "type": "EmrCluster", "masterinstancetype": "m1.small", "instancecoretype": "m1.xlarge", "instancecorecount": "1", 82

88 Using the Command Line Interface "schedule": "ref": "MySchedule" "enabledebugging": "true", "emrloguri": "s3://test_bucket/emr_logs" "id": "MyImportJob", "type": "EmrActivity", "dynamodboutputtable": "MyTable", "dynamodbwritepercent": "1.00", "s3mys3data": "#input.path}", "lateaftertimeout": "12 hours", "attempttimeout": "24 hours", "maximumretries": "0", "input": "ref": "MyS3Data" "runson": "ref": "ImportCluster" "schedule": "ref": "MySchedule" "onsuccess": "ref": "SuccessSnsAlarm" "onfail": "ref": "FailureSnsAlarm" "onlateaction": "ref": "LateSnsAlarm" "step": "s3://elasticmapreduce/libs/script-runner/script-run ner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hiveversions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importdynamod BTableFromS3,-d,DYNAMODB_OUTPUT_TABLE=#dynamoDBOutputTable-d,S3_INPUT_BUCK ET=#s3MyS3Data-d,DYNAMODB_WRITE_PERCENT=#dynamoDBWritePercent-d,DYNAMODB_EN DPOINT=dynamodb.us-east-1.amazonaws.com" "id": "SuccessSnsAlarm", "type": "SnsAlarm", "topicarn": "arn:aws:sns:us-east-1: :mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#node.dynamodboutputtable}' import succeeded", "message": "DynamoDB table '#node.dynamodboutputtable}' import from S3 bucket '#node.s3mys3data}' succeeded at JobId: #node.id}" "id": "LateSnsAlarm", "type": "SnsAlarm", "topicarn": "arn:aws:sns:us-east-1: :mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#node.dynamodboutputtable}' import is taking a long time!", 83

89 Using the Command Line Interface "message": "DynamoDB table '#node.dynamodboutputtable}' import from S3 bucket '#node.s3mys3data}' has exceeded the late warning period '#node.lateaftertimeout}'. JobId: #node.id}" "id": "FailureSnsAlarm", "type": "SnsAlarm", "topicarn": "arn:aws:sns:us-east-1: :mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#node.dynamodboutputtable}' import failed!", "message": "DynamoDB table '#node.dynamodboutputtable}' import from S3 bucket '#node.s3mys3data}' failed. JobId: #node.id}. Error: #node.errormes sage}." } ] } Schedule The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one. The Schedule component is defined by the following fields: "id": "MySchedule", "type": "Schedule", "startdatetime": " T00:00:00", "enddatetime":" t00:00:00", "period": "1 day" te In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies. The user-defined name for the pipeline schedule, which is a label for your reference only. Type The pipeline component type, which is Schedule. startdatetime The date/time (in UTC format) that you want the task to begin. enddatetime The date/time (in UTC format) that you want the task to stop. period The time period that you want to pass between task attempts, even if the task occurs only one time. The period must evenly divide the time between startdatetime and enddatetime. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time. Amazon S3 Data de Next, the S3Datade pipeline component defines a location for the input file; in this case a tab-delimited file in an Amazon S3 bucket location. The input S3Datade component is defined by the following fields: 84

90 Using the Command Line Interface "id": "MyS3Data", "type": "S3Datade", "schedule": "ref": "MySchedule" "filepath": "s3://input_bucket/productcatalog", "precondition": "ref": "InputReady" } The user-defined name for the input location (a label for your reference only). Type The pipeline component type, which is "S3Datade" to match the location where the data resides, in an Amazon S3 bucket. Schedule A reference to the schedule component that we created in the preceding lines of the JSON file labeled MySchedule. Path The path to the data associated with the data node. This path contains a sample product catalog input file that we use for this scenario. The syntax for a data node is determined by its type. For example, a data node for a file in Amazon S3 follows a different syntax that is appropriate for a database table. Precondition A reference to a precondition that must evaluate as true for the pipeline to consider the data node to be valid. The precondition itself is defined later in the pipeline definition file. Precondition Next, the precondition defines a condition that must be true for the pipeline to use the S3Datade associated with this precondition. The precondition is defined by the following fields: "id": "InputReady", "type": "S3PrefixtEmpty", "role": "test-role", "s3prefix": "#node.filepath}" The user-defined name for the precondition (a label for your reference only). Type The type of the precondition is S3PrefixtEmpty, which checks an Amazon S3 prefix to ensure that it is not empty. Role The IAM role that provides the permissions necessary to access the S3Datade. S3Prefix The Amazon S3 prefix to check for emptiness. This field uses an expression #node.filepath} populated from the referring component, which in this example is the S3Datade that refers to this precondition. 85

91 Using the Command Line Interface Amazon EMR Cluster Next, the EmrCluster pipeline component defines an Amazon EMR cluster that processes and moves the data in this tutorial. The EmrCluster component is defined by the following fields: "id": "ImportCluster", "type": "EmrCluster", "masterinstancetype": "m1.small", "instancecoretype": "m1.xlarge", "instancecorecount": "1", "schedule": "ref": "MySchedule" "enabledebugging": "true", "emrloguri": "s3://test_bucket/emr_logs" The user-defined name for the Amazon EMR cluster (a label for your reference only). Type The computational resource type, which is an Amazon EMR cluster. For more information, see Overview of Amazon EMR in the Amazon EMR Developer Guide. masterinstancetype The type of Amazon EC2 instance to use as the master node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation. instancecoretype The type of Amazon EC2 instance to use as the core node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation. instancecorecount The number of core Amazon EC2 instances to use in the Amazon EMR cluster. Schedule A reference to the schedule component that we created in the preceding lines of the JSON file labeled MySchedule. enabledebugging Indicates whether to create detailed debug logs for the Amazon EMR job flow. emrloguri Specifies an Amazon S3 location to store the Amazon EMR job flow debug logs if you enabled debugging with the previously mentioned enabledebugging field. Amazon EMR Activity Next, the EmrActivity pipeline component brings together the schedule, resources, and data nodes to define the work to perform, the conditions under which to do the work, and the actions to perform when certain events occur. The EmrActivity component is defined by the following fields: "id": "MyImportJob", "type": "EmrActivity", "dynamodboutputtable": "MyTable", "dynamodbwritepercent": "1.00", "s3mys3data": "#input.path}", "lateaftertimeout": "12 hours", 86

92 Using the Command Line Interface "attempttimeout": "24 hours", "maximumretries": "0", "input": "ref": "MyS3Data" "runson": "ref": "ImportCluster" "schedule": "ref": "MySchedule" "onsuccess": "ref": "SuccessSnsAlarm" "onfail": "ref": "FailureSnsAlarm" "onlateaction": "ref": "LateSnsAlarm" "step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elast icmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,-- args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importdynamodbtablefroms3,- d,dynamodb_output_table=#dynamodboutputtable-d,s3_input_bucket=#s3mys3data- d,dynamodb_write_percent=#dynamodbwritepercent-d,dynamodb_endpoint=dy namodb.us-east-1.amazonaws.com" The user-defined name for the Amazon EMR activity (a label for your reference only). Type The EmrActivity pipeline component type, which creates an Amazon EMR job flow to perform the defined work. For more information, see Overview of Amazon EMR in the Amazon EMR Developer Guide. dynamodboutputtable The Amazon DynamoDB table where the Amazon EMR job flow writes the output of the Hive script. dynamodbwritepercent Sets the rate of write operations to keep your Amazon DynamoDB database instance provisioned throughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively. For more information, see Hive Options in Amazon EMR Developer Guide. s3mys3data An expression that refers to the Amazon S3 location path of the input data defined by the S3Datade labeled "MyS3Data". lateaftertimeout The amount of time, after the schedule start time, that the activity can wait to start before AWS Data Pipeline considers it late. attempttimeout The amount of time, after the schedule start time, that the activity has to complete before AWS Data Pipeline considers it as failed. maximumretries The maximum number of times that AWS Data Pipeline retries the activity. input The Amazon S3 location path of the input data defined by the S3Datade labeled "MyS3Data". 87

93 Using the Command Line Interface runson A reference to the computational resource that will run the activity; in this case, an EmrCluster labeled "ImportCluster". schedule A reference to the schedule component that we created in the preceding lines of the JSON file labeled MySchedule. onsuccess A reference to the action to perform when the activity is successful. In this case, it is to send an Amazon SNS notification. onfail A reference to the action to perform when the activity fails. In this case, it is to send an Amazon SNS notification. onlateaction A reference to the action to perform when the activity is late. In this case, it is to send an Amazon SNS notification. step Defines the steps for the EMR job flow to perform. This step calls a Hive script named importdynamodbtablefroms3 that is provided by Amazon EMR and is specifically designed to move data from Amazon S3 into Amazon DynamoDB. To perform more complex data transformation tasks, you would customize this Hive script and provide its name and path here. For more information about sample Hive scripts that show how to perform data transformation tasks, see Contextual Advertising using Apache Hive and Amazon EMR in AWS Articles and Tutorials. Upload the Pipeline Definition You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information, see Install the Command Line Interface (p. 15) To upload your pipeline definition, use the following command. On Linux/Unix/Mac OS:./datapipeline - create pipeline_name - put pipeline_file On Windows: ruby datapipeline - create pipeline_name - put pipeline_file Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the.json file extension that defines your pipeline. If your pipeline validates successfully, you receive the following message: Pipeline with name pipeline_name and id df-akiaiosfodnn7example created. Pipeline definition pipeline_file.json uploaded. te For more information about any errors returned by the create command or other commands, see Troubleshoot AWS Data Pipeline (p. 128). Ensure that your pipeline appears in the pipeline list by using the following command. On Linux/Unix/Mac OS: 88

94 Using the Command Line Interface./datapipeline --list-pipelines On Windows: ruby datapipeline - list-pipelines The list of pipelines includes details such as, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-akiaiosfodnn7example. Activate the Pipeline You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command. On Linux/Unix/Mac OS:./datapipeline --activate - id df-akiaiosfodnn7example On Windows: ruby datapipeline --activate - id df-akiaiosfodnn7example Where df-akiaiosfodnn7example is the identifier for your pipeline. Verify the Pipeline Status View the status of your pipeline and its components, along with its activity attempts and retries with the following command. On Linux/Unix/Mac OS:./datapipeline --list-runs id df-akiaiosfodnn7example On Windows: ruby datapipeline --list-runs id df-akiaiosfodnn7example Where df-akiaiosfodnn7example is the identifier for your pipeline. The --list-runs command displays a list of pipelines components and details such as, Scheduled Start, Status, ID, Started, and Ended. te It is important to note the difference between the Scheduled Start date/time vs. the Started time. It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries. te AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled 89

95 Part Two: Export Data from Amazon DynamoDB Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs. Successful pipeline runs are indicated by all the activities in your pipeline reporting the FINISHED status. Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as Amazon EC2 instances, may show the SHUTTING_DOWN status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status. Verify Data Import Next, verify that the data import occurred successfully using the Amazon DynamoDB console to inspect the data in the table. To create a Amazon DynamoDB table 1. Sign in to the AWS Management Console and open the Amazon DynamoDB console. 2. On the Tables screen, click your Amazon DynamoDB table and click the Explore Table button. 3. On the Browse Items tab, columns that correspond to the data input file should display, such as Id, Price, ProductCategory, as shown in the following screen. This indicates that the import operation from the file to the Amazon DynamoDB table occurred successfully. Part Two: Export Data from Amazon DynamoDB Topics Before You Begin... (p. 91) Using the AWS Data Pipeline Console (p. 92) Using the Command Line Interface (p. 98) This is the second of a two-part tutorial that demonstrates how to bring together multiple AWS features to solve real-world problems in a scalable way through a common scenario: moving schema-less data in and out of Amazon DynamoDB using Amazon EMR and Hive. This tutorial involves the following concepts and procedures: 90

96 Before You Begin... Using the AWS Data Pipeline console and command-line interface (CLI) to create and configure pipelines Creating and configuring Amazon DynamoDB tables Creating and allocating work to Amazon EMR clusters Querying and processing data with Hive scripts Storing and accessing data using Amazon S3 Before You Begin... You must complete part one of this tutorial to ensure that your Amazon DynamoDB table contains the necessary data to perform the steps in this section. For more information, see Part One: Import Data into Amazon DynamoDB (p. 69). Additionally, be sure you've completed the following steps: Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see Access the Console (p. 12). Set up the AWS Data Pipeline tools and interface you plan on using. For more information about interfaces and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12). Create an Amazon S3 bucket as a data output location. For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started Guide. Ensure that you have the Amazon DynamoDB table that was created and populated with data in part one of this tutorial. This table will be your data source for part two of the tutorial. For more information, see Part One: Import Data into Amazon DynamoDB (p. 69). Be aware of the following: Imports may overwrite data in your Amazon DynamoDB table. When you import data from Amazon S3, the import may overwrite items in your Amazon DynamoDB table. Make sure that you are importing the right data and into the right table. Be careful not to accidentally set up a recurring import pipeline that will import the same data multiple times. Exports may overwrite data in your Amazon S3 bucket. When you export data to Amazon S3, you may overwrite previous exports if you write to the same bucket path. The default behavior of the Export DynamoDB to S3 template will append the job s scheduled time to the Amazon S3 bucket path, which will help you avoid this problem. Import and Export jobs will consume some of your Amazon DynamoDB table s provisioned throughput capacity. This section explains how to schedule an import or export job using Amazon EMR. The Amazon EMR cluster will consume some read capacity during exports or write capacity during imports. You can control the percentage of the provisioned capacity that the import/export jobs consume by with the settings MyImportJob.myDynamoDBWriteThroughputRatio and MyExportJob.myDynamoDBReadThroughputRatio. Be aware that these settings determine how much capacity to consume at the beginning of the import/export process and will not adapt in real time if you change your table s provisioned capacity in the middle of the process. Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still pay for the underlying AWS services that are being used. The import and export pipelines will create Amazon EMR clusters to read and write data and there are per-instance charges for each node in the cluster. You can read more about the details of Amazon EMR Pricing. The default cluster configuration is one m1.small instance master node and one m1.xlarge instance task node, though you can change this configuration in the pipeline definition. There are also charges for AWS Data Pipeline. For more information, see AWS Data Pipeline Pricing and Amazon S3 Pricing. 91

97 Using the AWS Data Pipeline Console Using the AWS Data Pipeline Console Topics Start Export from the Amazon DynamoDB Console (p. 92) Create the Pipeline Definition using the AWS Data Pipeline Console (p. 93) Create and Configure the Pipeline from a Template (p. 93) Complete the Data des (p. 94) Complete the Resources (p. 95) Complete the Activity (p. 95) Complete the tifications (p. 96) Validate and Save Your Pipeline (p. 96) Verify your Pipeline Definition (p. 96) Activate your Pipeline (p. 97) Monitor the Progress of Your Pipeline Runs (p. 97) [Optional] Delete your Pipeline (p. 98) The following topics explain how to perform the steps in part two of this tutorial using the AWS Data Pipeline console. Start Export from the Amazon DynamoDB Console You can begin the Amazon DynamoDB export operation from within the Amazon DynamoDB console. To start the data export 1. Sign in to the AWS Management Console and open the Amazon DynamoDB console. 2. On the Tables screen, click your Amazon DynamoDB table and click the Export Table button. 3. On the Import / Export Table screen, select Build a Pipeline. This opens the AWS Data Pipeline console so that you can choose a template to export the Amazon DynamoDB table data. 92

98 Using the AWS Data Pipeline Console Create the Pipeline Definition using the AWS Data Pipeline Console To create the new pipeline 1. Sign in to the AWS Management Console and open the AWS Data Pipeline console or arrive at the AWS Data Pipeline console through the Build a Pipeline button in the Amazon DynamoDB console. 2. Click Create new pipeline. 3. On the Create a New Pipeline page: a. In the Pipeline box, enter a name (for example, CopyMyS3Data). b. In Pipeline, enter a description. c. Leave the Select Schedule Type: button set to the default type Time Series Time Scheduling for this tutorial. Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. d. Leave the Role boxes set to their default values for this tutorial, which is DataPipelineDefaultRole. te If you have created your own IAM roles and would like to use them in this tutorial, you can select them now. e. Leave the Role boxes set to their default values for this tutorial, which are DataPipelineDefaultRole for the role and DataPipelineDefaultResourceRole for the resource role. te If you have created your own IAM roles and would like to use them in this tutorial, you can select them now. f. Click Create a new Pipeline. Create and Configure the Pipeline from a Template On the Pipeline screen, click Templates and select Export DynamoDB to S3. The AWS Data Pipeline console pre-populates a pipeline definition template with the base objects necessary to export data from Amazon DynamoDB, as shown in the following screen. 93

99 Using the AWS Data Pipeline Console Review the template and complete the missing fields. You start by choosing the schedule and frequency by which you want your data export operation to run. To complete the schedule On the Pipeline screen, click Schedules. a. In the DefaultSchedule1 section, set to ExportSchedule. b. Set Period to 1 Hours. c. Set Start Date Time using the calendar to the current date, such as and the time to 00:00:00 UTC. d. In the Add an optional field.. box, select End Date Time. e. Set End Date Time using the calendar to the following day, such as and the time to 00:00:00 UTC. Complete the Data des Next, you complete the data node objects in your pipeline definition template. To complete the Amazon DynamoDB data node 1. On the Pipeline: name of your pipeline page, select Datades. 2. In the Datades pane, in the Table box, type the name of the Amazon DynamoDB table that you created in part one of this tutorial; for example: MyTable. 94

100 Using the AWS Data Pipeline Console To complete the Amazon S3 data node In the MyS3Data section, in the Directory Path field, type the path to the files where you want the Amazon DynamoDB table data to be written, which is the Amazon S3 bucket that you configured in part one of this tutorial. For example: s3://mybucket/output/mytable. Complete the Resources Next, you complete the resources that will run the data import activities. Many of the fields are auto-populated by the template, as shown in the following screen. You only need to complete the empty fields. To complete the resources On the Pipeline page, select Resources. In the Emr Log Uri box, type the path where to store EMR debugging logs, using the Amazon S3 bucket that you configured in part one of this tutorial; for example: s3://mybucket/emr_debug_logs. Complete the Activity Next, you complete the activity that represents the steps to perform in your data export operation. To complete the activity 1. On the Pipeline: name of your pipeline page, select Activities. 2. In the MyExportJob section, review the default options already provided. You are not required to manually configure any options in this section. 95

101 Using the AWS Data Pipeline Console Complete the tifications Next, configure the SNS notification action AWS Data Pipeline must perform depending on the outcome of the activity. To configure the SNS success, failure, and late notification action 1. On the Pipeline: name of your pipeline page, in the right pane, click Others. 2. In the Others pane: a. In the LateSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example: arn:aws:sns:us-east-1:403example:my-example-topic. b. In the FailureSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example: arn:aws:sns:us-east-1:403example:my-example-topic. c. In the SuccessSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example: arn:aws:sns:us-east-1:403example:my-example-topic. You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline. Validate and Save Your Pipeline You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition. If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline. To validate and save your pipeline 1. On the Pipeline: name of your pipeline page, click Save Pipeline. 2. AWS Data Pipeline validates your pipeline definition and returns either success or the error message. 3. If you get an error message, click Close and then, in the right pane, click Errors. 4. The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object names and look for an error message in red. 5. When you see the error message, click the specific object pane where you see the error and fix it. For example, if you see an error message in the Datades object, click the Datades pane to fix the error. 6. After you've fixed the errors listed in the Errors pane, click Save Pipeline. 7. Repeat the process until your pipeline is validated. Next, verify that your pipeline definition has been saved. Verify your Pipeline Definition It is important to verify that your pipeline was correctly initialized from your definitions before you activate it. 96

102 Using the AWS Data Pipeline Console To verify your pipeline definition 1. On the Pipeline: name of your pipeline page, click Back to list of pipelines. 2. On the List Pipelines page, check if your newly-created pipeline is listed. AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition. The Status column in the row listing your pipeine should show PENDING. 3. Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point. 4. In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition. 5. Click Close. Next, activate your pipeline. Activate your Pipeline You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition. To activate your pipeline 1. On the List Pipelines page, in the Details column of your pipeline, click View pipeline. 2. In the Pipeline: name of your pipeline page, click Activate. Next, verify if your pipeline is running. Monitor the Progress of Your Pipeline Runs To monitor the progress of your pipeline 1. On the List Pipelines page, in the Details column of your pipeline, click View instance details. 2. The Instance details: name of your pipeline page lists the status of each object in your pipeline definition. te If you do not see runs listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update. 3. If the Status column of all the objects in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.you should receive an about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification. You can also check your Amazon S3 data target bucket to verify if the data was copied. 4. If the Status column of any of your objects indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed. a. To troubleshoot the failed or the incomplete runs Click the triangle next to a run; the Instance summary panel opens to show the details of the selected run. b. Click View instance fields to see additional details of the run. If the status of your selected run is FAILED, the additional details box has an entry indicating the reason for failure; for = Resource not healthy terminated. 97

103 Using the Command Line Interface c. You can use the information in the Instance summary panel and the View instance fields box to troubleshoot issues with your failed pipeline. For more information about the status listed for the runs, see Interpret Pipeline Status Details (p. 129). For more information about troubleshooting the failed or incomplete runs of your pipeline, see AWS Data Pipeline Problems and Solutions (p. 131). Important Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing. If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline. [Optional] Delete your Pipeline Deleting your pipeline deletes the pipeline definition including all the associated objects.you stop incurring charges as soon as your pipeline is deleted. To delete your pipeline 1. In the List Pipelines page, click the check box next to your pipeline. 2. Click Delete. 3. In the confirmation dialog box, click Delete to confirm the delete request. Using the Command Line Interface Topics Define the Export Pipeline in JSON Format (p. 98) Schedule (p. 100) Amazon S3 Data de (p. 101) Amazon EMR Cluster (p. 102) Amazon EMR Activity (p. 102) Upload the Pipeline Definition (p. 104) Activate the Pipeline (p. 105) Verify the Pipeline Status (p. 105) Verify Data Export (p. 106) The following topics explain how to perform the steps in this tutorial using the AWS Data Pipeline CLI. Define the Export Pipeline in JSON Format This example pipeline definition shows how to use AWS Data Pipeline to retrieve data from an Amazon DynamoDB table to populate a tab-delimited file in Amazon S3, use a Hive script to define the necessary data transformation steps, and automatically create an Amazon EMR cluster to perform the work. 98

104 Using the Command Line Interface Additionally, this pipeline will send Amazon SNS notifications if the pipeline succeeds, fails, or runs late. This is the full pipeline definition JSON file followed by an explanation for each of its sections. te We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the.json file extension. "objects": [ "id": "MySchedule", "type": "Schedule", "startdatetime": " T00:00:00", "enddatetime":" t00:00:00", "period": "1 day" "id": "MyS3Data", "type": "S3Datade", "schedule": "ref": "MySchedule" "filepath": "s3://output_bucket/productcatalog" "id": "ExportCluster", "type": "EmrCluster", "masterinstancetype": "m1.small", "instancecoretype": "m1.xlarge", "instancecorecount": "1", "schedule": "ref": "MySchedule" "enabledebugging": "true", "emrloguri": "s3://test_bucket/emr_logs" "id": "MyExportJob", "type": "EmrActivity", "dynamodbinputtable": "MyTable", "dynamodbreadpercent": "0.25", "s3outputbucket": "#output.path}", "lateaftertimeout": "12 hours", "attempttimeout": "24 hours", "maximumretries": "0", "output": "ref": "MyS3Data" "runson": "ref": "ExportCluster" "schedule": "ref": "MySchedule" "onsuccess": "ref": "SuccessSnsAlarm" "onfail": 99

105 Using the Command Line Interface "ref": "FailureSnsAlarm" "onlateaction": "ref": "LateSnsAlarm" "step": "s3://elasticmapreduce/libs/script-runner/script-run ner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hiveversions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/exportdynamod BTableToS3,-d,DYNAMODB_INPUT_TABLE=#dynamoDBInputTable-d,S3_OUTPUT_BUCK NAMODB_READ_PERCENT=#dynamoDBReadPercent-d,DYNAMODB_ENDPOINT=dynamodb.useast-1.amazonaws.com" "id": "SuccessSnsAlarm", "type": "SnsAlarm", "topicarn": "arn:aws:sns:us-east-1: :mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#node.dynamodbinputtable}' export succeeded", "message": "DynamoDB table '#node.dynamodbinputtable}' export to S3 bucket '#node.s3outputbucket}' succeeded at JobId: #node.id}" "id": "LateSnsAlarm", "type": "SnsAlarm", "topicarn": "arn:aws:sns:us-east-1: :mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#node.dynamodbinputtable}' export is taking a long time!", "message": "DynamoDB table '#node.dynamodbinputtable}' export to S3 bucket '#node.s3outputbucket}' has exceeded the late warning period '#node.lateaftertimeout}'. JobId: #node.id}" "id": "FailureSnsAlarm", "type": "SnsAlarm", "topicarn": "arn:aws:sns:us-east-1: :mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#node.dynamodbinputtable}' export failed!", "message": "DynamoDB table '#node.dynamodbinputtable}' export to S3 bucket '#node.s3outputbucket}' failed. JobId: #node.id}. Error: #node.er rormessage}." } ] } Schedule The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one. The Schedule component is defined by the following fields: 100

106 Using the Command Line Interface "id": "MySchedule", "type": "Schedule", "startdatetime": " T00:00:00", "enddatetime":" t00:00:00", "period": "1 day" te In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies. The user-defined name for the pipeline schedule, which is a label for your reference only. Type The pipeline component type, which is Schedule. startdatetime The date/time (in UTC format) that you want the task to begin. enddatetime The date/time (in UTC format) that you want the task to stop. period The time period that you want to pass between task attempts, even if the task occurs only one time. The period must evenly divide the time between startdatetime and enddatetime. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time. Amazon S3 Data de Next, the S3Datade pipeline component defines a location for the output file; in this case a tab-delimited file in an Amazon S3 bucket location. The output S3Datade component is defined by the following fields: "id": "MyS3Data", "type": "S3Datade", "schedule": "ref": "MySchedule" "filepath": "s3://output_bucket/productcatalog" } The user-defined name for the output location (a label for your reference only). Type The pipeline component type, which is "S3Datade" to match the data output location, in an Amazon S3 bucket. Schedule A reference to the schedule component that we created in the preceding lines of the JSON file labeled MySchedule. Path The path to the data associated with the data node.this path is an empty Amazon S3 location where a tab-delimited output file will be written that has the contents of a sample product catalog in an 101

107 Using the Command Line Interface Amazon DynamoDB table. The syntax for a data node is determined by its type. For example, a data node for a file in Amazon S3 follows a different syntax that is appropriate for a database table. Amazon EMR Cluster Next, the EmrCluster pipeline component defines an Amazon EMR cluster that processes and moves the data in this tutorial. The EmrCluster component is defined by the following fields: "id": "ImportCluster", "type": "EmrCluster", "masterinstancetype": "m1.small", "instancecoretype": "m1.xlarge", "instancecorecount": "1", "schedule": "ref": "MySchedule" "enabledebugging": "true", "emrloguri": "s3://test_bucket/emr_logs" The user-defined name for the Amazon EMR cluster (a label for your reference only). Type The computational resource type, which is an Amazon EMR cluster. For more information,see Overview of Amazon EMR in the Amazon EMR Developer Guide. masterinstancetype The type of Amazon EC2 instance to use as the master node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation. instancecoretype The type of Amazon EC2 instance to use as the core node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation. instancecorecount The number of core Amazon EC2 instances to use in the Amazon EMR cluster. Schedule A reference to the schedule component that we created in the preceding lines of the JSON file labeled MySchedule. enabledebugging Indicates whether to create detailed debug logs for the Amazon EMR job flow. emrloguri Specifies an Amazon S3 location to store the Amazon EMR job flow debug logs if you enabled debugging with the previously-mentioned enabledebugging field. Amazon EMR Activity Next, the EmrActivity pipeline component brings together the schedule, resources, and data nodes to define the work to perform, the conditions under which to do the work, and the actions to perform when certain events occur. The EmrActivity component is defined by the following fields: "id": "MyExportJob", "type": "EmrActivity", 102

108 Using the Command Line Interface "dynamodbinputtable": "MyTable", "dynamodbreadpercent": "0.25", "s3outputbucket": "#output.path}", "lateaftertimeout": "12 hours", "attempttimeout": "24 hours", "maximumretries": "0", "output": "ref": "MyS3Data" "runson": "ref": "ExportCluster" "schedule": "ref": "ExportPeriod" "onsuccess": "ref": "SuccessSnsAlarm" "onfail": "ref": "FailureSnsAlarm" "onlateaction": "ref": "LateSnsAlarm" "step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elast icmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,-- args,-f,s3://elasticmapreduce/libs/hive/dynamodb/exportdynamodbtabletos3,- d,dynamodb_input_table=#dynamodbinputtable-d,s3_output_bucket=#s3outputbuck namodbreadpercent-d,dynamodb_endpoint=dynamodb.us-east-1.amazonaws.com" The user-defined name for the Amazon EMR activity (a label for your reference only). Type The EmrActivity pipeline component type, which creates an Amazon EMR job flow to perform the defined work. For more information, see Overview of Amazon EMR in the Amazon EMR Developer Guide. dynamodbinputtable The Amazon DynamoDB table that the Amazon EMR job flow reads as the input for the Hive script. dynamodbreadpercent Set the rate of read operations to keep your Amazon DynamoDB provisioned throughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively. For more information, see Hive Options in Amazon EMR Developer Guide. s3outputbucket An expression that refers to the Amazon S3 location path for the output file defined by the S3Datade labeled "MyS3Data". lateaftertimeout The amount of time, after the schedule start time, that the activity can wait to start before AWS Data Pipeline considers it late. attempttimeout The amount of time, after the schedule start time, that the activity has to complete before AWS Data Pipeline considers it as failed. maximumretries The maximum number of times that AWS Data Pipeline retries the activity. 103

109 Using the Command Line Interface input The Amazon S3 location path of the input data defined by the S3Datade labeled "MyS3Data". runson A reference to the computational resource that will run the activity; in this case, an EmrCluster labeled "ImportCluster". schedule A reference to the schedule component that we created in the preceding lines of the JSON file labeled MySchedule. onsuccess A reference to the action to perform when the activity is successful. In this case, it is to send an Amazon SNS notification. onfail A reference to the action to perform when the activity fails. In this case, it is to send an Amazon SNS notification. onlateaction A reference to the action to perform when the activity is late. In this case, it is to send an Amazon SNS notification. step Defines the steps for the EMR job flow to perform. This step calls a Hive script named exportdynamodbtabletos3 that is provided by Amazon EMR and is specifically designed to move data from Amazon DynamoDB to Amazon S3. To perform more complex data transformation tasks, you would customize this Hive script and provide its name and path here. For more information about sample Hive scripts that show how to perform data transformation tasks, see Contextual Advertising using Apache Hive and Amazon EMR in AWS Articles and Tutorials. Upload the Pipeline Definition You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information, see Install the Command Line Interface (p. 15) To upload your pipeline definition, use the following command. On Linux/Unix/Mac OS:./datapipeline - create pipeline_name - put pipeline_file On Windows: ruby datapipeline - create pipeline_name - put pipeline_file Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the.json file extension that defines your pipeline. If your pipeline validates successfully, you receive the following message: Pipeline with name pipeline_name and id df-akiaiosfodnn7example created. Pipeline definition pipeline_file.json uploaded. te For more information about any errors returned by the create command or other commands, see Troubleshoot AWS Data Pipeline (p. 128). Ensure that your pipeline appears in the pipeline list by using the following command. 104

110 Using the Command Line Interface On Linux/Unix/Mac OS:./datapipeline --list-pipelines On Windows: ruby datapipeline - list-pipelines The list of pipelines includes details such as, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-akiaiosfodnn7example. Activate the Pipeline You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command. On Linux/Unix/Mac OS:./datapipeline --activate - id df-akiaiosfodnn7example On Windows: ruby datapipeline --activate - id df-akiaiosfodnn7example Where df-akiaiosfodnn7example is the identifier for your pipeline. Verify the Pipeline Status View the status of your pipeline and its components, along with its activity attempts and retries with the following command. On Linux/Unix/Mac OS:./datapipeline --list-runs id df-akiaiosfodnn7example On Windows: ruby datapipeline --list-runs id df-akiaiosfodnn7example Where df-akiaiosfodnn7example is the identifier for your pipeline. The --list-runs command displays a list of pipelines components and details such as, Scheduled Start, Status, ID, Started, and Ended. te It is important to note the difference between the Scheduled Start date/time vs. the Started time. It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries. te AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline 105

111 Using the Command Line Interface components the number of times the activity should have run if it had started on the Scheduled Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs. Successful pipeline runs are indicated by all the activities in your pipeline reporting the FINISHED status. Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as Amazon EC2 instances, may show the SHUTTING_DOWN status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status. Verify Data Export Next, verify that the data export occurred successfully using viewing the output file contents. To view the export file contents 1. Sign in to the AWS Management Console and open the Amazon S3 console. 2. On the Buckets pane, click the Amazon S3 bucket that contains your file output (the example pipeline uses the output path s3://output_bucket/productcatalog) and open the output file with your preferred text editor. The output file name is an identifier value with no extension, such as this example: ae10f955-fb2f b11-fbfea01a871e_ Using your preferred text editor, view the contents of the output file and ensure that there is delimited data that corresponds to the Amazon DynamoDB source table, such as Id, Price, ProductCategory, as shown in the following screen. This indicates that the export operation from Amazon DynamoDB to the output file occurred successfully. 106

112 Tutorial: Run a Shell Command to Process MySQL Table This tutorial walks you through the process of creating a data pipeline to use a script stored in Amazon S3 bucket to process a MySQL table, write the output in a comma-separated values (CSV) file in Amazon S3 bucket, and then send an Amazon SNS notification after the task completes successfully. You will use the Amazon EC2 instance resource provided by AWS Data Pipeline for this shell command activity. The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each pipeline object. For more information on pipeline definition, see Pipeline Definition (p. 2). This tutorial uses the following objects to create a pipeline definition: Activity Activity the AWS Data Pipeline must perform for this pipeline. This tutorial uses the ShellCommandActivity to process the data in MySQL table and write the output in a CSV file. Schedule The start date, time, and the duration for this activity. You can optionally specify the end date and time. Resource Resource AWS Data Pipeline must use to perform this activity. This tutorial uses Ec2Resource, an Amazon EC2 instance provided by AWS Data Pipeline, to run a command for processing the data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes. Datades Input and output nodes for this pipeline. This tutorial uses two input nodes and one output node. The first input node is the MySQLDatade that contains the MySQL table. The second input node is the S3Datade that contains the script. The output node is the S3Datade for storing the CSV file. Action Action AWS Data Pipeline must take when the specified conditions are met. This tutorial uses SnsAlarm action to send Amazon SNS notification to the address you specify, after the task finishes successfully. 107

113 Before you begin... For more information about the additional objects and fields supported by the copy activity, see ShellCommandActivity (p. 176). The following steps outline how to create a data pipeline to run a script stored in an Amazon S3 bucket. 1. Create your pipeline definition 2. Create and configure the pipeline definition objects 3. Validate and save your pipeline definition 4. Verify that your pipeline definition is saved 5. Activate your pipeline 6. Monitor the progress of your pipeline 7. [Optional] Delete your pipeline Before you begin... Be sure you've completed the following steps. Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see Access the Console (p. 12). Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12). Create and launch a MySQL database instance as a data source. For more information, see Launch a DB Instance in the Amazon Relational Database Service (RDS) Getting Started Guide. te Make a note of the user name and the password you used for creating the MySQL instance. After you've launched your MySQL database instance, make a note of the instance's endpoint. You will need all this information in this tutorial. Connect to your MySQL database instance, create a table, and then add test data values to the newly-created table. For more information, see Create a Table in the MySQL documentation. Create an Amazon S3 bucket as a source for the script. For more information, see Create a Bucket in the Amazon Simple Storage Service getting Started Guide. Create a script to read the data in the MySQL table, process the data, and then write the results in a CSV file. The script must run on an Amazon EC2 Linux instance. te The AWS Data Pipeline computational resources (Amazon EMR job flow and Amazon EC2 instance) are not supported on Windows in this release. Upload your script to your Amazon S3 bucket. For more information, see Add an Object to a Bucket in the Amazon Simple Storage Service Getting Started Guide. Create another Amazon S3 bucket as a data target. Create an Amazon SNS topic for sending notification and make a note of the topic Amazon Resource (ARN). For more information on creating an Amazon SNS topic, see Create a Topic in the Amazon Simple tification Service Getting Started Guide. 108

114 Using the AWS Data Pipeline Console [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your IAM role policy and trust relationships, follow the instructions described in Granting Permissions to Pipelines with IAM (p. 21). te Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier. Using the AWS Data Pipeline Console Topics Create and Configure the Pipeline Definition Objects (p. 109) Validate and Save Your Pipeline (p. 112) Verify your Pipeline Definition (p. 113) Activate your Pipeline (p. 113) Monitor the Progress of Your Pipeline Runs (p. 114) [Optional] Delete your Pipeline (p. 115) The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console. To create your pipeline definition 1. Sign in to the AWS Management Console and open the AWS Data Pipeline console. 2. Click Create Pipeline. 3. On the Create a New Pipeline page: a. In the Pipeline box, enter a name (for example, RunDailyScript). b. In Pipeline, enter a description. c. Leave the Select Schedule Type: button set to the default type for this tutorial. te Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. d. Leave the Role boxes set to their default values for this tutorial. te If you have created your own IAM roles and would like to use them in this tutorial, you can select them now. e. Click Create a new pipeline. Create and Configure the Pipeline Definition Objects Next, you define the Activity object in your pipeline definition. When you define the Activity object, you also define the objects that AWS Data Pipeline must use to perform this activity. 109

115 Create and Configure the Pipeline Definition Objects 1. On the Pipeline: name of your pipeline page, click Add activity. 2. In the Activities pane a. Enter the name of the activity; for example, run-my-script. b. In the Type box, select ShellCommandActivity. c. In the Schedule box, select Create new: Schedule. d. In the Add an optional field.. box, select Script Uri. e. In the Script Uri box, enter the path to your uploaded script; for example, s3://my-script/myscript.txt. f. In the Add an optional field.. box, select Input. g. In the Input box, select Create new: Datade. h. In the Add an optional field.. box, select Output. i. In the Output box, select Create new: Datade. j. In the Add an optional field.. box, select RunsOn. k. In the Runs On box, select Create new: Resource. l. In the Add an optional field.. box, select On Success. m. In the On Success box, select Create new: Action. n. In the left pane, separate the icons by dragging them apart. You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline will use to perform the shell command activity. The Pipeline: name of your pipeline pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects. Next, configure run date and time for your pipeline. To configure run date and time for your pipeline 1. On the Pipeline: name of your pipeline page, in the right pane, click Schedules. 2. In the Schedules pane: a. Enter a schedule name for this activity (for example, run-mysql-script-schedule). b. In the Type box, select Schedule. 110

116 Create and Configure the Pipeline Definition Objects c. In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity. te AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only. d. In the Period box, enter the duration for the activity (for example, 1), and then select the period category (for example, Days). e. [Optional] To specify the date and time to end the activity, in the Add an optional field box, select enddatetime, and enter the date and time. To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS Data Pipeline will then starting launching the "past due" runs immediately in an attempt to address what it perceives as a backlog of work. This backfilling means you don't have to wait an hour to see AWS Data Pipeline launch its first job flow. Next, configure the input and the output data nodes for your pipeline. To configure the input and output data nodes of your pipeline 1. On the Pipeline: name of your pipeline page, in the right pane, click Datades. 2. In the Datades pane: a. In the DefaultDatade1 box, enter the name for your MySQL data source node (for example, MySQLTableInput). b. In the Type box, select MySQLDatade. c. In the Connection box, enter the end point of your MySQL database instance (for example, mydbinstance.c3frkexample.us-east-1.rds.amazonaws.com). d. In the Table box, enter the name of the source database table (for example, mysql-input-table). e. In the Schedule box, select run-mysql-script-schedule. f. In the *Password box, enter the password you used when you created your MySQL database instance. g. In the Username box, enter the user name you used when you created your MySQL database instance. h. In the DefaultDatade2 box, enter the name for the data target node for your CSV file (for example, MySQLScriptOutput). i. In the Type box, select S3Datade. j. In the Schedule box, select run-mysql-script-schedule. k. In the Add an optional field.. box, select File Path. l. In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/name of your csv file). Next, configure the resource AWS Data Pipeline must use to run your script. To configure the resource, 1. On the Pipeline: name of your pipeline page, in the right pane, click Resources. 2. In the Resources pane: a. In the box, enter the name for your resource (for example, RunScriptInstance). 111

117 Validate and Save Your Pipeline b. In the Type box, select Ec2Resource. c. Leave the Resource Role and Role boxes set to the default values for this tutorial. d. In the Schedule box, select run-mysql-script-schedule. Next, configure the SNS notification action AWS Data Pipeline must perform after your script runs successfully. To configure the SNS notification action 1. On the Pipeline: name of your pipeline page, in the right pane, click Others. 2. In the Others pane: a. In the DefaultAction1 box, enter the name for your Amazon SNS notification (for example, RunDailyScripttice). b. In the Type box, select SnsAlarm. c. In the Topic Arn box, enter the ARN of your Amazon SNS topic. d. In the Subject box, enter the subject line for your notification. e. In the Message box, enter the message content. f. Leave the entry in the Role box set to default. You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline. Validate and Save Your Pipeline You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition. If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline. To validate and save your pipeline 1. On the Pipeline: name of your pipeline page, click Save Pipeline. 2. AWS Data Pipeline validates your pipeline definition and returns either success or the error message. 3. If you get an error message, click Close and then, in the right pane, click Errors. 4. The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object names and look for an error message in red. 5. When you see the error message, click the specific object pane where you see the error and fix it. For example, if you see an error message in the Datades object, click the Datades pane to fix the error. 6. After you've fixed the errors listed in the Errors pane, click Save Pipeline. 7. Repeat the process until your pipeline is validated. Next, verify that your pipeline definition has been saved. 112

118 Verify your Pipeline Definition Verify your Pipeline Definition It is important to verify that your pipeline was correctly initialized from your definitions before you activate it. To verify your pipeline definition 1. On the Pipeline: name of your pipeline page, click Back to list of pipelines. 2. On the List Pipelines page, check if your newly-created pipeline is listed. AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition. The Status column in the row listing your pipeine should show PENDING. 3. Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point. 4. In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition. 5. Click Close. Next, activate your pipeline. Activate your Pipeline You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition. To activate your pipeline 1. On the List Pipelines page, in the Details column of your pipeline, click View pipeline. 2. In the Pipeline: name of your pipeline page, click Activate. 113

119 Monitor the Progress of Your Pipeline Runs A confirmation dialog box opens up confirming the activation. 3. Click Close. Next, verify if your pipeline is running. Monitor the Progress of Your Pipeline Runs To monitor the progress of your pipeline 1. On the List Pipelines page, in the Details column of your pipeline, click View instance details. 2. The Instance details: name of your pipeline page lists the status of each instance in your pipeline definition. te If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update. 3. If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.you should receive an about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification. You can also check your Amazon S3 data target bucket to verify if the data was processed. 4. If the Status column of any of your instances indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed. a. To troubleshoot the failed or the incomplete instance runs Click the triangle next to an instance, Instance summary panel opens to show the details of the selected instance. b. Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For = Resource not healthy terminated. c. In the Instance summary pane, in the Select attempt for this instance box, select the attempt number. d. In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt. 114

120 [Optional] Delete your Pipeline 5. To take an action on your incomplete or failed instance, select an action (Rerun Cancel Mark Finished) from the Action column of the instance. You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline. For more information about instance status, see Interpret Pipeline Status Details (p. 129). For more information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data Pipeline Problems and Solutions (p. 131). Important Your pipeline is running and incurring charges. For more information, see AWS Data Pipeline pricing. If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline. [Optional] Delete your Pipeline Deleting your pipeline deletes the pipeline definition including all the associated objects.you stop incurring charges as soon as your pipeline is deleted. To delete your pipeline 1. In the List Pipelines page, click the check box next to your pipeline. 2. Click Delete. 3. In the confirmation dialog box, click Delete to confirm the delete request. 115

121 Using AWS Data Pipeline Console Manage Pipelines Topics Using AWS Data Pipeline Console (p. 116) Using the Command Line Interface (p. 121) You can use either the AWS Data Pipeline console or the AWS Data Pipeline command line interface (CLI) to view the details of your pipeline or to delete your pipeline. Using AWS Data Pipeline Console Topics View pipeline definition (p. 116) View details of each instance in an active pipeline (p. 117) Modify pipeline definition (p. 119) Delete a Pipeline (p. 121) With the AWS Data Pipeline console, you can: View the pipeline definition of any pipeline associated with your account View the details of each instance in your pipeline and use the information to troubleshoot a failed instance run Modify pipeline definition Delete pipeline The following sections walk you through the steps for managing your pipeline. Before you begin, be sure that you have at least one pipeline associated with your account, have access to the AWS Management Console, and have opened the AWS Data Pipeline console at View pipeline definition If you are signed in and have opened the AWS Data Pipeline console, your screen shows a list of pipelines associated with your account. 116

122 View details of each instance in an active pipeline The Status column in the pipeline listing displays the current state of your pipelines. A pipeline is SCHEDULED if the pipeline definition has passed validation and is activated, is currently running, or has completed its run. A pipeline is PENDING if the pipeline definition is incomplete or might have failed the validation step that all pipelines go through before being saved. If you want to modify or complete your pipeline definition, see Modify pipeline definition (p. 119). To view the pipeline definition of your pipeline, 1. On the List Pipelines page, in the Details column of your pipeline, click View instance details (or click View pipeline, if PENDING). 2. If your pipeline is SCHEDULED, a. On the Instance details: name of your pipeline page, click View pipeline. b. The Pipeline: name of your pipeline [The pipeline is active.] page opens. This is your pipeline definition page. As indicated in the title of the page, this pipeline is active. 3. To view the pipeline definition object definitions, on the Pipeline: name of your pipeline page, click the object icons in the design pane. The corresponding object pane on the right panel opens. 4. You can also click the object panes on the right panel to view the objects and the associated fields. 5. If your pipeline definition graph does not fit in the design pane, use the pan buttons on the right side of the design pane to slide the canvas. 6. Click Back to list of pipelines to get back to the List Pipelines page. View details of each instance in an active pipeline If you are signed in and have opened the AWS Data Pipeline console, your screen looks similar to this: 117

123 View details of each instance in an active pipeline The Status column in the pipeline listing displays the current state of your pipelines. Your pipeline is active if the status is SCHEDULED. A pipeline is in SCHEDULED state if the pipeline definition has passed validation and is activated, is currently running, or has completed its run.you can view the pipeline definition, the runs list, and the details of each run of an active pipeline. For information on modifying an active pipeline, seemodify pipeline definition (p. 119) To retrieve the details of your active pipeline 1. On the List Pipelines page, identify your active pipeline, and then click the small triangle that is next to the pipeline ID. 2. In the Pipeline summary pane, click View fields to see additional information on your pipeline definition. 3. Click Close to close the View fields box, and then click the triangle of your active pipeline again to close the Pipeline Summary pane. 4. In the row that lists your active pipeline, click View instance details. 5. The Instance details: name of your pipeline page lists all the instances of your active pipeline. te If you do not see the list of instances, click End (in UTC) date box, change it to a later date, and then click Update. 6. You can also use the Filter Object, Start, or End date-time fields to filter the number of instances returned based on either their current status or the date-range in which they were launched. Filtering the results is useful because, depending on the pipeline age and scheduling, the instance run history can be very large. 7. If the Status column of all the runs in your pipeline displays the FINISHED state, your pipeline has successfully completed running. 118

124 Modify pipeline definition If the Status column of any one of your runs indicate a status other than FINISHED, your pipeline is either running, waiting for some precondition to be met or has failed. 8. Click the triangle next to an instance to show the details of the selected instance. 9. In the Instance summary pane. click View instance fields to see details of fields associated with the selected instance. If the status of your selected instance is FAILED, the additional details box has an entry indicating the reason for failure. For = Resource not healthy terminated. 10. In the Instance summary pane, in the Select attempt for this instance box, select the attempt number. 11. In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt. 12. To take an action on your incomplete or failed instance, select an action (Rerun Cancel Mark Finished) from the Action column of the instance. 13. You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline. For more information about instance status, see Interpret Pipeline Status Details (p. 129). For more information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data Pipeline Problems and Solutions (p. 131). 14. Click Back to list of pipelines to get back to the List Pipelines page. Modify pipeline definition If your pipeline is in a PENDING state, either your pipeline definition is incomplete or your pipeline might have failed the validation step that all pipelines go through before saving. If your pipeline is active, you may need to change some aspect of it. However, if you are modifying the pipeline definition of an active pipeline, you must keep in mind the following rules: Cannot change the Default objects Cannot change the schedule of an object Cannot change the dependencies between objects Cannot add/delete/modify reference fields for existing objects, only non-reference fields are allowed New objects cannot reference an previously existing object for the output field, only the input fields is allowed Follow the steps in this section to either complete or modify your pipeline definition. 119

125 Modify pipeline definition To modify your pipeline definition 1. On the List Pipelines page, in the Details column of your pipeline, click View instance details (or click View pipeline, if PENDING). 2. If your pipeline is SCHEDULED, a. On the Instance details: name of your pipeline page, click View pipeline. b. The Pipeline: name of your pipeline [The pipeline is active.] page opens. This is your pipeline definition page. As indicated in the title of the page, this pipeline is active. 3. To complete or modify your pipeline definition: a. On the Pipeline: name of your pipeline page, click the object panes in the right side panel and complete defining the objects and fields of your pipeline definition. te If you are modifying an active pipeline, you will see some fields are grayed out and are inactive. You cannot modify those fields. b. Skip the next step and follow the steps to validate and save your pipeline definition. 4. To edit your pipeline definition: a. On the Pipeline: name of your pipeline page, click the Errors pane. The Errors pane lists the objects of your pipeline that failed validation. b. Click on the plus (+) sign next to the object names and look for an error message in red. c. Click the object pane where you see the error and fix it. For example, if you see error message in the Datades object, click the Datades pane to fix the error. To validate and save your pipeline definition 1. Click Save Pipeline. AWS Data Pipeline validates your pipeline definition and returns one of the following messages: Or 2. If you get an Error! message, click Close and then, on the right side panel, click Errors to see the objects that did not pass the validation. Fix the errors and save. Repeat this step till your pipeline definition passes validation. 120

126 Delete a Pipeline Activate and verify your pipeline 1. After you've saved your pipeline definition with no validation errors, click Activate. 2. To verify that your your pipeline definition has been activated, click Back to list of pipelines. 3. In the List Pipelines page, check if your newly-created pipeline is listed and the Status column displays SCHEDULED. Delete a Pipeline When you no longer require a pipeline, such as a pipeline created during application testing, you should delete it to remove it from active use. Deleting a pipeline puts it into a deleting state. When the pipeline is in the deleted state, its pipeline definition and run history are gone. Therefore, you can no longer perform operations on the pipeline, including describing it. You can't restore a pipeline after you delete it, so be sure that you won t need the pipeline in the future before you delete it. To delete your pipeline 1. In the List Pipelines page, click the empty box next to your pipeline. 2. Click Delete. 3. In the confirmation dialog box, click Delete to confirm the delete request. Using the Command Line Interface Topics Install the AWS Data Pipeline Command-Line Client (p. 122) Command-Line Syntax (p. 122) Setting Credentials for the AWS Data Pipeline Command Line Interface (p. 122) List Pipelines (p. 124) Create a New Pipeline (p. 124) Retrieve Pipeline Details (p. 124) View Pipeline Versions (p. 125) Modify a Pipeline (p. 126) Delete a Pipeline (p. 126) The AWS Data Pipeline Command-Line Client (CLI) is one of three ways to interact with AWS Data Pipeline. The other two are: using the AWS Data Pipeline console a graphical user interface or calling the APIs (calling functions in the AWS Data Pipeline SDK). For more information, see What is AWS Data Pipeline? (p. 1). 121