Platfora Deployment Planning Guide Version 5.3 Copyright Platfora 2016 Last Updated: 5:30 p.m. June 27, 2016
Contents Document Conventions... 4 Contact Platfora Support...5 Copyright Notices... 5 Chapter 1: About Platfora Deployments... 7 Platfora Deployment Architectures...7 On-Premise Hadoop Deployments... 7 Amazon AWS Cloud Deployments...9 Google Cloud Platform Deployments... 10 Platfora Server Architecture...10 FAQs Platfora Deployments... 13 Chapter 2: Supported Environments and Versions...16 Chapter 3: System Requirements (On-Premise)... 18 Platfora Server Requirements...18 Hadoop Resource Requirements...19 Chapter 4: System Requirements (AWS Cloud)...21 Platfora EC2 Instance Requirements...21 Amazon EMR Instance Requirements...22 AWS Security Settings for Platfora...23 Amazon AWS Virtual Private Cloud (VPC)... 23 IAM User and IAM Roles for Platfora...24 EC2 Security Group Settings... 29 Chapter 5: System Requirements (GCP Cloud)... 30 Platfora Compute Engine Machine Requirements...30 Google Dataproc Machine Requirements...31 GCP Security Settings for Platfora... 32 Chapter 6: Port Configuration Requirements...34 Ports to Open on Platfora Nodes... 34 Ports to Open on Hadoop Nodes... 35 Chapter 7: Browser Requirements... 37 Appendix A: Hardware Specifications for Platfora Nodes... 38
Platfora Deployment Planning Guide - Contents Appendix B: EC2 Considerations for Platfora Instances... 39 Page 3
Preface This guide provides information about what you need to consider when deploying a new Platfora cluster. This guide is intended for system and Hadoop administrators who are responsible for procuring and managing server resources. Knowledge of Linux system administration, network administration and Hadoop administration is recommended. Document Conventions This documentation uses certain text conventions for language syntax and code examples. Convention Usage Example $ Command-line prompt - proceeds a command to be entered in a command-line terminal session. $ sudo Command-line prompt for a command that requires root permissions (commands will be prefixed with sudo). $ ls $ sudo yum install open-jdk-1.7 UPPERCASE italics [ ] (square brackets)... (elipsis) Function names and keywords are shown in all uppercase for readability, but keywords are caseinsensitive (can be written in upper or lower case). Italics indicate a usersupplied argument or variable. Square brackets denote optional syntax items. An elipsis denotes a syntax item that can be repeated any number of times. SUM(page_views) SUM(field_name) CONCAT(string_expression[,...]) CONCAT(string_expression[,...]) Page 4
Platfora Deployment Planning Guide - Introduction Contact Platfora Support For technical support, you can send an email to: support@platfora.com Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, and product tips. http://support.platfora.com To access the support portal, you must have a valid support agreement with Platfora. Please contact your Platfora sales representative for details about obtaining a valid support agreement or with questions about your account. Copyright Notices Copyright 2012-16 Platfora Corporation. All rights reserved. Platfora believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. PLATFORA CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any Platfora software described in this publication requires an applicable software license. Platfora, You Should Know, Interest Driven Pipeline, Fractal Cache, and Adaptive Job Synthesis are trademarks of the Platfora Corporation. Apache Hadoop and Apache Hive are trademarks of the Apache Software Foundation. All other trademarks used herein are the property of their respective owners. Embedded Software Copyrights and License Agreements Platfora contains the following open source and third-party proprietary software subject to their respective copyrights and license agreements: Apache Hive PDK dom4j freemarker GeoNames Google Maps API Apache Jandex Page 5
Platfora Deployment Planning Guide - Introduction Apache POI javassist javax.servlet Mortbay Jetty 6.1.26 OWASP CSRFGuard 3 PostgreSQL JDBC 9.1-901 Scala sjsxp : 1.0.1 Unboundid Tableau jbcrypt SimpleSlider Page 6
Chapter 1 About Platfora Deployments Platfora runs on dedicated servers in the same network as your Hadoop deployment, which can be in an onpremise data center or in the cloud. Platfora uses the data processing services of Hadoop to process and prepare data for analysis. Platfora uses the data storage services of Hadoop to access the raw data and to store the output of the optimized data it prepares. This section explains how Platfora is deployed and the basics of the Platfora/ Hadoop server architecture. Topics: Platfora Deployment Architectures Platfora Server Architecture FAQs Platfora Deployments Platfora Deployment Architectures The Platfora software runs on a scale-out cluster of servers. These servers can be physical servers in an on-premise data center or virtual server instances in the cloud. Platfora uses native Hadoop protocols to connect to the distributed file system and data processing services of Hadoop. Platfora should be deployed on dedicated machines with low-latency connections to these Hadoop cluster services. This section explains how Platfora is deployed in your network environment, using either an on-premise, Google Dataproc cloud, or AWS cloud deployment of Hadoop. On-Premise Hadoop Deployments An on-premise Hadoop deployment means that you already have an existing Hadoop installation in your data center (either a physical data center or a virtual private cloud). Page 7
Platfora Deployment Planning Guide - About Platfora Deployments Platfora connects to the Hadoop cluster managed by your organization, and the majority of your organization's data is stored in the distributed file system of this primary Hadoop cluster. For on-premise Hadoop deployments, the Platfora servers should be on their own dedicated hardware co-located in the same data center as your Hadoop cluster. A data center can be a physical location with actual hardware resources, or a virtual private cloud environment with virtual server instances (such as Rackspace or Amazon EC2). Platfora recommends putting the Platfora servers on a network with at least 1 Gbps connectivity to the Hadoop nodes. Platfora users access the Platfora master node using an HTML5-compliant web browser. The Platfora master node accesses the HDFS NameNode and the MapReduce JobTracker or YARN Resource Manager using native Hadoop protocols. The Platfora worker nodes access the HDFS DataNodes directly. If using a firewall, Platfora recommends placing the Platfora servers on the same side of the firewall as your Hadoop cluster. Platfora software can run on a wide variety of server configurations on as little as one server or scale across multiple servers. Since Platfora runs best with all of the active lenses readily available in RAM, Platfora recommends obtaining servers optimized for higher RAM capacity and a minimum of 8 CPUs. Page 8
Platfora Deployment Planning Guide - About Platfora Deployments Amazon AWS Cloud Deployments An Amazon Web Services (AWS) cloud deployment means that you do not have a persistent Hadoop cluster. Instead, your organization uses Amazon S3 for raw data storage and Amazon EMR for ondemand Hadoop data processing. In an Amazon AWS cloud deployment, the Platfora server instances are deployed on dedicated, highmemory EC2 instances. Your organization s raw data is managed in Amazon's Simple Storage Service (S3). Platfora uses Amazon Elastic MapReduce (EMR) to run its data processing jobs (lens builds). The results of the lens build jobs are then written back to S3. Page 9
Platfora Deployment Planning Guide - About Platfora Deployments Google Cloud Platform Deployments A Google Cloud Platform (GCP) cloud deployment means that you do not have a persistent Hadoop cluster. Instead, your organization uses Google Cloud Storage for raw data storage and Google Cloud Dataproc for on-demand Hadoop data processing. In a Google Cloud Platform deployment, the Platfora server instances are deployed on dedicated, highmemory Google Compute Engine instances. Your organization s raw data is managed in Google's Cloud Storage. Platfora uses Google Cloud Dataproc to run its data processing jobs (lens builds). The results of the lens build jobs are then written back to Google Cloud Storage. Platfora Server Architecture Platfora connects to an existing Hadoop implementation, and makes the raw data residing in Hadoop accessible to users. The Platfora server has a number of services that work together with Hadoop's Page 10
Platfora Deployment Planning Guide - About Platfora Deployments services to access the raw data, prepare it for analysis, and present the results to users. This topic helps you understand the main components of the Platfora server architecture. The Platfora Master Node You can have a fully-functioning Platfora installation with just one node the master node. The master node manages the following Platfora services: Metadata Catalog - Platfora's metadata catalog holds all of the information about the data managed by Platfora (the datasets, lenses, vizboards and so on). The metadata catalog is a relational database that runs on the Platfora master node, but is accessed by all nodes in the Platfora cluster. Lens Builder - The lens builder interfaces with the data processing services of Hadoop. It translates data requests from the Platfora application into a series of custom MapReduce jobs, which it then submits to the Hadoop Job Tracker or Resource Manager for execution. After the requested data has been extracted and transformed in Hadoop, the job results are written back to the Hadoop file system in Platfora's proprietary file format called a lens. On-Disk Storage - Finished lenses are immediately copied from the Hadoop file system to on-disk storage of the Platfora nodes. The data of a lens is distributed across all of the available worker nodes in a Platfora cluster. In-Memory Query Engine - When users explore and analyze data in Platfora, they are actually generating queries that run against a lens. The result of a lens query is rendered as a visualization in Platfora. When users construct visualizations, they choose a lens to work with. Choosing a lens Page 11
Platfora Deployment Planning Guide - About Platfora Deployments loads its data into Platfora's in-memory query engine. The in-memory query engine has two kinds of processes that work on a query: 1. Query Coordinator - The query coordinator process runs on the master node only, and translates actions made in the Platfora application into queries. The coordinator sends the query to the workers for processing, then consolidates the partial results from each worker into a final result. 2. Query Worker - The query worker process typically runs on the worker nodes, but the master may also serve as a worker in some cases. A query worker process works on its portion of lens data for a given query. Web Application Server - Platfora's user interface runs as a web application in your network. Users connect to Platfora using any HTML5-compliant browser. Through the browser, users interact with data in Hadoop as easily as browsing a web site. The Platfora Worker Nodes The Platfora worker nodes are used to distribute lens storage capacity and query processing workload. As users work with more and bigger lenses in Platfora, more memory and processing power is needed to render visualizations quickly. Administrators can add additional worker nodes to scale up lens storage capacity and performance. By using the resources of multiple machines to store and process lens data, Platfora can handle true 'big data' query workloads. Page 12
Platfora Deployment Planning Guide - About Platfora Deployments FAQs Platfora Deployments Got questions about what you need to get Platfora up and running? Want to know how Platfora is deployed in your data center environment and how it works with Hadoop? This topic answers the most frequently asked questions (FAQs) about Platfora installation and deployment. What do I need before I can install Platfora? Before you can install Platfora, you will need: Hadoop Platfora needs access to an installed and running Hadoop cluster, or to a Google Cloud Platform account with Google Cloud Storage and Google Cloud Dataproc enabled, or to an Amazon Web Services (AWS) account with Amazon S3 (Simple Storage Service) and EMR (Elastic MapReduce) enabled. Linux Server(s) You will need one or more dedicated servers running a supported Linux operating system on which to install Platfora. The Platfora server(s) should be in the same data center (or region) as your Hadoop distribution, but not on the same machines. Platfora Binaries A Platfora customer support representative can give you the download link to the Platfora installation package for your chosen Hadoop distribution. Platfora provides both rpm and tar installer packages. Platfora License A Platfora customer support representative must issue you a license file. Trial period licenses are available upon request for pilot installations. Platfora Installation Guide You will need the Platfora installation guide that covers your specific Hadoop distribution. The setup steps vary slightly depending on the version of Hadoop you are using. What are the high-level steps involved in installing Platfora? Every Platfora installation involves these basic steps, although the details will vary slightly depending on the Hadoop distribution you are using: Configure Hadoop for Platfora Access Make sure that the Platfora server(s) can access your Hadoop services over the network and that Platfora has write access to a designated directory in the Hadoop file system. Obtain the required connection details for your Hadoop services (Platfora connects to Hadoop during setup). Install Prerequisites on all Platfora Nodes Make sure the Platfora servers have the required dependencies before installing Platfora. If using the rpm installer, Platfora provides a base package that includes the dependencies. If using the tar installer, you will need to manually install the dependent software yourself. Install the Platfora Software on the Master Install the Platfora binaries on the master node. Setup the Platfora Master Run the setup utility to configure the Platfora master server and connect it to your Hadoop services. Start Platfora After setup completes, start the Platfora server. You should now have a fullyfunctioning single-node Platfora installation. Page 13
Platfora Deployment Planning Guide - About Platfora Deployments Run Tests and Load the Tutorial Data After setup completes, you may want to run some tests to make sure that Platfora is properly configured and can access your Hadoop cluster. One way to test everything is to load the tutorial data that comes with your Platfora installation. This will put some data in Hadoop and build a small lens to make sure everything is working. Add Platfora Worker Nodes Once you have the Platfora master node up and running, you can use it to add Platfora worker nodes to the cluster. The master node is always used to install and manage the worker nodes. Is there a trial version of Platfora? Platfora does not currently have a trial version available for download. You can contact Platfora Customer Support to arrange for a pilot or trial installation. Why would I need multiple Platfora nodes? When users work with lens data in Platfora, that data is loaded into memory so that queries (vizzes) are fast and responsive. If there is more lens data than can fit into memory, then some queries may be slow or not be able to run at all. Adding more nodes to your Platfora cluster makes more disk, memory and CPU available to store and process lens data. How many Platfora nodes would I need? Platfora is intended for big data query workloads, and performs best when using the resources of multiple machines. Although you can have a fully-functioning Platfora installation with just one node, a multi-node installation is necessary for optimal performance and bigger lens sizes. The ideal number of Platfora nodes really depends on a lot of factors: lens size, lens quantity, data variety, and number of concurrent users (to name a few). Your Platfora account representative will help you determine the number of nodes that best fits your unique data requirements. You can also scale up your Platfora cluster as your data and usage grows. How does Platfora interact with Hadoop? Platfora uses the powerful distributed storage and processing features of Hadoop, but masks the complexity of working with HDFS and MapReduce by providing an easy-to-use web interface. Platfora uses Hadoop to access the raw data stored in its distributed file system (DFS) and makes the data visible to Platfora users. It uses the data processing services of Hadoop (MapReduce) to pull requested data and prepare it for analysis. The result of these processing jobs is the Platfora lens. Platfora lenses are stored in the Hadoop distributed file system, as well as copied over to the Platfora servers. Can Platfora connect to more than one source system? When you install Platfora, you connect it to one Hadoop distribution. This is the primary source system that Platfora uses to access the source data and process its lens builds. You can create data sources that point to external sources (such as a cloud storage service or a relational database). However, this external data must be pulled over to the primary Hadoop source system during Page 14
Platfora Deployment Planning Guide - About Platfora Deployments lens build processing. To avoid moving large amounts of data over the network, Platfora recommends using external data sources for smaller, supplemental datasets only. What does Platfora do to the data in Hadoop? Platfora reads the raw data, but does not edit, update, or delete it in place. It makes a copy of the requested portion of the data when it builds a lens, and does its lens processing on the copied data. Your original data remains intact and unaltered. How does Platfora keep my data secure? Platfora's role-based security allows you to control who can authenticate to the Platfora application and what actions they can perform. You can maintain user credentials within the Platfora application, or configure Platfora to use an external LDAP directory service to authenticate users. To authorize access to the raw data, you can either manage data access permissions within the Platfora application itself, or you can configure Platfora to use Kerberos authorization check the HDFS file system permissions. How does Platfora handle redundancy and high availability? Platfora relies on Hadoop for redundancy and high-availability of the raw data itself. The Platfora worker nodes are fully redundant and highly available. The worker nodes process the lens queries submitted to the Platfora application. Lens data is distributed and replicated across all of the worker nodes in the Platfora cluster. Depending on the number of worker nodes you have, you can lose a node and still continue processing queries without interruption of service. A redundant Platfora master node involves taking routine backups of the metadata catalog database so you can restore the master node if needed. Page 15
Chapter 2 Supported Environments and Versions This section lists the environments and versions that Platfora supports. Hadoop and Hive Versions This section lists the Hadoop distributions and versions that are compatible with the Platfora installation packages. If using Hive as a data source for Platfora, the version of Hive must be compatible with the version of Hadoop you are using. Hadoop Distro Version Hive Version M/R Version Platfora Package Cloudera 5 CDH 5.3.1+ 0.13.1 YARN cdh52 CDH 5.4 1.1 YARN cdh54 CDH 5.5 1.1 YARN cdh54 CDH 5.7 1.1 YARN cdh54 Hortonworks HDP 2.2.x 0.14.0 YARN hadoop_2_6_0_hive_0_14_0 HDP 2.3.x 1.2.1 YARN hadoop_2_7_1_hive_1_2_1 HDP 2.4.x 1.2.1 YARN hadoop_2_7_1_hive_1_2_1 MapR MapR 4.0.2 0.13.0 YARN mapr402 MapR 4.1.0 0.13.0 YARN mapr41 MapR 5.0.0 1.1 YARN mapr5 MapR 5.1.0 1.1 YARN mapr51 Pivotal Labs PivotalHD 3.0 0.14.0 YARN hadoop_2_6_0_hive_0_14_0 Amazon EMR (AMI 3.10.x) Google Dataproc (1.0) Hadoop 2.4.0 0.13.1 YARN hadoop_2_4_0_hive_0_13_0 Hadoop 2.7.2 1.2.1 YARN hadoop_2_7_2_hive_1_2_1 Page 16
Platfora Deployment Planning Guide - Supported Environments and Versions Operating Systems Operating System Supported Versions Red Hat Enterprise Linux 6.2, 6.3, 6.4, 6.5, and 6.6 CentOS 6.2, 6.3, 6.4, 6.5, and 6.6 Scientific Linux 6.2 Amazon Linux AMI AMI 2014.03 and AMI 2015.03 Ubuntu 12.04.1 LTS Oracle Linux 6.x Web Browsers Web Browser Chrome Firefox Supported Versions Latest version (Evergreen) and three previous releases 25.0.x or higher Safari 6.1+ and 7.x Internet Explorer with the IE 11 (Windows 7, Windows 8, Windows 10) Compatibility View feature disabled IE 10 (Windows 7 and Windows 8) Platfora supports these web browsers on desktop machines only. Platfora recommends using a screen resolution width of 1400 pixels or greater for viewing some pages in the Platfora web application. Java java-1.7.0-openjdk (recommended) Java 1.7.0 Sun/Oracle Python Python 2.6.8, 2.7.1, 2.7.3, 2.7.4, 2.7.5, 2.7.6, 2.7.7, 2.7.8 only Postgres Database PostgreSQL 9.2.1-1, 9.2.1-1.28 (on Amazon AMI), 9.2.5, 9.2.7 Page 17
Chapter 3 System Requirements (On-Premise) The Platfora software runs on a scale-out cluster of servers. You can install Platfora on a single node to start, and then scale up storage and processing capacity by adding additional nodes. Platfora requires access to an existing, compatible Hadoop implementation in order to start. Users then access the Platfora application using a compatible web browser client. This section describes the system requirements for on-premise deployments of the Platfora servers, Hadoop source systems, network connectivity, and web browser clients. Topics: Platfora Server Requirements Hadoop Resource Requirements Platfora Server Requirements Platfora recommends the following minimum system requirements for Platfora servers. For multi-node installations, the master server and all worker servers must be the same operating system (OS) and system configuration (same amount of memory, CPU, etc.). 64-bit Operating System or Amazon Machine Image (AMIs) CentOS 6.2-6.5 (7.0 is not supported) RHEL 6.2-6.5 (7.0 is not supported) Scientific Linux 6.2 Amazon Linux AMI 2014.03+ Oracle Enterprise Linux 6.x Ubuntu 12.04.1 LTS or higher Security-Enhanced Linux 6.2 1 Software Java 1.7 Python 2.6.8, 2.7.1, 2.7.3 through 2.7.6 (3.0 not supported) PostgreSQL 9.2.1-1, 9.2.5, 9.2.7 or 9.3 (master only) OpenSSL 1.0.1 or higher 2 Unix Utilities rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget 1 If you wish to install Security-Enhanced Linux, refer to Platfora's Support site for installation instructions. 2 Only required if you want to enable SSL for secure communications between Platfora servers Page 18
Platfora Deployment Planning Guide - System Requirements (On-Premise) Memory 64 GB minimum, 256 recommended The server needs enough memory to accommodate actively used lens data. Additionally, it needs 1-2 GB reserved for normal operations and the lens query engine workspace. CPU Disk Network 8 cores minimum, 16 recommended All Platfora nodes (master or worker) require 300 MB for the Platfora installation. Every node requires high-speed local storage and a local disk cache configured as a single logical volume. Hardware RAID is recommended for the best performance. All nodes combined require appropriate free space for aggregated data structures (Platfora lenses). At a minimum, you will need twice the amount of disk space as the amount of system memory. The Platfora master node requires an additional, approximately 850 MB for metadata catalog (dataset definitions, vizboard and visualization definitions, lens definitions, etc.) 1 Gbps reliable network connectivity between Platfora master server and query processing servers 1 Gbps reliable network connectivity between Platfora master server and Hadoop NameNode and JobTracker/ResourceManager node Network bandwidth should be comparable to the amount of memory on the Platfora master server Hadoop Resource Requirements Platfora must be able to connect to an existing Hadoop installation. Platfora also requires permissions and resources in the Hadoop source system. This section describes the Hadoop resource requirements for Platfora. Platfora uses the remote Distributed File System (DFS) of the Hadoop cluster for persistent storage and as the primary data source. Optionally, you can also configure Platfora to use a Hive metastore server as a data source. Page 19
Platfora Deployment Planning Guide - System Requirements (On-Premise) Platfora uses the Hadoop MapReduce services to process data and build lenses. For larger lens builds to succeed, Platfora requires minimum resources on the Hadoop cluster for MapReduce tasks. DFS Disk Space DFS Permissions MapReduce Permissions Platfora requires a designated persistent storage directory in the remote distributed file system (DFS) with appropriate free space for Platfora system files and data structures (lenses). The location is configurable. The platfora system user needs read permissions to source data directories and files. The platfora system user needs write permissions to Platfora's persistent storage directory on DFS. The platfora system user needs to be added to the submit-jobs and administer-jobs access control list (or added to a group that has these permissions). DFS Resources Minimum Open File Limit = 5000 MapReduce Resources Minimum Memory for Task Processes = 1 GB Page 20
Chapter 4 System Requirements (AWS Cloud) This section describes the system requirements for customers who plan to use Amazon Web Services (AWS) as their installation environment for Platfora, and Simple Storage Service (S3) and Elastic MapReduce (EMR) and as their Hadoop distributed data storage and processing services. Topics: Platfora EC2 Instance Requirements Amazon EMR Instance Requirements AWS Security Settings for Platfora Platfora EC2 Instance Requirements Platfora recommends the following system requirements for Amazon EC2 instances that will serve as Platfora server nodes. For multi-node installations, the master server instance and all worker server instances must be the same configuration (same EC2 instance type, storage configuration, network configuration, etc.). Amazon Machine Images (AMIs) EC2 Instance Type Root Device Volume (EBS) Additional EBS Volumes Amazon Linux AMI 2014.03.x or higher Red Hat Enterprise Linux 6.2-6.5 Ubuntu Server 12.04.1 LTS or higher Small to Medium Lens Sizes: c3.8xlarge Medium to Large Lens Sizes, 10+ Platfora nodes: r3.8xlarge Medium to Large Lens Sizes, 1-9 Platfora nodes: i2.8xlarge Recommended Size = 1 TB Type = General Purpose (SSD) Optional. Additional EBS volumes can be attached to an EC2 instance after launch time, and can be used to increase lens cache storage capacity if needed. EBS volumes are less expensive than Instance Store volumes, and the data is persistent between shutdowns. Page 21
Platfora Deployment Planning Guide - System Requirements (AWS Cloud) Instance Store Volume (Ephemeral) Enhanced Networking EBS Optimized Instance Availability Zone Placement Group IAM User Other Required Software Required Unix Utilities Optional. You may choose to add instance store volumes for the Platfora lens cache instead of using EBS volumes. This costs more, but offers slightly faster performance. Instance store volumes can only be attached to an EC2 instance at launch time, and the data is not saved when the instance shuts down. The size of an instance store volume depends on the instance type: c3.8xlarge: 2 x 320 GB SSD (640 GB) r3.8xlarge: 2 x 320 GB SSD (640 GB) i2.8xlarge: 8 x 800 GB SSD (6400 GB) yes (requires use of VPC instead of EC2-Classic) yes (the 8xlarge instance types are EBS optimized instances by default) yes (use same zone for all nodes in the Platfora cluster) yes (use same placement group for all nodes in the Platfora cluster) yes (create a dedicated Platfora IAM User in your AWS account) Java 1.7 Python 2.7.8 through 2.7.9 (3.0 not supported) (master node only) PostgreSQL 9.2.1-1.28 (AMZN), 9.2.5, 9.2.7 or 9.3 OpenSSL 1.0.1 or higher 3 rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget Amazon EMR Instance Requirements Platfora launches an Elastic MapReduce (EMR) cluster when it builds a lens. This section describes the recommended requirements for the EMR instances that are launched by Platfora. Amazon EMR is Hadoop as a web service. Platfora uses the EMR Hadoop cluster to process its lens builds. Since the EMR Hadoop cluster is only instantiated as needed, the source data does not reside in the Hadoop Distributed File System (HDFS) of the EMR Hadoop cluster. The source data is instead stored on Amazon S3. Data is copied from S3 to EMR for data processing, then the results are written back to S3 when the job completes. 3 Only required if you want to enable SSL for secure communications between Platfora servers Page 22
Platfora Deployment Planning Guide - System Requirements (AWS Cloud) At the start of a lens build job, the raw source data is copied from S3 to the local HDFS file system on the EMR nodes. The EMR instances must have enough local instance storage to support the input source dataset and the temporary workspace for intermediate lens build job results. Also consider that the local HDFS of the EMR cluster replicates the data to ensure redundancy and high availability during lens build processing. Platfora recommends the i2.4xlarge instance type for EMR data nodes and the m3.xlarge for the EMR name node. The i2.4xlarge offers a great balance between total local disk space, CPU power, and pernode memory size. Hadoop Version 2.4.0 AMI Version Amazon EMR 3 (AMI 3.10) EMR NameNode Instance Type EMR DataNode Instance Type Number of EMR DataNodes m3.xlarge i2.4xlarge The number of nodes you will need to complete a lens build depends on the following factors: The size of the raw dataset in S3 that is considered as input to the lens build. The replication factor of HDFS. EMR clusters of 1-4 nodes have a replication factor of 1, 5-9 nodes have a replication factor of 2, and over 10 nodes have a replication factor of 3. Temporary work space for intermediate lens build results - about 20-30% of total disk space. AWS Security Settings for Platfora Amazon Web Services (AWS) has a number of security features that you can use to protect your AWS account and cloud server instances. This section contains security setting recommendations if you plan to use Amazon Elastic MapReduce (EMR) as the Hadoop implementation for your Platfora cluster. Amazon AWS Virtual Private Cloud (VPC) To use Amazon EMR for Hadoop data processing, Platfora must be able to launch an EMR cluster in a public subnet. Administrators do this by provisioning an Amazon VPC with a public subnet, and then specifying the subnet identifier in Platfora. Platfora must create the EMR cluster on an Internet-facing subnet to allow the AWS EMR Provisioning Service to reach the EMR cluster. Additionally, you must ensure the Platfora server can communicate with the Amazon EMR cluster. If the Platfora server is on the same subnet as the Amazon EMR cluster, this happens automatically. If Page 23
Platfora Deployment Planning Guide - System Requirements (AWS Cloud) the Platfora server and the EMR cluster are on different VPC subnets, then a route between the subnets needs to be added to the Route table(s) so that communication can occur between the two subnets. Also, if the VPC uses Access Control Lists (ACLs), then those ACLs must be modified to allow traffic from Platfora to Hadoop. The subnet identifier cannot exceed 255 characters in length. After the Amazon VPC has been provisioned, specify its subnet identifier in the platfora.emr.subnet.id Platfora configuration property. For more information on setting up and using an Amazon VPC with Amazon EMR, see http:// docs.aws.amazon.com/elasticmapreduce/latest/developerguide/emr-plan-vpc-subnet.html. IAM User and IAM Roles for Platfora AWS Identity and Access Management (IAM) allows you to create users, groups, and roles to control access to AWS services and resources. Platfora recommends creating an IAM User account and two IAM Roles specifically for use by Platfora. Platfora uses a combination of an IAM User and IAM Roles to communicate with Amazon AWS and to create an EMR cluster. An Amazon AWS administrator needs to create a platfora IAM User and two IAM Roles specifically for use by Platfora. Then a Platfora system administrator needs to enter some information about that user and those roles in Platfora. The Platfora server uses security credentials of the platfora IAM User to request Amazon AWS to create an Amazon EMR cluster. Once that request is approved, the platfora IAM User then passes an IAM Role to actually launch an EMR cluster, and then uses another IAM Role to start EC2 instances in the EMR cluster. You must specify these roles in Platfora. For more details on creating the user and roles, see Create IAM User for Platfora and Create IAM Roles for Platfora. Create IAM User for Platfora The Amazon AWS administrator can create a new platfora user in the IAM Management Console of your AWS account. After creating the user, download the AWS credentials for this user. The Platfora Page 24
Platfora Deployment Planning Guide - System Requirements (AWS Cloud) system administrator will need the Access Key Id and Secret Access Key when you initialize Platfora for use with Amazon EMR. The security policy for the platfora IAM User must have (at a minimum) the permissions listed in the following sample policy: { "Version": "2012-10-17", "Statement": [ { "Action": [ "iam:listroles", "iam:passrole", "elasticmapreduce:*", "s3:getbucketlocation", "s3:listallmybuckets" Page 25
Platfora Deployment Planning Guide - System Requirements (AWS Cloud) }, { ], "Effect": "Allow", "Resource": "*" "Effect": "Allow", "Action": [ "s3:listbucket" ], "Resource": [ "arn:aws:s3:::bucket_defined_in_core-site.xml", "arn:aws:s3:::datasource_bucket_1", "arn:aws:s3:::datasource_bucket_n" } ] }, { }, { } ] "Effect": "Allow", "Action": [ "s3:putobject", "s3:get*", "s3:deleteobject", ], "Resource": [ "arn:aws:s3:::bucket_defined_in_core-site.xml/*" ] "Effect": "Allow", "Action": [ "s3:get*" ], "Resource": [ "arn:aws:s3:::datasource_bucket_1/path/to/files/*", "arn:aws:s3:::datasource_bucket_n/*" ] Under Permissions for this user, attach a security policy that contains the permissions listed above. These permissions allow the platfora IAM User to pass an IAM Role to launch the EMR cluster, start an EMR cluster, and access S3 for source data during data ingest. Create IAM Roles for Platfora Amazon requires all AWS users to use IAM Roles to launch EMR clusters. One IAM Role is used to start the Amazon EMR service, and the other role is used by the EC2 instances in the EMR cluster. Amazon AWS offers some default IAM Roles for these services. However, Platfora recommends creating custom IAM Roles specifically for use by Platfora instead. Page 26
Platfora Deployment Planning Guide - System Requirements (AWS Cloud) The Amazon AWS administrator can create the IAM Roles in the IAM Management Console of your AWS account. Create a role for each of the following EMR cluster services, and specify them in Platfora using the specified configuration properties: Amazon EMR service (service role). In Amazon AWS, create an IAM Role and attach a security policy that contains at a minimum the permissions specified below. Enter this IAM Role name in the platfora.emr.service.role Platfora configuration property. The custom role you define corresponds to the default IAM Role Amazon offers called EMR_DefaultRole. EC2 instances (instance profile) in the Amazon EMR cluster. In Amazon AWS, create an IAM Role and attach a security policy that contains at a minimum the permissions specified below. Enter this IAM Role name in the platfora.emr.jobflow.role Platfora configuration property. The custom role you define corresponds to the default IAM Role Amazon offers called EMR_EC2_DefaultRole. The security policy for the Amazon EMR service (service role) IAM Role must have (at a minimum) the permissions listed in the following sample policy: { "Version": "2012-10-17", "Statement": [ { "Action": [ "ec2:authorizesecuritygroupingress", "ec2:cancelspotinstancerequests", "ec2:createsecuritygroup", "ec2:createtags", "ec2:deletetags", "ec2:describe*", "ec2:modifyimageattribute", "ec2:modifyinstanceattribute", "ec2:requestspotinstances", "ec2:runinstances", "ec2:terminateinstances" ], "Effect": "Allow", "Resource": "*" }, { "Action": [ "iam:passrole", "iam:listrolepolicies", "iam:getrole", "iam:getrolepolicy", "iam:listinstanceprofiles" ], "Effect": "Allow", "Resource": "*" }, { "Effect": "Allow", "Action": [ Page 27
Platfora Deployment Planning Guide - System Requirements (AWS Cloud) } ] } "s3:get*" ], "Resource": "arn:aws:s3:::bucket_defined_in_core-site.xml/*" The security policy for the EC2 instances (instance profile) IAM Role must have (at a minimum) the permissions listed in the following sample policy: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Resource": "*", "Action": [ "ec2:describe*", "elasticmapreduce:describe*", "elasticmapreduce:listbootstrapactions", "elasticmapreduce:listclusters", "elasticmapreduce:listinstancegroups", "elasticmapreduce:listinstances", "elasticmapreduce:liststeps", "s3:listallmybuckets" ] }, { }, { }, { "Effect": "Allow", "Action": [ "s3:listbucket" ], "Resource": [ "arn:aws:s3:::bucket_defined_in_core-site.xml", "arn:aws:s3:::datasource_bucket_1", "arn:aws:s3:::datasource_bucket_n" ] "Effect": "Allow", "Action": [ "s3:putobject", "s3:get*", "s3:deleteobject" ], "Resource": [ "arn:aws:s3:::bucket_defined_in_core-site.xml/*", ] "Effect": "Allow", "Action": [ Page 28
Platfora Deployment Planning Guide - System Requirements (AWS Cloud) } ], ] } "s3:get*", "s3:list*" "Resource": [ "arn:aws:s3:::datasource_bucket_1/path/to/files/*", "arn:aws:s3:::datasource_bucket_n/*", "arn:aws:s3:::*elasticmapreduce/*" ] Verify that the permissions for and access to Amazon resources (especially S3) for the EC2 instances role are the same or greater than the permissions and access assigned to the platfora IAM User. For example, if the platfora IAM User can access an Amazon S3 bucket, but the EC2 instances role cannot, then lens builds that rely on that S3 bucket will fail. For more information on using IAM Roles for EMR, see http://docs.aws.amazon.com/ ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html. EC2 Security Group Settings EC2 security groups allow you to specify firewalling rules for your Amazon elastic cloud computing (EC2) server instances. EC2 security group rules are independent of, and in addition to, the software firewalling provided by the instance's operating system. Security groups must be defined before you create an EC2 instance. The security group configured for the Platfora server instance must permit connections from your user network to the Platfora web application server port (8001 by default). You also may want to open the EMR Hadoop ResourceManager and JobHistory web ports so that you can monitor and troubleshoot YARN jobs executed by Platfora. An example security group configuration for a Platfora server instance would look something like the following: Page 29
Chapter 5 System Requirements (GCP Cloud) This section describes the system requirements for customers who plan to use Google Cloud Platform (GCP) as their installation environment for Platfora, and Google Cloud Storage (GCS) and Cloud Dataproc and as their Hadoop distributed data storage and processing services. Topics: Platfora Compute Engine Machine Requirements Google Dataproc Machine Requirements GCP Security Settings for Platfora Platfora Compute Engine Machine Requirements Platfora recommends the following system requirements for Google Compute Engine machines that will serve as Platfora server nodes. For multi-node installations, the master machine and all worker machines must be the same configuration (same Compute Engine machine type, storage configuration, network configuration, etc.). Machine Boot Disk Operating System Compute Engine Machine Type Boot Disk Drive Additional Disks Debian GNU/Linux 8 (jessie) Debian GNU/Linux 7 (wheezy) CentOS 6 Ubuntu 14.04 LTS Red Hat Enterprise Linux 6 Small to Medium Lens Sizes: Custom: 32 vcpus and 64 GB of Memory (RAM) Medium to Large Lens Sizes, 1+ Platfora nodes: n1-highmem-32 Recommended Size = 1 TB Type = SSD Persistent Disk Optional. Additional disks can be attached to a Compute Engine machine after launch time, and can be used to increase lens cache storage capacity if needed. Standard Persistent Disks are less expensive than SSD Persistent Disks, and the data is persistent between shutdowns. Page 30
Platfora Deployment Planning Guide - System Requirements (GCP Cloud) Zone Google Service Account Other Required Software Required Unix Utilities yes (use the same zone for all nodes in the Platfora cluster) yes (create a dedicated Service Account for Platfora in your Google Cloud Platform account) Java 1.7 Python 2.7.8 through 2.7.9 (3.0 not supported) (master node only) PostgreSQL 9.2.5, 9.2.7, or 9.3 OpenSSL 1.0.1 or higher 4 rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget Google Dataproc Machine Requirements Platfora launches a Google Cloud Dataproc cluster when it builds a lens. This section describes the recommended requirements for the Dataproc machines that are launched by Platfora. Google Cloud Dataproc is Hadoop as a web service. Platfora uses the Dataproc Hadoop cluster to process its lens builds. Since the Dataproc Hadoop cluster is only instantiated as needed, the source data does not reside in the Hadoop Distributed File System (HDFS) of the Dataproc Hadoop cluster. The source data is instead stored on Google Cloud Storage (GCS). Data is copied from GCS to Dataproc for data processing, then the results are written back to GCS when the job completes. At the start of a lens build job, the raw source data is copied from GCS to the local HDFS file system on the Dataproc nodes. The Dataproc machines must have enough local machine storage to support the input source dataset and the temporary workspace for intermediate lens build job results. Also consider that the local HDFS of the Dataproc cluster replicates the data to ensure redundancy and high availability during lens build processing. Platfora recommends the n1-highmem-16 machine type for Dataproc data nodes and the n1-standard-4 for the Dataproc name node. The n1-highmem-16 machine type offers a great balance between total local disk space, CPU power, and per-node memory size. Hadoop Version 2.7.2 Dataproc Software Version Dataproc NameNode Machine Type Dataproc DataNode Machine Type Dataproc 1.0 n1-standard-4 n1-highmem-16 4 Only required if you want to enable SSL for secure communications between Platfora servers Page 31
Platfora Deployment Planning Guide - System Requirements (GCP Cloud) Number of Dataproc DataNodes The number of nodes you will need to complete a lens build depends on the following factors: The size of the raw dataset in GCS that is considered as input to the lens build. The replication factor of HDFS. Dataproc clusters of 1-4 nodes have a replication factor of 1, 5-9 nodes have a replication factor of 2, and over 10 nodes have a replication factor of 3. Temporary work space for intermediate lens build results about 20-30% of total disk space. The number of worker nodes in a Dataproc cluster must be a value of two or higher. GCP Security Settings for Platfora Google Cloud Platform has a number of security features that you can use to protect your Google Cloud Platform account and cloud server machines. This section contains security setting recommendations if you plan to use Google Cloud Dataproc as the Hadoop implementation for your Platfora cluster. Google Cloud Service Account for Platfora A service account is a special Google account that can be used by applications to access Google services programmatically. To use any of the Google services (Dataproc, Storage, or BigQuery), you must create a Google service account in your Google Cloud Platform account that is used by Platfora. You will specify this service account for the Compute Engine machines used for the Platfora cluster. Platfora uses the service account when it accesses other Google services. At a minimum, the service account must meet the following requirements: Read access for every Google Cloud Storage bucket that Platfora needs to access. Write access to the Google Cloud Storage bucket where Platfora writes lens build files. Additionally, Google Cloud Platform creates all Dataproc clusters in the default service account. If you use Dataproc as your Hadoop environment, the default service account must have Edit permission to the Google Project. (This is required for Google Cloud Dataproc. Contact Google Support for any questions about this requirement.) Make sure that no Google Cloud Storage bucket access control lists (ACLs) prevent the Platfora service account from accessing the Storage bucket folders it needs. For more information on Google service accounts, see https://cloud.google.com/iam/docs/serviceaccounts. Page 32
Platfora Deployment Planning Guide - System Requirements (GCP Cloud) Google Cloud Subnetwork for Platfora Google Cloud Platform allows you to define a network in which all machine instances are located. You can segment the IP addresses in a GCP network into subnets, which GCP calls subnetworks. To use any of the Google services (Dataproc, Storage, or BigQuery), you must create a Google Cloud Platform subnetwork and use that subnetwork name when configuring Platfora. You must ensure the following are true: All nodes of the Platfora cluster are in the same subnetwork. The Dataproc cluster is configured to launch in the same subnetwork as the Platfora cluster. (platfora.gcp.dataproc.subnet.name configuration property) The Firewall rules in the subnetwork allow each node of the Platfora cluster to communicate with the other Platfora nodes and the nodes in the Dataproc cluster. For more information on Google networks, see https://cloud.google.com/compute/docs/ networking#before-you-begin. Page 33
Chapter 6 Port Configuration Requirements You must open ports in the firewall of your Platfora nodes to allow client access and intra-cluster communications. You also must open ports within your Hadoop cluster to allow access from Platfora. This section lists the default ports required. Topics: Ports to Open on Platfora Nodes Ports to Open on Hadoop Nodes Ports to Open on Platfora Nodes Your Platfora master node must allow HTTP connections from your user network. All nodes must allow connections from the other Platfora nodes in a multi-node cluster. On Amazon EC2 instances, you must configure the port firewall rules on the Platfora server instances in addition to the EC2 Security Group Settings. Platfora Service Default Port Allow connections from Master Web Services Port (HTTP) Secure Master Web Services Port (HTTPS) Master Server Management Port Worker Server Management Port 8001 External user network Platfora worker servers localhost 8443 External user network Platfora worker servers localhost 8002 Platfora worker servers localhost 8002 Platfora master server other Platfora worker servers localhost Page 34
Platfora Deployment Planning Guide - Port Configuration Requirements Platfora Service Default Port Allow connections from Master Data Port 8003 Platfora worker servers localhost Spark UI 4040 External user network (optional for troubleshooting Spark jobs) Worker Data Port 8003 Platfora master server other Platfora worker servers localhost Master PostgreSQL Database Port 5432 Platfora worker servers localhost Spark Ephemeral Port Range Depends on the OS. For CentOS and Ubuntu, it is 32768 to 61000. All nodes in the Hadoop cluster, Dataproc cluster, or EMR cluster Ports to Open on Hadoop Nodes Platfora must be able to access certain services of your Hadoop cluster. This section lists the Hadoop services Platfora needs to access and the default ports for those services. Note that this only applies to on-premise Hadoop deployments or to self-managed Hadoop deployments in a virtual private cloud, not to Google Cloud Dataproc or Amazon Elastic MapReduce (EMR). Hadoop Service Default Ports by Hadoop Distro CDH, HDP, MapR Pivotal Allow connections from HDFS NameNode 8020 N/A Platfora master and worker servers HDFS DataNodes 50010 N/A Platfora master and worker servers MapRFS CLDB N/A 7222 Platfora master and worker servers MapRFS DataNodes N/A 5660 Platfora master and worker servers YARN ResourceManager 8032 8032 Platfora master server Page 35
Platfora Deployment Planning Guide - Port Configuration Requirements Hadoop Service Default Ports by Hadoop Distro CDH, HDP, MapR Pivotal Allow connections from YARN ResourceManager Web UI 8088 8088 External user network (optional for troubleshooting) YARN Job History Server 10020 10020 Platfora master server YARN Job History Server Web UI 19888 19888 External user network (optional for troubleshooting) YARN Application Master Depends on mapredsite.xml 5 Depends on mapredsite.xml 6 Platfora master server HiveServer Thrift Port 9083 9083 Platfora master server Hive Metastore DB Port 7 Depends on the database used 8 Depends on the database used 9 Platfora master server Spark Server ephemeral port range ephemeral port range Platfora master server To limit the ephemeral port range, see your Linux operating system documentation about changing the net.ipv4.ip_local_port_range OS setting. 5 See yarn.app.mapreduce.am.job.client.port-range property in mapred-site.xml 6 See yarn.app.mapreduce.am.job.client.port-range property in mapred-site.xml 7 If connecting to Hive directly using JDBC 8 For example, MySQL is 3306, and Postgres is 7432. 9 For example, MySQL is 3306, and Postgres is 7432. Page 36
Chapter 7 Browser Requirements Users can connect to the Platfora web application using the latest HTML5-compliant web browsers. Platfora supports the following releases of the following web browsers: Web Browsers Web Browser Chrome Firefox Supported Versions Latest version (Evergreen) and three previous releases 25.0.x or higher Safari 6.1+ and 7.x Internet Explorer with the IE 11 (Windows 7, Windows 8, Windows 10) Compatibility View feature disabled IE 10 (Windows 7 and Windows 8) Platfora supports these web browsers on desktop machines only. Platfora recommends using a screen resolution width of 1400 pixels or greater for viewing some pages in the Platfora web application. Page 37
Appendix A Hardware Specifications for Platfora Nodes This section shows some example hardware configurations that have worked well in other Platfora deployments. To achieve the best performance and lowest operating cost, Platfora recommends that all servers in the Platfora cluster have the same configuration. At a minimum, all servers in the Platfora cluster should have an identical RAM capacity and the same number of CPU cores. Platfora software can be deployed on either rack or blade servers. Typical Platfora server configurations have specifications similar to: Rack Server Specs CPU: 2x E5-2440 2.40GHz 6-cores RAM: 12x 16GB RAM (192GB total) Disk: 8x 300GB 10K SAS 2.5 HDDs Blade Server Specs CPU: 2x E5-2470 2.30GHz 8-cores RAM: 12x16GB RAM (192GB total) Disk: 2x 900GB 10K SATA 2.5 HDDs Network: 1x Gbps NIC Page 38
Appendix B EC2 Considerations for Platfora Instances This section explains what to consider when using Amazon Elastic Compute Cloud (EC2) instances to deploy a production Platfora cluster. EC2 Storage Considerations When you launch an Amazon EC2 instance, you have several choices with regards to the storage that you can attach to the instance. There are two main types of storage available: Elastic Block Store (EBS) and Instance Store (Ephemeral). The type and capacity of storage available depends on the instance type you choose. The Root Device Volume - All instances have a root device volume, which is backed by either EBS or Instance storage. Platfora recommends EBS-backed instance types; they launch faster and use persistent storage. Root device volumes for Platfora nodes should always be increased to the maximum size (1 TB). This ensures adequate space for the Platfora installation and logs. When using the Platfora recommended 8xlarge instance types, general purpose (SSD) EBS volumes also guarantee 3,000 IOPS. EBS Volumes - Amazon EBS volumes are highly available and reliable storage volumes that can be attached to any running instance that is in the same Availability Zone. Amazon EBS volumes that are attached to an Amazon EC2 instance are exposed as storage volumes that persist independently from the life of the instance. Also with Amazon EBS, you only pay for what you use, making it a costeffective choice. Platfora recommends General Purpose (SSD) EBS volumes. For maximum performance, you can choose Provisioned IOPS EBS volumes instead. If you choose an instance type that is not EBS optimized by default, make sure to choose EBS Optimized Instance at launch time. This ensures that the instance has a dedicated connection to the EBS volume, which reduces overall latency and maximizes throughput. The Platfora recommended 8xlarge instance types are already EBS optimized instances. Instance Store Volumes - Ephemeral storage is ideal for temporary storage of information that changes frequently, such as caches, or for data that is replicated across multiple instances. Instances that use EBS for the root device do not, by default, have instance store volumes available at boot time. Also, you can't attach instance store volumes after you've launched an instance. Therefore, if you want your Amazon EBS-backed instance to use instance store volumes, you must specify them when you first launch your instance. Page 39
Platfora Deployment Planning Guide - EC2 Considerations for Platfora Instances The choice to add instance store volumes to Platfora nodes depends on price, performance, and persistence of the data. Ephemeral storage allows data to be read faster from disk, but is also more expensive. Also, the data stored on these volumes is not persistent - it will be lost if the instance is shutdown or terminated. If you do decide to use ephemeral drives for the Platfora cache directories, use RAID 0 (Stripe). This ensures Platfora has access to the maximum possible disk space and will also yield the highest performance. Remember, ephemeral drives are temporary storage, so there is no need to use RAID 1. When the instance is stopped, the data is not saved. In Platfora, the PLATFORA_DATA/dfscache and PLATFORA_DATA/fsCache directories can be mapped to instance store volumes (if you decide to use them). These are the only directories of a Platfora installation that should use ephemeral storage. Lens data is backed up in S3, so the loss of any cached data is temporary. EC2 Network Considerations Placement Groups - All Platfora server instances should be launched within the same Amazon EC2 Placement Group. A placement group is a logical grouping of instances within a single Availability Zone. Using placement groups enables applications to participate in a low-latency, 10 Gbps network connectivity. Placement groups are recommended for applications that benefit from low network latency, high network throughput, or both. See the Amazon EC2 Documentation on Placement Groups. Enhanced Networking - To enable enhanced networking, you must launch each instance in the same Amazon EC2 virtual private cloud (VPC). You can't enable enhanced networking if the instance is in EC2-Classic. For more information, see the Amazon VPC User Guide and the Amazon EC2 Documentation on Enhanced Networking. Page 40