9.4 SPD Engine: Storing Data in the Hadoop Distributed File System

Transcription

1 SAS 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System Third Edition SAS Documentation

2 The correct bibliographic citation for this manual is as follows: SAS Institute Inc SAS 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System, Third Edition. Cary, NC: SAS Institute Inc. SAS 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System, Third Edition Copyright 2015, SAS Institute Inc., Cary, NC, USA All rights reserved. Produced in the United States of America. For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others' rights is appreciated. U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR , DFAR (a), DFAR (a) and DFAR and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR (DEC 2007). If FAR is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government's rights in Software and documentation shall be only those set forth in this Agreement. SAS Institute Inc., SAS Campus Drive, Cary, North Carolina July 2015 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

3 Contents What s New in the SAS 9.4 SPD Engine to Store Data in HDFS v Chapter 1 Introduction to Storing Data in HDFS Deciding to Store Data in HDFS Using the SPD Engine to Store Data in HDFS Chapter 2 Storing Data in HDFS Overview: Storing Data in HDFS SAS and Hadoop Requirements Supported SAS File Features Using the SPD Engine Security Chapter 3 Using the SPD Engine Overview: Using the SPD Engine How the SPD Engine Supports Data Distribution I/O Operation Performance Creating SAS Indexes Parallel Processing for Data in HDFS WHERE Processing Optimization with MapReduce SPD Engine File System Locking SPD Engine Distributed Locking Updating Data in HDFS Using SAS High-Performance Analytics Procedures Chapter 4 SPD Engine Reference Overview: SPD Engine Reference Dictionary Chapter 5 How to Use Hadoop Data Storage Overview: How to Use Hadoop Data Storage Example 1: Loading Existing SAS Data Using the COPY Procedure

4 iv Contents Example 2: Creating a Data Set Using the DATA Step Example 3: Adding to Existing Data Set Using the APPEND Procedure Example 4: Loading Oracle Data Using the COPY Procedure Example 5: Analyzing Data Using the FREQ Procedure Example 6: Managing SAS Files Using the DATASETS Procedure Example 7: Setting the SPD Engine I/O Block Size Example 8: Optimizing WHERE Processing with MapReduce Appendix 1 Hive SerDe for SPD Engine Data Accessing SPD Engine Data Using Hive Troubleshooting Recommended Reading Index

5 v What s New in the SAS 9.4 SPD Engine to Store Data in HDFS Whatʼs New Overview In the second maintenance release for SAS 9.4, the SPD Engine has improved performance. The SPD Engine creates a SAS index much faster, sets a larger I/O block size and expands the scope of the block size, expands parallel processing support for Read operations, performs data filtering in the Hadoop cluster, and enables you to control the number of MapReduce tasks when writing data in HDFS. In the third maintenance release for SAS 9.4, the SPD Engine expands the supported Hadoop distributions, enables parallel processing for Write operations, expands WHERE processing optimization with more WHERE expression syntax, enhances file system locking by enabling you to specify a pathname for the SPD Engine lock directory, supports distributed locking, and provides a custom Hive SerDe so that SPD Engine data stored in HDFS can be accessed using Hive.

6 vi SAS SPD Engine Hadoop Distribution Support In the third maintenance release for SAS 9.4, the SPD Engine has expanded the supported Hadoop distributions. For the list of supported Hadoop distributions, see Hadoop Distribution Support on page 6. Improved Performance When Creating a SAS Index In the second maintenance release for SAS 9.4, when you create a SAS index for a data set in HDFS, the performance of creating a large index is significantly improved because the index is partitioned. For more information, see Creating SAS Indexes on page 11. Improved Performance By Setting SPD Engine I/O Block Size In the second maintenance release for SAS 9.4, the scope of the SPD Engine I/O block size is expanded. The default block size is larger at 1,048,576 bytes (1 megabyte). The block size affects compressed, uncompressed, and encrypted data sets. The block size influences the size of I/O operations when reading all data sets and writing compressed data sets. For more information, see I/O Operation Performance on page 11. To specify an I/O block size, use the IOBLOCKSIZE= data set option on page 40 or the new IOBLOCKSIZE= LIBNAME statement option on page 33.

7 Optimized WHERE Processing vii Improved Performance of Reading Data in HDFS In the second maintenance release for SAS 9.4, to improve the performance of reading data stored in HDFS, the SPD Engine has expanded its support of parallel processing. You can request parallel processing for all Read operations of data stored in HDFS. For more information, see Parallel Processing for Data in HDFS on page 12. To request parallel processing for all Read operations of data stored in HDFS, use the SPDEPARALLELREAD= system option on page 45, the PARALLELREAD= LIBNAME statement option on page 36, or the PARALLELREAD= data set option on page 42. Improved Performance of Writing Data to HDFS In the third maintenance release for SAS 9.4, you can now request parallel processing for all Write operations in HDFS. For more information, see Parallel Processing for Data in HDFS on page 12. To request parallel processing for Write operations, use the PARALLELWRITE= LIBNAME statement option on page 36 or the PARALLELWRITE= data set option on page 43. Optimized WHERE Processing To optimize the performance of WHERE processing, you can request that data subsetting be performed in the Hadoop cluster, which takes advantage of the filtering and ordering capabilities of the MapReduce framework. For more information, see WHERE Processing Optimization with MapReduce on page 15. To request that data subsetting be performed in the Hadoop cluster, use the ACCELWHERE= LIBNAME statement option on page 31 or the ACCELWHERE= data set option on page 39.

8 viii SAS SPD Engine In the third maintenance release for SAS 9.4, optimized WHERE processing is expanded to include more operators and compound expressions. For more information, see WHERE Expression Syntax Support on page 16. Controlling Tasks When Writing Data in HDFS In the second maintenance release for SAS 9.4, to specify the number of MapReduce tasks when writing data in HDFS, you can use the NUMTASKS= LIBNAME statement option. This option controls parallel processing on the Hadoop cluster when writing output from a SAS High-Performance Analytics procedure. For more information, see the NUMTASKS= LIBNAME statement option on page 35. SPD Engine File System Locking In the second maintenance release for SAS 9.4, the SPD Engine implements a locking strategy that honors the HDFS concurrent access model and provides additional levels of concurrent access to ensure the integrity of the data stored in HDFS. For more information, see SPD Engine File System Locking on page 18. In the third maintenance release for SAS 9.4, to store the lock files, the SPD Engine creates a lock directory in the /tmp directory. You can specify a pathname for the SPD Engine lock directory by defining the new SAS environment variable SPDELOCKPATH. For more information, see SPDELOCKPATH SAS Environment Variable on page 51. SPD Engine Distributed Locking In the third maintenance release for SAS 9.4, the SPD Engine supports distributed locking for data stored in HDFS. Distributed locking provides synchronization and group

9 Accessing SPD Engine Data Using Hive ix coordination services to clients over a network connection. For more information, see SPD Engine Distributed Locking on page 20. To request SPD Engine distributed locking, you must first create an XML configuration file, and then define the SAS environment variable SPDE_CONFIG_FILE to specify the location of the user-defined XML file that is available to the SAS client machine. For more information, see SPDE_CONFIG_FILE SAS Environment Variable on page 46. Configuring the SPD Engine to Store Data in HDFS To store data in HDFS using the SPD Engine, required Hadoop JAR files and Hadoop cluster configuration files must be available to the SAS client machine. For information about configuring the SPD Engine, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Accessing SPD Engine Data Using Hive In the third maintenance release for SAS 9.4, SAS provides a custom Hive SerDe for SPD Engine data that is stored in HDFS. The SerDe makes the data available for applications outside of SAS to query using HiveQL. For more information, see Appendix 1, Hive SerDe for SPD Engine Data, on page 73.

10 x SAS SPD Engine

11 1 1 Introduction to Storing Data in HDFS Deciding to Store Data in HDFS Using the SPD Engine to Store Data in HDFS What Is the SPD Engine? Understanding the SPD Engine File Format How to Use the SPD Engine Deciding to Store Data in HDFS Storing data in the Hadoop Distributed File System (HDFS) is a good strategy for very large data sets. HDFS is a component of Apache Hadoop, which is an open-source software framework of tools that are written in Java. HDFS provides distributed data storage and processing of large amounts of data. Reasons for storing SAS data in HDFS include the following: HDFS is a low-cost alternative for data storage. Organizations are exploring it as an alternative to commercial relational database solutions. HDFS is well suited for distributed storage and processing using commodity hardware. It is fault tolerant, scalable, and simple to expand. HDFS manages files as blocks of equal size, which are replicated across the machines in a Hadoop cluster to provide fault tolerance. SAS provides support within the current SAS product offering and product roadmap. SAS provides the ability to manage, process, and analyze data in HDFS.

12 2 Chapter 1 / Introduction to Storing Data in HDFS Hadoop storage is for big data. If standard SAS optimization techniques such as indexes no longer meet your performance needs, then storing the data in HDFS could improve performance. Using the SPD Engine to Store Data in HDFS What Is the SPD Engine? The SAS Scalable Performance Data (SPD) Engine is a scalable engine delivered to SAS customers as part of Base SAS. The SPD Engine is designed for highperformance data delivery, reading data sets that contain billions of observations. The engine uses threads to read data very rapidly and in parallel. The SPD Engine reads, writes, and updates data in HDFS. You can use the SPD Engine with standard SAS applications to retrieve data for analysis, perform administrative functions, and update the data. Understanding the SPD Engine File Format The SPD Engine organizes data into a streamlined file format that has advantages for a distributed file system like HDFS. The advantages of the SPD Engine file format include the following: Data is separate from the metadata. The file format consists of separate files: one for data, one for metadata, and two for indexes. Each type of file has an identifying file extension. The extensions are.dpf for data,.mdf for metadata, and.hbx and.idx for indexes. The SPD Engine file format partitions the data by spreading it across multiple files based on a partition size. Each partition is stored as a separate physical file with the extension.dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file.

13 Using the SPD Engine to Store Data in HDFS 3 The default partition size is 128 megabytes. You can specify a different partition size with the PARTSIZE= LIBNAME statement option on page 38 or the PARTSIZE= data set option on page 41. How to Use the SPD Engine The SPD Engine works like other SAS data access engines. That is, you execute a LIBNAME statement to assign a libref, specify the engine, and connect to the Hadoop cluster. You then use that libref throughout the SAS session where a libref is valid. The libref is associated with a specific directory in the Hadoop cluster. Arguments in the LIBNAME statement specify a libref, the engine name, the pathname to a directory in the Hadoop cluster, and the HDFSHOST=DEFAULT argument to indicate that you want to connect to a Hadoop cluster. Here is an example of a LIBNAME statement to connect to a Hadoop cluster: libname myspde spde '/user/abcdef' hdfshost=default; To interface with Hadoop and connect to a specific Hadoop cluster, required Hadoop JAR files and Hadoop cluster configuration files must be available to the SAS client machine. To make the required files available, you must define two SAS environment variables to set the location of the required files. For more information about the SAS environment variables, see SAS and Hadoop Requirements on page 6. Any data source that can be accessed with a SAS engine can be loaded into a Hadoop cluster using the SPD Engine. For example: You can use the default Base SAS engine to access an existing SAS data set and the SPD Engine to connect to the Hadoop cluster. You can then use SAS code to load the data to the Hadoop cluster. See Example 1: Loading Existing SAS Data Using the COPY Procedure on page 57. You can use a SAS/ACCESS engine such as the SAS/ACCESS to Oracle engine to access an Oracle table and the SPD Engine to connect to the Hadoop cluster. You can then use SAS code to load the data to the Hadoop cluster. See Example 4: Loading Oracle Data Using the COPY Procedure on page 61. Note: Most existing SAS programs can run with the SPD Engine with little modification other than to the LIBNAME statement. However, some limitations apply. For example, if

14 4 Chapter 1 / Introduction to Storing Data in HDFS your default Base SAS engine data has integrity constraints, then the integrity constraints are dropped when the data is converted for the SPD Engine. For more information about supported SAS file features, see Supported SAS File Features Using the SPD Engine on page 7.

15 5 2 Storing Data in HDFS Overview: Storing Data in HDFS SAS and Hadoop Requirements SAS Version Hadoop Distribution Support Configuring Hadoop JAR Files Making Required Hadoop Cluster Configuration Files Available to Your Machine Supported SAS File Features Using the SPD Engine Security Overview: Storing Data in HDFS To store data in HDFS using the SPD Engine, you must do the following: Ensure that all version and configuration requirements are met. See SAS and Hadoop Requirements on page 6. Understand what the supported and not supported SAS file features are when using the SPD Engine. See Supported SAS File Features Using the SPD Engine on page 7. Use the LIBNAME statement for the SPD Engine to establish the connection to the Hadoop cluster. See LIBNAME Statement for HDFS on page 28.

16 6 Chapter 2 / Storing Data in HDFS SAS and Hadoop Requirements SAS Version To store data in HDFS using the SPD Engine, you must have the first maintenance release or later for SAS 9.4. Note: Access to data in HDFS using the SPD Engine is not supported from a SAS session in the z/os operating environment. Hadoop Distribution Support In the third maintenance release for SAS 9.4, the SPD Engine supports the following Hadoop distributions, with or without Kerberos: Cloudera CDH 4.x Cloudera CDH 5.x Hortonworks HDP 2.x IBM InfoSphere BigInsights 3.x MapR 4.x (for Microsoft Windows and Linux operating environments only) Pivotal HD 2.x Configuring Hadoop JAR Files To store data in HDFS using the SPD Engine, you must use a supported Hadoop distribution and configure a required set of Hadoop JAR files. The JAR files must be available to the SAS client machine. The SAS environment variable SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

17 Supported SAS File Features Using the SPD Engine 7 Making Required Hadoop Cluster Configuration Files Available to Your Machine Hadoop cluster configuration files contain information such as the name of the computer that hosts the Hadoop cluster and the TCP port. To connect to the Hadoop cluster, Hadoop configuration files must be copied from the specific Hadoop cluster to a physical location that the SAS client machine can access. The SAS environment variable SAS_HADOOP_CONFIG_PATH must be defined and set to the location of the Hadoop configuration files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Supported SAS File Features Using the SPD Engine The following SAS file features are supported for data sets using the SPD Engine: Encryption File compression Member-level locking SAS indexes SAS passwords Special missing values Physical ordering of returned observations User-defined formats and informats Note: When you create a data set, you cannot request both encryption and file compression. The following SAS file features are not supported for data sets using the SPD Engine: Audit trails

18 8 Chapter 2 / Storing Data in HDFS Cross-Environment Data Access (CEDA) Extended attributes Generation data sets Integrity constraints NLS support (such as to specify encoding for the data) Record-level locking SAS catalogs, SAS views, and MDDB files The following SAS software does not support SPD Engine data sets: SAS/CONNECT SAS/SHARE Security HDFS supports defined levels of permissions at both the directory and file levels. The SPD Engine honors those permissions. For example, if the file is available as Read only, you cannot modify it. If the Hadoop cluster supports Kerberos, the SPD Engine honors Kerberos authentication and authorization as long as the Hadoop cluster configuration files are accessed. For more information about accessing the Hadoop cluster configuration files, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Restricting access to members of SAS libraries by assigning SAS passwords to the members is supported when a data set is stored in HDFS. You can specify three levels of permission: Read, Write, and Alter. For more information about SAS passwords, see SAS Language Reference: Concepts.

19 9 3 Using the SPD Engine Overview: Using the SPD Engine How the SPD Engine Supports Data Distribution I/O Operation Performance Creating SAS Indexes Parallel Processing for Data in HDFS Overview: Parallel Processing for Data in HDFS Parallel Processing Considerations Tuning Parallel Processing Performance WHERE Processing Optimization with MapReduce Overview: WHERE Processing Optimization with MapReduce.. 15 WHERE Expression Syntax Support Data Set and SAS Code Requirements Hadoop Requirements SPD Engine File System Locking Overview: SPD Engine File System Locking Requesting Read Access Lock Files Specifying a Pathname for the SPD Engine Lock Directory SPD Engine Distributed Locking Overview: SPD Engine Distributed Locking Understanding the Service Provider Requirements for SPD Engine Distributed Locking

20 10 Chapter 3 / Using the SPD Engine Requesting Distributed Locking Updating Data in HDFS Using SAS High-Performance Analytics Procedures Overview: Using the SPD Engine The SPD Engine reads, writes, and updates data in HDFS. Specific SPD Engine features are supported for Hadoop storage and are explained in this document. For more information about the SPD Engine and its features that are not specific to Hadoop storage, see SAS Scalable Performance Data Engine: Reference. How the SPD Engine Supports Data Distribution When loading data into a Hadoop cluster, the SPD Engine ensures that the data is distributed appropriately. The SPD Engine uses the SPD Engine partition size and the HDFS block size to compute the maximum number of observations that can fit into both parameters. That is, observations never span multiple partitions or multiple blocks. After a data set is loaded into a Hadoop cluster, the actual block size of the loaded data might be less than the block size that was defined by the Hadoop administrator. The reason for the size difference can be because of the SPD Engine calculations regarding the partition size, block size, and observation length. Note: Defragmenting the Hadoop cluster is not recommended. Changing the block size and re-creating the files could result in the data becoming inaccessible by SAS.

21 Creating SAS Indexes 11 I/O Operation Performance To improve I/O operation performance, consider setting a different SPD Engine I/O block size. The larger the block size, the less I/O. For example, when reading a data set, the block size can significantly affect performance. When retrieving a large percentage of the data, a larger block size improves performance. However, when retrieving a subset of the data such as with WHERE processing, a smaller block size performs better. You can specify a different block size with the IOBLOCKSIZE= LIBNAME statement option and the IOBLOCKSIZE= data set option. For more information, see the IOBLOCKSIZE= LIBNAME statement option on page 33 and the IOBLOCKSIZE= data set option on page 40. Creating SAS Indexes When you create a SAS index for a data set that is stored in HDFS, a large index could require a long time to create. To provide efficient index creation, the SPD Engine partitions the two index files (.hbx and.idx). The index files are spread across multiple files based on the index partition size, which is 2 megabytes. Even though the index files are partitioned, the PARTSIZE= option, which specifies a size for the SPD Engine data partition file, does not affect the index partition size. You cannot increase or decrease the index partition size. To improve the performance of creating an index, consider these options: Request that indexes be created in parallel, asynchronously. To enable asynchronous parallel index creation, use the ASYNCINDEX= data set option. Request more temporary utility file space for sorting the data. To allocate an adequate amount of space for processing, use the SPDEUTILLOC= system option. Specify the utility file location on the SAS client machine, not on the Hadoop cluster.

22 12 Chapter 3 / Using the SPD Engine Request larger memory space for the sorting utility to use when sorting values for creating an index. To specify the amount of memory, use the SPDEINDEXSORTSIZE= system option. For more information about these options, see SAS Scalable Performance Data Engine: Reference. Parallel Processing for Data in HDFS Overview: Parallel Processing for Data in HDFS Parallel processing uses multiple threads that run in parallel so that a large operation is divided into multiple smaller ones that are executed simultaneously. The SPD Engine supports parallel processing to improve the performance of reading and writing data stored in HDFS. By default, the SPD Engine performs parallel processing only if a Read operation includes WHERE processing. If the Read operation does not include WHERE processing, the Read operation is performed by a single thread. To request parallel processing for all Read operations for all SAS releases and for Write operations in the third maintenance release for SAS 9.4 only, use these options: SPDEPARALLELREAD= system option on page 45 to request parallel read processing for the SAS session. PARALLELREAD= LIBNAME statement option on page 36 to request parallel read processing when using the assigned libref. PARALLELREAD= data set option on page 42 to request parallel read processing for the specific data set. In the third maintenance release for SAS 9.4, PARALLELWRITE= LIBNAME statement option on page 36 to request parallel write processing when using the assigned libref.

23 Parallel Processing for Data in HDFS 13 In the third maintenance release for SAS 9.4, PARALLELWRITE= data set option on page 43 to request parallel write processing for the specific data set. Here is an example of the SPDEPARALLELREAD= system option to request parallel processing for all Read operations for the SAS session: options spdeparallelread=yes; In this example, the LIBNAME statement requests parallel processing for all Read operations using the assigned libref. By specifying the PARALLELREAD= LIBNAME statement option, parallel processing is performed for all Read operations using the Class libref: libname class spde '/user/abcdef' hdfshost=default parallelread=yes; proc freq data=class.studentid; tables age; run; In this example, the PARALLELREAD= data set option requests parallel processing for all Read operations for the Class.StudentID data set: libname class spde '/user/abcdef' hdfshost=default; proc freq data=class.studentid (parallelread=yes); tables age; run; Here is an example of the PARALLELWRITE= LIBNAME statement option to request parallel processing for all Write operations using the assigned libref. By specifying the PARALLELWRITE= LIBNAME statement option, parallel processing is performed for all Write operations using the Class libref: libname class spde '/user/abcdef' hdfshost=default parallelwrite=yes; TIP To display information in the SAS log about parallel processing, set the MSGLEVEL= system option to I. When you set options msglevel=i;, the SAS log reports whether parallel processing is in effect.

24 14 Chapter 3 / Using the SPD Engine Parallel Processing Considerations The following are considerations for requesting parallel processing: For some environments, parallel processing might not improve the performance. The availability of network bandwidth and the number of CPUs on the SAS client machine determine the performance improvement. It is recommended that you set up a test in your environment to measure performance with and without parallel processing. When parallel read processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. For example, the COMPARE procedure expects that observations are read from the data set in the same order that they were written to the data set. Also, legacy code that uses the DATA step or the OBS= data set option might rely on physical order to produce the expected results. Tuning Parallel Processing Performance To tune the performance of parallel processing, consider these SPD Engine options: The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing. The SPD Engine THREADNUM= data set option specifies the maximum number of threads to use for the processing. For more information about these options, see SAS Scalable Performance Data Engine: Reference. Note: The Base SAS NOTHREADS= and CPUCOUNT= system options have no effect on SPD Engine parallel processing.

25 WHERE Processing Optimization with MapReduce 15 WHERE Processing Optimization with MapReduce Overview: WHERE Processing Optimization with MapReduce WHERE processing enables you to conditionally select a subset of observations so that SAS processes only the observations that meet specified conditions. To optimize the performance of WHERE processing, you can request that data subsetting be performed in the Hadoop cluster. Then, when you submit SAS code that includes a WHERE expression (which defines the condition that selected observations must satisfy), the SPD Engine instantiates the WHERE expression as a Java class. The SPD Engine submits the Java class to the Hadoop cluster as a component in a MapReduce program. By requesting that data subsetting be performed in the Hadoop cluster, performance might be improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SAS client. Performance is often improved with large data sets when the WHERE expression qualifies only a relatively small subset. By default, data subsetting is performed by the SPD Engine on the SAS client. To request that data subsetting be performed in the Hadoop cluster, you must specify the ACCELWHERE= LIBNAME statement option on page 31 or the ACCELWHERE= data set option on page 39. Here is an example of a LIBNAME statement that connects to a Hadoop cluster and requests that data subsetting be performed in the Hadoop cluster. By specifying the ACCELWHERE= LIBNAME statement option, subsequent WHERE processing for all data sets accessed with the Class libref are performed in the Hadoop cluster. libname class spde '/user/abcdef' hdfshost=default accelwhere=yes; proc freq data=class.studentid; tables age; where age gt 14; run;

26 16 Chapter 3 / Using the SPD Engine In this example, the ACCELWHERE= data set option requests that data subsetting be performed in the Hadoop cluster. The WHERE processing for the Class.StudentID data set is performed in the Hadoop cluster. WHERE processing for any other data set with the Class libref is performed by the SPD Engine on the SAS client machine. libname class spde '/user/abcdef' hdfshost=default; proc freq data=class.studentid (accelwhere=yes); tables age; where age gt 14; run; WHERE Expression Syntax Support In the third maintenance release for SAS 9.4, WHERE processing optimization is expanded. WHERE processing optimization supports the following syntax: comparison operators such as EQ (=), NE (^=), GT (>), LT (<), GE (>=), LE (<=) IN operator full bounded range condition, such as where 500 <= empnum <= 1000; BETWEEN-AND operator, such as where empnum between 500 and 1000; compound expressions using the logical operators AND, OR, and NOT, such as where skill = 'java' or years = 4; parentheses to control the order of evaluation, such as where (product='graph' or product='stat') and country='canada'; Data Set and SAS Code Requirements To perform the data subsetting in the Hadoop cluster, the following data set and SAS code requirements must be met. If any of these requirements are not met, the subsetting of the data is performed by the SPD Engine, not by a MapReduce program in the Hadoop cluster. The data set cannot be encrypted. The data set cannot be compressed.

27 WHERE Processing Optimization with MapReduce 17 The data set must be larger than the HDFS block size. The submitted SAS code cannot request BY-group processing. The submitted SAS code cannot include the STARTOBS= or ENDOBS= options. The LIBNAME statement cannot include the HDFSUSER= option. The submitted WHERE expression cannot include any of the following syntax: o o o a variable as an operand, such as where lastname; variable-to-variable comparison SAS functions, such as SUBSTR, TODAY, UPCASE, and PUT o arithmetic operators *, /, +, -, and ** o IS NULL or IS MISSING and IS NOT NULL or IS NOT MISSING operators o concatenation operator, such as or!! o o negative prefix operator, such as where z = -(x + y); pattern-matching operators LIKE and CONTAINS o sounds-like operator SOUNDEX (=*) o truncated comparison operator using the colon (:) modifier, such as where lastname=: 'S'; TIP To display information in the SAS log regarding WHERE processing optimization, set the MSGLEVEL= system option to I. When you issue options msglevel=i;, the SAS log reports whether the data filtering occurred in the Hadoop cluster. If the optimization occurred, the Hadoop Job ID is displayed in the SAS log. If the optimization did not occur, additional messages explain why. Hadoop Requirements To perform the data subsetting in the Hadoop cluster, the following Hadoop requirements must be met.

28 18 Chapter 3 / Using the SPD Engine The Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN. The JRE version for the Hadoop cluster must be either 1.6, which is the default, or 1.7. If the JRE version is 1.7, use the ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version. SPD Engine File System Locking Overview: SPD Engine File System Locking The HDFS concurrent access model allows multiple readers and a single writer. If an application accesses a file to write to it, no other application can write to the file, but multiple applications can read the file. The SPD Engine supports a file system locking strategy that honors the HDFS concurrent access model and provides additional levels of concurrent access to ensure the integrity of the data stored in HDFS. By default, the SPD Engine creates a Write access lock file when a data set stored in HDFS is opened for Write access. With the Write access lock file, no other SAS session can write to the file, but multiple SAS sessions can read the file if the readers accessed the data set before the Write access lock file was created. During concurrent access, the following describes the results of the default SPD Engine locking mechanism: Once a SAS session opens a data set for Write access, any previous readers can continue to access the data set. However, the readers could experience unexpected data results. For example, the writer could delete the data set while the readers are accessing the data set. Once a SAS session opens a data set for Write access, any subsequent reader is not allowed to access the data set. With the Write access locking mechanism, a lock error message occurs in these situations:

29 SPD Engine File System Locking 19 When a SAS session requests Write access to a data set that another SAS session has open for Write access. When a SAS session requests Read access to a data set that another SAS session has open for Write access. When a SAS session requests to delete a data set that another SAS session has open for Write access. In the third maintenance release for SAS 9.4, to store the lock files, the SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eight-character hexadecimal value (which is the checksum of the Hadoop cluster directory that contains the data set), and the suffix _spdslock9, such as BigFile_ a_spdslock9. In most situations, you will not see the lock directory because lock files are deleted when the process completes. TIP In some situations, such as an abnormal termination of a SAS session, lock files might not be properly deleted. The leftover lock files could prohibit access to a data set. If this occurs, the leftover lock files must be manually deleted by submitting HDFS commands. Requesting Read Access Lock Files In some situations, you might want to control the level of concurrent access to guarantee the integrity of the data by requesting that a Read access lock file be created. To request a Read access lock file, define the SAS environment variable SPDEREADLOCK and set it to YES. Then, when a SAS session opens a data set for Read access, a Read access lock file is created in addition to any Write access lock files. For more information, see SPDEREADLOCK SAS Environment Variable on page 52. With the Read and Write access locking mechanism, a lock error message occurs in these situations: When a SAS session requests Write access to a data set that another SAS session has open for either Read or Write access.

30 20 Chapter 3 / Using the SPD Engine When a SAS session requests Read access to a data set that another SAS session has open for Write access. When a SAS session requests to delete a data set that another SAS session has open for either Read or Write access. Note: When you request a Read access lock file, all data access, even for Read access, requires Write permission to the Hadoop cluster. TIP By creating both Read and Write access lock files, the possibility of leftover lock files is increased. If you experience situations such as an abnormal termination of a SAS session, lock files that were not properly deleted must be manually deleted by submitting HDFS commands. Specifying a Pathname for the SPD Engine Lock Directory By default, for HDFS concurrent access, the SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eightcharacter hexadecimal value (which is the checksum of the Hadoop cluster directory that contains the data set, and the suffix _spdslock9, such as BigFile_ a_spdslock9. In the third maintenance release for SAS 9.4, you can specify a pathname for the SPD Engine lock directory by defining the SAS environment variable SPDELOCKPATH to specify a directory in the Hadoop cluster. For more information, see SPDELOCKPATH SAS Environment Variable on page 51. SPD Engine Distributed Locking Overview: SPD Engine Distributed Locking In the third maintenance release for SAS 9.4, the SPD Engine supports distributed locking for data stored in HDFS. Distributed locking provides synchronization and group

31 coordination services to clients over a network connection. For the service provider, the SPD Engine uses the Apache ZooKeeper coordination service, specifically the implementation of the recipe for Shared Lock that is provided by Apache Curator. Distributed locking provides the following benefits: SPD Engine Distributed Locking 21 The lock server maintains the lock state information in memory and does not require Write permission to any client or data library disk storage locations. A process requesting a lock on a data set that is not available (because the data set is already locked) can choose to wait for the data set to become available, rather than having the lock request fail immediately. If a process abnormally terminates while holding locks on data sets, the lock server automatically drops all locks that the client was holding, which eliminates the possibility of leftover lock files. Understanding the Service Provider Apache ZooKeeper is an open-source distributed server that enables reliable distributed coordination to distributed client applications over a network. ZooKeeper safely coordinates access to shared resources with other applications or processes. At its core, ZooKeeper is a fault tolerant multi-machine server that maintains a virtual hierarchy of data nodes that store coordination data. For more information about ZooKeeper and the ZooKeeper data nodes, see Apache ZooKeeper. Apache Curator is a high-level API that simplifies using ZooKeeper. Curator adds many features that build on ZooKeeper and handles the complexity of managing connections to the ZooKeeper cluster. For more information about Curator, see Curator. The SPD Engine accesses the Curator API to provide the locking services. Requirements for SPD Engine Distributed Locking SPD Engine distributed locking has the following requirements: ZooKeeper or later must be downloaded, installed, and running on the Hadoop cluster. The zookeeper JAR file is required.

32 22 Chapter 3 / Using the SPD Engine Curator or later must be downloaded on the Hadoop cluster. The following Curator JAR files are required: o o o curator-client curator-framework curator-recipes The following Hadoop distribution JAR files are required on the client side: o o o guava log4j slf4j The JAR files must be available to the SAS client machine. The SAS environment variable SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. TIP To be effective, all access to SPD data sets must use the same locking method. If some processes or instances use distributed locking and others do not, proper coordination of access to the data sets cannot be guaranteed, and at a minimum, lock failures will be encountered. Requesting Distributed Locking To request distributed locking, you must first create an XML configuration file that contains information so that the SPD Engine can communicate with ZooKeeper. The format of the XML is similar to Hadoop configuration files in that the XML contains properties and attributes as name-value pairs. For an example of an XML configuration file, see XML Configuration File on page 46. In addition, you must define the SAS environment variable SPDE_CONFIG_FILE to specify the location of the user-defined XML configuration file. The location must be available to the SAS client machine. For more information, see SPDE_CONFIG_FILE SAS Environment Variable on page 46.

33 Updating Data in HDFS 23 Updating Data in HDFS HDFS does not support updating data. However, because traditional SAS processing involves updating data, the SPD Engine supports SAS Update operations for data stored in HDFS. To update data in HDFS, the SPD Engine uses an approach that replaces the data set s data partition file for each observation that is updated. When an update is requested, the SPD Engine re-creates the data partition file in its entirety (including all replications), and then inserts the updated data into the new data partition file. Because the data partition file is replaced for each observation that is updated, the greater the number of observations to be updated, the longer the process. For a general-purpose data storage engine like the SPD Engine, the ability to perform small, infrequent updates can be beneficial. However, updating data in HDFS is intended for situations when the time it takes to complete the update outweighs the alternatives. The following are best practices for Update operations using the SPD Engine: It is recommended that you set up a test in your environment to measure Update operation performance. For example, update a small number of observations to gauge how long updates take in your environment. Then, project the test results to a larger number of observations to determine whether updating is realistic. It is recommended that you do not use the SQL procedure to update data in HDFS because of how PROC SQL opens, updates, and closes a file. There are other SAS methods that provide better performance such as the DATA step UPDATE statement and MODIFY statement. The performance of appending a data set can be slower if the data set has a unique index. Test case results show that appending a data set to another data set without a unique index was significantly faster than appending the same data set to another data set with a unique index.

34 24 Chapter 3 / Using the SPD Engine Using SAS High-Performance Analytics Procedures You can use the SPD Engine with SAS High-Performance Analytics procedures to read and write the SPD Engine file format in HDFS. In many cases, the SPD Engine data used by the procedures can be read and written in parallel using the SAS Embedded Process. The following are requirements for a SAS Embedded Process parallel read: Access to the machines in the cluster where a SAS High-Performance Analytics deployment of Hadoop is installed and running. The data set cannot be encrypted or compressed. The STARTOBS= and ENDOBS= data set options cannot be specified. The following are requirements for a SAS Embedded Process parallel write: The ALIGN=, COMPRESS=, ENCRYPT=, and PADCOMPRESS= data set options cannot be specified. The SAS client machine must have a data representation that is compatible with the data representation of the Hadoop cluster. The SAS client machine must be either Linux x64 or Solaris x64. The following are best practices when using the SPD Engine with SAS High- Performance Analytics procedures: With SAS Enterprise Miner, a SAS process can be terminated in such a way that the SPD Engine does not follow normal shutdown procedures, which can result in a lock file not being deleted. The orphan lock file could prevent a subsequent open of the data set. If this occurs, the orphan lock file must be manually deleted by submitting Hadoop commands. To delete the orphan lock file, you can use the HADOOP procedure to submit Hadoop commands. For SAS High-Performance Analytics Work files, the SPD Engine uses the standard UNIX temporary directory /tmp. To override the default Work directory, you can

35 Using SAS High-Performance Analytics Procedures 25 define the SAS environment variable SPDE_HADOOP_WORK_PATH to specify a directory in the Hadoop cluster. The directory must exist and you must have Write access. For example, the following OPTIONS statement sets the Work directory: options set=spde_hadoop_work_path="/sasdata/cluster1/hpawork"; For more information, see SPDE_HADOOP_WORK_PATH SAS Environment Variable on page 50.

36 26 Chapter 3 / Using the SPD Engine

37 27 4 SPD Engine Reference Overview: SPD Engine Reference Dictionary LIBNAME Statement for HDFS ACCELWHERE= Data Set Option for HDFS IOBLOCKSIZE= Data Set Option for HDFS PARTSIZE= Data Set Option for HDFS PARALLELREAD= Data Set Option for HDFS PARALLELWRITE= Data Set Option for HDFS SPDEPARALLELREAD= System Option for HDFS SPDE_CONFIG_FILE SAS Environment Variable SPDE_HADOOP_WORK_PATH SAS Environment Variable SPDELOCKPATH SAS Environment Variable SPDEREADLOCK SAS Environment Variable Overview: SPD Engine Reference The SPD Engine reads, writes, and updates data in HDFS. A specific SPD Engine LIBNAME statement and options are provided for Hadoop storage and are explained in this document. For more information about the SPD Engine LIBNAME statement and options that are not specific to Hadoop storage, see SAS Scalable Performance Data Engine: Reference.

38 28 Chapter 4 / SPD Engine Reference Dictionary LIBNAME Statement for HDFS Associates a libref with a Hadoop cluster to read, write, and update a data set in HDFS. Restrictions: Requirements: The SPD Engine LIBNAME statement arguments that are specific to HDFS are not supported in the z/os operating environment. You can connect to only one Hadoop cluster at a time per SAS session. You can submit multiple LIBNAME statements to different directories in the Hadoop cluster, but there can be only one Hadoop cluster connection per SAS session. To associate a libref with a Hadoop cluster, you must have the first maintenance release or later for SAS 9.4. Supported Hadoop distributions: Cloudera CDH 4.x, Cloudera CDH 5.x, Hortonworks HDP 2.x, IBM InfoSphere BigInsights 3.x, MapR 4.x (Microsoft Windows and Linux only), Pivotal HD 2.x, with or without Kerberos To store data in HDFS using the SPD Engine, you must use a supported Hadoop distribution and configure a required set of Hadoop JAR files. The JAR files must be available to the SAS client machine. The SAS environment variable SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. To connect to the Hadoop cluster, Hadoop configuration files must be copied from the specific Hadoop cluster to a physical location that the SAS client machine can access. The SAS environment variable SAS_HADOOP_CONFIG_PATH must be defined and set to the location of the Hadoop configuration files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Example: Chapter 5, How to Use Hadoop Data Storage, on page 55 Syntax LIBNAME libref SPDE 'primary-pathname' HDFSHOST=DEFAULT <ACCELJAVAVERSION=version> <ACCELWHERE=NO YES> <DATAPATH=('pathname')> <HDFSUSER=ID> <IOBLOCKSIZE=n> <NUMTASKS=n> <PARALLELREAD=NO YES> <PARALLELWRITE=NO YES threads> <PARTSIZE=n nm ng nt>;

39 LIBNAME Statement for HDFS 29 Summary of Optional Arguments ACCELJAVAVERSION=version When requesting that WHERE processing be optimized by being performed in the Hadoop cluster, specifies the Java Runtime Environment (JRE) version for the Hadoop cluster. ACCELWHERE=NO YES Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster. DATAPATH=('pathname') When creating a data set, specifies the fully qualified pathname to a directory in the Hadoop cluster to store data partition files. HDFSUSER=ID Is an authorized user ID on the Hadoop cluster. IOBLOCKSIZE=n Specifies a size in bytes of a block of observations to be used in an I/O operation. NUMTASKS=n Specifies the number of MapReduce tasks when writing data in HDFS. PARALLELREAD=NO YES Determines when the SPD Engine uses parallel processing to read data stored in HDFS. PARALLELWRITE=NO YES threads Determines whether the SPD Engine uses parallel processing to write data in HDFS. PARTSIZE=n nm ng nt Specifies a size for the SPD Engine data partition file.

40 30 Chapter 4 / SPD Engine Reference Required Arguments libref is a valid SAS library name that serves as a shortcut name to associate with a data set in a Hadoop cluster. The name can be up to eight characters long and must conform to the rules for SAS names. SPDE is the engine name for the SAS Scalable Performance Data (SPD) Engine. 'primary-pathname' specifies the fully qualified pathname to a directory in the Hadoop cluster, which is typically UNIX based. Enclose the primary pathname in single or double quotation marks. An example is '/user/abcdef/'. When data is loaded into a Hadoop cluster directory, the SPD Engine automatically creates a subdirectory with the specified data set name and the suffix _spde. The SPD Engine data partition files are contained in that subdirectory. For example, if you load a data set named BigFile into the directory /user/abcdef/, the data partition files are located at /user/abcdef/bigfile_spde/. The SPD Engine metadata and index files are located at /user/abcdef/. Restrictions Maximum length is 260 characters for Windows and 1024 characters for UNIX. The primary pathname must be unique for each assigned libref. Assigned librefs that are different but reference the same primary pathname can result in lost data. Requirement Interaction You must use valid directory syntax for the host. The pathname must be recognized by the operating environment. You can specify a different location to store the data partition files with the DATAPATH= option on page 32. HDFSHOST=DEFAULT specifies that you want to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. The SPD Engine locates the Hadoop cluster

41 LIBNAME Statement for HDFS 31 configuration files using the SAS_HADOOP_CONFIG_PATH environment variable. The environment variable sets the location of the configuration files for a specific cluster. For more information about the SAS_HADOOP_CONFIG_PATH environment variable, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Requirement You must specify the HDFSHOST=DEFAULT argument. Optional Arguments ACCELJAVAVERSION=version When requesting that WHERE processing be optimized by being performed in the Hadoop cluster, specifies the Java Runtime Environment (JRE) version for the Hadoop cluster. The value must be either 1.6 or 1.7. Default 1.6 Interaction Example To request that data subsetting be performed in the Hadoop cluster, use the ACCELWHERE= LIBNAME statement option on page 31. By default, data subsetting is performed by the SPD Engine on the SAS client. Example 8: Optimizing WHERE Processing with MapReduce on page 69 ACCELWHERE=NO YES Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster. NO specifies that data subsetting is performed by the SPD Engine on the SAS client. This is the default setting. YES specifies that data subsetting is performed by a MapReduce program in the Hadoop cluster.

42 32 Chapter 4 / SPD Engine Reference Requirements To perform data subsetting in the Hadoop cluster, there are data set and SAS code requirements. See WHERE Processing Optimization with MapReduce on page 15. To submit the MapReduce program to the Hadoop cluster, the Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN. Interactions If the JRE version for the Hadoop cluster is 1.7 instead of the default version 1.6, use the ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version. The ACCELWHERE= data set option overrides the ACCELWHERE= LIBNAME statement option. For more information, see ACCELWHERE= data set option on page 39. Example Example 8: Optimizing WHERE Processing with MapReduce on page 69 Default NO DATAPATH=('pathname') When creating a data set, specifies the fully qualified pathname to a directory in the Hadoop cluster to store data partition files. Enclose the pathname in single or double quotation marks within parentheses. An example is datapath=('/sasdata'). When data is loaded into the directory, a subdirectory is automatically created with the specified data set name and the suffix _spde. The SPD Engine data partition files are contained in that subdirectory. For example, if you load a data set named BigFile into the directory /user/abcdef/ and specify datapath=( /sasdata/ ), the data partition files are located at /sasdata/bigfile_spde/. The SPD Engine metadata and index files are located at /user/abcdef/.

43 LIBNAME Statement for HDFS 33 Restrictions You can specify only one pathname to store data partition files. Maximum length is 260 characters for Windows and 1024 characters for UNIX. The pathname must be unique for each assigned libref. Assigned librefs that are different but reference the same pathname can result in lost data. Requirement Interaction You must use valid directory syntax for the host. The pathname must be recognized by the operating environment. Specifying the DATAPATH= option overrides the primary pathname for storing the data partition files only. The SPD Engine metadata and index files are always stored in the primary pathname. HDFSUSER=ID Is an authorized user ID on the Hadoop cluster. You can specify a user ID to connect to the Hadoop cluster with a different ID than your current logon ID. Restrictions If the HDFSUSER= option is specified, Kerberos authentication is bypassed, which prevents access to a secure Hadoop cluster. If the HDFSUSER= option is specified, WHERE processing optimization with the ACCELWHERE= option cannot be performed in the Hadoop cluster. HDFSUSER= is not supported by a MapR Apache Hadoop distribution. IOBLOCKSIZE=n Specifies a size in bytes of a block of observations to be used in an I/O operation. The I/O block size determines the amount of data that is physically transferred together in an I/O operation. The larger the block size, the less I/O. The SPD Engine

44 34 Chapter 4 / SPD Engine Reference uses blocks in memory to collect the observations to be written to or read from a data component file. The IOBLOCKSIZE= option specifies the size of the block. (The actual size is computed to accommodate the largest number of observations that fit in the specified size of n bytes. Therefore, the actual size is a multiple of the observation length.) The block size affects I/O operations for compressed, uncompressed, and encrypted data sets. However, the effects are different and depend on the I/O operation. For a compressed data set, the block size determines how many observations are compressed together, which determines the amount of data that is physically transferred for both Read and Write operations. The block size is a permanent attribute of the file. To specify a different block size, you must copy the data set to a new data set, and then specify a new block size for the output file. For a compressed data set, a larger block size can improve performance for both Read and Write operations. For an encrypted data set, the block size is a permanent attribute of the file. For an uncompressed data set, the block size determines the size of the blocks that are used to read the data from disk to memory. The block size has no affect when writing data to disk. For an uncompressed data set, the block size is not a permanent attribute of the file. That is, you can specify a different block size based on the Read operation that you are performing. For example, reading data that is randomly distributed or reading a subset of the data calls for a smaller block size because accessing smaller blocks is faster than accessing larger blocks. In contrast, reading data that is uniformly or sequentially distributed or that requires a full data set scan works better with a larger block size. Default Ranges 1,048,576 bytes (1 megabyte) The minimum block size is 32,768 bytes. The maximum block size is half the size of the SPD Engine data partition file. Restriction The SPD Engine I/O block size must be smaller than or equal to the Hadoop cluster block size.

45 LIBNAME Statement for HDFS 35 Interaction Tip Example The IOBLOCKSIZE= data set option overrides the IOBLOCKSIZE= LIBNAME statement option. For more information, see IOBLOCKSIZE= Data Set Option for HDFS on page 40. When reading a data set, the block size can significantly affect performance. If retrieving a large percentage of the data, a larger block size improves performance. However, if retrieving a subset of the data (such as with WHERE processing), a smaller block size performs better. Example 7: Setting the SPD Engine I/O Block Size on page 68 NUMTASKS=n Specifies the number of MapReduce tasks when writing data in HDFS. This option controls parallel processing on the Hadoop cluster when writing output from a SAS High-Performance Analytics procedure using the SAS Embedded Process. When a high-performance procedure reads and writes Hadoop data, and the amount of output data is similar to the amount of input data, the same number of output tasks as input tasks should be a good default. However, if the amount of output data differs significantly from the amount of input data, you should use this option to tune the number of tasks proportionally to the output data. Default Restriction Interaction The number of MapReduce tasks is the number of SAS High-Performance Analytics nodes. Or, if the highperformance procedure reads a Hadoop file as input, it is the number of tasks that were used to read the input file. This option affects writing data in HDFS only when a high-performance procedure writes output to HDFS using the SAS Embedded Process. If the specified number of MapReduce tasks is less than the number of SAS High-Performance Analytics nodes on which the procedure runs, the setting is ignored.

46 36 Chapter 4 / SPD Engine Reference PARALLELREAD=NO YES Determines when the SPD Engine uses parallel processing to read data stored in HDFS. NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. YES specifies parallel processing for all Read operations using the assigned libref. Default Interactions NO The SET statement POINT= option is inconsistent with parallel processing. When parallel processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. The PARALLELREAD= LIBNAME statement option overrides the SPDEPARALLELREAD= system option. For more information, see SPDEPARALLELREAD= System Option for HDFS on page 45. The PARALLELREAD= LIBNAME statement option can be overridden by the PARALLELREAD= data set option. For more information, see PARALLELREAD= Data Set Option for HDFS on page 42. See Parallel Processing for Data in HDFS on page 12 PARALLELWRITE=NO YES threads Determines whether the SPD Engine uses parallel processing to write data in HDFS. NO specifies that parallel processing for a Write operation does not occur. This is the default behavior for the SPD Engine.

47 LIBNAME Statement for HDFS 37 YES specifies parallel processing for all Write operations using the assigned libref. A thread is used for each CPU on the SAS client machine. For example, if eight CPUs exist on the SAS client machine, then eight threads are used to write data. threads specifies parallel processing for all Write operations using the assigned libref and specifies the number of threads to use for the Write operations. Default The default is 1, which specifies that parallel processing for a Write operation does not occur. Range 2 to 512 Default Restrictions NO You cannot use parallel processing for a Write operation and also request to create a SAS index. You cannot use parallel processing for a Write operation and also request BY-group processing or sorting. Interactions When parallel Write processing occurs, the order in which the observations are written is unpredictable. The order in which the observations are returned cannot be determined unless the application imposes ordering criteria. The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing. For more information, see SAS Scalable Performance Data Engine: Reference. The PARALLELWRITE= LIBNAME statement option can be overridden by the PARALLELWRITE= data set option. For more information, see PARALLELWRITE= Data Set Option for HDFS on page 43.

48 38 Chapter 4 / SPD Engine Reference Note The PARALLELWRITE= LIBNAME statement option is available in the third maintenance release for SAS 9.4. See Parallel Processing for Data in HDFS on page 12 PARTSIZE=n nm ng nt Specifies a size for the SPD Engine data partition file. Each partition is stored as a separate file with the file extension.dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file. The value is specified in megabytes, gigabytes, or terabytes. If n is specified without M, G, or T, the default is megabytes. That is, partsize=64 is the same as partsize=64m. Default Restrictions 128 megabytes The minimum value is 16 megabytes. The maximum value is 8,796,093,022,207 megabytes. Interaction Tip The PARTSIZE= data set option overrides the PARTSIZE= LIBNAME statement option. For more information, see PARTSIZE= Data Set Option for HDFS on page 41. To update data, a smaller partition size provides the best performance. For example, when you update a value, the SPD Engine locates the appropriate partition, modifies the value, and rewrites all replications of the partition. Because each update requires that the partition be rewritten, it is recommended that you perform updates only occasionally or set a small partition size if you are planning to update the data frequently.

49 ACCELWHERE= Data Set Option for HDFS 39 ACCELWHERE= Data Set Option for HDFS Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster. Valid in: Category: Default: Requirements: Interaction: DATA step and PROC step Data Set Control NO To perform data subsetting in the Hadoop cluster, there are data set and SAS code requirements. For more information, see WHERE Processing Optimization with MapReduce on page 15. To submit the MapReduce program to the Hadoop cluster, the Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN. If the JRE version for the Hadoop cluster is 1.7 instead of the default 1.6 version, use the ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version. Syntax ACCELWHERE=NO YES Syntax Description NO specifies that data subsetting is performed by the SPD Engine on the SAS client. This is the default setting. YES specifies that data subsetting is performed by a MapReduce program in the Hadoop cluster. Comparisons The ACCELWHERE= data set option overrides the ACCELWHERE= LIBNAME statement option. See Also ACCELWHERE= LIBNAME statement option on page 31

50 40 Chapter 4 / SPD Engine Reference IOBLOCKSIZE= Data Set Option for HDFS Specifies a size in bytes of a block of observations to be used in an I/O operation. Valid in: Category: Default: Ranges: Restriction: Tip: DATA step and PROC step Data Set Control 1,048,576 bytes (1 megabyte) The minimum block size is 32,768 bytes., The maximum block size is half the size of the SPD Engine data partition file. The SPD Engine I/O block size must be smaller than or equal to the Hadoop cluster block size. When reading a data set, the block size can significantly affect performance. If retrieving a large percentage of the data, a larger block size improves performance. However, if retrieving a subset of the data (such as with WHERE processing), a smaller block size performs better. Example: Example 7: Setting the SPD Engine I/O Block Size on page 68 IOBLOCKSIZE=n Syntax n Syntax Description is the size in bytes of a block of observations. Details The I/O block size determines the amount of data that is physically transferred together in an I/O operation. The larger the block size, the less I/O. The SPD Engine uses blocks in memory to collect the observations to be written to or read from a data component file. The IOBLOCKSIZE= data set option specifies the size of the block. (The actual size is computed to accommodate the largest number of observations that fit in the specified size of n bytes. Therefore, the actual size is a multiple of the observation length.) The block size affects I/O operations for compressed, uncompressed, and encrypted data sets. However, the effects are different and depend on the I/O operation.

51 For a compressed data set, the block size determines how many observations are compressed together, which determines the amount of data that is physically transferred for both Read and Write operations. The block size is a permanent attribute of the file. To specify a different block size, you must copy the data set to a new data set, and then specify a new block size for the output file. For a compressed data set, a larger block size can improve performance for both Read and Write operations. For an encrypted data set, the block size is a permanent attribute of the file. For an uncompressed data set, the block size determines the size of the blocks that are used to read the data from disk to memory. The block size has no affect when writing data to disk. For an uncompressed data set, the block size is not a permanent attribute of the file. That is, you can specify a different block size based on the Read operation that you are performing. For example, reading data that is randomly distributed or reading a subset of the data calls for a smaller block size because accessing smaller blocks is faster than accessing larger blocks. In contrast, reading data that is uniformly or sequentially distributed or that requires a full data set scan works better with a larger block size. Comparisons The IOBLOCKSIZE= data set option overrides the IOBLOCKSIZE= LIBNAME statement option. See Also IOBLOCKSIZE= LIBNAME statement option on page 33 PARTSIZE= Data Set Option for HDFS 41 PARTSIZE= Data Set Option for HDFS Specifies a size for the SPD Engine data partition file. Valid in: Category: Default: Restrictions: DATA step and PROC step Data Set Control 128 megabytes The minimum value is 16 megabytes. The maximum value is 8,796,093,022,207 megabytes.

52 42 Chapter 4 / SPD Engine Reference Specify a data partition file size only when creating a new data set. Tip: To update data, a smaller partition size provides the best performance. For example, when you update a value, the SPD Engine locates the appropriate partition, modifies the value, and rewrites all replications of the partition. Because each update requires that the partition be rewritten, it is recommended that you perform updates only occasionally or set a small partition size if you are planning to update the data frequently. Syntax PARTSIZE=n nm ng nt Syntax Description n nm ng nt is the size of the data partition file in megabytes, gigabytes, or terabytes. If n is specified without M, G, or T, the default is megabytes. That is, partsize=64 is the same as partsize=64m. Details Each partition is stored as a separate file with the file extension.dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file. Comparisons The PARTSIZE= data set option overrides the PARTSIZE= LIBNAME statement option. See Also PARTSIZE= LIBNAME statement option on page 38 PARALLELREAD= Data Set Option for HDFS Determines when the SPD Engine uses parallel processing to read data stored in HDFS. Valid in: Category: Default: Interactions: DATA step and PROC step Data Set Control NO The SET statement POINT= option is inconsistent with parallel processing.

53 When parallel processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. See: Parallel Processing for Data in HDFS on page 12 PARALLELWRITE= Data Set Option for HDFS 43 Syntax PARALLELREAD=NO YES Required Arguments NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. YES requests parallel processing for all Read operations for the specific data set. Comparisons The PARALLELREAD= data set option overrides the SPDEPARALLELREAD= system option and the PARALLELREAD= LIBNAME statement option. See Also PARALLELREAD= LIBNAME Statement Option on page 36 SPDEPARALLELREAD= System Option for HDFS on page 45 PARALLELWRITE= Data Set Option for HDFS Determines whether the SPD Engine uses parallel processing to write data in HDFS. Valid in: Category: Default: Restrictions: DATA step and PROC step Data Set Control NO You cannot use parallel processing for a Write operation and also request to create a SAS index. You cannot use parallel processing for a Write operation and also request BY-group processing or sorting.

54 44 Chapter 4 / SPD Engine Reference Interactions: Note: When parallel Write processing occurs, the order in which the observations are written is unpredictable. The order in which the observations are returned cannot be determined unless the application imposes ordering criteria. The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing. For more information, see SAS Scalable Performance Data Engine: Reference. The PARALLELWRITE= data set option is available in the third maintenance release for SAS 9.4. See: Parallel Processing for Data in HDFS on page 12 Syntax PARALLELWRITE=NO YES threads Required Arguments NO specifies that parallel processing for a Write operation does not occur. This is the default behavior for the SPD Engine. YES specifies parallel processing for all Write operations for the specific data set. A thread is used for each CPU on the SAS client machine. For example, if eight CPUs exist on the SAS client machine, then eight threads are used to write data. threads specifies parallel processing for all Write operations for the specific data set and specifies the number of threads to use for the Write operations. Default The default is 1, which specifies that parallel processing for a Write operation does not occur. Range 2 to 512 Comparisons The PARALLELWRITE= data set option overrides the PARALLELWRITE= LIBNAME statement option.

55 SPDEPARALLELREAD= System Option for HDFS 45 See Also PARALLELWRITE= LIBNAME Statement Option on page 36 SPDEPARALLELREAD= System Option for HDFS Determines when the SPD Engine uses parallel processing to read data stored in HDFS. Valid in: Category: PROC OPTIONS GROUP= Default: Interactions: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window SASFILES: SAS Files SASFILES NO The SET statement POINT= option is inconsistent with parallel processing. When parallel processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. See: Parallel Processing for Data in HDFS on page 12 Syntax SPDEPARALLELREAD=NO YES Required Arguments NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. YES requests parallel processing for all Read operations for the SAS session. Comparisons The SPDEPARALLELREAD= system option can be overridden by the PARALLELREAD= LIBNAME statement option and the PARALLELREAD= data set option.

56 46 Chapter 4 / SPD Engine Reference See Also PARALLELREAD= LIBNAME Statement Option on page 36 PARALLELREAD= Data Set Option for HDFS on page 42 SPDE_CONFIG_FILE SAS Environment Variable Requests SPD Engine distributed locking by specifying the location of the user-defined XML configuration file. Valid in: Default: Note: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window SPD Engine uses HDFS distributed locking. The SPDE_CONFIG_FILE SAS environment variable is available in the third maintenance release for SAS 9.4. See: SPD Engine Distributed Locking on page 20 Syntax SPDE_CONFIG_FILE='pathname' Required Argument 'pathname' specifies the fully qualified pathname to the user-defined XML configuration file. The location must be available to the SAS client machine. Enclose the primary pathname in single or double quotation marks. You can name the file whatever you want. An example is '/user/abcdef/hadoop/spde-site.xml'. Details XML Configuration File The XML configuration file contains the information so that the SPD Engine can communicate with ZooKeeper. The format of the XML configuration file is similar to a Hadoop configuration file in that the XML contains properties and attributes as name and value pairs. You must create an XML configuration file. The following is an example XML configuration file:

57 SPDE_CONFIG_FILE SAS Environment Variable 47 <?xml version="1.0" encoding="utf-8"?> <configuration> <property> <name>spde.zookeeper.quorum</name>  <value>abcdef07.unx.sas.com,abcdef08.unx.sas.com,abcdef06.unx.sas.com</value> </property> <property> <name>spde.zookeeper.port</name>  <value>2181</value> </property> <property>  <value>3</value> <name>spde.zookeeper.connect.maxretries</name> </property> <property>  <name>spde.zookeeper.connect.retrysleep</name> <value>1000</value> </property> <property>  <name>spde.zookeeper.connect.timeout</name> <value>30000</value> </property> <property>  <name>spde.zookeeper.session.timeout</name> <value>180000</value> </property> <property>  <name>spde.zookeeper.lockwait.timeout</name> <value>10000</value> </property> <property>

58 48 Chapter 4 / SPD Engine Reference <!-- Number of milliseconds to wait before deleting an empty ZooKeeper data node. <name>spde.zookeeper.reaper.threshold</name> <value>3000</value> </property> </configuration> Creating the XML Configuration File The following are XML configuration file properties. The first two properties, spde.zookeeper.quorum and spde.zookeeper.port, are required. The other properties have default values if they are not included in the XML configuration file. spde.zookeeper.quorum a comma-separated list of quorum machines that are configured to work together as a single server. The listed machines must be running a ZooKeeper server and servicing requests on the port that is specified in the spde.zookeeper.port property. This property is required. spde.zookeeper.port the I/O port on which the quorum machines that are listed in the spde.zookeeper.quorom property are configured to service requests. This property is required. spde.zookeeper.connect.maxretries the maximum number of times that Curator attempts to connect to ZooKeeper before failing. Values less than or equal to zero are ignored. The default is 3. spde.zookeeper.connect.retrysleep the milliseconds that Curator sleeps between attempts to connect to ZooKeeper. The sleep time starts with this setting, but increases between each attempt. Values less than or equal to zero are ignored. The default is spde.zookeeper.connect.timeout the milliseconds that Curator and the ZooKeeper client wait for a communication from the ZooKeeper server before considering the server connection to be expired. When operating normally, the client establishes a connection to the server and communicates with it over that connection. If the connection is non-responsive for more than the specified value, it is considered expired and is dropped, followed by an attempt to establish a new connection. Values less than or equal to zero are ignored. The default is

59 SPDE_CONFIG_FILE SAS Environment Variable 49 spde.zookeeper.session.timeout the milliseconds that Curator and the ZooKeeper client wait for a communication from the ZooKeeper server before considering the client session to be expired. When operating normally, the client establishes a connection to the server and communicates with it over that connection. The connection might be dropped and reestablished as the network or server nodes experience faults, but the client session continues to exist for the duration of these interruptions. If an interruption persists for more than the specified value, the client session is considered expired and is terminated. No reconnection is possible after that. Values less than or equal to zero are ignored. The default is spde.zookeeper.lockwait.timeout the milliseconds that the ZooKeeper server waits for a lock to become available before declaring that a lock request has failed and returning control to the client. Values less than zero are ignored. A value of zero is valid. spde.zookeeper.reaper.threshold the milliseconds that the ZooKeeper server waits before deleting an empty Zookeeper server node. The default is Defining the SPDE_CONFIG_FILE Environment Variable The following table includes examples of defining the SPDE_CONFIG_FILE environment variable: Table 4.1 Method Defining the SPDE_CONFIG_FILE Environment Variable Example SAS configuration file SAS invocation OPTIONS statement -set SPDE_CONFIG_FILE /user/abcdef/ hadoop/spde-site.xml -set SPDE_CONFIG_FILE /user/abcdef/ hadoop/spde-site.xml options set=spde_config_file= /user/ abcdef/hadoop/spde-site.xml ;

60 50 Chapter 4 / SPD Engine Reference SPDE_HADOOP_WORK_PATH SAS Environment Variable Specifies a pathname for SAS High-Performance Analytics work files. Valid in: Default: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window The SPD Engine uses the standard UNIX temporary directory /tmp. See: Using SAS High-Performance Analytics Procedures on page 24 Syntax SPDE_HADOOP_WORK_PATH='pathname' Required Argument 'pathname' specifies the fully qualified pathname to a directory in the Hadoop cluster, which is typically UNIX based. Enclose the primary pathname in single or double quotation marks. An example is '/sasdata/cluster1/hpawork'. Requirement The directory must exist, and you must have Write access. Details The following table includes examples of defining the SPDE_HADOOP_WORK_PATH environment variable: Table 4.2 Method Defining the SPDE_HADOOP_WORK_PATH Environment Variable Example SAS configuration file SAS invocation -set SPDE_HADOOP_WORK_PATH /sasdata/ cluster1/hpawork -set SPDE_HADOOP_WORK_PATH /sasdata/ cluster1/hpawork

61 SPDELOCKPATH SAS Environment Variable 51 Method OPTIONS statement Example options set=spde_hadoop_work_path= / sasdata/cluster1/hpawork ; SPDELOCKPATH SAS Environment Variable Specifies a pathname for the SPD Engine lock directory for HDFS concurrent access. Valid in: Default: Note: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window The SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eight-character hexadecimal value (which is the checksum of the Hadoop cluster that contains the data set), and the suffix _spdslock9. The SPDELOCKPATH SAS environment variable is available in the third maintenance release for SAS 9.4. See: SPD Engine File System Locking on page 18 Syntax SPDELOCKPATH='pathname' Required Argument 'pathname' specifies the fully qualified pathname to a directory in the Hadoop cluster, which is typically UNIX based. Enclose the primary pathname in single or double quotation marks. An example is '/user/abcdef/'. Tip Specify only one lock directory pathname for each Hadoop cluster so that the same data set is not using different lock directories. Details The following table includes examples of defining the SPDELOCKPATH environment variable:

62 52 Chapter 4 / SPD Engine Reference Table 4.3 Method Defining the SPDELOCKPATH Environment Variable Example SAS configuration file SAS invocation OPTIONS statement -set SPDELOCKPATH /user/abcdef -set SPDELOCKPATH /user/abcdef options set=spdelockpath= /user/abcdef ; SPDEREADLOCK SAS Environment Variable Determines whether a Read access lock file is created. Valid in: Default: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window NO See: SPD Engine File System Locking on page 18 Syntax SPDEREADLOCK NO YES Required Arguments NO specifies that a Read access lock file is not created when a data set stored in HDFS is opened for Read access. This is the default behavior for the SPD Engine. Only Write access lock files are created. YES specifies that a Read access lock file is created when a data set stored in HDFS is opened for Read access. Once the lock file is created, no other SAS process can open the data set for Write access.

63 SPDEREADLOCK SAS Environment Variable 53 Details To control the level of concurrent access, you can request a Read access lock file by defining the SAS environment variable SPDEREADLOCK and setting it to YES. Then, when a SAS session opens a data set for Read access, a lock file is created in addition to any Write access lock files. The following table includes examples of defining the SPDEREADLOCK environment variable: Table 4.4 Method Defining the SPDELOCKPATH Environment Variable Example SAS configuration file SAS invocation OPTIONS statement -set SPDEREADLOCK YES -set SPDEREADLOCK YES options set=spdereadlock YES;

64 54 Chapter 4 / SPD Engine Reference

65 55 5 How to Use Hadoop Data Storage Overview: How to Use Hadoop Data Storage Example 1: Loading Existing SAS Data Using the COPY Procedure Details Program Program Description Example 2: Creating a Data Set Using the DATA Step Details Program Program Description Example 3: Adding to Existing Data Set Using the APPEND Procedure Details Program Program Description Example 4: Loading Oracle Data Using the COPY Procedure Details Program Program Description Example 5: Analyzing Data Using the FREQ Procedure Details Program

66 56 Chapter 5 / How to Use Hadoop Data Storage Program Description Example 6: Managing SAS Files Using the DATASETS Procedure Details Program Program Description Example 7: Setting the SPD Engine I/O Block Size Details Program Program Description Example 8: Optimizing WHERE Processing with MapReduce.. 69 Details Program Program Description Overview: How to Use Hadoop Data Storage These examples illustrate how to use Hadoop data storage. The examples show you how to load existing data into a Hadoop cluster, how to create a new data set in a Hadoop cluster, and how to append data to an existing data set in a Hadoop cluster. Other examples show you how to load Oracle data into a Hadoop cluster and how to access data sets stored in a Hadoop cluster for data management and analysis. Note: The example data was created to illustrate SPD Engine functionality to read, write, and update data sets in a Hadoop cluster. The example data does not reflect the type of data or file size that might typically be loaded into a Hadoop cluster.

67 Example 1: Loading Existing SAS Data Using the COPY Procedure 57 Example 1: Loading Existing SAS Data Using the COPY Procedure Details This example loads existing SAS data into a Hadoop cluster. The example uses the default Base SAS engine, the SPD Engine, and the COPY procedure. The data set named MyBase.BigFile is copied, converted to the SPD Engine format, and then written to the Hadoop cluster as an SPD Engine data set named MySpde.BigFile. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname mybase 'C:\SASFiles'; 2 libname myspde spde '/data/spde' hdfshost=default; 3 proc copy in=mybase out=myspde; 4 select bigfile; run; Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The first LIBNAME statement assigns the libref MyBase to the physical location of the SAS library that stores the data set. (The default Base SAS engine for a LIBNAME statement is the V9 (or BASE) engine.)

68 58 Chapter 5 / How to Use Hadoop Data Storage 3 The second LIBNAME statement assigns the libref MySpde to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The COPY procedure copies the data set named BigFile. The SPD Engine creates a subdirectory with the specified data set name and the suffix _spde, converts the data to the SPD Engine format, and writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. The SPD Engine data partition files for the data set BigFile are located at /data/spde/bigfile_spde/. The first partition file is named bigfile.dpf.080e0a8f.0.1.spds9. Example 2: Creating a Data Set Using the DATA Step Details This example creates a data set named MySpde.Fitness in a Hadoop cluster. The example uses the default Base SAS engine, the SPD Engine, and the DATA step SET statement to concatenate several data sets. The data sets are converted to the SPD Engine format and then written to a directory in the Hadoop cluster. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45dl"; libname mybase 'C:\SASFiles'; 2 libname myspde spde '/data/spde' hdfshost=default; 3 data myspde.fitness; 4 set mybase.fitness_2010 mybase.fitness_2011 mybase.fitness_2012; run;

69 Example 3: Adding to Existing Data Set Using the APPEND Procedure 59 Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The first LIBNAME statement assigns the libref MyBase to the physical location of the SAS library that stores the data sets. (The default Base SAS engine for a LIBNAME statement is the V9 (or BASE) engine.) 3 The second LIBNAME statement assigns the libref MySpde to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The DATA statement assigns the name Fitness to the new data set. The SET statement lists the names of existing data sets to be read. The SPD Engine copies the three input data sets, concatenates them into one output data set named Fitness, converts the data to the SPD Engine format, and then writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. Example 3: Adding to Existing Data Set Using the APPEND Procedure Details This example adds data to an existing data set that is stored in a Hadoop cluster. The example uses the default Base SAS engine, the SPD Engine, and the APPEND procedure. The data sets named MyBase.September and MyBase.October are converted to the SPD Engine format and then written to the existing data set named Sales.YearToDate.

70 60 Chapter 5 / How to Use Hadoop Data Storage Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname mybase 'C:\SASFiles'; 2 libname sales spde '/data/spde' hdfshost=default; 3 proc append base=sales.yeartodate data=mybase.september; 4 run; proc append base=sales.yeartodate data=mybase.october; 5 run; Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The first LIBNAME statement assigns the libref MyBase to the physical location of the SAS library that stores the data sets. (The default Base SAS engine for a LIBNAME statement is the V9 (or BASE) engine.) 3 The second LIBNAME statement assigns the libref Sales to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The first PROC APPEND copies the data from MyBase.September to Sales.YearToDate. The SPD Engine converts the data to the SPD Engine format and then writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. 5 The second PROC APPEND copies the data from MyBase.October to Sales.YearToDate. The SPD Engine converts the data to the SPD Engine format and

71 Example 4: Loading Oracle Data Using the COPY Procedure 61 then writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. Example 4: Loading Oracle Data Using the COPY Procedure Details This example loads Oracle data into a Hadoop cluster. The example uses the SAS/ACCESS to Oracle engine, the SPD Engine, and the COPY procedure. The table named MyOracle.Oracle1 is written to the Hadoop cluster as an SPD Engine data set named MySpde.Oracle1. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname myoracle oracle user=myusr1 password=mypwd1 path=mysrv1; 2 libname myspde spde '/data/spde' hdfshost=default; 3 proc copy in=myoracle out=myspde; 4 select oracle1; run; Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster.

72 62 Chapter 5 / How to Use Hadoop Data Storage 2 The first LIBNAME statement assigns the libref MyOracle, specifies the Oracle engine, and specifies the connection information to the Oracle database that contains the Oracle table. 3 The second LIBNAME statement assigns the libref MySpde to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The COPY procedure copies the table named Oracle1. The SPD Engine creates a subdirectory with the specified data set name and suffix _spde, converts the data to the SPD Engine format, and writes the data to the directory in the Hadoop cluster as an SPD Engine data set. HDFS distributes the data on the Hadoop cluster. The SPD Engine data partition files for the data set Oracle1 are located at /data/spde/ oracle1_spde/. Example 5: Analyzing Data Using the FREQ Procedure Details This example analyzes the data set StudentID that is stored in a Hadoop cluster. The data set contains 3,231,765 observations and three variables: ID, Age, and Name. The example uses the SPD Engine and the FREQ procedure to produce a one-way frequency table for the students ages. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname class spde '/data/spde' hdfshost=default; 2 proc freq data=class.studentid; 3

73 Example 5: Analyzing Data Using the FREQ Procedure 63 tables age; run; Program Description 1 The two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 To read a data set that is stored in a Hadoop cluster, simply connect to the cluster with the LIBNAME statement for the SPD Engine. The LIBNAME statement assigns the libref Class to a directory in the Hadoop cluster and specifies the SPD Engine. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 3 PROC FREQ produces a one-way frequency table for the students ages. Figure 5.1 PROC FREQ One-Way Frequency Table

74 64 Chapter 5 / How to Use Hadoop Data Storage Example 6: Managing SAS Files Using the DATASETS Procedure Details This example illustrates how to manage SAS files that are stored in a Hadoop cluster. The example uses the DATASETS procedure to list the SAS files, describe the contents of a specific data set, and delete a data set from HDFS. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname myspde spde '/data/spde' hdfshost=default; 2 proc datasets library=myspde; 3 contents data=studentid (listfiles=yes); 4 run; delete bigfile; 5 run; quit; Program Description 1 The two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 To manage your SAS files that are stored in a Hadoop cluster, simply connect to the cluster with the LIBNAME statement for the SPD Engine. The LIBNAME statement

75 Example 6: Managing SAS Files Using the DATASETS Procedure 65 assigns the libref MySpde to a directory in the Hadoop cluster and specifies the SPD Engine. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 3 PROC DATASETS lists the SAS files that are stored in the directory in the Hadoop cluster. 4 The CONTENTS statement describes the contents of the data set named StudentID, which includes the number of observations, whether the data set has an index, and the observation length. The LISTFILES= data set option lists the complete pathnames of the SPD Engine files such as the data partition files and the metadata file. 5 The DELETE statement removes the data set named BigFile. The SPD Engine data partition, metadata, and index files are removed. The data set name subdirectory is also removed unless the subdirectory contains files other than the data partition files.

76 66 Chapter 5 / How to Use Hadoop Data Storage Figure 5.2 MySpde Directory Listing

77 Example 6: Managing SAS Files Using the DATASETS Procedure 67 Figure 5.3 Contents of StudentID Data Set

78 68 Chapter 5 / How to Use Hadoop Data Storage Example 7: Setting the SPD Engine I/O Block Size Details This example illustrates how to set the SPD Engine I/O block size to improve performance. The example uses the SPD Engine, an uncompressed data set, and SAS procedures to analyze the data. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname class spde '/data/spde' hdfshost=default; 2 proc means data=class.studentid; 3 var age; run; proc print data=class.studentid (ioblocksize=32768); 4 where age > 18; run; Program Description 1 The two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The LIBNAME statement assigns the libref Class to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster

79 Example 8: Optimizing WHERE Processing with MapReduce 69 configuration files. The LIBNAME statement does not include the IOBLOCKSIZE= option, so the default I/O block size is 1,048,576 bytes (1 megabyte). 3 The MEANS procedure calculates statistics on the Age variable. Because the Read operation requires a full data set scan, the procedure uses the default I/O block size, which was set from the LIBNAME statement. For this Read operation, including the IOBLOCKSIZE= data set option to specify a larger I/O block size could improve performance. When retrieving a large percentage of the data, a larger block size provides a performance benefit. 4 The PRINT procedure requests output where the value of the Age variable is greater than 18. Because the Read operation requests a subset of the data, the procedure includes the IOBLOCKSIZE= data set option to specify a smaller I/O block size. A smaller I/O block size provides better performance because the SPD Engine does not read large blocks of observations when it only needs a few observations from the block. Example 8: Optimizing WHERE Processing with MapReduce Details This example illustrates how to optimize WHERE processing by requesting that data subsetting be performed in the Hadoop cluster. This example analyzes the data set StudentID that is stored in a Hadoop cluster and submits the WHERE expression to the Hadoop cluster as a MapReduce program. By requesting that data subsetting be performed in the Hadoop cluster, performance is improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SAS client. Program options msglevel=i; 1

80 70 Chapter 5 / How to Use Hadoop Data Storage options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 2 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname class spde '/data/spde' hdfshost=default accelwhere=yes; 3 proc freq data=class.studentid; tables age; where age gt 14; 4 run; Program Description 1 The first OPTIONS statement specifies the MSGLEVEL=I SAS system option to request that informative messages be written to the SAS log. For WHERE processing optimization, the SAS log reports whether the data filtering occurred in the Hadoop cluster. 2 The next two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 3 The LIBNAME statement assigns the libref Class to a directory in the Hadoop cluster and specifies the SPD Engine. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. The ACCELWHERE=YES argument requests that data subsetting be performed by a MapReduce program in the Hadoop cluster. 4 PROC FREQ produces a one-way frequency table for the students ages that are greater than 14. The WHERE expression, which defines the condition that selected observations must satisfy, is instantiated as a Java class. The SPD Engine submits the Java class to the Hadoop cluster as a component in a MapReduce program. As a result, only a subset of the data is returned to the SAS client.

81 Example 8: Optimizing WHERE Processing with MapReduce 71 Figure 5.4 PROC FREQ One-Way Frequency Table Optimized WHERE Processing Note: The SAS log reports that there were 2,371,486 observations read from the data set. That number of observations is a subset of the data set stored in the Hadoop cluster, which contains 3,231,765 observations. Log 5.1 SAS Log Reporting WHERE Optimization 1 options msglevel=i; 2 options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 3 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; 4 libname class spde '/data/spde' hdfshost=default accelwhere=yes; NOTE: Libref CLASS was successfully assigned as follows: Engine: SPDE Physical Name: /data/spde/ 5 proc freq data=class.studentid; 6 tables age; 7 where age gt 14; whinit: WHERE (Age>14) whinit returns: ALL EVAL2 8 run; NOTE: Writing HTML Body file: sashtml.htm NOTE: There were observations read from the data set CLASS.STUDENTID. WHERE age>14; WHERE processing is optimized on the Hadoop cluster. Hadoop Job ID: job_ _14972 NOTE: PROCEDURE FREQ used (Total process time): real time 2:31.74 cpu time 1.70 seconds

82 72 Chapter 5 / How to Use Hadoop Data Storage

83 73 Hive SerDe for SPD Engine Data Appendix 1 Accessing SPD Engine Data Using Hive Introduction Requirements for Accessing SPD Engine Tables with Hive Deploying the SPD Engine SerDe Registering the SPD Engine Table Metadata in the Hive Metastore Reading SPD Engine Tables from Hive Logging Support How the SPD Engine SerDe Reads the Data Troubleshooting Accessing SPD Engine Data Using Hive Introduction Hive uses an interface called SerDe to translate data that is stored in proprietary formats such as JSON and Parquet into HDFS. SerDe deserializes data into a Java object that HiveQL and other languages that are supported by HiveServer2 can manipulate. Hive provides a variety of built-in SerDes and supports custom SerDes. For more information about Hive SerDes, see your Hive documentation.

84 74 Appendix 1 / Hive SerDe for SPD Engine Data In the third maintenance release for SAS 9.4, SAS provides a custom Hive SerDe for SPD Engine data that is stored in HDFS. The SerDe makes the data available for applications outside of SAS to query. The SPD Engine SerDe does not support creating, altering, or updating SPD Engine data in HDFS using HiveQL or other languages. That is, the SerDe is Read-only and cannot serialize data for storage in HDFS. If you want to process SPD Engine data stored in HDFS using SAS applications, you should access it directly with the SPD Engine. In addition, if the SPD Engine table in HDFS has any of the following features, it cannot be registered in Hive or use the SerDe. You must access it by going through SAS and the SPD Engine. The following table features are not supported: compressed or encrypted tables tables with SAS informats tables that have user-defined formats password-protected tables tables owned by the SAS Scalable Performance Data Server In addition, the following processing functionality is not supported by the SerDe and requires processing by the SPD Engine: Write, Update, and Append operations if preserving observation order is required Requirements for Accessing SPD Engine Tables with Hive The following are required to access SPD Engine tables using the SPD Engine SerDe: You must deploy SAS Foundation using the SAS Deployment Wizard. Select SAS Hive SerDe for SPDE Data.

85 Accessing SPD Engine Data Using Hive 75 Figure A1.1 SAS Deployment Wizard Product Selection Page You must be running a supported Hadoop distribution that includes Hive 0.13: o Cloudera CDH 5.2 o o Hortonworks HDP 2.1 or later MapR or later The SPD Engine table stored in HDFS must have been created using the SPD Engine. The SerDe is delivered as two JAR files, which must be deployed to all nodes in the Hadoop cluster.

86 76 Appendix 1 / Hive SerDe for SPD Engine Data The SPD Engine table must be registered in the Hive metastore. SAS supplies a metastore registration utility for registering SPD Engine tables. You cannot use any other method to register tables. Deploying the SPD Engine SerDe Deploying the SerDe JAR files on the Hadoop cluster involves the following steps. You must first copy the SerDe JAR files to the Hadoop cluster, and then execute the bash script named sashiveserdespde-installjar.sh to copy the JAR files onto all the nodes in the Hadoop cluster. Then, the MapReduce and Hive services need to be restarted by the administrator. Follow these steps: 1 On the host where you installed SAS, go to /SASHOME-directory/ SASHiveSerDeforSPDEData/9.4. If you installed SAS directly on the Hadoop cluster, skip to step 6. 2 Locate the SerDe JAR files named sashiveserdespde.jar and sas.hiveserdespde.nls.jar. 3 Locate the bash script named sashiveserdespde-installjar.sh. 4 Copy the sas.hiveserdespde.jar file, sas.hiveserdespde.nls.jar file, and sashiveserdespde-installjar.sh script from the SASHOME-directory directory to a directory on a node in the Hadoop cluster. This node must be able to use SSH to access the other data nodes in the cluster. 5 Go to the node on which you copied the JAR files and script in step 4, and change directories to the directory that contains the copied JAR files and script. (Or, you can use the -jarloc option in step 8. The -jarloc option specifies the location in which you copied the JAR files and script.) 6 Switch to the root user using with the sudo su command. (Or, you can prepend the sudo command to the command in step 8 to gain root access.) 7 Set the Hadoop CLASSPATH to include the MapReduce and Hive library installation directories.

87 Accessing SPD Engine Data Using Hive 77 CLASSPATH=/<path-name>/* Export HADOOP_CLASSPATH=$CLASSPATH 8 From the UNIX shell, execute the script to copy the SerDe JAR files onto the nodes. (Specify the location of the MapReduce and Hive library installation directories.) For example, on UNIX, this command executes the script and specifies -mr and -hive libraries: sh sashiveserdespde-installjar.sh -mr /opt/cloudera/parcels/cdh/lib/hadoop-mapreduce/lib -hive /opt/cloudera/parcels/cdh/lib/hive/lib In this example, the -jarloc option is not needed because of step 5. The sudo command is not needed because of step 6. 9 The administrator must restart the MapReduce and Hive services. The SerDe now exists on each node in the Hadoop cluster. You can start registering SPD Engine tables. Note: To uninstall the SerDe, use the -u option. To use the -u option, run commands from the directory that contains the script and JAR files. Specify the -u, -mr, and -hive options. Here is an example: sh sashiveserdespde-installjar.sh -u -mr /usr/lib/hadoop-mapreduce/lib/ -hive /usr/lib/hive/lib Registering the SPD Engine Table Metadata in the Hive Metastore You must register metadata in the Hive metastore for each SPD Engine table that you plan to access from Hive. Hive applications that query SPD Engine data in HDFS need the metadata to locate and deserialize the data in order to process the query. Registering the SPD Engine table projects a schema-like structure onto the table and creates Hive metadata about the location and structure of the data in HDFS. For more information about the Hive metastore and metastore databases, see your Apache documentation.

88 78 Appendix 1 / Hive SerDe for SPD Engine Data Using the Hive Metastore Registration Utility SAS supplies a metastore registration utility that registers an SPD Engine table in a database in the Hive metastore. The utility reads an SPD Engine table s metadata file (.mdf) in HDFS and creates Hive metadata in the Hive metastore as table properties. Because the utility reads the SPD Engine table s metadata file that is stored in HDFS, if the metadata is changed by the SPD Engine, you must reregister the table.sas supplies a metastore registration utility for registering SPD Engine tables. You cannot use any other method to register tables. The utility stores the Hive metadata in the default or current Hive database unless you create a new database or specify a different database. To run the metastore registration utility, perform the following steps. Note that the SerDe JAR files must be deployed on each node of the Hadoop cluster using the script described in the previous section. 1 Set the Hadoop CLASSPATH to include a directory with the client Hadoop configuration files and SerDe JAR files: CLASSPATH=/path-name/* Export HADOOP_CLASSPATH=$CLASSPATH 2 If you are not using an existing database, invoke Hive, and use HiveQL DDL statements on the CLI command line to create the database. Use QUIT to log off from Hive. CREATE DATABASE database-name; QUIT; 3 If a table exists in the metastore with the same name as the table that you are registering, invoke Hive, and use HiveQL DDL statements on the CLI command line to delete the existing table. Use QUIT to log off from Hive. DROP TABLE table-name; QUIT; 4 Run the SerDe JAR command with appropriate command parameters and options to register the SPD Engine table. For example, the following command executes the SerDe JAR files and registers an SPD Engine table named Customers. It specifies

89 the HDFS directory location that contains the.mdf file of that SPD Engine table. The table and mdflocation parameters are required. hadoop jar path-name/sas.hiveserdespde.jar com.sas.hadoop.serde.spde.hive.metastoreregistration table customer mdflocation /mytables/customer Accessing SPD Engine Data Using Hive 79 The following options are also supported: -database database_name specifies a Hive metastore database if you are not using the default database. -renametable table-name assigns a different name to the table stored in HDFS. -owner owner-name assigns an owner name. The default owner name is Anonymous. When the utility completes, you will get a message that the table is registered in the Hive metastore as database-name.table-name. You can execute HiveQL statements on the SPD Engine table, but the data remains in HDFS. Restriction: Not all HiveQL statements are accepted. For example, CREATE TABLE and any commands that write to tables in HDFS are not supported because the SPD Engine SerDe is Read-only, which means it deserializes data from HDFS into Java objects, but it does not serialize Java objects to write into HDFS format. Reading SPD Engine Tables from Hive To use HiveQL to read and manipulate SPD Engine data in tables that are registered in Hive, follow these steps: 1 Start the Hadoop server. 2 Invoke Hive. 3 Issue a Hive USE command from the CLI command line, and specify the Hive metastore database that contains the metadata for the target table: USE database-name;

90 80 Appendix 1 / Hive SerDe for SPD Engine Data 4 To list the SPD Engine tables registered in the database, issue the SHOW TABLES command: SHOW TABLES; 5 To view the schema information of an SPD Engine table registered in the database, issue the DESCRIBE FORMATTED command. Information such as column length, table encoding, and the location of the.mdf file is returned. DESCRIBE FORMATTED table-name; 6 To return all of the data from an SPD Engine table registered in the database, execute a HiveQL SELECT statement: SELECT * FROM table-name; 7 To return a subset of the data in an SPD Engine table, you can execute compound SELECT statements: SELECT * FROM table-name WHERE column_name < 50; SELECT column_name FROM table-name; SELECT * FROM table_name WHERE column_name is NULL; Logging Support The SerDe supports Hive log4j for logging. For more information about setting logging levels for log4j, refer to your Hive documentation. How the SPD Engine SerDe Reads the Data The SerDe reads the data using the encoding of the SPD Engine table. Make sure that the SPD Engine table name is appropriate for the encoding associated with the cluster. Table A1.1 Current SerDe Implementation of Data Conversion from SAS to Hive SAS Data Type SAS Format Hive Data Type Hive Table Property Character $n. Varchar(length) None Numeric DATETIMEw.p Timestamp None

91 Accessing SPD Engine Data Using Hive 81 SAS Data Type SAS Format Hive Data Type Hive Table Property Numeric DATEw. Date None Numeric TIMEw.p. String Time(w.p) Numeric 1. to 4. SmallInt None Numeric 5. to 9. Int None Numeric Other numeric formats Double None Explanation of the table: For character columns, the SAS format is completely ignored, and the Hive data type is Varchar. The parameter passed in is the length of the character column. For numeric columns with DATE and DATETIME SAS formats, the Hive date types are Date and Timestamp, respectively. For numeric columns with the TIME SAS format, because it is a duration of time calculated in hours, minutes, and seconds, it is defined as a String data type in Hive. For numeric columns with a SAS format of 1. to 4., the Hive data type is SmallInt. For numeric columns with a SAS format of 5. to 9., the Hive data type is Int. As a side note, these SAS formats are not to be confused with short numerics, as those are defined by column length, not column format. A short numeric without a SAS format is defined as a double in Hive. For all of the other numeric formats not listed or the numeric columns without SAS formats, the column is defined as a double in Hive. User-defined formats for both character and numeric columns are ignored.

92 82 Appendix 1 / Hive SerDe for SPD Engine Data Troubleshooting Here are some error messages that you might receive and their suggested resolutions: SerDe Install Script Error: Install Script Does Not Work in Your Environment If the install script fails and you have tried other troubleshooting methods, try the following solution. Solution: Copy both the sashiveserdespde.jar file and the sas.hiveserdespde.nls.jar file to the Hive library installation directory that is specified with the hive option and also to the MapReduce library installation directory that is specified with the mr option on each node in your Hadoop cluster. Java Error: IOException or ClassNotFound Error: java.io.ioexception: cannot find class com.sas.hadoop.serde.spde.hive.spdinputformat or java.lang.classnotfoundexception Class com.sas.hadoop.serde.spde.hive.spdeserde not found Solution: Check the Hive and MapReduce libraries. Make sure you specified the correct location when you installed the SerDe JAR files. Also, make sure that the SerDe JAR files were successfully copied to the Hive and MapReduce directories on all data nodes in the cluster. Make sure that the administrator restarted the MapReduce and Hive services after installing the SerDe. SerDe Error: ClassNotFound: ERROR: java.lang.classnotfoundexception: org.apache.hive.jdbc.hivedriver ERROR: Could not open connection to jdbc:hive2://myserver01lax:10000/default. Check that the SAS_HADOOP_JAR_PATH environment variable and set option SUBPROTOCOL= appropriately. Set SUBPROTOCOL=hive2 if you are running HiveServer2. Set SUBPROTOCOL=hive if you are running Hive1.

93 Troubleshooting 83 Solution: Check your HADOOP_CLASSPATH= pathname carefully for errors. The paths for any directory that contain JAR files should end with /*. Also, check that the directory that contains your configuration files is included in the HADOOP_CLASSPATH= pathname. Ssh Error: authenticity of host cannot be established ERROR: RSA key fingerprint is < >... Are you sure you want to continue connecting? Solution: If you answer yes, the host will be added to the list of known hosts. From that point forward, you should not see the warning.

94 84 Appendix 1 / Hive SerDe for SPD Engine Data

95 85 Recommended Reading Here is the recommended reading list for this title: SAS and Hadoop Technology: Overview SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS SAS Language Reference: Concepts SAS Scalable Performance Data Engine: Reference For a complete list of SAS publications, go to sas.com/store/books. If you have questions about which titles you need, please contact a SAS Representative: SAS Books SAS Campus Drive Cary, NC Phone: Fax: [email protected] Web address: sas.com/store/books

96 86 Recommended Reading

97 87 Index A ACCELJAVAVERSION= LIBNAME statement option 31 ACCELWHERE= data set option 39 ACCELWHERE= LIBNAME statement option 31 adding to existing data set, example 59 analyzing data, example 62 APPEND procedure, example 59 C configuration files 7 COPY procedure, example 57, 61 creating a data set, example 58 D DATA step, example 58 DATAPATH= LIBNAME statement option 32 DATASETS procedure, example 64 distributed locking 20 E environment variables SAS_HADOOP_CONFIG_PA TH 7 SAS_HADOOP_JAR_PATH 6 F file format 2 file system locking 18 FREQ procedure, example 62, 69 H Hadoop cluster configuration files 7 Hadoop distribution support 6 Hadoop JAR files 6 Hadoop requirements 6 HDFS block size 10

98 88 Index HDFSHOST= LIBNAME statement option 30 HDFSUSER= LIBNAME statement option 33 Hive SerDe 73 I I/O operation performance 11 indexes, creating 11 IOBLOCKSIZE= data set option 40 IOBLOCKSIZE= LIBNAME statement option 33 IOBLOCKSIZE= option, example 68 J JAR files 6 K Kerberos support 8 L LIBNAME statement, SPD Engine for HDFS 28 loading a data set, example 57 loading Oracle data, example 61 locking distributed locking 20 file system locking 18 N NUMTASKS= LIBNAME statement option 35 P parallel processing 12 PARALLELREAD= data set option 42 PARALLELREAD= LIBNAME statement option 36 PARALLELWRITE= data set option 43 PARALLELWRITE= LIBNAME statement option 36 partition size 10 PARTSIZE= data set option 41 PARTSIZE= LIBNAME statement option 38 performance optimization creating SAS indexes 11 I/O operations 11 WHERE processing 15 Q querying file details, example 64

99 Index 89 R requirements 6 S SAS file features 7 SAS High-Performance Analytics procedures 24 SAS indexes, creating 11 SAS requirements 6 SAS version requirement 6 SAS_HADOOP_CONFIG_PAT H environment variable 7 SAS_HADOOP_JAR_PATH environment variable 6 security 8 SerDe 73 setting the SPD Engine I/O block size, example 68 SPD Engine 2 file format 2 partition size 10 SPD Engine LIBNAME statement for HDFS 28 SPDE_CONFIG_FILE environment variable 46 SPDE_HADOOP_WORK_PAT H environment variable 50 SPDELOCKPATH environment variable 51 SPDEPARALLELREAD= system option 45 SPDEREADLOCK environment variable 52 U updating data in HDFS 23 W WHERE processing optimization 15 WHERE processing optimization, example 69

100 90 Index

101

102