9.4 SPD Engine: Storing Data in the Hadoop Distributed File System



Similar documents
Guide to Operating SAS IT Resource Management 3.5 without a Middle Tier

SAS 9.4 Intelligence Platform

and Hadoop Technology

SAS 9.4 Logging. Configuration and Programming Reference Second Edition. SAS Documentation

SAS 9.4 PC Files Server

Grid Computing in SAS 9.4 Third Edition

Scheduling in SAS 9.4 Second Edition

SAS 9.4 Intelligence Platform

OnDemand for Academics

SAS 9.3 Intelligence Platform

SAS Task Manager 2.2. User s Guide. SAS Documentation

SAS 9.4 Intelligence Platform: Migration Guide, Second Edition

SAS University Edition: Installation Guide for Linux

SAS. Cloud. Account Administrator s Guide. SAS Documentation

SAS 9.3 Scalable Performance Data Engine: Reference

SAS University Edition: Installation Guide for Windows

When to Move a SAS File between Hosts

SAS 9.3 Logging: Configuration and Programming Reference

SAS 9.4 In-Database Products

QUEST meeting Big Data Analytics

SAS. 9.3 Guide to Software Updates. SAS Documentation

SAS. 9.4 Guide to Software Updates. SAS Documentation

9.1 SAS/ACCESS. Interface to SAP BW. User s Guide

Technical Paper. Migrating a SAS Deployment to Microsoft Windows x64

SAS University Edition: Installation Guide for Amazon Web Services

SAS IT Resource Management 3.2

Analyzing the Server Log

SAS BI Dashboard 4.4. User's Guide Second Edition. SAS Documentation

SAS Visual Analytics 7.1 for SAS Cloud. Quick-Start Guide

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Paper SAS Techniques in Processing Data on Hadoop

SAS/ACCESS 9.3 Interface to PC Files

SAS Business Data Network 3.1

SAS Data Loader 2.1 for Hadoop

UNIX Operating Environment

Communications Access Methods for SAS/CONNECT 9.3 and SAS/SHARE 9.3 Second Edition

SAS Scalable Performance Data Server 5.1

Document Type: Best Practice

9.4 Hadoop Configuration Guide for Base SAS. and SAS/ACCESS

IT Service Level Management 2.1 User s Guide SAS

Scheduling in SAS 9.3

SAS 9.3 Intelligence Platform

Constructing a Data Lake: Hadoop and Oracle Database United!

SAS 9.3 Open Metadata Interface: Reference and Usage

WHAT S NEW IN SAS 9.4

SAS 9.3 Intelligence Platform

SAS/IntrNet 9.4: Application Dispatcher

A Survey of Shared File Systems

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Running a Workflow on a PowerCenter Grid

SAS 9.3 Drivers for ODBC

SAS Intelligence Platform. System Administration Guide

9.1 SAS. SQL Query Window. User s Guide

Communications Access Methods for SAS/CONNECT 9.4 and SAS/SHARE 9.4 Second Edition

Encryption Services. What Are Encryption Services? Terminology. System and Software Requirements APPENDIX 5

A Performance Analysis of Distributed Indexing using Terrier

GraySort and MinuteSort at Yahoo on Hadoop 0.23

The Advantages of Using RAID

Cloudera Backup and Disaster Recovery

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

THE HADOOP DISTRIBUTED FILE SYSTEM

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

What's New in SAS Data Management

H2O on Hadoop. September 30,

Important Notice. (c) Cloudera, Inc. All rights reserved.

The Inside Scoop on Hadoop

HDFS Users Guide. Table of contents

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

CA Workload Automation Agent for Databases

Hadoop & Spark Using Amazon EMR

Hadoop & its Usage at Facebook

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

SAS Data Set Encryption Options

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

SAS/IntrNet 9.3: Application Dispatcher

Transferring vs. Transporting Between SAS Operating Environments Mimi Lou, Medical College of Georgia, Augusta, GA

Integrating VoltDB with Hadoop

SAS LASR Analytic Server 2.4

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop Architecture. Part 1

Plug-In for Informatica Guide

SAS Marketing Automation 5.1. User s Guide

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Chapter 2 The Data Table. Chapter Table of Contents

Cloudera Backup and Disaster Recovery

I/O Considerations in Big Data Analytics

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.1.x

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.0.x

Hadoop Job Oriented Training Agenda

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data With Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

CA Clarity Project & Portfolio Manager

Postgres Plus xdb Replication Server with Multi-Master User s Guide

9.4 Intelligence. SAS Platform. Overview Second Edition. SAS Documentation

HDFS. Hadoop Distributed File System

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

Chase Wu New Jersey Ins0tute of Technology

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Hadoop and Map-Reduce. Swati Gore

Transcription:

SAS 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System Third Edition SAS Documentation

The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System, Third Edition. Cary, NC: SAS Institute Inc. SAS 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System, Third Edition Copyright 2015, SAS Institute Inc., Cary, NC, USA All rights reserved. Produced in the United States of America. For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others' rights is appreciated. U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government's rights in Software and documentation shall be only those set forth in this Agreement. SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414. July 2015 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

Contents What s New in the SAS 9.4 SPD Engine to Store Data in HDFS...... v Chapter 1 Introduction to Storing Data in HDFS........................................ 1 Deciding to Store Data in HDFS............................. 1 Using the SPD Engine to Store Data in HDFS.................. 2 Chapter 2 Storing Data in HDFS..................................................... 5 Overview: Storing Data in HDFS............................ 5 SAS and Hadoop Requirements............................. 6 Supported SAS File Features Using the SPD Engine............ 7 Security................................................ 8 Chapter 3 Using the SPD Engine..................................................... 9 Overview: Using the SPD Engine........................... 10 How the SPD Engine Supports Data Distribution............... 10 I/O Operation Performance................................ 11 Creating SAS Indexes.................................... 11 Parallel Processing for Data in HDFS........................ 12 WHERE Processing Optimization with MapReduce............. 15 SPD Engine File System Locking........................... 18 SPD Engine Distributed Locking............................ 20 Updating Data in HDFS.................................. 23 Using SAS High-Performance Analytics Procedures............ 24 Chapter 4 SPD Engine Reference................................................... 27 Overview: SPD Engine Reference.......................... 27 Dictionary............................................. 28 Chapter 5 How to Use Hadoop Data Storage.......................................... 55 Overview: How to Use Hadoop Data Storage.................. 56 Example 1: Loading Existing SAS Data Using the COPY Procedure..................................... 57

iv Contents Example 2: Creating a Data Set Using the DATA Step........... 58 Example 3: Adding to Existing Data Set Using the APPEND Procedure................................... 59 Example 4: Loading Oracle Data Using the COPY Procedure..... 61 Example 5: Analyzing Data Using the FREQ Procedure......... 62 Example 6: Managing SAS Files Using the DATASETS Procedure................................. 64 Example 7: Setting the SPD Engine I/O Block Size............. 68 Example 8: Optimizing WHERE Processing with MapReduce..... 69 Appendix 1 Hive SerDe for SPD Engine Data.......................................... 73 Accessing SPD Engine Data Using Hive..................... 73 Troubleshooting........................................ 82 Recommended Reading................................................. 85 Index......................................................................... 87

v What s New in the SAS 9.4 SPD Engine to Store Data in HDFS Whatʼs New Overview In the second maintenance release for SAS 9.4, the SPD Engine has improved performance. The SPD Engine creates a SAS index much faster, sets a larger I/O block size and expands the scope of the block size, expands parallel processing support for Read operations, performs data filtering in the Hadoop cluster, and enables you to control the number of MapReduce tasks when writing data in HDFS. In the third maintenance release for SAS 9.4, the SPD Engine expands the supported Hadoop distributions, enables parallel processing for Write operations, expands WHERE processing optimization with more WHERE expression syntax, enhances file system locking by enabling you to specify a pathname for the SPD Engine lock directory, supports distributed locking, and provides a custom Hive SerDe so that SPD Engine data stored in HDFS can be accessed using Hive.

vi SAS SPD Engine Hadoop Distribution Support In the third maintenance release for SAS 9.4, the SPD Engine has expanded the supported Hadoop distributions. For the list of supported Hadoop distributions, see Hadoop Distribution Support on page 6. Improved Performance When Creating a SAS Index In the second maintenance release for SAS 9.4, when you create a SAS index for a data set in HDFS, the performance of creating a large index is significantly improved because the index is partitioned. For more information, see Creating SAS Indexes on page 11. Improved Performance By Setting SPD Engine I/O Block Size In the second maintenance release for SAS 9.4, the scope of the SPD Engine I/O block size is expanded. The default block size is larger at 1,048,576 bytes (1 megabyte). The block size affects compressed, uncompressed, and encrypted data sets. The block size influences the size of I/O operations when reading all data sets and writing compressed data sets. For more information, see I/O Operation Performance on page 11. To specify an I/O block size, use the IOBLOCKSIZE= data set option on page 40 or the new IOBLOCKSIZE= LIBNAME statement option on page 33.

Optimized WHERE Processing vii Improved Performance of Reading Data in HDFS In the second maintenance release for SAS 9.4, to improve the performance of reading data stored in HDFS, the SPD Engine has expanded its support of parallel processing. You can request parallel processing for all Read operations of data stored in HDFS. For more information, see Parallel Processing for Data in HDFS on page 12. To request parallel processing for all Read operations of data stored in HDFS, use the SPDEPARALLELREAD= system option on page 45, the PARALLELREAD= LIBNAME statement option on page 36, or the PARALLELREAD= data set option on page 42. Improved Performance of Writing Data to HDFS In the third maintenance release for SAS 9.4, you can now request parallel processing for all Write operations in HDFS. For more information, see Parallel Processing for Data in HDFS on page 12. To request parallel processing for Write operations, use the PARALLELWRITE= LIBNAME statement option on page 36 or the PARALLELWRITE= data set option on page 43. Optimized WHERE Processing To optimize the performance of WHERE processing, you can request that data subsetting be performed in the Hadoop cluster, which takes advantage of the filtering and ordering capabilities of the MapReduce framework. For more information, see WHERE Processing Optimization with MapReduce on page 15. To request that data subsetting be performed in the Hadoop cluster, use the ACCELWHERE= LIBNAME statement option on page 31 or the ACCELWHERE= data set option on page 39.

viii SAS SPD Engine In the third maintenance release for SAS 9.4, optimized WHERE processing is expanded to include more operators and compound expressions. For more information, see WHERE Expression Syntax Support on page 16. Controlling Tasks When Writing Data in HDFS In the second maintenance release for SAS 9.4, to specify the number of MapReduce tasks when writing data in HDFS, you can use the NUMTASKS= LIBNAME statement option. This option controls parallel processing on the Hadoop cluster when writing output from a SAS High-Performance Analytics procedure. For more information, see the NUMTASKS= LIBNAME statement option on page 35. SPD Engine File System Locking In the second maintenance release for SAS 9.4, the SPD Engine implements a locking strategy that honors the HDFS concurrent access model and provides additional levels of concurrent access to ensure the integrity of the data stored in HDFS. For more information, see SPD Engine File System Locking on page 18. In the third maintenance release for SAS 9.4, to store the lock files, the SPD Engine creates a lock directory in the /tmp directory. You can specify a pathname for the SPD Engine lock directory by defining the new SAS environment variable SPDELOCKPATH. For more information, see SPDELOCKPATH SAS Environment Variable on page 51. SPD Engine Distributed Locking In the third maintenance release for SAS 9.4, the SPD Engine supports distributed locking for data stored in HDFS. Distributed locking provides synchronization and group

Accessing SPD Engine Data Using Hive ix coordination services to clients over a network connection. For more information, see SPD Engine Distributed Locking on page 20. To request SPD Engine distributed locking, you must first create an XML configuration file, and then define the SAS environment variable SPDE_CONFIG_FILE to specify the location of the user-defined XML file that is available to the SAS client machine. For more information, see SPDE_CONFIG_FILE SAS Environment Variable on page 46. Configuring the SPD Engine to Store Data in HDFS To store data in HDFS using the SPD Engine, required Hadoop JAR files and Hadoop cluster configuration files must be available to the SAS client machine. For information about configuring the SPD Engine, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Accessing SPD Engine Data Using Hive In the third maintenance release for SAS 9.4, SAS provides a custom Hive SerDe for SPD Engine data that is stored in HDFS. The SerDe makes the data available for applications outside of SAS to query using HiveQL. For more information, see Appendix 1, Hive SerDe for SPD Engine Data, on page 73.

x SAS SPD Engine

1 1 Introduction to Storing Data in HDFS Deciding to Store Data in HDFS......................................... 1 Using the SPD Engine to Store Data in HDFS........................ 2 What Is the SPD Engine?.............................................. 2 Understanding the SPD Engine File Format......................... 2 How to Use the SPD Engine........................................... 3 Deciding to Store Data in HDFS Storing data in the Hadoop Distributed File System (HDFS) is a good strategy for very large data sets. HDFS is a component of Apache Hadoop, which is an open-source software framework of tools that are written in Java. HDFS provides distributed data storage and processing of large amounts of data. Reasons for storing SAS data in HDFS include the following: HDFS is a low-cost alternative for data storage. Organizations are exploring it as an alternative to commercial relational database solutions. HDFS is well suited for distributed storage and processing using commodity hardware. It is fault tolerant, scalable, and simple to expand. HDFS manages files as blocks of equal size, which are replicated across the machines in a Hadoop cluster to provide fault tolerance. SAS provides support within the current SAS product offering and product roadmap. SAS provides the ability to manage, process, and analyze data in HDFS.

2 Chapter 1 / Introduction to Storing Data in HDFS Hadoop storage is for big data. If standard SAS optimization techniques such as indexes no longer meet your performance needs, then storing the data in HDFS could improve performance. Using the SPD Engine to Store Data in HDFS What Is the SPD Engine? The SAS Scalable Performance Data (SPD) Engine is a scalable engine delivered to SAS customers as part of Base SAS. The SPD Engine is designed for highperformance data delivery, reading data sets that contain billions of observations. The engine uses threads to read data very rapidly and in parallel. The SPD Engine reads, writes, and updates data in HDFS. You can use the SPD Engine with standard SAS applications to retrieve data for analysis, perform administrative functions, and update the data. Understanding the SPD Engine File Format The SPD Engine organizes data into a streamlined file format that has advantages for a distributed file system like HDFS. The advantages of the SPD Engine file format include the following: Data is separate from the metadata. The file format consists of separate files: one for data, one for metadata, and two for indexes. Each type of file has an identifying file extension. The extensions are.dpf for data,.mdf for metadata, and.hbx and.idx for indexes. The SPD Engine file format partitions the data by spreading it across multiple files based on a partition size. Each partition is stored as a separate physical file with the extension.dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file.

Using the SPD Engine to Store Data in HDFS 3 The default partition size is 128 megabytes. You can specify a different partition size with the PARTSIZE= LIBNAME statement option on page 38 or the PARTSIZE= data set option on page 41. How to Use the SPD Engine The SPD Engine works like other SAS data access engines. That is, you execute a LIBNAME statement to assign a libref, specify the engine, and connect to the Hadoop cluster. You then use that libref throughout the SAS session where a libref is valid. The libref is associated with a specific directory in the Hadoop cluster. Arguments in the LIBNAME statement specify a libref, the engine name, the pathname to a directory in the Hadoop cluster, and the HDFSHOST=DEFAULT argument to indicate that you want to connect to a Hadoop cluster. Here is an example of a LIBNAME statement to connect to a Hadoop cluster: libname myspde spde '/user/abcdef' hdfshost=default; To interface with Hadoop and connect to a specific Hadoop cluster, required Hadoop JAR files and Hadoop cluster configuration files must be available to the SAS client machine. To make the required files available, you must define two SAS environment variables to set the location of the required files. For more information about the SAS environment variables, see SAS and Hadoop Requirements on page 6. Any data source that can be accessed with a SAS engine can be loaded into a Hadoop cluster using the SPD Engine. For example: You can use the default Base SAS engine to access an existing SAS data set and the SPD Engine to connect to the Hadoop cluster. You can then use SAS code to load the data to the Hadoop cluster. See Example 1: Loading Existing SAS Data Using the COPY Procedure on page 57. You can use a SAS/ACCESS engine such as the SAS/ACCESS to Oracle engine to access an Oracle table and the SPD Engine to connect to the Hadoop cluster. You can then use SAS code to load the data to the Hadoop cluster. See Example 4: Loading Oracle Data Using the COPY Procedure on page 61. Note: Most existing SAS programs can run with the SPD Engine with little modification other than to the LIBNAME statement. However, some limitations apply. For example, if

4 Chapter 1 / Introduction to Storing Data in HDFS your default Base SAS engine data has integrity constraints, then the integrity constraints are dropped when the data is converted for the SPD Engine. For more information about supported SAS file features, see Supported SAS File Features Using the SPD Engine on page 7.

5 2 Storing Data in HDFS Overview: Storing Data in HDFS........................................ 5 SAS and Hadoop Requirements........................................ 6 SAS Version.............................................................. 6 Hadoop Distribution Support........................................... 6 Configuring Hadoop JAR Files......................................... 6 Making Required Hadoop Cluster Configuration Files Available to Your Machine...................................... 7 Supported SAS File Features Using the SPD Engine............... 7 Security...................................................................... 8 Overview: Storing Data in HDFS To store data in HDFS using the SPD Engine, you must do the following: Ensure that all version and configuration requirements are met. See SAS and Hadoop Requirements on page 6. Understand what the supported and not supported SAS file features are when using the SPD Engine. See Supported SAS File Features Using the SPD Engine on page 7. Use the LIBNAME statement for the SPD Engine to establish the connection to the Hadoop cluster. See LIBNAME Statement for HDFS on page 28.

6 Chapter 2 / Storing Data in HDFS SAS and Hadoop Requirements SAS Version To store data in HDFS using the SPD Engine, you must have the first maintenance release or later for SAS 9.4. Note: Access to data in HDFS using the SPD Engine is not supported from a SAS session in the z/os operating environment. Hadoop Distribution Support In the third maintenance release for SAS 9.4, the SPD Engine supports the following Hadoop distributions, with or without Kerberos: Cloudera CDH 4.x Cloudera CDH 5.x Hortonworks HDP 2.x IBM InfoSphere BigInsights 3.x MapR 4.x (for Microsoft Windows and Linux operating environments only) Pivotal HD 2.x Configuring Hadoop JAR Files To store data in HDFS using the SPD Engine, you must use a supported Hadoop distribution and configure a required set of Hadoop JAR files. The JAR files must be available to the SAS client machine. The SAS environment variable SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

Supported SAS File Features Using the SPD Engine 7 Making Required Hadoop Cluster Configuration Files Available to Your Machine Hadoop cluster configuration files contain information such as the name of the computer that hosts the Hadoop cluster and the TCP port. To connect to the Hadoop cluster, Hadoop configuration files must be copied from the specific Hadoop cluster to a physical location that the SAS client machine can access. The SAS environment variable SAS_HADOOP_CONFIG_PATH must be defined and set to the location of the Hadoop configuration files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Supported SAS File Features Using the SPD Engine The following SAS file features are supported for data sets using the SPD Engine: Encryption File compression Member-level locking SAS indexes SAS passwords Special missing values Physical ordering of returned observations User-defined formats and informats Note: When you create a data set, you cannot request both encryption and file compression. The following SAS file features are not supported for data sets using the SPD Engine: Audit trails

8 Chapter 2 / Storing Data in HDFS Cross-Environment Data Access (CEDA) Extended attributes Generation data sets Integrity constraints NLS support (such as to specify encoding for the data) Record-level locking SAS catalogs, SAS views, and MDDB files The following SAS software does not support SPD Engine data sets: SAS/CONNECT SAS/SHARE Security HDFS supports defined levels of permissions at both the directory and file levels. The SPD Engine honors those permissions. For example, if the file is available as Read only, you cannot modify it. If the Hadoop cluster supports Kerberos, the SPD Engine honors Kerberos authentication and authorization as long as the Hadoop cluster configuration files are accessed. For more information about accessing the Hadoop cluster configuration files, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Restricting access to members of SAS libraries by assigning SAS passwords to the members is supported when a data set is stored in HDFS. You can specify three levels of permission: Read, Write, and Alter. For more information about SAS passwords, see SAS Language Reference: Concepts.

9 3 Using the SPD Engine Overview: Using the SPD Engine...................................... 10 How the SPD Engine Supports Data Distribution.................. 10 I/O Operation Performance............................................. 11 Creating SAS Indexes.................................................... 11 Parallel Processing for Data in HDFS................................ 12 Overview: Parallel Processing for Data in HDFS.................. 12 Parallel Processing Considerations.................................. 14 Tuning Parallel Processing Performance........................... 14 WHERE Processing Optimization with MapReduce............... 15 Overview: WHERE Processing Optimization with MapReduce.. 15 WHERE Expression Syntax Support................................ 16 Data Set and SAS Code Requirements............................. 16 Hadoop Requirements................................................. 17 SPD Engine File System Locking..................................... 18 Overview: SPD Engine File System Locking....................... 18 Requesting Read Access Lock Files................................ 19 Specifying a Pathname for the SPD Engine Lock Directory...... 20 SPD Engine Distributed Locking...................................... 20 Overview: SPD Engine Distributed Locking........................ 20 Understanding the Service Provider................................. 21 Requirements for SPD Engine Distributed Locking................ 21

10 Chapter 3 / Using the SPD Engine Requesting Distributed Locking...................................... 22 Updating Data in HDFS.................................................. 23 Using SAS High-Performance Analytics Procedures............. 24 Overview: Using the SPD Engine The SPD Engine reads, writes, and updates data in HDFS. Specific SPD Engine features are supported for Hadoop storage and are explained in this document. For more information about the SPD Engine and its features that are not specific to Hadoop storage, see SAS Scalable Performance Data Engine: Reference. How the SPD Engine Supports Data Distribution When loading data into a Hadoop cluster, the SPD Engine ensures that the data is distributed appropriately. The SPD Engine uses the SPD Engine partition size and the HDFS block size to compute the maximum number of observations that can fit into both parameters. That is, observations never span multiple partitions or multiple blocks. After a data set is loaded into a Hadoop cluster, the actual block size of the loaded data might be less than the block size that was defined by the Hadoop administrator. The reason for the size difference can be because of the SPD Engine calculations regarding the partition size, block size, and observation length. Note: Defragmenting the Hadoop cluster is not recommended. Changing the block size and re-creating the files could result in the data becoming inaccessible by SAS.

Creating SAS Indexes 11 I/O Operation Performance To improve I/O operation performance, consider setting a different SPD Engine I/O block size. The larger the block size, the less I/O. For example, when reading a data set, the block size can significantly affect performance. When retrieving a large percentage of the data, a larger block size improves performance. However, when retrieving a subset of the data such as with WHERE processing, a smaller block size performs better. You can specify a different block size with the IOBLOCKSIZE= LIBNAME statement option and the IOBLOCKSIZE= data set option. For more information, see the IOBLOCKSIZE= LIBNAME statement option on page 33 and the IOBLOCKSIZE= data set option on page 40. Creating SAS Indexes When you create a SAS index for a data set that is stored in HDFS, a large index could require a long time to create. To provide efficient index creation, the SPD Engine partitions the two index files (.hbx and.idx). The index files are spread across multiple files based on the index partition size, which is 2 megabytes. Even though the index files are partitioned, the PARTSIZE= option, which specifies a size for the SPD Engine data partition file, does not affect the index partition size. You cannot increase or decrease the index partition size. To improve the performance of creating an index, consider these options: Request that indexes be created in parallel, asynchronously. To enable asynchronous parallel index creation, use the ASYNCINDEX= data set option. Request more temporary utility file space for sorting the data. To allocate an adequate amount of space for processing, use the SPDEUTILLOC= system option. Specify the utility file location on the SAS client machine, not on the Hadoop cluster.

12 Chapter 3 / Using the SPD Engine Request larger memory space for the sorting utility to use when sorting values for creating an index. To specify the amount of memory, use the SPDEINDEXSORTSIZE= system option. For more information about these options, see SAS Scalable Performance Data Engine: Reference. Parallel Processing for Data in HDFS Overview: Parallel Processing for Data in HDFS Parallel processing uses multiple threads that run in parallel so that a large operation is divided into multiple smaller ones that are executed simultaneously. The SPD Engine supports parallel processing to improve the performance of reading and writing data stored in HDFS. By default, the SPD Engine performs parallel processing only if a Read operation includes WHERE processing. If the Read operation does not include WHERE processing, the Read operation is performed by a single thread. To request parallel processing for all Read operations for all SAS releases and for Write operations in the third maintenance release for SAS 9.4 only, use these options: SPDEPARALLELREAD= system option on page 45 to request parallel read processing for the SAS session. PARALLELREAD= LIBNAME statement option on page 36 to request parallel read processing when using the assigned libref. PARALLELREAD= data set option on page 42 to request parallel read processing for the specific data set. In the third maintenance release for SAS 9.4, PARALLELWRITE= LIBNAME statement option on page 36 to request parallel write processing when using the assigned libref.

Parallel Processing for Data in HDFS 13 In the third maintenance release for SAS 9.4, PARALLELWRITE= data set option on page 43 to request parallel write processing for the specific data set. Here is an example of the SPDEPARALLELREAD= system option to request parallel processing for all Read operations for the SAS session: options spdeparallelread=yes; In this example, the LIBNAME statement requests parallel processing for all Read operations using the assigned libref. By specifying the PARALLELREAD= LIBNAME statement option, parallel processing is performed for all Read operations using the Class libref: libname class spde '/user/abcdef' hdfshost=default parallelread=yes; proc freq data=class.studentid; tables age; run; In this example, the PARALLELREAD= data set option requests parallel processing for all Read operations for the Class.StudentID data set: libname class spde '/user/abcdef' hdfshost=default; proc freq data=class.studentid (parallelread=yes); tables age; run; Here is an example of the PARALLELWRITE= LIBNAME statement option to request parallel processing for all Write operations using the assigned libref. By specifying the PARALLELWRITE= LIBNAME statement option, parallel processing is performed for all Write operations using the Class libref: libname class spde '/user/abcdef' hdfshost=default parallelwrite=yes; TIP To display information in the SAS log about parallel processing, set the MSGLEVEL= system option to I. When you set options msglevel=i;, the SAS log reports whether parallel processing is in effect.

14 Chapter 3 / Using the SPD Engine Parallel Processing Considerations The following are considerations for requesting parallel processing: For some environments, parallel processing might not improve the performance. The availability of network bandwidth and the number of CPUs on the SAS client machine determine the performance improvement. It is recommended that you set up a test in your environment to measure performance with and without parallel processing. When parallel read processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. For example, the COMPARE procedure expects that observations are read from the data set in the same order that they were written to the data set. Also, legacy code that uses the DATA step or the OBS= data set option might rely on physical order to produce the expected results. Tuning Parallel Processing Performance To tune the performance of parallel processing, consider these SPD Engine options: The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing. The SPD Engine THREADNUM= data set option specifies the maximum number of threads to use for the processing. For more information about these options, see SAS Scalable Performance Data Engine: Reference. Note: The Base SAS NOTHREADS= and CPUCOUNT= system options have no effect on SPD Engine parallel processing.

WHERE Processing Optimization with MapReduce 15 WHERE Processing Optimization with MapReduce Overview: WHERE Processing Optimization with MapReduce WHERE processing enables you to conditionally select a subset of observations so that SAS processes only the observations that meet specified conditions. To optimize the performance of WHERE processing, you can request that data subsetting be performed in the Hadoop cluster. Then, when you submit SAS code that includes a WHERE expression (which defines the condition that selected observations must satisfy), the SPD Engine instantiates the WHERE expression as a Java class. The SPD Engine submits the Java class to the Hadoop cluster as a component in a MapReduce program. By requesting that data subsetting be performed in the Hadoop cluster, performance might be improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SAS client. Performance is often improved with large data sets when the WHERE expression qualifies only a relatively small subset. By default, data subsetting is performed by the SPD Engine on the SAS client. To request that data subsetting be performed in the Hadoop cluster, you must specify the ACCELWHERE= LIBNAME statement option on page 31 or the ACCELWHERE= data set option on page 39. Here is an example of a LIBNAME statement that connects to a Hadoop cluster and requests that data subsetting be performed in the Hadoop cluster. By specifying the ACCELWHERE= LIBNAME statement option, subsequent WHERE processing for all data sets accessed with the Class libref are performed in the Hadoop cluster. libname class spde '/user/abcdef' hdfshost=default accelwhere=yes; proc freq data=class.studentid; tables age; where age gt 14; run;

16 Chapter 3 / Using the SPD Engine In this example, the ACCELWHERE= data set option requests that data subsetting be performed in the Hadoop cluster. The WHERE processing for the Class.StudentID data set is performed in the Hadoop cluster. WHERE processing for any other data set with the Class libref is performed by the SPD Engine on the SAS client machine. libname class spde '/user/abcdef' hdfshost=default; proc freq data=class.studentid (accelwhere=yes); tables age; where age gt 14; run; WHERE Expression Syntax Support In the third maintenance release for SAS 9.4, WHERE processing optimization is expanded. WHERE processing optimization supports the following syntax: comparison operators such as EQ (=), NE (^=), GT (>), LT (<), GE (>=), LE (<=) IN operator full bounded range condition, such as where 500 <= empnum <= 1000; BETWEEN-AND operator, such as where empnum between 500 and 1000; compound expressions using the logical operators AND, OR, and NOT, such as where skill = 'java' or years = 4; parentheses to control the order of evaluation, such as where (product='graph' or product='stat') and country='canada'; Data Set and SAS Code Requirements To perform the data subsetting in the Hadoop cluster, the following data set and SAS code requirements must be met. If any of these requirements are not met, the subsetting of the data is performed by the SPD Engine, not by a MapReduce program in the Hadoop cluster. The data set cannot be encrypted. The data set cannot be compressed.

WHERE Processing Optimization with MapReduce 17 The data set must be larger than the HDFS block size. The submitted SAS code cannot request BY-group processing. The submitted SAS code cannot include the STARTOBS= or ENDOBS= options. The LIBNAME statement cannot include the HDFSUSER= option. The submitted WHERE expression cannot include any of the following syntax: o o o a variable as an operand, such as where lastname; variable-to-variable comparison SAS functions, such as SUBSTR, TODAY, UPCASE, and PUT o arithmetic operators *, /, +, -, and ** o IS NULL or IS MISSING and IS NOT NULL or IS NOT MISSING operators o concatenation operator, such as or!! o o negative prefix operator, such as where z = -(x + y); pattern-matching operators LIKE and CONTAINS o sounds-like operator SOUNDEX (=*) o truncated comparison operator using the colon (:) modifier, such as where lastname=: 'S'; TIP To display information in the SAS log regarding WHERE processing optimization, set the MSGLEVEL= system option to I. When you issue options msglevel=i;, the SAS log reports whether the data filtering occurred in the Hadoop cluster. If the optimization occurred, the Hadoop Job ID is displayed in the SAS log. If the optimization did not occur, additional messages explain why. Hadoop Requirements To perform the data subsetting in the Hadoop cluster, the following Hadoop requirements must be met.

18 Chapter 3 / Using the SPD Engine The Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN. The JRE version for the Hadoop cluster must be either 1.6, which is the default, or 1.7. If the JRE version is 1.7, use the ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version. SPD Engine File System Locking Overview: SPD Engine File System Locking The HDFS concurrent access model allows multiple readers and a single writer. If an application accesses a file to write to it, no other application can write to the file, but multiple applications can read the file. The SPD Engine supports a file system locking strategy that honors the HDFS concurrent access model and provides additional levels of concurrent access to ensure the integrity of the data stored in HDFS. By default, the SPD Engine creates a Write access lock file when a data set stored in HDFS is opened for Write access. With the Write access lock file, no other SAS session can write to the file, but multiple SAS sessions can read the file if the readers accessed the data set before the Write access lock file was created. During concurrent access, the following describes the results of the default SPD Engine locking mechanism: Once a SAS session opens a data set for Write access, any previous readers can continue to access the data set. However, the readers could experience unexpected data results. For example, the writer could delete the data set while the readers are accessing the data set. Once a SAS session opens a data set for Write access, any subsequent reader is not allowed to access the data set. With the Write access locking mechanism, a lock error message occurs in these situations:

SPD Engine File System Locking 19 When a SAS session requests Write access to a data set that another SAS session has open for Write access. When a SAS session requests Read access to a data set that another SAS session has open for Write access. When a SAS session requests to delete a data set that another SAS session has open for Write access. In the third maintenance release for SAS 9.4, to store the lock files, the SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eight-character hexadecimal value (which is the checksum of the Hadoop cluster directory that contains the data set), and the suffix _spdslock9, such as BigFile_0000393a_spdslock9. In most situations, you will not see the lock directory because lock files are deleted when the process completes. TIP In some situations, such as an abnormal termination of a SAS session, lock files might not be properly deleted. The leftover lock files could prohibit access to a data set. If this occurs, the leftover lock files must be manually deleted by submitting HDFS commands. Requesting Read Access Lock Files In some situations, you might want to control the level of concurrent access to guarantee the integrity of the data by requesting that a Read access lock file be created. To request a Read access lock file, define the SAS environment variable SPDEREADLOCK and set it to YES. Then, when a SAS session opens a data set for Read access, a Read access lock file is created in addition to any Write access lock files. For more information, see SPDEREADLOCK SAS Environment Variable on page 52. With the Read and Write access locking mechanism, a lock error message occurs in these situations: When a SAS session requests Write access to a data set that another SAS session has open for either Read or Write access.

20 Chapter 3 / Using the SPD Engine When a SAS session requests Read access to a data set that another SAS session has open for Write access. When a SAS session requests to delete a data set that another SAS session has open for either Read or Write access. Note: When you request a Read access lock file, all data access, even for Read access, requires Write permission to the Hadoop cluster. TIP By creating both Read and Write access lock files, the possibility of leftover lock files is increased. If you experience situations such as an abnormal termination of a SAS session, lock files that were not properly deleted must be manually deleted by submitting HDFS commands. Specifying a Pathname for the SPD Engine Lock Directory By default, for HDFS concurrent access, the SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eightcharacter hexadecimal value (which is the checksum of the Hadoop cluster directory that contains the data set, and the suffix _spdslock9, such as BigFile_0000393a_spdslock9. In the third maintenance release for SAS 9.4, you can specify a pathname for the SPD Engine lock directory by defining the SAS environment variable SPDELOCKPATH to specify a directory in the Hadoop cluster. For more information, see SPDELOCKPATH SAS Environment Variable on page 51. SPD Engine Distributed Locking Overview: SPD Engine Distributed Locking In the third maintenance release for SAS 9.4, the SPD Engine supports distributed locking for data stored in HDFS. Distributed locking provides synchronization and group

coordination services to clients over a network connection. For the service provider, the SPD Engine uses the Apache ZooKeeper coordination service, specifically the implementation of the recipe for Shared Lock that is provided by Apache Curator. Distributed locking provides the following benefits: SPD Engine Distributed Locking 21 The lock server maintains the lock state information in memory and does not require Write permission to any client or data library disk storage locations. A process requesting a lock on a data set that is not available (because the data set is already locked) can choose to wait for the data set to become available, rather than having the lock request fail immediately. If a process abnormally terminates while holding locks on data sets, the lock server automatically drops all locks that the client was holding, which eliminates the possibility of leftover lock files. Understanding the Service Provider Apache ZooKeeper is an open-source distributed server that enables reliable distributed coordination to distributed client applications over a network. ZooKeeper safely coordinates access to shared resources with other applications or processes. At its core, ZooKeeper is a fault tolerant multi-machine server that maintains a virtual hierarchy of data nodes that store coordination data. For more information about ZooKeeper and the ZooKeeper data nodes, see Apache ZooKeeper. Apache Curator is a high-level API that simplifies using ZooKeeper. Curator adds many features that build on ZooKeeper and handles the complexity of managing connections to the ZooKeeper cluster. For more information about Curator, see Curator. The SPD Engine accesses the Curator API to provide the locking services. Requirements for SPD Engine Distributed Locking SPD Engine distributed locking has the following requirements: ZooKeeper 3.4.0 or later must be downloaded, installed, and running on the Hadoop cluster. The zookeeper JAR file is required.

22 Chapter 3 / Using the SPD Engine Curator 2.7.0 or later must be downloaded on the Hadoop cluster. The following Curator JAR files are required: o o o curator-client curator-framework curator-recipes The following Hadoop distribution JAR files are required on the client side: o o o guava log4j slf4j The JAR files must be available to the SAS client machine. The SAS environment variable SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. TIP To be effective, all access to SPD data sets must use the same locking method. If some processes or instances use distributed locking and others do not, proper coordination of access to the data sets cannot be guaranteed, and at a minimum, lock failures will be encountered. Requesting Distributed Locking To request distributed locking, you must first create an XML configuration file that contains information so that the SPD Engine can communicate with ZooKeeper. The format of the XML is similar to Hadoop configuration files in that the XML contains properties and attributes as name-value pairs. For an example of an XML configuration file, see XML Configuration File on page 46. In addition, you must define the SAS environment variable SPDE_CONFIG_FILE to specify the location of the user-defined XML configuration file. The location must be available to the SAS client machine. For more information, see SPDE_CONFIG_FILE SAS Environment Variable on page 46.

Updating Data in HDFS 23 Updating Data in HDFS HDFS does not support updating data. However, because traditional SAS processing involves updating data, the SPD Engine supports SAS Update operations for data stored in HDFS. To update data in HDFS, the SPD Engine uses an approach that replaces the data set s data partition file for each observation that is updated. When an update is requested, the SPD Engine re-creates the data partition file in its entirety (including all replications), and then inserts the updated data into the new data partition file. Because the data partition file is replaced for each observation that is updated, the greater the number of observations to be updated, the longer the process. For a general-purpose data storage engine like the SPD Engine, the ability to perform small, infrequent updates can be beneficial. However, updating data in HDFS is intended for situations when the time it takes to complete the update outweighs the alternatives. The following are best practices for Update operations using the SPD Engine: It is recommended that you set up a test in your environment to measure Update operation performance. For example, update a small number of observations to gauge how long updates take in your environment. Then, project the test results to a larger number of observations to determine whether updating is realistic. It is recommended that you do not use the SQL procedure to update data in HDFS because of how PROC SQL opens, updates, and closes a file. There are other SAS methods that provide better performance such as the DATA step UPDATE statement and MODIFY statement. The performance of appending a data set can be slower if the data set has a unique index. Test case results show that appending a data set to another data set without a unique index was significantly faster than appending the same data set to another data set with a unique index.

24 Chapter 3 / Using the SPD Engine Using SAS High-Performance Analytics Procedures You can use the SPD Engine with SAS High-Performance Analytics procedures to read and write the SPD Engine file format in HDFS. In many cases, the SPD Engine data used by the procedures can be read and written in parallel using the SAS Embedded Process. The following are requirements for a SAS Embedded Process parallel read: Access to the machines in the cluster where a SAS High-Performance Analytics deployment of Hadoop is installed and running. The data set cannot be encrypted or compressed. The STARTOBS= and ENDOBS= data set options cannot be specified. The following are requirements for a SAS Embedded Process parallel write: The ALIGN=, COMPRESS=, ENCRYPT=, and PADCOMPRESS= data set options cannot be specified. The SAS client machine must have a data representation that is compatible with the data representation of the Hadoop cluster. The SAS client machine must be either Linux x64 or Solaris x64. The following are best practices when using the SPD Engine with SAS High- Performance Analytics procedures: With SAS Enterprise Miner, a SAS process can be terminated in such a way that the SPD Engine does not follow normal shutdown procedures, which can result in a lock file not being deleted. The orphan lock file could prevent a subsequent open of the data set. If this occurs, the orphan lock file must be manually deleted by submitting Hadoop commands. To delete the orphan lock file, you can use the HADOOP procedure to submit Hadoop commands. For SAS High-Performance Analytics Work files, the SPD Engine uses the standard UNIX temporary directory /tmp. To override the default Work directory, you can

Using SAS High-Performance Analytics Procedures 25 define the SAS environment variable SPDE_HADOOP_WORK_PATH to specify a directory in the Hadoop cluster. The directory must exist and you must have Write access. For example, the following OPTIONS statement sets the Work directory: options set=spde_hadoop_work_path="/sasdata/cluster1/hpawork"; For more information, see SPDE_HADOOP_WORK_PATH SAS Environment Variable on page 50.

26 Chapter 3 / Using the SPD Engine

27 4 SPD Engine Reference Overview: SPD Engine Reference......................... 27 Dictionary............................................ 28 LIBNAME Statement for HDFS.......................... 28 ACCELWHERE= Data Set Option for HDFS................ 39 IOBLOCKSIZE= Data Set Option for HDFS................. 40 PARTSIZE= Data Set Option for HDFS.................... 41 PARALLELREAD= Data Set Option for HDFS............... 42 PARALLELWRITE= Data Set Option for HDFS.............. 43 SPDEPARALLELREAD= System Option for HDFS........... 45 SPDE_CONFIG_FILE SAS Environment Variable............ 46 SPDE_HADOOP_WORK_PATH SAS Environment Variable... 50 SPDELOCKPATH SAS Environment Variable............... 51 SPDEREADLOCK SAS Environment Variable............... 52 Overview: SPD Engine Reference The SPD Engine reads, writes, and updates data in HDFS. A specific SPD Engine LIBNAME statement and options are provided for Hadoop storage and are explained in this document. For more information about the SPD Engine LIBNAME statement and options that are not specific to Hadoop storage, see SAS Scalable Performance Data Engine: Reference.

28 Chapter 4 / SPD Engine Reference Dictionary LIBNAME Statement for HDFS Associates a libref with a Hadoop cluster to read, write, and update a data set in HDFS. Restrictions: Requirements: The SPD Engine LIBNAME statement arguments that are specific to HDFS are not supported in the z/os operating environment. You can connect to only one Hadoop cluster at a time per SAS session. You can submit multiple LIBNAME statements to different directories in the Hadoop cluster, but there can be only one Hadoop cluster connection per SAS session. To associate a libref with a Hadoop cluster, you must have the first maintenance release or later for SAS 9.4. Supported Hadoop distributions: Cloudera CDH 4.x, Cloudera CDH 5.x, Hortonworks HDP 2.x, IBM InfoSphere BigInsights 3.x, MapR 4.x (Microsoft Windows and Linux only), Pivotal HD 2.x, with or without Kerberos To store data in HDFS using the SPD Engine, you must use a supported Hadoop distribution and configure a required set of Hadoop JAR files. The JAR files must be available to the SAS client machine. The SAS environment variable SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. To connect to the Hadoop cluster, Hadoop configuration files must be copied from the specific Hadoop cluster to a physical location that the SAS client machine can access. The SAS environment variable SAS_HADOOP_CONFIG_PATH must be defined and set to the location of the Hadoop configuration files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Example: Chapter 5, How to Use Hadoop Data Storage, on page 55 Syntax LIBNAME libref SPDE 'primary-pathname' HDFSHOST=DEFAULT <ACCELJAVAVERSION=version> <ACCELWHERE=NO YES> <DATAPATH=('pathname')> <HDFSUSER=ID> <IOBLOCKSIZE=n> <NUMTASKS=n> <PARALLELREAD=NO YES> <PARALLELWRITE=NO YES threads> <PARTSIZE=n nm ng nt>;

LIBNAME Statement for HDFS 29 Summary of Optional Arguments ACCELJAVAVERSION=version When requesting that WHERE processing be optimized by being performed in the Hadoop cluster, specifies the Java Runtime Environment (JRE) version for the Hadoop cluster. ACCELWHERE=NO YES Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster. DATAPATH=('pathname') When creating a data set, specifies the fully qualified pathname to a directory in the Hadoop cluster to store data partition files. HDFSUSER=ID Is an authorized user ID on the Hadoop cluster. IOBLOCKSIZE=n Specifies a size in bytes of a block of observations to be used in an I/O operation. NUMTASKS=n Specifies the number of MapReduce tasks when writing data in HDFS. PARALLELREAD=NO YES Determines when the SPD Engine uses parallel processing to read data stored in HDFS. PARALLELWRITE=NO YES threads Determines whether the SPD Engine uses parallel processing to write data in HDFS. PARTSIZE=n nm ng nt Specifies a size for the SPD Engine data partition file.

30 Chapter 4 / SPD Engine Reference Required Arguments libref is a valid SAS library name that serves as a shortcut name to associate with a data set in a Hadoop cluster. The name can be up to eight characters long and must conform to the rules for SAS names. SPDE is the engine name for the SAS Scalable Performance Data (SPD) Engine. 'primary-pathname' specifies the fully qualified pathname to a directory in the Hadoop cluster, which is typically UNIX based. Enclose the primary pathname in single or double quotation marks. An example is '/user/abcdef/'. When data is loaded into a Hadoop cluster directory, the SPD Engine automatically creates a subdirectory with the specified data set name and the suffix _spde. The SPD Engine data partition files are contained in that subdirectory. For example, if you load a data set named BigFile into the directory /user/abcdef/, the data partition files are located at /user/abcdef/bigfile_spde/. The SPD Engine metadata and index files are located at /user/abcdef/. Restrictions Maximum length is 260 characters for Windows and 1024 characters for UNIX. The primary pathname must be unique for each assigned libref. Assigned librefs that are different but reference the same primary pathname can result in lost data. Requirement Interaction You must use valid directory syntax for the host. The pathname must be recognized by the operating environment. You can specify a different location to store the data partition files with the DATAPATH= option on page 32. HDFSHOST=DEFAULT specifies that you want to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. The SPD Engine locates the Hadoop cluster

LIBNAME Statement for HDFS 31 configuration files using the SAS_HADOOP_CONFIG_PATH environment variable. The environment variable sets the location of the configuration files for a specific cluster. For more information about the SAS_HADOOP_CONFIG_PATH environment variable, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Requirement You must specify the HDFSHOST=DEFAULT argument. Optional Arguments ACCELJAVAVERSION=version When requesting that WHERE processing be optimized by being performed in the Hadoop cluster, specifies the Java Runtime Environment (JRE) version for the Hadoop cluster. The value must be either 1.6 or 1.7. Default 1.6 Interaction Example To request that data subsetting be performed in the Hadoop cluster, use the ACCELWHERE= LIBNAME statement option on page 31. By default, data subsetting is performed by the SPD Engine on the SAS client. Example 8: Optimizing WHERE Processing with MapReduce on page 69 ACCELWHERE=NO YES Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster. NO specifies that data subsetting is performed by the SPD Engine on the SAS client. This is the default setting. YES specifies that data subsetting is performed by a MapReduce program in the Hadoop cluster.

32 Chapter 4 / SPD Engine Reference Requirements To perform data subsetting in the Hadoop cluster, there are data set and SAS code requirements. See WHERE Processing Optimization with MapReduce on page 15. To submit the MapReduce program to the Hadoop cluster, the Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN. Interactions If the JRE version for the Hadoop cluster is 1.7 instead of the default version 1.6, use the ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version. The ACCELWHERE= data set option overrides the ACCELWHERE= LIBNAME statement option. For more information, see ACCELWHERE= data set option on page 39. Example Example 8: Optimizing WHERE Processing with MapReduce on page 69 Default NO DATAPATH=('pathname') When creating a data set, specifies the fully qualified pathname to a directory in the Hadoop cluster to store data partition files. Enclose the pathname in single or double quotation marks within parentheses. An example is datapath=('/sasdata'). When data is loaded into the directory, a subdirectory is automatically created with the specified data set name and the suffix _spde. The SPD Engine data partition files are contained in that subdirectory. For example, if you load a data set named BigFile into the directory /user/abcdef/ and specify datapath=( /sasdata/ ), the data partition files are located at /sasdata/bigfile_spde/. The SPD Engine metadata and index files are located at /user/abcdef/.

LIBNAME Statement for HDFS 33 Restrictions You can specify only one pathname to store data partition files. Maximum length is 260 characters for Windows and 1024 characters for UNIX. The pathname must be unique for each assigned libref. Assigned librefs that are different but reference the same pathname can result in lost data. Requirement Interaction You must use valid directory syntax for the host. The pathname must be recognized by the operating environment. Specifying the DATAPATH= option overrides the primary pathname for storing the data partition files only. The SPD Engine metadata and index files are always stored in the primary pathname. HDFSUSER=ID Is an authorized user ID on the Hadoop cluster. You can specify a user ID to connect to the Hadoop cluster with a different ID than your current logon ID. Restrictions If the HDFSUSER= option is specified, Kerberos authentication is bypassed, which prevents access to a secure Hadoop cluster. If the HDFSUSER= option is specified, WHERE processing optimization with the ACCELWHERE= option cannot be performed in the Hadoop cluster. HDFSUSER= is not supported by a MapR Apache Hadoop distribution. IOBLOCKSIZE=n Specifies a size in bytes of a block of observations to be used in an I/O operation. The I/O block size determines the amount of data that is physically transferred together in an I/O operation. The larger the block size, the less I/O. The SPD Engine

34 Chapter 4 / SPD Engine Reference uses blocks in memory to collect the observations to be written to or read from a data component file. The IOBLOCKSIZE= option specifies the size of the block. (The actual size is computed to accommodate the largest number of observations that fit in the specified size of n bytes. Therefore, the actual size is a multiple of the observation length.) The block size affects I/O operations for compressed, uncompressed, and encrypted data sets. However, the effects are different and depend on the I/O operation. For a compressed data set, the block size determines how many observations are compressed together, which determines the amount of data that is physically transferred for both Read and Write operations. The block size is a permanent attribute of the file. To specify a different block size, you must copy the data set to a new data set, and then specify a new block size for the output file. For a compressed data set, a larger block size can improve performance for both Read and Write operations. For an encrypted data set, the block size is a permanent attribute of the file. For an uncompressed data set, the block size determines the size of the blocks that are used to read the data from disk to memory. The block size has no affect when writing data to disk. For an uncompressed data set, the block size is not a permanent attribute of the file. That is, you can specify a different block size based on the Read operation that you are performing. For example, reading data that is randomly distributed or reading a subset of the data calls for a smaller block size because accessing smaller blocks is faster than accessing larger blocks. In contrast, reading data that is uniformly or sequentially distributed or that requires a full data set scan works better with a larger block size. Default Ranges 1,048,576 bytes (1 megabyte) The minimum block size is 32,768 bytes. The maximum block size is half the size of the SPD Engine data partition file. Restriction The SPD Engine I/O block size must be smaller than or equal to the Hadoop cluster block size.

LIBNAME Statement for HDFS 35 Interaction Tip Example The IOBLOCKSIZE= data set option overrides the IOBLOCKSIZE= LIBNAME statement option. For more information, see IOBLOCKSIZE= Data Set Option for HDFS on page 40. When reading a data set, the block size can significantly affect performance. If retrieving a large percentage of the data, a larger block size improves performance. However, if retrieving a subset of the data (such as with WHERE processing), a smaller block size performs better. Example 7: Setting the SPD Engine I/O Block Size on page 68 NUMTASKS=n Specifies the number of MapReduce tasks when writing data in HDFS. This option controls parallel processing on the Hadoop cluster when writing output from a SAS High-Performance Analytics procedure using the SAS Embedded Process. When a high-performance procedure reads and writes Hadoop data, and the amount of output data is similar to the amount of input data, the same number of output tasks as input tasks should be a good default. However, if the amount of output data differs significantly from the amount of input data, you should use this option to tune the number of tasks proportionally to the output data. Default Restriction Interaction The number of MapReduce tasks is the number of SAS High-Performance Analytics nodes. Or, if the highperformance procedure reads a Hadoop file as input, it is the number of tasks that were used to read the input file. This option affects writing data in HDFS only when a high-performance procedure writes output to HDFS using the SAS Embedded Process. If the specified number of MapReduce tasks is less than the number of SAS High-Performance Analytics nodes on which the procedure runs, the setting is ignored.

36 Chapter 4 / SPD Engine Reference PARALLELREAD=NO YES Determines when the SPD Engine uses parallel processing to read data stored in HDFS. NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. YES specifies parallel processing for all Read operations using the assigned libref. Default Interactions NO The SET statement POINT= option is inconsistent with parallel processing. When parallel processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. The PARALLELREAD= LIBNAME statement option overrides the SPDEPARALLELREAD= system option. For more information, see SPDEPARALLELREAD= System Option for HDFS on page 45. The PARALLELREAD= LIBNAME statement option can be overridden by the PARALLELREAD= data set option. For more information, see PARALLELREAD= Data Set Option for HDFS on page 42. See Parallel Processing for Data in HDFS on page 12 PARALLELWRITE=NO YES threads Determines whether the SPD Engine uses parallel processing to write data in HDFS. NO specifies that parallel processing for a Write operation does not occur. This is the default behavior for the SPD Engine.

LIBNAME Statement for HDFS 37 YES specifies parallel processing for all Write operations using the assigned libref. A thread is used for each CPU on the SAS client machine. For example, if eight CPUs exist on the SAS client machine, then eight threads are used to write data. threads specifies parallel processing for all Write operations using the assigned libref and specifies the number of threads to use for the Write operations. Default The default is 1, which specifies that parallel processing for a Write operation does not occur. Range 2 to 512 Default Restrictions NO You cannot use parallel processing for a Write operation and also request to create a SAS index. You cannot use parallel processing for a Write operation and also request BY-group processing or sorting. Interactions When parallel Write processing occurs, the order in which the observations are written is unpredictable. The order in which the observations are returned cannot be determined unless the application imposes ordering criteria. The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing. For more information, see SAS Scalable Performance Data Engine: Reference. The PARALLELWRITE= LIBNAME statement option can be overridden by the PARALLELWRITE= data set option. For more information, see PARALLELWRITE= Data Set Option for HDFS on page 43.

38 Chapter 4 / SPD Engine Reference Note The PARALLELWRITE= LIBNAME statement option is available in the third maintenance release for SAS 9.4. See Parallel Processing for Data in HDFS on page 12 PARTSIZE=n nm ng nt Specifies a size for the SPD Engine data partition file. Each partition is stored as a separate file with the file extension.dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file. The value is specified in megabytes, gigabytes, or terabytes. If n is specified without M, G, or T, the default is megabytes. That is, partsize=64 is the same as partsize=64m. Default Restrictions 128 megabytes The minimum value is 16 megabytes. The maximum value is 8,796,093,022,207 megabytes. Interaction Tip The PARTSIZE= data set option overrides the PARTSIZE= LIBNAME statement option. For more information, see PARTSIZE= Data Set Option for HDFS on page 41. To update data, a smaller partition size provides the best performance. For example, when you update a value, the SPD Engine locates the appropriate partition, modifies the value, and rewrites all replications of the partition. Because each update requires that the partition be rewritten, it is recommended that you perform updates only occasionally or set a small partition size if you are planning to update the data frequently.

ACCELWHERE= Data Set Option for HDFS 39 ACCELWHERE= Data Set Option for HDFS Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster. Valid in: Category: Default: Requirements: Interaction: DATA step and PROC step Data Set Control NO To perform data subsetting in the Hadoop cluster, there are data set and SAS code requirements. For more information, see WHERE Processing Optimization with MapReduce on page 15. To submit the MapReduce program to the Hadoop cluster, the Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN. If the JRE version for the Hadoop cluster is 1.7 instead of the default 1.6 version, use the ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version. Syntax ACCELWHERE=NO YES Syntax Description NO specifies that data subsetting is performed by the SPD Engine on the SAS client. This is the default setting. YES specifies that data subsetting is performed by a MapReduce program in the Hadoop cluster. Comparisons The ACCELWHERE= data set option overrides the ACCELWHERE= LIBNAME statement option. See Also ACCELWHERE= LIBNAME statement option on page 31

40 Chapter 4 / SPD Engine Reference IOBLOCKSIZE= Data Set Option for HDFS Specifies a size in bytes of a block of observations to be used in an I/O operation. Valid in: Category: Default: Ranges: Restriction: Tip: DATA step and PROC step Data Set Control 1,048,576 bytes (1 megabyte) The minimum block size is 32,768 bytes., The maximum block size is half the size of the SPD Engine data partition file. The SPD Engine I/O block size must be smaller than or equal to the Hadoop cluster block size. When reading a data set, the block size can significantly affect performance. If retrieving a large percentage of the data, a larger block size improves performance. However, if retrieving a subset of the data (such as with WHERE processing), a smaller block size performs better. Example: Example 7: Setting the SPD Engine I/O Block Size on page 68 IOBLOCKSIZE=n Syntax n Syntax Description is the size in bytes of a block of observations. Details The I/O block size determines the amount of data that is physically transferred together in an I/O operation. The larger the block size, the less I/O. The SPD Engine uses blocks in memory to collect the observations to be written to or read from a data component file. The IOBLOCKSIZE= data set option specifies the size of the block. (The actual size is computed to accommodate the largest number of observations that fit in the specified size of n bytes. Therefore, the actual size is a multiple of the observation length.) The block size affects I/O operations for compressed, uncompressed, and encrypted data sets. However, the effects are different and depend on the I/O operation.

For a compressed data set, the block size determines how many observations are compressed together, which determines the amount of data that is physically transferred for both Read and Write operations. The block size is a permanent attribute of the file. To specify a different block size, you must copy the data set to a new data set, and then specify a new block size for the output file. For a compressed data set, a larger block size can improve performance for both Read and Write operations. For an encrypted data set, the block size is a permanent attribute of the file. For an uncompressed data set, the block size determines the size of the blocks that are used to read the data from disk to memory. The block size has no affect when writing data to disk. For an uncompressed data set, the block size is not a permanent attribute of the file. That is, you can specify a different block size based on the Read operation that you are performing. For example, reading data that is randomly distributed or reading a subset of the data calls for a smaller block size because accessing smaller blocks is faster than accessing larger blocks. In contrast, reading data that is uniformly or sequentially distributed or that requires a full data set scan works better with a larger block size. Comparisons The IOBLOCKSIZE= data set option overrides the IOBLOCKSIZE= LIBNAME statement option. See Also IOBLOCKSIZE= LIBNAME statement option on page 33 PARTSIZE= Data Set Option for HDFS 41 PARTSIZE= Data Set Option for HDFS Specifies a size for the SPD Engine data partition file. Valid in: Category: Default: Restrictions: DATA step and PROC step Data Set Control 128 megabytes The minimum value is 16 megabytes. The maximum value is 8,796,093,022,207 megabytes.

42 Chapter 4 / SPD Engine Reference Specify a data partition file size only when creating a new data set. Tip: To update data, a smaller partition size provides the best performance. For example, when you update a value, the SPD Engine locates the appropriate partition, modifies the value, and rewrites all replications of the partition. Because each update requires that the partition be rewritten, it is recommended that you perform updates only occasionally or set a small partition size if you are planning to update the data frequently. Syntax PARTSIZE=n nm ng nt Syntax Description n nm ng nt is the size of the data partition file in megabytes, gigabytes, or terabytes. If n is specified without M, G, or T, the default is megabytes. That is, partsize=64 is the same as partsize=64m. Details Each partition is stored as a separate file with the file extension.dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file. Comparisons The PARTSIZE= data set option overrides the PARTSIZE= LIBNAME statement option. See Also PARTSIZE= LIBNAME statement option on page 38 PARALLELREAD= Data Set Option for HDFS Determines when the SPD Engine uses parallel processing to read data stored in HDFS. Valid in: Category: Default: Interactions: DATA step and PROC step Data Set Control NO The SET statement POINT= option is inconsistent with parallel processing.

When parallel processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. See: Parallel Processing for Data in HDFS on page 12 PARALLELWRITE= Data Set Option for HDFS 43 Syntax PARALLELREAD=NO YES Required Arguments NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. YES requests parallel processing for all Read operations for the specific data set. Comparisons The PARALLELREAD= data set option overrides the SPDEPARALLELREAD= system option and the PARALLELREAD= LIBNAME statement option. See Also PARALLELREAD= LIBNAME Statement Option on page 36 SPDEPARALLELREAD= System Option for HDFS on page 45 PARALLELWRITE= Data Set Option for HDFS Determines whether the SPD Engine uses parallel processing to write data in HDFS. Valid in: Category: Default: Restrictions: DATA step and PROC step Data Set Control NO You cannot use parallel processing for a Write operation and also request to create a SAS index. You cannot use parallel processing for a Write operation and also request BY-group processing or sorting.

44 Chapter 4 / SPD Engine Reference Interactions: Note: When parallel Write processing occurs, the order in which the observations are written is unpredictable. The order in which the observations are returned cannot be determined unless the application imposes ordering criteria. The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing. For more information, see SAS Scalable Performance Data Engine: Reference. The PARALLELWRITE= data set option is available in the third maintenance release for SAS 9.4. See: Parallel Processing for Data in HDFS on page 12 Syntax PARALLELWRITE=NO YES threads Required Arguments NO specifies that parallel processing for a Write operation does not occur. This is the default behavior for the SPD Engine. YES specifies parallel processing for all Write operations for the specific data set. A thread is used for each CPU on the SAS client machine. For example, if eight CPUs exist on the SAS client machine, then eight threads are used to write data. threads specifies parallel processing for all Write operations for the specific data set and specifies the number of threads to use for the Write operations. Default The default is 1, which specifies that parallel processing for a Write operation does not occur. Range 2 to 512 Comparisons The PARALLELWRITE= data set option overrides the PARALLELWRITE= LIBNAME statement option.

SPDEPARALLELREAD= System Option for HDFS 45 See Also PARALLELWRITE= LIBNAME Statement Option on page 36 SPDEPARALLELREAD= System Option for HDFS Determines when the SPD Engine uses parallel processing to read data stored in HDFS. Valid in: Category: PROC OPTIONS GROUP= Default: Interactions: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window SASFILES: SAS Files SASFILES NO The SET statement POINT= option is inconsistent with parallel processing. When parallel processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. See: Parallel Processing for Data in HDFS on page 12 Syntax SPDEPARALLELREAD=NO YES Required Arguments NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. YES requests parallel processing for all Read operations for the SAS session. Comparisons The SPDEPARALLELREAD= system option can be overridden by the PARALLELREAD= LIBNAME statement option and the PARALLELREAD= data set option.

46 Chapter 4 / SPD Engine Reference See Also PARALLELREAD= LIBNAME Statement Option on page 36 PARALLELREAD= Data Set Option for HDFS on page 42 SPDE_CONFIG_FILE SAS Environment Variable Requests SPD Engine distributed locking by specifying the location of the user-defined XML configuration file. Valid in: Default: Note: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window SPD Engine uses HDFS distributed locking. The SPDE_CONFIG_FILE SAS environment variable is available in the third maintenance release for SAS 9.4. See: SPD Engine Distributed Locking on page 20 Syntax SPDE_CONFIG_FILE='pathname' Required Argument 'pathname' specifies the fully qualified pathname to the user-defined XML configuration file. The location must be available to the SAS client machine. Enclose the primary pathname in single or double quotation marks. You can name the file whatever you want. An example is '/user/abcdef/hadoop/spde-site.xml'. Details XML Configuration File The XML configuration file contains the information so that the SPD Engine can communicate with ZooKeeper. The format of the XML configuration file is similar to a Hadoop configuration file in that the XML contains properties and attributes as name and value pairs. You must create an XML configuration file. The following is an example XML configuration file:

SPDE_CONFIG_FILE SAS Environment Variable 47 <?xml version="1.0" encoding="utf-8"?> <configuration> <property> <name>spde.zookeeper.quorum</name> <!-- Comma-separated list of Hadoop clusters running ZooKeeper server. --> <value>abcdef07.unx.sas.com,abcdef08.unx.sas.com,abcdef06.unx.sas.com</value> </property> <property> <name>spde.zookeeper.port</name> <!-- Port number used to connect to the ZooKeeper ensemble. --> <value>2181</value> </property> <property> <!-- Number of times to attempt to connect to ZooKeeper before failing. --> <value>3</value> <name>spde.zookeeper.connect.maxretries</name> </property> <property> <!-- Number of milliseconds to sleep between connection attempts. --> <name>spde.zookeeper.connect.retrysleep</name> <value>1000</value> </property> <property> <!-- Number of milliseconds to wait before connection considered expired. --> <name>spde.zookeeper.connect.timeout</name> <value>30000</value> </property> <property> <!-- Number of milliseconds to wait before session considered expired. --> <name>spde.zookeeper.session.timeout</name> <value>180000</value> </property> <property> <!-- Number of milliseconds to wait before lock request considered failed. --> <name>spde.zookeeper.lockwait.timeout</name> <value>10000</value> </property> <property>

48 Chapter 4 / SPD Engine Reference <!-- Number of milliseconds to wait before deleting an empty ZooKeeper data node. <name>spde.zookeeper.reaper.threshold</name> <value>3000</value> </property> </configuration> Creating the XML Configuration File The following are XML configuration file properties. The first two properties, spde.zookeeper.quorum and spde.zookeeper.port, are required. The other properties have default values if they are not included in the XML configuration file. spde.zookeeper.quorum a comma-separated list of quorum machines that are configured to work together as a single server. The listed machines must be running a ZooKeeper server and servicing requests on the port that is specified in the spde.zookeeper.port property. This property is required. spde.zookeeper.port the I/O port on which the quorum machines that are listed in the spde.zookeeper.quorom property are configured to service requests. This property is required. spde.zookeeper.connect.maxretries the maximum number of times that Curator attempts to connect to ZooKeeper before failing. Values less than or equal to zero are ignored. The default is 3. spde.zookeeper.connect.retrysleep the milliseconds that Curator sleeps between attempts to connect to ZooKeeper. The sleep time starts with this setting, but increases between each attempt. Values less than or equal to zero are ignored. The default is 1000. spde.zookeeper.connect.timeout the milliseconds that Curator and the ZooKeeper client wait for a communication from the ZooKeeper server before considering the server connection to be expired. When operating normally, the client establishes a connection to the server and communicates with it over that connection. If the connection is non-responsive for more than the specified value, it is considered expired and is dropped, followed by an attempt to establish a new connection. Values less than or equal to zero are ignored. The default is 30000.

SPDE_CONFIG_FILE SAS Environment Variable 49 spde.zookeeper.session.timeout the milliseconds that Curator and the ZooKeeper client wait for a communication from the ZooKeeper server before considering the client session to be expired. When operating normally, the client establishes a connection to the server and communicates with it over that connection. The connection might be dropped and reestablished as the network or server nodes experience faults, but the client session continues to exist for the duration of these interruptions. If an interruption persists for more than the specified value, the client session is considered expired and is terminated. No reconnection is possible after that. Values less than or equal to zero are ignored. The default is 180000. spde.zookeeper.lockwait.timeout the milliseconds that the ZooKeeper server waits for a lock to become available before declaring that a lock request has failed and returning control to the client. Values less than zero are ignored. A value of zero is valid. spde.zookeeper.reaper.threshold the milliseconds that the ZooKeeper server waits before deleting an empty Zookeeper server node. The default is 3000. Defining the SPDE_CONFIG_FILE Environment Variable The following table includes examples of defining the SPDE_CONFIG_FILE environment variable: Table 4.1 Method Defining the SPDE_CONFIG_FILE Environment Variable Example SAS configuration file SAS invocation OPTIONS statement -set SPDE_CONFIG_FILE /user/abcdef/ hadoop/spde-site.xml -set SPDE_CONFIG_FILE /user/abcdef/ hadoop/spde-site.xml options set=spde_config_file= /user/ abcdef/hadoop/spde-site.xml ;

50 Chapter 4 / SPD Engine Reference SPDE_HADOOP_WORK_PATH SAS Environment Variable Specifies a pathname for SAS High-Performance Analytics work files. Valid in: Default: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window The SPD Engine uses the standard UNIX temporary directory /tmp. See: Using SAS High-Performance Analytics Procedures on page 24 Syntax SPDE_HADOOP_WORK_PATH='pathname' Required Argument 'pathname' specifies the fully qualified pathname to a directory in the Hadoop cluster, which is typically UNIX based. Enclose the primary pathname in single or double quotation marks. An example is '/sasdata/cluster1/hpawork'. Requirement The directory must exist, and you must have Write access. Details The following table includes examples of defining the SPDE_HADOOP_WORK_PATH environment variable: Table 4.2 Method Defining the SPDE_HADOOP_WORK_PATH Environment Variable Example SAS configuration file SAS invocation -set SPDE_HADOOP_WORK_PATH /sasdata/ cluster1/hpawork -set SPDE_HADOOP_WORK_PATH /sasdata/ cluster1/hpawork

SPDELOCKPATH SAS Environment Variable 51 Method OPTIONS statement Example options set=spde_hadoop_work_path= / sasdata/cluster1/hpawork ; SPDELOCKPATH SAS Environment Variable Specifies a pathname for the SPD Engine lock directory for HDFS concurrent access. Valid in: Default: Note: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window The SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eight-character hexadecimal value (which is the checksum of the Hadoop cluster that contains the data set), and the suffix _spdslock9. The SPDELOCKPATH SAS environment variable is available in the third maintenance release for SAS 9.4. See: SPD Engine File System Locking on page 18 Syntax SPDELOCKPATH='pathname' Required Argument 'pathname' specifies the fully qualified pathname to a directory in the Hadoop cluster, which is typically UNIX based. Enclose the primary pathname in single or double quotation marks. An example is '/user/abcdef/'. Tip Specify only one lock directory pathname for each Hadoop cluster so that the same data set is not using different lock directories. Details The following table includes examples of defining the SPDELOCKPATH environment variable:

52 Chapter 4 / SPD Engine Reference Table 4.3 Method Defining the SPDELOCKPATH Environment Variable Example SAS configuration file SAS invocation OPTIONS statement -set SPDELOCKPATH /user/abcdef -set SPDELOCKPATH /user/abcdef options set=spdelockpath= /user/abcdef ; SPDEREADLOCK SAS Environment Variable Determines whether a Read access lock file is created. Valid in: Default: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window NO See: SPD Engine File System Locking on page 18 Syntax SPDEREADLOCK NO YES Required Arguments NO specifies that a Read access lock file is not created when a data set stored in HDFS is opened for Read access. This is the default behavior for the SPD Engine. Only Write access lock files are created. YES specifies that a Read access lock file is created when a data set stored in HDFS is opened for Read access. Once the lock file is created, no other SAS process can open the data set for Write access.

SPDEREADLOCK SAS Environment Variable 53 Details To control the level of concurrent access, you can request a Read access lock file by defining the SAS environment variable SPDEREADLOCK and setting it to YES. Then, when a SAS session opens a data set for Read access, a lock file is created in addition to any Write access lock files. The following table includes examples of defining the SPDEREADLOCK environment variable: Table 4.4 Method Defining the SPDELOCKPATH Environment Variable Example SAS configuration file SAS invocation OPTIONS statement -set SPDEREADLOCK YES -set SPDEREADLOCK YES options set=spdereadlock YES;

54 Chapter 4 / SPD Engine Reference

55 5 How to Use Hadoop Data Storage Overview: How to Use Hadoop Data Storage....................... 56 Example 1: Loading Existing SAS Data Using the COPY Procedure..................................................... 57 Details................................................................... 57 Program................................................................. 57 Program Description................................................... 57 Example 2: Creating a Data Set Using the DATA Step............ 58 Details................................................................... 58 Program................................................................. 58 Program Description................................................... 59 Example 3: Adding to Existing Data Set Using the APPEND Procedure................................................. 59 Details................................................................... 59 Program................................................................. 60 Program Description................................................... 60 Example 4: Loading Oracle Data Using the COPY Procedure... 61 Details................................................................... 61 Program................................................................. 61 Program Description................................................... 61 Example 5: Analyzing Data Using the FREQ Procedure.......... 62 Details................................................................... 62 Program................................................................. 62

56 Chapter 5 / How to Use Hadoop Data Storage Program Description................................................... 63 Example 6: Managing SAS Files Using the DATASETS Procedure................................................... 64 Details................................................................... 64 Program................................................................. 64 Program Description................................................... 64 Example 7: Setting the SPD Engine I/O Block Size................ 68 Details................................................................... 68 Program................................................................. 68 Program Description................................................... 68 Example 8: Optimizing WHERE Processing with MapReduce.. 69 Details................................................................... 69 Program................................................................. 69 Program Description................................................... 70 Overview: How to Use Hadoop Data Storage These examples illustrate how to use Hadoop data storage. The examples show you how to load existing data into a Hadoop cluster, how to create a new data set in a Hadoop cluster, and how to append data to an existing data set in a Hadoop cluster. Other examples show you how to load Oracle data into a Hadoop cluster and how to access data sets stored in a Hadoop cluster for data management and analysis. Note: The example data was created to illustrate SPD Engine functionality to read, write, and update data sets in a Hadoop cluster. The example data does not reflect the type of data or file size that might typically be loaded into a Hadoop cluster.

Example 1: Loading Existing SAS Data Using the COPY Procedure 57 Example 1: Loading Existing SAS Data Using the COPY Procedure Details This example loads existing SAS data into a Hadoop cluster. The example uses the default Base SAS engine, the SPD Engine, and the COPY procedure. The data set named MyBase.BigFile is copied, converted to the SPD Engine format, and then written to the Hadoop cluster as an SPD Engine data set named MySpde.BigFile. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname mybase 'C:\SASFiles'; 2 libname myspde spde '/data/spde' hdfshost=default; 3 proc copy in=mybase out=myspde; 4 select bigfile; run; Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The first LIBNAME statement assigns the libref MyBase to the physical location of the SAS library that stores the data set. (The default Base SAS engine for a LIBNAME statement is the V9 (or BASE) engine.)

58 Chapter 5 / How to Use Hadoop Data Storage 3 The second LIBNAME statement assigns the libref MySpde to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The COPY procedure copies the data set named BigFile. The SPD Engine creates a subdirectory with the specified data set name and the suffix _spde, converts the data to the SPD Engine format, and writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. The SPD Engine data partition files for the data set BigFile are located at /data/spde/bigfile_spde/. The first partition file is named bigfile.dpf.080e0a8f.0.1.spds9. Example 2: Creating a Data Set Using the DATA Step Details This example creates a data set named MySpde.Fitness in a Hadoop cluster. The example uses the default Base SAS engine, the SPD Engine, and the DATA step SET statement to concatenate several data sets. The data sets are converted to the SPD Engine format and then written to a directory in the Hadoop cluster. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45dl"; libname mybase 'C:\SASFiles'; 2 libname myspde spde '/data/spde' hdfshost=default; 3 data myspde.fitness; 4 set mybase.fitness_2010 mybase.fitness_2011 mybase.fitness_2012; run;

Example 3: Adding to Existing Data Set Using the APPEND Procedure 59 Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The first LIBNAME statement assigns the libref MyBase to the physical location of the SAS library that stores the data sets. (The default Base SAS engine for a LIBNAME statement is the V9 (or BASE) engine.) 3 The second LIBNAME statement assigns the libref MySpde to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The DATA statement assigns the name Fitness to the new data set. The SET statement lists the names of existing data sets to be read. The SPD Engine copies the three input data sets, concatenates them into one output data set named Fitness, converts the data to the SPD Engine format, and then writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. Example 3: Adding to Existing Data Set Using the APPEND Procedure Details This example adds data to an existing data set that is stored in a Hadoop cluster. The example uses the default Base SAS engine, the SPD Engine, and the APPEND procedure. The data sets named MyBase.September and MyBase.October are converted to the SPD Engine format and then written to the existing data set named Sales.YearToDate.

60 Chapter 5 / How to Use Hadoop Data Storage Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname mybase 'C:\SASFiles'; 2 libname sales spde '/data/spde' hdfshost=default; 3 proc append base=sales.yeartodate data=mybase.september; 4 run; proc append base=sales.yeartodate data=mybase.october; 5 run; Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The first LIBNAME statement assigns the libref MyBase to the physical location of the SAS library that stores the data sets. (The default Base SAS engine for a LIBNAME statement is the V9 (or BASE) engine.) 3 The second LIBNAME statement assigns the libref Sales to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The first PROC APPEND copies the data from MyBase.September to Sales.YearToDate. The SPD Engine converts the data to the SPD Engine format and then writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. 5 The second PROC APPEND copies the data from MyBase.October to Sales.YearToDate. The SPD Engine converts the data to the SPD Engine format and

Example 4: Loading Oracle Data Using the COPY Procedure 61 then writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. Example 4: Loading Oracle Data Using the COPY Procedure Details This example loads Oracle data into a Hadoop cluster. The example uses the SAS/ACCESS to Oracle engine, the SPD Engine, and the COPY procedure. The table named MyOracle.Oracle1 is written to the Hadoop cluster as an SPD Engine data set named MySpde.Oracle1. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname myoracle oracle user=myusr1 password=mypwd1 path=mysrv1; 2 libname myspde spde '/data/spde' hdfshost=default; 3 proc copy in=myoracle out=myspde; 4 select oracle1; run; Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster.

62 Chapter 5 / How to Use Hadoop Data Storage 2 The first LIBNAME statement assigns the libref MyOracle, specifies the Oracle engine, and specifies the connection information to the Oracle database that contains the Oracle table. 3 The second LIBNAME statement assigns the libref MySpde to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The COPY procedure copies the table named Oracle1. The SPD Engine creates a subdirectory with the specified data set name and suffix _spde, converts the data to the SPD Engine format, and writes the data to the directory in the Hadoop cluster as an SPD Engine data set. HDFS distributes the data on the Hadoop cluster. The SPD Engine data partition files for the data set Oracle1 are located at /data/spde/ oracle1_spde/. Example 5: Analyzing Data Using the FREQ Procedure Details This example analyzes the data set StudentID that is stored in a Hadoop cluster. The data set contains 3,231,765 observations and three variables: ID, Age, and Name. The example uses the SPD Engine and the FREQ procedure to produce a one-way frequency table for the students ages. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname class spde '/data/spde' hdfshost=default; 2 proc freq data=class.studentid; 3

Example 5: Analyzing Data Using the FREQ Procedure 63 tables age; run; Program Description 1 The two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 To read a data set that is stored in a Hadoop cluster, simply connect to the cluster with the LIBNAME statement for the SPD Engine. The LIBNAME statement assigns the libref Class to a directory in the Hadoop cluster and specifies the SPD Engine. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 3 PROC FREQ produces a one-way frequency table for the students ages. Figure 5.1 PROC FREQ One-Way Frequency Table

64 Chapter 5 / How to Use Hadoop Data Storage Example 6: Managing SAS Files Using the DATASETS Procedure Details This example illustrates how to manage SAS files that are stored in a Hadoop cluster. The example uses the DATASETS procedure to list the SAS files, describe the contents of a specific data set, and delete a data set from HDFS. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname myspde spde '/data/spde' hdfshost=default; 2 proc datasets library=myspde; 3 contents data=studentid (listfiles=yes); 4 run; delete bigfile; 5 run; quit; Program Description 1 The two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 To manage your SAS files that are stored in a Hadoop cluster, simply connect to the cluster with the LIBNAME statement for the SPD Engine. The LIBNAME statement

Example 6: Managing SAS Files Using the DATASETS Procedure 65 assigns the libref MySpde to a directory in the Hadoop cluster and specifies the SPD Engine. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 3 PROC DATASETS lists the SAS files that are stored in the directory in the Hadoop cluster. 4 The CONTENTS statement describes the contents of the data set named StudentID, which includes the number of observations, whether the data set has an index, and the observation length. The LISTFILES= data set option lists the complete pathnames of the SPD Engine files such as the data partition files and the metadata file. 5 The DELETE statement removes the data set named BigFile. The SPD Engine data partition, metadata, and index files are removed. The data set name subdirectory is also removed unless the subdirectory contains files other than the data partition files.

66 Chapter 5 / How to Use Hadoop Data Storage Figure 5.2 MySpde Directory Listing

Example 6: Managing SAS Files Using the DATASETS Procedure 67 Figure 5.3 Contents of StudentID Data Set

68 Chapter 5 / How to Use Hadoop Data Storage Example 7: Setting the SPD Engine I/O Block Size Details This example illustrates how to set the SPD Engine I/O block size to improve performance. The example uses the SPD Engine, an uncompressed data set, and SAS procedures to analyze the data. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname class spde '/data/spde' hdfshost=default; 2 proc means data=class.studentid; 3 var age; run; proc print data=class.studentid (ioblocksize=32768); 4 where age > 18; run; Program Description 1 The two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The LIBNAME statement assigns the libref Class to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster

Example 8: Optimizing WHERE Processing with MapReduce 69 configuration files. The LIBNAME statement does not include the IOBLOCKSIZE= option, so the default I/O block size is 1,048,576 bytes (1 megabyte). 3 The MEANS procedure calculates statistics on the Age variable. Because the Read operation requires a full data set scan, the procedure uses the default I/O block size, which was set from the LIBNAME statement. For this Read operation, including the IOBLOCKSIZE= data set option to specify a larger I/O block size could improve performance. When retrieving a large percentage of the data, a larger block size provides a performance benefit. 4 The PRINT procedure requests output where the value of the Age variable is greater than 18. Because the Read operation requests a subset of the data, the procedure includes the IOBLOCKSIZE= data set option to specify a smaller I/O block size. A smaller I/O block size provides better performance because the SPD Engine does not read large blocks of observations when it only needs a few observations from the block. Example 8: Optimizing WHERE Processing with MapReduce Details This example illustrates how to optimize WHERE processing by requesting that data subsetting be performed in the Hadoop cluster. This example analyzes the data set StudentID that is stored in a Hadoop cluster and submits the WHERE expression to the Hadoop cluster as a MapReduce program. By requesting that data subsetting be performed in the Hadoop cluster, performance is improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SAS client. Program options msglevel=i; 1

70 Chapter 5 / How to Use Hadoop Data Storage options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 2 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname class spde '/data/spde' hdfshost=default accelwhere=yes; 3 proc freq data=class.studentid; tables age; where age gt 14; 4 run; Program Description 1 The first OPTIONS statement specifies the MSGLEVEL=I SAS system option to request that informative messages be written to the SAS log. For WHERE processing optimization, the SAS log reports whether the data filtering occurred in the Hadoop cluster. 2 The next two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 3 The LIBNAME statement assigns the libref Class to a directory in the Hadoop cluster and specifies the SPD Engine. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. The ACCELWHERE=YES argument requests that data subsetting be performed by a MapReduce program in the Hadoop cluster. 4 PROC FREQ produces a one-way frequency table for the students ages that are greater than 14. The WHERE expression, which defines the condition that selected observations must satisfy, is instantiated as a Java class. The SPD Engine submits the Java class to the Hadoop cluster as a component in a MapReduce program. As a result, only a subset of the data is returned to the SAS client.

Example 8: Optimizing WHERE Processing with MapReduce 71 Figure 5.4 PROC FREQ One-Way Frequency Table Optimized WHERE Processing Note: The SAS log reports that there were 2,371,486 observations read from the data set. That number of observations is a subset of the data set stored in the Hadoop cluster, which contains 3,231,765 observations. Log 5.1 SAS Log Reporting WHERE Optimization 1 options msglevel=i; 2 options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 3 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; 4 libname class spde '/data/spde' hdfshost=default accelwhere=yes; NOTE: Libref CLASS was successfully assigned as follows: Engine: SPDE Physical Name: /data/spde/ 5 proc freq data=class.studentid; 6 tables age; 7 where age gt 14; whinit: WHERE (Age>14) whinit returns: ALL EVAL2 8 run; NOTE: Writing HTML Body file: sashtml.htm NOTE: There were 2371486 observations read from the data set CLASS.STUDENTID. WHERE age>14; WHERE processing is optimized on the Hadoop cluster. Hadoop Job ID: job_201405290931_14972 NOTE: PROCEDURE FREQ used (Total process time): real time 2:31.74 cpu time 1.70 seconds

72 Chapter 5 / How to Use Hadoop Data Storage

73 Hive SerDe for SPD Engine Data Appendix 1 Accessing SPD Engine Data Using Hive............................. 73 Introduction.............................................................. 73 Requirements for Accessing SPD Engine Tables with Hive...... 74 Deploying the SPD Engine SerDe................................... 76 Registering the SPD Engine Table Metadata in the Hive Metastore................................................... 77 Reading SPD Engine Tables from Hive............................. 79 Logging Support........................................................ 80 How the SPD Engine SerDe Reads the Data...................... 80 Troubleshooting.......................................................... 82 Accessing SPD Engine Data Using Hive Introduction Hive uses an interface called SerDe to translate data that is stored in proprietary formats such as JSON and Parquet into HDFS. SerDe deserializes data into a Java object that HiveQL and other languages that are supported by HiveServer2 can manipulate. Hive provides a variety of built-in SerDes and supports custom SerDes. For more information about Hive SerDes, see your Hive documentation.

74 Appendix 1 / Hive SerDe for SPD Engine Data In the third maintenance release for SAS 9.4, SAS provides a custom Hive SerDe for SPD Engine data that is stored in HDFS. The SerDe makes the data available for applications outside of SAS to query. The SPD Engine SerDe does not support creating, altering, or updating SPD Engine data in HDFS using HiveQL or other languages. That is, the SerDe is Read-only and cannot serialize data for storage in HDFS. If you want to process SPD Engine data stored in HDFS using SAS applications, you should access it directly with the SPD Engine. In addition, if the SPD Engine table in HDFS has any of the following features, it cannot be registered in Hive or use the SerDe. You must access it by going through SAS and the SPD Engine. The following table features are not supported: compressed or encrypted tables tables with SAS informats tables that have user-defined formats password-protected tables tables owned by the SAS Scalable Performance Data Server In addition, the following processing functionality is not supported by the SerDe and requires processing by the SPD Engine: Write, Update, and Append operations if preserving observation order is required Requirements for Accessing SPD Engine Tables with Hive The following are required to access SPD Engine tables using the SPD Engine SerDe: You must deploy SAS Foundation using the SAS Deployment Wizard. Select SAS Hive SerDe for SPDE Data.

Accessing SPD Engine Data Using Hive 75 Figure A1.1 SAS Deployment Wizard Product Selection Page You must be running a supported Hadoop distribution that includes Hive 0.13: o Cloudera CDH 5.2 o o Hortonworks HDP 2.1 or later MapR 4.0.2 or later The SPD Engine table stored in HDFS must have been created using the SPD Engine. The SerDe is delivered as two JAR files, which must be deployed to all nodes in the Hadoop cluster.