9.4 SPD Engine: Storing Data in the Hadoop Distributed File System

Size: px
Start display at page:

Download "9.4 SPD Engine: Storing Data in the Hadoop Distributed File System"

Transcription

1 SAS 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System Third Edition SAS Documentation

2 The correct bibliographic citation for this manual is as follows: SAS Institute Inc SAS 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System, Third Edition. Cary, NC: SAS Institute Inc. SAS 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System, Third Edition Copyright 2015, SAS Institute Inc., Cary, NC, USA All rights reserved. Produced in the United States of America. For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others' rights is appreciated. U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR , DFAR (a), DFAR (a) and DFAR and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR (DEC 2007). If FAR is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government's rights in Software and documentation shall be only those set forth in this Agreement. SAS Institute Inc., SAS Campus Drive, Cary, North Carolina July 2015 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

3 Contents What s New in the SAS 9.4 SPD Engine to Store Data in HDFS v Chapter 1 Introduction to Storing Data in HDFS Deciding to Store Data in HDFS Using the SPD Engine to Store Data in HDFS Chapter 2 Storing Data in HDFS Overview: Storing Data in HDFS SAS and Hadoop Requirements Supported SAS File Features Using the SPD Engine Security Chapter 3 Using the SPD Engine Overview: Using the SPD Engine How the SPD Engine Supports Data Distribution I/O Operation Performance Creating SAS Indexes Parallel Processing for Data in HDFS WHERE Processing Optimization with MapReduce SPD Engine File System Locking SPD Engine Distributed Locking Updating Data in HDFS Using SAS High-Performance Analytics Procedures Chapter 4 SPD Engine Reference Overview: SPD Engine Reference Dictionary Chapter 5 How to Use Hadoop Data Storage Overview: How to Use Hadoop Data Storage Example 1: Loading Existing SAS Data Using the COPY Procedure

4 iv Contents Example 2: Creating a Data Set Using the DATA Step Example 3: Adding to Existing Data Set Using the APPEND Procedure Example 4: Loading Oracle Data Using the COPY Procedure Example 5: Analyzing Data Using the FREQ Procedure Example 6: Managing SAS Files Using the DATASETS Procedure Example 7: Setting the SPD Engine I/O Block Size Example 8: Optimizing WHERE Processing with MapReduce Appendix 1 Hive SerDe for SPD Engine Data Accessing SPD Engine Data Using Hive Troubleshooting Recommended Reading Index

5 v What s New in the SAS 9.4 SPD Engine to Store Data in HDFS Whatʼs New Overview In the second maintenance release for SAS 9.4, the SPD Engine has improved performance. The SPD Engine creates a SAS index much faster, sets a larger I/O block size and expands the scope of the block size, expands parallel processing support for Read operations, performs data filtering in the Hadoop cluster, and enables you to control the number of MapReduce tasks when writing data in HDFS. In the third maintenance release for SAS 9.4, the SPD Engine expands the supported Hadoop distributions, enables parallel processing for Write operations, expands WHERE processing optimization with more WHERE expression syntax, enhances file system locking by enabling you to specify a pathname for the SPD Engine lock directory, supports distributed locking, and provides a custom Hive SerDe so that SPD Engine data stored in HDFS can be accessed using Hive.

6 vi SAS SPD Engine Hadoop Distribution Support In the third maintenance release for SAS 9.4, the SPD Engine has expanded the supported Hadoop distributions. For the list of supported Hadoop distributions, see Hadoop Distribution Support on page 6. Improved Performance When Creating a SAS Index In the second maintenance release for SAS 9.4, when you create a SAS index for a data set in HDFS, the performance of creating a large index is significantly improved because the index is partitioned. For more information, see Creating SAS Indexes on page 11. Improved Performance By Setting SPD Engine I/O Block Size In the second maintenance release for SAS 9.4, the scope of the SPD Engine I/O block size is expanded. The default block size is larger at 1,048,576 bytes (1 megabyte). The block size affects compressed, uncompressed, and encrypted data sets. The block size influences the size of I/O operations when reading all data sets and writing compressed data sets. For more information, see I/O Operation Performance on page 11. To specify an I/O block size, use the IOBLOCKSIZE= data set option on page 40 or the new IOBLOCKSIZE= LIBNAME statement option on page 33.

7 Optimized WHERE Processing vii Improved Performance of Reading Data in HDFS In the second maintenance release for SAS 9.4, to improve the performance of reading data stored in HDFS, the SPD Engine has expanded its support of parallel processing. You can request parallel processing for all Read operations of data stored in HDFS. For more information, see Parallel Processing for Data in HDFS on page 12. To request parallel processing for all Read operations of data stored in HDFS, use the SPDEPARALLELREAD= system option on page 45, the PARALLELREAD= LIBNAME statement option on page 36, or the PARALLELREAD= data set option on page 42. Improved Performance of Writing Data to HDFS In the third maintenance release for SAS 9.4, you can now request parallel processing for all Write operations in HDFS. For more information, see Parallel Processing for Data in HDFS on page 12. To request parallel processing for Write operations, use the PARALLELWRITE= LIBNAME statement option on page 36 or the PARALLELWRITE= data set option on page 43. Optimized WHERE Processing To optimize the performance of WHERE processing, you can request that data subsetting be performed in the Hadoop cluster, which takes advantage of the filtering and ordering capabilities of the MapReduce framework. For more information, see WHERE Processing Optimization with MapReduce on page 15. To request that data subsetting be performed in the Hadoop cluster, use the ACCELWHERE= LIBNAME statement option on page 31 or the ACCELWHERE= data set option on page 39.

8 viii SAS SPD Engine In the third maintenance release for SAS 9.4, optimized WHERE processing is expanded to include more operators and compound expressions. For more information, see WHERE Expression Syntax Support on page 16. Controlling Tasks When Writing Data in HDFS In the second maintenance release for SAS 9.4, to specify the number of MapReduce tasks when writing data in HDFS, you can use the NUMTASKS= LIBNAME statement option. This option controls parallel processing on the Hadoop cluster when writing output from a SAS High-Performance Analytics procedure. For more information, see the NUMTASKS= LIBNAME statement option on page 35. SPD Engine File System Locking In the second maintenance release for SAS 9.4, the SPD Engine implements a locking strategy that honors the HDFS concurrent access model and provides additional levels of concurrent access to ensure the integrity of the data stored in HDFS. For more information, see SPD Engine File System Locking on page 18. In the third maintenance release for SAS 9.4, to store the lock files, the SPD Engine creates a lock directory in the /tmp directory. You can specify a pathname for the SPD Engine lock directory by defining the new SAS environment variable SPDELOCKPATH. For more information, see SPDELOCKPATH SAS Environment Variable on page 51. SPD Engine Distributed Locking In the third maintenance release for SAS 9.4, the SPD Engine supports distributed locking for data stored in HDFS. Distributed locking provides synchronization and group

9 Accessing SPD Engine Data Using Hive ix coordination services to clients over a network connection. For more information, see SPD Engine Distributed Locking on page 20. To request SPD Engine distributed locking, you must first create an XML configuration file, and then define the SAS environment variable SPDE_CONFIG_FILE to specify the location of the user-defined XML file that is available to the SAS client machine. For more information, see SPDE_CONFIG_FILE SAS Environment Variable on page 46. Configuring the SPD Engine to Store Data in HDFS To store data in HDFS using the SPD Engine, required Hadoop JAR files and Hadoop cluster configuration files must be available to the SAS client machine. For information about configuring the SPD Engine, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Accessing SPD Engine Data Using Hive In the third maintenance release for SAS 9.4, SAS provides a custom Hive SerDe for SPD Engine data that is stored in HDFS. The SerDe makes the data available for applications outside of SAS to query using HiveQL. For more information, see Appendix 1, Hive SerDe for SPD Engine Data, on page 73.

10 x SAS SPD Engine

11 1 1 Introduction to Storing Data in HDFS Deciding to Store Data in HDFS Using the SPD Engine to Store Data in HDFS What Is the SPD Engine? Understanding the SPD Engine File Format How to Use the SPD Engine Deciding to Store Data in HDFS Storing data in the Hadoop Distributed File System (HDFS) is a good strategy for very large data sets. HDFS is a component of Apache Hadoop, which is an open-source software framework of tools that are written in Java. HDFS provides distributed data storage and processing of large amounts of data. Reasons for storing SAS data in HDFS include the following: HDFS is a low-cost alternative for data storage. Organizations are exploring it as an alternative to commercial relational database solutions. HDFS is well suited for distributed storage and processing using commodity hardware. It is fault tolerant, scalable, and simple to expand. HDFS manages files as blocks of equal size, which are replicated across the machines in a Hadoop cluster to provide fault tolerance. SAS provides support within the current SAS product offering and product roadmap. SAS provides the ability to manage, process, and analyze data in HDFS.

12 2 Chapter 1 / Introduction to Storing Data in HDFS Hadoop storage is for big data. If standard SAS optimization techniques such as indexes no longer meet your performance needs, then storing the data in HDFS could improve performance. Using the SPD Engine to Store Data in HDFS What Is the SPD Engine? The SAS Scalable Performance Data (SPD) Engine is a scalable engine delivered to SAS customers as part of Base SAS. The SPD Engine is designed for highperformance data delivery, reading data sets that contain billions of observations. The engine uses threads to read data very rapidly and in parallel. The SPD Engine reads, writes, and updates data in HDFS. You can use the SPD Engine with standard SAS applications to retrieve data for analysis, perform administrative functions, and update the data. Understanding the SPD Engine File Format The SPD Engine organizes data into a streamlined file format that has advantages for a distributed file system like HDFS. The advantages of the SPD Engine file format include the following: Data is separate from the metadata. The file format consists of separate files: one for data, one for metadata, and two for indexes. Each type of file has an identifying file extension. The extensions are.dpf for data,.mdf for metadata, and.hbx and.idx for indexes. The SPD Engine file format partitions the data by spreading it across multiple files based on a partition size. Each partition is stored as a separate physical file with the extension.dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file.

13 Using the SPD Engine to Store Data in HDFS 3 The default partition size is 128 megabytes. You can specify a different partition size with the PARTSIZE= LIBNAME statement option on page 38 or the PARTSIZE= data set option on page 41. How to Use the SPD Engine The SPD Engine works like other SAS data access engines. That is, you execute a LIBNAME statement to assign a libref, specify the engine, and connect to the Hadoop cluster. You then use that libref throughout the SAS session where a libref is valid. The libref is associated with a specific directory in the Hadoop cluster. Arguments in the LIBNAME statement specify a libref, the engine name, the pathname to a directory in the Hadoop cluster, and the HDFSHOST=DEFAULT argument to indicate that you want to connect to a Hadoop cluster. Here is an example of a LIBNAME statement to connect to a Hadoop cluster: libname myspde spde '/user/abcdef' hdfshost=default; To interface with Hadoop and connect to a specific Hadoop cluster, required Hadoop JAR files and Hadoop cluster configuration files must be available to the SAS client machine. To make the required files available, you must define two SAS environment variables to set the location of the required files. For more information about the SAS environment variables, see SAS and Hadoop Requirements on page 6. Any data source that can be accessed with a SAS engine can be loaded into a Hadoop cluster using the SPD Engine. For example: You can use the default Base SAS engine to access an existing SAS data set and the SPD Engine to connect to the Hadoop cluster. You can then use SAS code to load the data to the Hadoop cluster. See Example 1: Loading Existing SAS Data Using the COPY Procedure on page 57. You can use a SAS/ACCESS engine such as the SAS/ACCESS to Oracle engine to access an Oracle table and the SPD Engine to connect to the Hadoop cluster. You can then use SAS code to load the data to the Hadoop cluster. See Example 4: Loading Oracle Data Using the COPY Procedure on page 61. Note: Most existing SAS programs can run with the SPD Engine with little modification other than to the LIBNAME statement. However, some limitations apply. For example, if

14 4 Chapter 1 / Introduction to Storing Data in HDFS your default Base SAS engine data has integrity constraints, then the integrity constraints are dropped when the data is converted for the SPD Engine. For more information about supported SAS file features, see Supported SAS File Features Using the SPD Engine on page 7.

15 5 2 Storing Data in HDFS Overview: Storing Data in HDFS SAS and Hadoop Requirements SAS Version Hadoop Distribution Support Configuring Hadoop JAR Files Making Required Hadoop Cluster Configuration Files Available to Your Machine Supported SAS File Features Using the SPD Engine Security Overview: Storing Data in HDFS To store data in HDFS using the SPD Engine, you must do the following: Ensure that all version and configuration requirements are met. See SAS and Hadoop Requirements on page 6. Understand what the supported and not supported SAS file features are when using the SPD Engine. See Supported SAS File Features Using the SPD Engine on page 7. Use the LIBNAME statement for the SPD Engine to establish the connection to the Hadoop cluster. See LIBNAME Statement for HDFS on page 28.

16 6 Chapter 2 / Storing Data in HDFS SAS and Hadoop Requirements SAS Version To store data in HDFS using the SPD Engine, you must have the first maintenance release or later for SAS 9.4. Note: Access to data in HDFS using the SPD Engine is not supported from a SAS session in the z/os operating environment. Hadoop Distribution Support In the third maintenance release for SAS 9.4, the SPD Engine supports the following Hadoop distributions, with or without Kerberos: Cloudera CDH 4.x Cloudera CDH 5.x Hortonworks HDP 2.x IBM InfoSphere BigInsights 3.x MapR 4.x (for Microsoft Windows and Linux operating environments only) Pivotal HD 2.x Configuring Hadoop JAR Files To store data in HDFS using the SPD Engine, you must use a supported Hadoop distribution and configure a required set of Hadoop JAR files. The JAR files must be available to the SAS client machine. The SAS environment variable SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

17 Supported SAS File Features Using the SPD Engine 7 Making Required Hadoop Cluster Configuration Files Available to Your Machine Hadoop cluster configuration files contain information such as the name of the computer that hosts the Hadoop cluster and the TCP port. To connect to the Hadoop cluster, Hadoop configuration files must be copied from the specific Hadoop cluster to a physical location that the SAS client machine can access. The SAS environment variable SAS_HADOOP_CONFIG_PATH must be defined and set to the location of the Hadoop configuration files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Supported SAS File Features Using the SPD Engine The following SAS file features are supported for data sets using the SPD Engine: Encryption File compression Member-level locking SAS indexes SAS passwords Special missing values Physical ordering of returned observations User-defined formats and informats Note: When you create a data set, you cannot request both encryption and file compression. The following SAS file features are not supported for data sets using the SPD Engine: Audit trails

18 8 Chapter 2 / Storing Data in HDFS Cross-Environment Data Access (CEDA) Extended attributes Generation data sets Integrity constraints NLS support (such as to specify encoding for the data) Record-level locking SAS catalogs, SAS views, and MDDB files The following SAS software does not support SPD Engine data sets: SAS/CONNECT SAS/SHARE Security HDFS supports defined levels of permissions at both the directory and file levels. The SPD Engine honors those permissions. For example, if the file is available as Read only, you cannot modify it. If the Hadoop cluster supports Kerberos, the SPD Engine honors Kerberos authentication and authorization as long as the Hadoop cluster configuration files are accessed. For more information about accessing the Hadoop cluster configuration files, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Restricting access to members of SAS libraries by assigning SAS passwords to the members is supported when a data set is stored in HDFS. You can specify three levels of permission: Read, Write, and Alter. For more information about SAS passwords, see SAS Language Reference: Concepts.

19 9 3 Using the SPD Engine Overview: Using the SPD Engine How the SPD Engine Supports Data Distribution I/O Operation Performance Creating SAS Indexes Parallel Processing for Data in HDFS Overview: Parallel Processing for Data in HDFS Parallel Processing Considerations Tuning Parallel Processing Performance WHERE Processing Optimization with MapReduce Overview: WHERE Processing Optimization with MapReduce.. 15 WHERE Expression Syntax Support Data Set and SAS Code Requirements Hadoop Requirements SPD Engine File System Locking Overview: SPD Engine File System Locking Requesting Read Access Lock Files Specifying a Pathname for the SPD Engine Lock Directory SPD Engine Distributed Locking Overview: SPD Engine Distributed Locking Understanding the Service Provider Requirements for SPD Engine Distributed Locking

20 10 Chapter 3 / Using the SPD Engine Requesting Distributed Locking Updating Data in HDFS Using SAS High-Performance Analytics Procedures Overview: Using the SPD Engine The SPD Engine reads, writes, and updates data in HDFS. Specific SPD Engine features are supported for Hadoop storage and are explained in this document. For more information about the SPD Engine and its features that are not specific to Hadoop storage, see SAS Scalable Performance Data Engine: Reference. How the SPD Engine Supports Data Distribution When loading data into a Hadoop cluster, the SPD Engine ensures that the data is distributed appropriately. The SPD Engine uses the SPD Engine partition size and the HDFS block size to compute the maximum number of observations that can fit into both parameters. That is, observations never span multiple partitions or multiple blocks. After a data set is loaded into a Hadoop cluster, the actual block size of the loaded data might be less than the block size that was defined by the Hadoop administrator. The reason for the size difference can be because of the SPD Engine calculations regarding the partition size, block size, and observation length. Note: Defragmenting the Hadoop cluster is not recommended. Changing the block size and re-creating the files could result in the data becoming inaccessible by SAS.

21 Creating SAS Indexes 11 I/O Operation Performance To improve I/O operation performance, consider setting a different SPD Engine I/O block size. The larger the block size, the less I/O. For example, when reading a data set, the block size can significantly affect performance. When retrieving a large percentage of the data, a larger block size improves performance. However, when retrieving a subset of the data such as with WHERE processing, a smaller block size performs better. You can specify a different block size with the IOBLOCKSIZE= LIBNAME statement option and the IOBLOCKSIZE= data set option. For more information, see the IOBLOCKSIZE= LIBNAME statement option on page 33 and the IOBLOCKSIZE= data set option on page 40. Creating SAS Indexes When you create a SAS index for a data set that is stored in HDFS, a large index could require a long time to create. To provide efficient index creation, the SPD Engine partitions the two index files (.hbx and.idx). The index files are spread across multiple files based on the index partition size, which is 2 megabytes. Even though the index files are partitioned, the PARTSIZE= option, which specifies a size for the SPD Engine data partition file, does not affect the index partition size. You cannot increase or decrease the index partition size. To improve the performance of creating an index, consider these options: Request that indexes be created in parallel, asynchronously. To enable asynchronous parallel index creation, use the ASYNCINDEX= data set option. Request more temporary utility file space for sorting the data. To allocate an adequate amount of space for processing, use the SPDEUTILLOC= system option. Specify the utility file location on the SAS client machine, not on the Hadoop cluster.

22 12 Chapter 3 / Using the SPD Engine Request larger memory space for the sorting utility to use when sorting values for creating an index. To specify the amount of memory, use the SPDEINDEXSORTSIZE= system option. For more information about these options, see SAS Scalable Performance Data Engine: Reference. Parallel Processing for Data in HDFS Overview: Parallel Processing for Data in HDFS Parallel processing uses multiple threads that run in parallel so that a large operation is divided into multiple smaller ones that are executed simultaneously. The SPD Engine supports parallel processing to improve the performance of reading and writing data stored in HDFS. By default, the SPD Engine performs parallel processing only if a Read operation includes WHERE processing. If the Read operation does not include WHERE processing, the Read operation is performed by a single thread. To request parallel processing for all Read operations for all SAS releases and for Write operations in the third maintenance release for SAS 9.4 only, use these options: SPDEPARALLELREAD= system option on page 45 to request parallel read processing for the SAS session. PARALLELREAD= LIBNAME statement option on page 36 to request parallel read processing when using the assigned libref. PARALLELREAD= data set option on page 42 to request parallel read processing for the specific data set. In the third maintenance release for SAS 9.4, PARALLELWRITE= LIBNAME statement option on page 36 to request parallel write processing when using the assigned libref.

23 Parallel Processing for Data in HDFS 13 In the third maintenance release for SAS 9.4, PARALLELWRITE= data set option on page 43 to request parallel write processing for the specific data set. Here is an example of the SPDEPARALLELREAD= system option to request parallel processing for all Read operations for the SAS session: options spdeparallelread=yes; In this example, the LIBNAME statement requests parallel processing for all Read operations using the assigned libref. By specifying the PARALLELREAD= LIBNAME statement option, parallel processing is performed for all Read operations using the Class libref: libname class spde '/user/abcdef' hdfshost=default parallelread=yes; proc freq data=class.studentid; tables age; run; In this example, the PARALLELREAD= data set option requests parallel processing for all Read operations for the Class.StudentID data set: libname class spde '/user/abcdef' hdfshost=default; proc freq data=class.studentid (parallelread=yes); tables age; run; Here is an example of the PARALLELWRITE= LIBNAME statement option to request parallel processing for all Write operations using the assigned libref. By specifying the PARALLELWRITE= LIBNAME statement option, parallel processing is performed for all Write operations using the Class libref: libname class spde '/user/abcdef' hdfshost=default parallelwrite=yes; TIP To display information in the SAS log about parallel processing, set the MSGLEVEL= system option to I. When you set options msglevel=i;, the SAS log reports whether parallel processing is in effect.

24 14 Chapter 3 / Using the SPD Engine Parallel Processing Considerations The following are considerations for requesting parallel processing: For some environments, parallel processing might not improve the performance. The availability of network bandwidth and the number of CPUs on the SAS client machine determine the performance improvement. It is recommended that you set up a test in your environment to measure performance with and without parallel processing. When parallel read processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. For example, the COMPARE procedure expects that observations are read from the data set in the same order that they were written to the data set. Also, legacy code that uses the DATA step or the OBS= data set option might rely on physical order to produce the expected results. Tuning Parallel Processing Performance To tune the performance of parallel processing, consider these SPD Engine options: The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing. The SPD Engine THREADNUM= data set option specifies the maximum number of threads to use for the processing. For more information about these options, see SAS Scalable Performance Data Engine: Reference. Note: The Base SAS NOTHREADS= and CPUCOUNT= system options have no effect on SPD Engine parallel processing.

25 WHERE Processing Optimization with MapReduce 15 WHERE Processing Optimization with MapReduce Overview: WHERE Processing Optimization with MapReduce WHERE processing enables you to conditionally select a subset of observations so that SAS processes only the observations that meet specified conditions. To optimize the performance of WHERE processing, you can request that data subsetting be performed in the Hadoop cluster. Then, when you submit SAS code that includes a WHERE expression (which defines the condition that selected observations must satisfy), the SPD Engine instantiates the WHERE expression as a Java class. The SPD Engine submits the Java class to the Hadoop cluster as a component in a MapReduce program. By requesting that data subsetting be performed in the Hadoop cluster, performance might be improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SAS client. Performance is often improved with large data sets when the WHERE expression qualifies only a relatively small subset. By default, data subsetting is performed by the SPD Engine on the SAS client. To request that data subsetting be performed in the Hadoop cluster, you must specify the ACCELWHERE= LIBNAME statement option on page 31 or the ACCELWHERE= data set option on page 39. Here is an example of a LIBNAME statement that connects to a Hadoop cluster and requests that data subsetting be performed in the Hadoop cluster. By specifying the ACCELWHERE= LIBNAME statement option, subsequent WHERE processing for all data sets accessed with the Class libref are performed in the Hadoop cluster. libname class spde '/user/abcdef' hdfshost=default accelwhere=yes; proc freq data=class.studentid; tables age; where age gt 14; run;

26 16 Chapter 3 / Using the SPD Engine In this example, the ACCELWHERE= data set option requests that data subsetting be performed in the Hadoop cluster. The WHERE processing for the Class.StudentID data set is performed in the Hadoop cluster. WHERE processing for any other data set with the Class libref is performed by the SPD Engine on the SAS client machine. libname class spde '/user/abcdef' hdfshost=default; proc freq data=class.studentid (accelwhere=yes); tables age; where age gt 14; run; WHERE Expression Syntax Support In the third maintenance release for SAS 9.4, WHERE processing optimization is expanded. WHERE processing optimization supports the following syntax: comparison operators such as EQ (=), NE (^=), GT (>), LT (<), GE (>=), LE (<=) IN operator full bounded range condition, such as where 500 <= empnum <= 1000; BETWEEN-AND operator, such as where empnum between 500 and 1000; compound expressions using the logical operators AND, OR, and NOT, such as where skill = 'java' or years = 4; parentheses to control the order of evaluation, such as where (product='graph' or product='stat') and country='canada'; Data Set and SAS Code Requirements To perform the data subsetting in the Hadoop cluster, the following data set and SAS code requirements must be met. If any of these requirements are not met, the subsetting of the data is performed by the SPD Engine, not by a MapReduce program in the Hadoop cluster. The data set cannot be encrypted. The data set cannot be compressed.

27 WHERE Processing Optimization with MapReduce 17 The data set must be larger than the HDFS block size. The submitted SAS code cannot request BY-group processing. The submitted SAS code cannot include the STARTOBS= or ENDOBS= options. The LIBNAME statement cannot include the HDFSUSER= option. The submitted WHERE expression cannot include any of the following syntax: o o o a variable as an operand, such as where lastname; variable-to-variable comparison SAS functions, such as SUBSTR, TODAY, UPCASE, and PUT o arithmetic operators *, /, +, -, and ** o IS NULL or IS MISSING and IS NOT NULL or IS NOT MISSING operators o concatenation operator, such as or!! o o negative prefix operator, such as where z = -(x + y); pattern-matching operators LIKE and CONTAINS o sounds-like operator SOUNDEX (=*) o truncated comparison operator using the colon (:) modifier, such as where lastname=: 'S'; TIP To display information in the SAS log regarding WHERE processing optimization, set the MSGLEVEL= system option to I. When you issue options msglevel=i;, the SAS log reports whether the data filtering occurred in the Hadoop cluster. If the optimization occurred, the Hadoop Job ID is displayed in the SAS log. If the optimization did not occur, additional messages explain why. Hadoop Requirements To perform the data subsetting in the Hadoop cluster, the following Hadoop requirements must be met.

28 18 Chapter 3 / Using the SPD Engine The Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN. The JRE version for the Hadoop cluster must be either 1.6, which is the default, or 1.7. If the JRE version is 1.7, use the ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version. SPD Engine File System Locking Overview: SPD Engine File System Locking The HDFS concurrent access model allows multiple readers and a single writer. If an application accesses a file to write to it, no other application can write to the file, but multiple applications can read the file. The SPD Engine supports a file system locking strategy that honors the HDFS concurrent access model and provides additional levels of concurrent access to ensure the integrity of the data stored in HDFS. By default, the SPD Engine creates a Write access lock file when a data set stored in HDFS is opened for Write access. With the Write access lock file, no other SAS session can write to the file, but multiple SAS sessions can read the file if the readers accessed the data set before the Write access lock file was created. During concurrent access, the following describes the results of the default SPD Engine locking mechanism: Once a SAS session opens a data set for Write access, any previous readers can continue to access the data set. However, the readers could experience unexpected data results. For example, the writer could delete the data set while the readers are accessing the data set. Once a SAS session opens a data set for Write access, any subsequent reader is not allowed to access the data set. With the Write access locking mechanism, a lock error message occurs in these situations:

29 SPD Engine File System Locking 19 When a SAS session requests Write access to a data set that another SAS session has open for Write access. When a SAS session requests Read access to a data set that another SAS session has open for Write access. When a SAS session requests to delete a data set that another SAS session has open for Write access. In the third maintenance release for SAS 9.4, to store the lock files, the SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eight-character hexadecimal value (which is the checksum of the Hadoop cluster directory that contains the data set), and the suffix _spdslock9, such as BigFile_ a_spdslock9. In most situations, you will not see the lock directory because lock files are deleted when the process completes. TIP In some situations, such as an abnormal termination of a SAS session, lock files might not be properly deleted. The leftover lock files could prohibit access to a data set. If this occurs, the leftover lock files must be manually deleted by submitting HDFS commands. Requesting Read Access Lock Files In some situations, you might want to control the level of concurrent access to guarantee the integrity of the data by requesting that a Read access lock file be created. To request a Read access lock file, define the SAS environment variable SPDEREADLOCK and set it to YES. Then, when a SAS session opens a data set for Read access, a Read access lock file is created in addition to any Write access lock files. For more information, see SPDEREADLOCK SAS Environment Variable on page 52. With the Read and Write access locking mechanism, a lock error message occurs in these situations: When a SAS session requests Write access to a data set that another SAS session has open for either Read or Write access.

30 20 Chapter 3 / Using the SPD Engine When a SAS session requests Read access to a data set that another SAS session has open for Write access. When a SAS session requests to delete a data set that another SAS session has open for either Read or Write access. Note: When you request a Read access lock file, all data access, even for Read access, requires Write permission to the Hadoop cluster. TIP By creating both Read and Write access lock files, the possibility of leftover lock files is increased. If you experience situations such as an abnormal termination of a SAS session, lock files that were not properly deleted must be manually deleted by submitting HDFS commands. Specifying a Pathname for the SPD Engine Lock Directory By default, for HDFS concurrent access, the SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eightcharacter hexadecimal value (which is the checksum of the Hadoop cluster directory that contains the data set, and the suffix _spdslock9, such as BigFile_ a_spdslock9. In the third maintenance release for SAS 9.4, you can specify a pathname for the SPD Engine lock directory by defining the SAS environment variable SPDELOCKPATH to specify a directory in the Hadoop cluster. For more information, see SPDELOCKPATH SAS Environment Variable on page 51. SPD Engine Distributed Locking Overview: SPD Engine Distributed Locking In the third maintenance release for SAS 9.4, the SPD Engine supports distributed locking for data stored in HDFS. Distributed locking provides synchronization and group

31 coordination services to clients over a network connection. For the service provider, the SPD Engine uses the Apache ZooKeeper coordination service, specifically the implementation of the recipe for Shared Lock that is provided by Apache Curator. Distributed locking provides the following benefits: SPD Engine Distributed Locking 21 The lock server maintains the lock state information in memory and does not require Write permission to any client or data library disk storage locations. A process requesting a lock on a data set that is not available (because the data set is already locked) can choose to wait for the data set to become available, rather than having the lock request fail immediately. If a process abnormally terminates while holding locks on data sets, the lock server automatically drops all locks that the client was holding, which eliminates the possibility of leftover lock files. Understanding the Service Provider Apache ZooKeeper is an open-source distributed server that enables reliable distributed coordination to distributed client applications over a network. ZooKeeper safely coordinates access to shared resources with other applications or processes. At its core, ZooKeeper is a fault tolerant multi-machine server that maintains a virtual hierarchy of data nodes that store coordination data. For more information about ZooKeeper and the ZooKeeper data nodes, see Apache ZooKeeper. Apache Curator is a high-level API that simplifies using ZooKeeper. Curator adds many features that build on ZooKeeper and handles the complexity of managing connections to the ZooKeeper cluster. For more information about Curator, see Curator. The SPD Engine accesses the Curator API to provide the locking services. Requirements for SPD Engine Distributed Locking SPD Engine distributed locking has the following requirements: ZooKeeper or later must be downloaded, installed, and running on the Hadoop cluster. The zookeeper JAR file is required.

32 22 Chapter 3 / Using the SPD Engine Curator or later must be downloaded on the Hadoop cluster. The following Curator JAR files are required: o o o curator-client curator-framework curator-recipes The following Hadoop distribution JAR files are required on the client side: o o o guava log4j slf4j The JAR files must be available to the SAS client machine. The SAS environment variable SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. TIP To be effective, all access to SPD data sets must use the same locking method. If some processes or instances use distributed locking and others do not, proper coordination of access to the data sets cannot be guaranteed, and at a minimum, lock failures will be encountered. Requesting Distributed Locking To request distributed locking, you must first create an XML configuration file that contains information so that the SPD Engine can communicate with ZooKeeper. The format of the XML is similar to Hadoop configuration files in that the XML contains properties and attributes as name-value pairs. For an example of an XML configuration file, see XML Configuration File on page 46. In addition, you must define the SAS environment variable SPDE_CONFIG_FILE to specify the location of the user-defined XML configuration file. The location must be available to the SAS client machine. For more information, see SPDE_CONFIG_FILE SAS Environment Variable on page 46.

33 Updating Data in HDFS 23 Updating Data in HDFS HDFS does not support updating data. However, because traditional SAS processing involves updating data, the SPD Engine supports SAS Update operations for data stored in HDFS. To update data in HDFS, the SPD Engine uses an approach that replaces the data set s data partition file for each observation that is updated. When an update is requested, the SPD Engine re-creates the data partition file in its entirety (including all replications), and then inserts the updated data into the new data partition file. Because the data partition file is replaced for each observation that is updated, the greater the number of observations to be updated, the longer the process. For a general-purpose data storage engine like the SPD Engine, the ability to perform small, infrequent updates can be beneficial. However, updating data in HDFS is intended for situations when the time it takes to complete the update outweighs the alternatives. The following are best practices for Update operations using the SPD Engine: It is recommended that you set up a test in your environment to measure Update operation performance. For example, update a small number of observations to gauge how long updates take in your environment. Then, project the test results to a larger number of observations to determine whether updating is realistic. It is recommended that you do not use the SQL procedure to update data in HDFS because of how PROC SQL opens, updates, and closes a file. There are other SAS methods that provide better performance such as the DATA step UPDATE statement and MODIFY statement. The performance of appending a data set can be slower if the data set has a unique index. Test case results show that appending a data set to another data set without a unique index was significantly faster than appending the same data set to another data set with a unique index.

34 24 Chapter 3 / Using the SPD Engine Using SAS High-Performance Analytics Procedures You can use the SPD Engine with SAS High-Performance Analytics procedures to read and write the SPD Engine file format in HDFS. In many cases, the SPD Engine data used by the procedures can be read and written in parallel using the SAS Embedded Process. The following are requirements for a SAS Embedded Process parallel read: Access to the machines in the cluster where a SAS High-Performance Analytics deployment of Hadoop is installed and running. The data set cannot be encrypted or compressed. The STARTOBS= and ENDOBS= data set options cannot be specified. The following are requirements for a SAS Embedded Process parallel write: The ALIGN=, COMPRESS=, ENCRYPT=, and PADCOMPRESS= data set options cannot be specified. The SAS client machine must have a data representation that is compatible with the data representation of the Hadoop cluster. The SAS client machine must be either Linux x64 or Solaris x64. The following are best practices when using the SPD Engine with SAS High- Performance Analytics procedures: With SAS Enterprise Miner, a SAS process can be terminated in such a way that the SPD Engine does not follow normal shutdown procedures, which can result in a lock file not being deleted. The orphan lock file could prevent a subsequent open of the data set. If this occurs, the orphan lock file must be manually deleted by submitting Hadoop commands. To delete the orphan lock file, you can use the HADOOP procedure to submit Hadoop commands. For SAS High-Performance Analytics Work files, the SPD Engine uses the standard UNIX temporary directory /tmp. To override the default Work directory, you can

35 Using SAS High-Performance Analytics Procedures 25 define the SAS environment variable SPDE_HADOOP_WORK_PATH to specify a directory in the Hadoop cluster. The directory must exist and you must have Write access. For example, the following OPTIONS statement sets the Work directory: options set=spde_hadoop_work_path="/sasdata/cluster1/hpawork"; For more information, see SPDE_HADOOP_WORK_PATH SAS Environment Variable on page 50.

36 26 Chapter 3 / Using the SPD Engine

37 27 4 SPD Engine Reference Overview: SPD Engine Reference Dictionary LIBNAME Statement for HDFS ACCELWHERE= Data Set Option for HDFS IOBLOCKSIZE= Data Set Option for HDFS PARTSIZE= Data Set Option for HDFS PARALLELREAD= Data Set Option for HDFS PARALLELWRITE= Data Set Option for HDFS SPDEPARALLELREAD= System Option for HDFS SPDE_CONFIG_FILE SAS Environment Variable SPDE_HADOOP_WORK_PATH SAS Environment Variable SPDELOCKPATH SAS Environment Variable SPDEREADLOCK SAS Environment Variable Overview: SPD Engine Reference The SPD Engine reads, writes, and updates data in HDFS. A specific SPD Engine LIBNAME statement and options are provided for Hadoop storage and are explained in this document. For more information about the SPD Engine LIBNAME statement and options that are not specific to Hadoop storage, see SAS Scalable Performance Data Engine: Reference.

38 28 Chapter 4 / SPD Engine Reference Dictionary LIBNAME Statement for HDFS Associates a libref with a Hadoop cluster to read, write, and update a data set in HDFS. Restrictions: Requirements: The SPD Engine LIBNAME statement arguments that are specific to HDFS are not supported in the z/os operating environment. You can connect to only one Hadoop cluster at a time per SAS session. You can submit multiple LIBNAME statements to different directories in the Hadoop cluster, but there can be only one Hadoop cluster connection per SAS session. To associate a libref with a Hadoop cluster, you must have the first maintenance release or later for SAS 9.4. Supported Hadoop distributions: Cloudera CDH 4.x, Cloudera CDH 5.x, Hortonworks HDP 2.x, IBM InfoSphere BigInsights 3.x, MapR 4.x (Microsoft Windows and Linux only), Pivotal HD 2.x, with or without Kerberos To store data in HDFS using the SPD Engine, you must use a supported Hadoop distribution and configure a required set of Hadoop JAR files. The JAR files must be available to the SAS client machine. The SAS environment variable SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. To connect to the Hadoop cluster, Hadoop configuration files must be copied from the specific Hadoop cluster to a physical location that the SAS client machine can access. The SAS environment variable SAS_HADOOP_CONFIG_PATH must be defined and set to the location of the Hadoop configuration files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Example: Chapter 5, How to Use Hadoop Data Storage, on page 55 Syntax LIBNAME libref SPDE 'primary-pathname' HDFSHOST=DEFAULT <ACCELJAVAVERSION=version> <ACCELWHERE=NO YES> <DATAPATH=('pathname')> <HDFSUSER=ID> <IOBLOCKSIZE=n> <NUMTASKS=n> <PARALLELREAD=NO YES> <PARALLELWRITE=NO YES threads> <PARTSIZE=n nm ng nt>;

39 LIBNAME Statement for HDFS 29 Summary of Optional Arguments ACCELJAVAVERSION=version When requesting that WHERE processing be optimized by being performed in the Hadoop cluster, specifies the Java Runtime Environment (JRE) version for the Hadoop cluster. ACCELWHERE=NO YES Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster. DATAPATH=('pathname') When creating a data set, specifies the fully qualified pathname to a directory in the Hadoop cluster to store data partition files. HDFSUSER=ID Is an authorized user ID on the Hadoop cluster. IOBLOCKSIZE=n Specifies a size in bytes of a block of observations to be used in an I/O operation. NUMTASKS=n Specifies the number of MapReduce tasks when writing data in HDFS. PARALLELREAD=NO YES Determines when the SPD Engine uses parallel processing to read data stored in HDFS. PARALLELWRITE=NO YES threads Determines whether the SPD Engine uses parallel processing to write data in HDFS. PARTSIZE=n nm ng nt Specifies a size for the SPD Engine data partition file.

40 30 Chapter 4 / SPD Engine Reference Required Arguments libref is a valid SAS library name that serves as a shortcut name to associate with a data set in a Hadoop cluster. The name can be up to eight characters long and must conform to the rules for SAS names. SPDE is the engine name for the SAS Scalable Performance Data (SPD) Engine. 'primary-pathname' specifies the fully qualified pathname to a directory in the Hadoop cluster, which is typically UNIX based. Enclose the primary pathname in single or double quotation marks. An example is '/user/abcdef/'. When data is loaded into a Hadoop cluster directory, the SPD Engine automatically creates a subdirectory with the specified data set name and the suffix _spde. The SPD Engine data partition files are contained in that subdirectory. For example, if you load a data set named BigFile into the directory /user/abcdef/, the data partition files are located at /user/abcdef/bigfile_spde/. The SPD Engine metadata and index files are located at /user/abcdef/. Restrictions Maximum length is 260 characters for Windows and 1024 characters for UNIX. The primary pathname must be unique for each assigned libref. Assigned librefs that are different but reference the same primary pathname can result in lost data. Requirement Interaction You must use valid directory syntax for the host. The pathname must be recognized by the operating environment. You can specify a different location to store the data partition files with the DATAPATH= option on page 32. HDFSHOST=DEFAULT specifies that you want to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. The SPD Engine locates the Hadoop cluster

41 LIBNAME Statement for HDFS 31 configuration files using the SAS_HADOOP_CONFIG_PATH environment variable. The environment variable sets the location of the configuration files for a specific cluster. For more information about the SAS_HADOOP_CONFIG_PATH environment variable, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Requirement You must specify the HDFSHOST=DEFAULT argument. Optional Arguments ACCELJAVAVERSION=version When requesting that WHERE processing be optimized by being performed in the Hadoop cluster, specifies the Java Runtime Environment (JRE) version for the Hadoop cluster. The value must be either 1.6 or 1.7. Default 1.6 Interaction Example To request that data subsetting be performed in the Hadoop cluster, use the ACCELWHERE= LIBNAME statement option on page 31. By default, data subsetting is performed by the SPD Engine on the SAS client. Example 8: Optimizing WHERE Processing with MapReduce on page 69 ACCELWHERE=NO YES Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster. NO specifies that data subsetting is performed by the SPD Engine on the SAS client. This is the default setting. YES specifies that data subsetting is performed by a MapReduce program in the Hadoop cluster.

42 32 Chapter 4 / SPD Engine Reference Requirements To perform data subsetting in the Hadoop cluster, there are data set and SAS code requirements. See WHERE Processing Optimization with MapReduce on page 15. To submit the MapReduce program to the Hadoop cluster, the Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN. Interactions If the JRE version for the Hadoop cluster is 1.7 instead of the default version 1.6, use the ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version. The ACCELWHERE= data set option overrides the ACCELWHERE= LIBNAME statement option. For more information, see ACCELWHERE= data set option on page 39. Example Example 8: Optimizing WHERE Processing with MapReduce on page 69 Default NO DATAPATH=('pathname') When creating a data set, specifies the fully qualified pathname to a directory in the Hadoop cluster to store data partition files. Enclose the pathname in single or double quotation marks within parentheses. An example is datapath=('/sasdata'). When data is loaded into the directory, a subdirectory is automatically created with the specified data set name and the suffix _spde. The SPD Engine data partition files are contained in that subdirectory. For example, if you load a data set named BigFile into the directory /user/abcdef/ and specify datapath=( /sasdata/ ), the data partition files are located at /sasdata/bigfile_spde/. The SPD Engine metadata and index files are located at /user/abcdef/.

43 LIBNAME Statement for HDFS 33 Restrictions You can specify only one pathname to store data partition files. Maximum length is 260 characters for Windows and 1024 characters for UNIX. The pathname must be unique for each assigned libref. Assigned librefs that are different but reference the same pathname can result in lost data. Requirement Interaction You must use valid directory syntax for the host. The pathname must be recognized by the operating environment. Specifying the DATAPATH= option overrides the primary pathname for storing the data partition files only. The SPD Engine metadata and index files are always stored in the primary pathname. HDFSUSER=ID Is an authorized user ID on the Hadoop cluster. You can specify a user ID to connect to the Hadoop cluster with a different ID than your current logon ID. Restrictions If the HDFSUSER= option is specified, Kerberos authentication is bypassed, which prevents access to a secure Hadoop cluster. If the HDFSUSER= option is specified, WHERE processing optimization with the ACCELWHERE= option cannot be performed in the Hadoop cluster. HDFSUSER= is not supported by a MapR Apache Hadoop distribution. IOBLOCKSIZE=n Specifies a size in bytes of a block of observations to be used in an I/O operation. The I/O block size determines the amount of data that is physically transferred together in an I/O operation. The larger the block size, the less I/O. The SPD Engine

44 34 Chapter 4 / SPD Engine Reference uses blocks in memory to collect the observations to be written to or read from a data component file. The IOBLOCKSIZE= option specifies the size of the block. (The actual size is computed to accommodate the largest number of observations that fit in the specified size of n bytes. Therefore, the actual size is a multiple of the observation length.) The block size affects I/O operations for compressed, uncompressed, and encrypted data sets. However, the effects are different and depend on the I/O operation. For a compressed data set, the block size determines how many observations are compressed together, which determines the amount of data that is physically transferred for both Read and Write operations. The block size is a permanent attribute of the file. To specify a different block size, you must copy the data set to a new data set, and then specify a new block size for the output file. For a compressed data set, a larger block size can improve performance for both Read and Write operations. For an encrypted data set, the block size is a permanent attribute of the file. For an uncompressed data set, the block size determines the size of the blocks that are used to read the data from disk to memory. The block size has no affect when writing data to disk. For an uncompressed data set, the block size is not a permanent attribute of the file. That is, you can specify a different block size based on the Read operation that you are performing. For example, reading data that is randomly distributed or reading a subset of the data calls for a smaller block size because accessing smaller blocks is faster than accessing larger blocks. In contrast, reading data that is uniformly or sequentially distributed or that requires a full data set scan works better with a larger block size. Default Ranges 1,048,576 bytes (1 megabyte) The minimum block size is 32,768 bytes. The maximum block size is half the size of the SPD Engine data partition file. Restriction The SPD Engine I/O block size must be smaller than or equal to the Hadoop cluster block size.

45 LIBNAME Statement for HDFS 35 Interaction Tip Example The IOBLOCKSIZE= data set option overrides the IOBLOCKSIZE= LIBNAME statement option. For more information, see IOBLOCKSIZE= Data Set Option for HDFS on page 40. When reading a data set, the block size can significantly affect performance. If retrieving a large percentage of the data, a larger block size improves performance. However, if retrieving a subset of the data (such as with WHERE processing), a smaller block size performs better. Example 7: Setting the SPD Engine I/O Block Size on page 68 NUMTASKS=n Specifies the number of MapReduce tasks when writing data in HDFS. This option controls parallel processing on the Hadoop cluster when writing output from a SAS High-Performance Analytics procedure using the SAS Embedded Process. When a high-performance procedure reads and writes Hadoop data, and the amount of output data is similar to the amount of input data, the same number of output tasks as input tasks should be a good default. However, if the amount of output data differs significantly from the amount of input data, you should use this option to tune the number of tasks proportionally to the output data. Default Restriction Interaction The number of MapReduce tasks is the number of SAS High-Performance Analytics nodes. Or, if the highperformance procedure reads a Hadoop file as input, it is the number of tasks that were used to read the input file. This option affects writing data in HDFS only when a high-performance procedure writes output to HDFS using the SAS Embedded Process. If the specified number of MapReduce tasks is less than the number of SAS High-Performance Analytics nodes on which the procedure runs, the setting is ignored.

46 36 Chapter 4 / SPD Engine Reference PARALLELREAD=NO YES Determines when the SPD Engine uses parallel processing to read data stored in HDFS. NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. YES specifies parallel processing for all Read operations using the assigned libref. Default Interactions NO The SET statement POINT= option is inconsistent with parallel processing. When parallel processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. The PARALLELREAD= LIBNAME statement option overrides the SPDEPARALLELREAD= system option. For more information, see SPDEPARALLELREAD= System Option for HDFS on page 45. The PARALLELREAD= LIBNAME statement option can be overridden by the PARALLELREAD= data set option. For more information, see PARALLELREAD= Data Set Option for HDFS on page 42. See Parallel Processing for Data in HDFS on page 12 PARALLELWRITE=NO YES threads Determines whether the SPD Engine uses parallel processing to write data in HDFS. NO specifies that parallel processing for a Write operation does not occur. This is the default behavior for the SPD Engine.

47 LIBNAME Statement for HDFS 37 YES specifies parallel processing for all Write operations using the assigned libref. A thread is used for each CPU on the SAS client machine. For example, if eight CPUs exist on the SAS client machine, then eight threads are used to write data. threads specifies parallel processing for all Write operations using the assigned libref and specifies the number of threads to use for the Write operations. Default The default is 1, which specifies that parallel processing for a Write operation does not occur. Range 2 to 512 Default Restrictions NO You cannot use parallel processing for a Write operation and also request to create a SAS index. You cannot use parallel processing for a Write operation and also request BY-group processing or sorting. Interactions When parallel Write processing occurs, the order in which the observations are written is unpredictable. The order in which the observations are returned cannot be determined unless the application imposes ordering criteria. The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing. For more information, see SAS Scalable Performance Data Engine: Reference. The PARALLELWRITE= LIBNAME statement option can be overridden by the PARALLELWRITE= data set option. For more information, see PARALLELWRITE= Data Set Option for HDFS on page 43.

48 38 Chapter 4 / SPD Engine Reference Note The PARALLELWRITE= LIBNAME statement option is available in the third maintenance release for SAS 9.4. See Parallel Processing for Data in HDFS on page 12 PARTSIZE=n nm ng nt Specifies a size for the SPD Engine data partition file. Each partition is stored as a separate file with the file extension.dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file. The value is specified in megabytes, gigabytes, or terabytes. If n is specified without M, G, or T, the default is megabytes. That is, partsize=64 is the same as partsize=64m. Default Restrictions 128 megabytes The minimum value is 16 megabytes. The maximum value is 8,796,093,022,207 megabytes. Interaction Tip The PARTSIZE= data set option overrides the PARTSIZE= LIBNAME statement option. For more information, see PARTSIZE= Data Set Option for HDFS on page 41. To update data, a smaller partition size provides the best performance. For example, when you update a value, the SPD Engine locates the appropriate partition, modifies the value, and rewrites all replications of the partition. Because each update requires that the partition be rewritten, it is recommended that you perform updates only occasionally or set a small partition size if you are planning to update the data frequently.

49 ACCELWHERE= Data Set Option for HDFS 39 ACCELWHERE= Data Set Option for HDFS Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster. Valid in: Category: Default: Requirements: Interaction: DATA step and PROC step Data Set Control NO To perform data subsetting in the Hadoop cluster, there are data set and SAS code requirements. For more information, see WHERE Processing Optimization with MapReduce on page 15. To submit the MapReduce program to the Hadoop cluster, the Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN. If the JRE version for the Hadoop cluster is 1.7 instead of the default 1.6 version, use the ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version. Syntax ACCELWHERE=NO YES Syntax Description NO specifies that data subsetting is performed by the SPD Engine on the SAS client. This is the default setting. YES specifies that data subsetting is performed by a MapReduce program in the Hadoop cluster. Comparisons The ACCELWHERE= data set option overrides the ACCELWHERE= LIBNAME statement option. See Also ACCELWHERE= LIBNAME statement option on page 31

50 40 Chapter 4 / SPD Engine Reference IOBLOCKSIZE= Data Set Option for HDFS Specifies a size in bytes of a block of observations to be used in an I/O operation. Valid in: Category: Default: Ranges: Restriction: Tip: DATA step and PROC step Data Set Control 1,048,576 bytes (1 megabyte) The minimum block size is 32,768 bytes., The maximum block size is half the size of the SPD Engine data partition file. The SPD Engine I/O block size must be smaller than or equal to the Hadoop cluster block size. When reading a data set, the block size can significantly affect performance. If retrieving a large percentage of the data, a larger block size improves performance. However, if retrieving a subset of the data (such as with WHERE processing), a smaller block size performs better. Example: Example 7: Setting the SPD Engine I/O Block Size on page 68 IOBLOCKSIZE=n Syntax n Syntax Description is the size in bytes of a block of observations. Details The I/O block size determines the amount of data that is physically transferred together in an I/O operation. The larger the block size, the less I/O. The SPD Engine uses blocks in memory to collect the observations to be written to or read from a data component file. The IOBLOCKSIZE= data set option specifies the size of the block. (The actual size is computed to accommodate the largest number of observations that fit in the specified size of n bytes. Therefore, the actual size is a multiple of the observation length.) The block size affects I/O operations for compressed, uncompressed, and encrypted data sets. However, the effects are different and depend on the I/O operation.

51 For a compressed data set, the block size determines how many observations are compressed together, which determines the amount of data that is physically transferred for both Read and Write operations. The block size is a permanent attribute of the file. To specify a different block size, you must copy the data set to a new data set, and then specify a new block size for the output file. For a compressed data set, a larger block size can improve performance for both Read and Write operations. For an encrypted data set, the block size is a permanent attribute of the file. For an uncompressed data set, the block size determines the size of the blocks that are used to read the data from disk to memory. The block size has no affect when writing data to disk. For an uncompressed data set, the block size is not a permanent attribute of the file. That is, you can specify a different block size based on the Read operation that you are performing. For example, reading data that is randomly distributed or reading a subset of the data calls for a smaller block size because accessing smaller blocks is faster than accessing larger blocks. In contrast, reading data that is uniformly or sequentially distributed or that requires a full data set scan works better with a larger block size. Comparisons The IOBLOCKSIZE= data set option overrides the IOBLOCKSIZE= LIBNAME statement option. See Also IOBLOCKSIZE= LIBNAME statement option on page 33 PARTSIZE= Data Set Option for HDFS 41 PARTSIZE= Data Set Option for HDFS Specifies a size for the SPD Engine data partition file. Valid in: Category: Default: Restrictions: DATA step and PROC step Data Set Control 128 megabytes The minimum value is 16 megabytes. The maximum value is 8,796,093,022,207 megabytes.

52 42 Chapter 4 / SPD Engine Reference Specify a data partition file size only when creating a new data set. Tip: To update data, a smaller partition size provides the best performance. For example, when you update a value, the SPD Engine locates the appropriate partition, modifies the value, and rewrites all replications of the partition. Because each update requires that the partition be rewritten, it is recommended that you perform updates only occasionally or set a small partition size if you are planning to update the data frequently. Syntax PARTSIZE=n nm ng nt Syntax Description n nm ng nt is the size of the data partition file in megabytes, gigabytes, or terabytes. If n is specified without M, G, or T, the default is megabytes. That is, partsize=64 is the same as partsize=64m. Details Each partition is stored as a separate file with the file extension.dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file. Comparisons The PARTSIZE= data set option overrides the PARTSIZE= LIBNAME statement option. See Also PARTSIZE= LIBNAME statement option on page 38 PARALLELREAD= Data Set Option for HDFS Determines when the SPD Engine uses parallel processing to read data stored in HDFS. Valid in: Category: Default: Interactions: DATA step and PROC step Data Set Control NO The SET statement POINT= option is inconsistent with parallel processing.

53 When parallel processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. See: Parallel Processing for Data in HDFS on page 12 PARALLELWRITE= Data Set Option for HDFS 43 Syntax PARALLELREAD=NO YES Required Arguments NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. YES requests parallel processing for all Read operations for the specific data set. Comparisons The PARALLELREAD= data set option overrides the SPDEPARALLELREAD= system option and the PARALLELREAD= LIBNAME statement option. See Also PARALLELREAD= LIBNAME Statement Option on page 36 SPDEPARALLELREAD= System Option for HDFS on page 45 PARALLELWRITE= Data Set Option for HDFS Determines whether the SPD Engine uses parallel processing to write data in HDFS. Valid in: Category: Default: Restrictions: DATA step and PROC step Data Set Control NO You cannot use parallel processing for a Write operation and also request to create a SAS index. You cannot use parallel processing for a Write operation and also request BY-group processing or sorting.

54 44 Chapter 4 / SPD Engine Reference Interactions: Note: When parallel Write processing occurs, the order in which the observations are written is unpredictable. The order in which the observations are returned cannot be determined unless the application imposes ordering criteria. The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing. For more information, see SAS Scalable Performance Data Engine: Reference. The PARALLELWRITE= data set option is available in the third maintenance release for SAS 9.4. See: Parallel Processing for Data in HDFS on page 12 Syntax PARALLELWRITE=NO YES threads Required Arguments NO specifies that parallel processing for a Write operation does not occur. This is the default behavior for the SPD Engine. YES specifies parallel processing for all Write operations for the specific data set. A thread is used for each CPU on the SAS client machine. For example, if eight CPUs exist on the SAS client machine, then eight threads are used to write data. threads specifies parallel processing for all Write operations for the specific data set and specifies the number of threads to use for the Write operations. Default The default is 1, which specifies that parallel processing for a Write operation does not occur. Range 2 to 512 Comparisons The PARALLELWRITE= data set option overrides the PARALLELWRITE= LIBNAME statement option.

55 SPDEPARALLELREAD= System Option for HDFS 45 See Also PARALLELWRITE= LIBNAME Statement Option on page 36 SPDEPARALLELREAD= System Option for HDFS Determines when the SPD Engine uses parallel processing to read data stored in HDFS. Valid in: Category: PROC OPTIONS GROUP= Default: Interactions: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window SASFILES: SAS Files SASFILES NO The SET statement POINT= option is inconsistent with parallel processing. When parallel processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. See: Parallel Processing for Data in HDFS on page 12 Syntax SPDEPARALLELREAD=NO YES Required Arguments NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. YES requests parallel processing for all Read operations for the SAS session. Comparisons The SPDEPARALLELREAD= system option can be overridden by the PARALLELREAD= LIBNAME statement option and the PARALLELREAD= data set option.

56 46 Chapter 4 / SPD Engine Reference See Also PARALLELREAD= LIBNAME Statement Option on page 36 PARALLELREAD= Data Set Option for HDFS on page 42 SPDE_CONFIG_FILE SAS Environment Variable Requests SPD Engine distributed locking by specifying the location of the user-defined XML configuration file. Valid in: Default: Note: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window SPD Engine uses HDFS distributed locking. The SPDE_CONFIG_FILE SAS environment variable is available in the third maintenance release for SAS 9.4. See: SPD Engine Distributed Locking on page 20 Syntax SPDE_CONFIG_FILE='pathname' Required Argument 'pathname' specifies the fully qualified pathname to the user-defined XML configuration file. The location must be available to the SAS client machine. Enclose the primary pathname in single or double quotation marks. You can name the file whatever you want. An example is '/user/abcdef/hadoop/spde-site.xml'. Details XML Configuration File The XML configuration file contains the information so that the SPD Engine can communicate with ZooKeeper. The format of the XML configuration file is similar to a Hadoop configuration file in that the XML contains properties and attributes as name and value pairs. You must create an XML configuration file. The following is an example XML configuration file:

57 SPDE_CONFIG_FILE SAS Environment Variable 47 <?xml version="1.0" encoding="utf-8"?> <configuration> <property> <name>spde.zookeeper.quorum</name> <!-- Comma-separated list of Hadoop clusters running ZooKeeper server. --> <value>abcdef07.unx.sas.com,abcdef08.unx.sas.com,abcdef06.unx.sas.com</value> </property> <property> <name>spde.zookeeper.port</name> <!-- Port number used to connect to the ZooKeeper ensemble. --> <value>2181</value> </property> <property> <!-- Number of times to attempt to connect to ZooKeeper before failing. --> <value>3</value> <name>spde.zookeeper.connect.maxretries</name> </property> <property> <!-- Number of milliseconds to sleep between connection attempts. --> <name>spde.zookeeper.connect.retrysleep</name> <value>1000</value> </property> <property> <!-- Number of milliseconds to wait before connection considered expired. --> <name>spde.zookeeper.connect.timeout</name> <value>30000</value> </property> <property> <!-- Number of milliseconds to wait before session considered expired. --> <name>spde.zookeeper.session.timeout</name> <value>180000</value> </property> <property> <!-- Number of milliseconds to wait before lock request considered failed. --> <name>spde.zookeeper.lockwait.timeout</name> <value>10000</value> </property> <property>

58 48 Chapter 4 / SPD Engine Reference <!-- Number of milliseconds to wait before deleting an empty ZooKeeper data node. <name>spde.zookeeper.reaper.threshold</name> <value>3000</value> </property> </configuration> Creating the XML Configuration File The following are XML configuration file properties. The first two properties, spde.zookeeper.quorum and spde.zookeeper.port, are required. The other properties have default values if they are not included in the XML configuration file. spde.zookeeper.quorum a comma-separated list of quorum machines that are configured to work together as a single server. The listed machines must be running a ZooKeeper server and servicing requests on the port that is specified in the spde.zookeeper.port property. This property is required. spde.zookeeper.port the I/O port on which the quorum machines that are listed in the spde.zookeeper.quorom property are configured to service requests. This property is required. spde.zookeeper.connect.maxretries the maximum number of times that Curator attempts to connect to ZooKeeper before failing. Values less than or equal to zero are ignored. The default is 3. spde.zookeeper.connect.retrysleep the milliseconds that Curator sleeps between attempts to connect to ZooKeeper. The sleep time starts with this setting, but increases between each attempt. Values less than or equal to zero are ignored. The default is spde.zookeeper.connect.timeout the milliseconds that Curator and the ZooKeeper client wait for a communication from the ZooKeeper server before considering the server connection to be expired. When operating normally, the client establishes a connection to the server and communicates with it over that connection. If the connection is non-responsive for more than the specified value, it is considered expired and is dropped, followed by an attempt to establish a new connection. Values less than or equal to zero are ignored. The default is

59 SPDE_CONFIG_FILE SAS Environment Variable 49 spde.zookeeper.session.timeout the milliseconds that Curator and the ZooKeeper client wait for a communication from the ZooKeeper server before considering the client session to be expired. When operating normally, the client establishes a connection to the server and communicates with it over that connection. The connection might be dropped and reestablished as the network or server nodes experience faults, but the client session continues to exist for the duration of these interruptions. If an interruption persists for more than the specified value, the client session is considered expired and is terminated. No reconnection is possible after that. Values less than or equal to zero are ignored. The default is spde.zookeeper.lockwait.timeout the milliseconds that the ZooKeeper server waits for a lock to become available before declaring that a lock request has failed and returning control to the client. Values less than zero are ignored. A value of zero is valid. spde.zookeeper.reaper.threshold the milliseconds that the ZooKeeper server waits before deleting an empty Zookeeper server node. The default is Defining the SPDE_CONFIG_FILE Environment Variable The following table includes examples of defining the SPDE_CONFIG_FILE environment variable: Table 4.1 Method Defining the SPDE_CONFIG_FILE Environment Variable Example SAS configuration file SAS invocation OPTIONS statement -set SPDE_CONFIG_FILE /user/abcdef/ hadoop/spde-site.xml -set SPDE_CONFIG_FILE /user/abcdef/ hadoop/spde-site.xml options set=spde_config_file= /user/ abcdef/hadoop/spde-site.xml ;

60 50 Chapter 4 / SPD Engine Reference SPDE_HADOOP_WORK_PATH SAS Environment Variable Specifies a pathname for SAS High-Performance Analytics work files. Valid in: Default: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window The SPD Engine uses the standard UNIX temporary directory /tmp. See: Using SAS High-Performance Analytics Procedures on page 24 Syntax SPDE_HADOOP_WORK_PATH='pathname' Required Argument 'pathname' specifies the fully qualified pathname to a directory in the Hadoop cluster, which is typically UNIX based. Enclose the primary pathname in single or double quotation marks. An example is '/sasdata/cluster1/hpawork'. Requirement The directory must exist, and you must have Write access. Details The following table includes examples of defining the SPDE_HADOOP_WORK_PATH environment variable: Table 4.2 Method Defining the SPDE_HADOOP_WORK_PATH Environment Variable Example SAS configuration file SAS invocation -set SPDE_HADOOP_WORK_PATH /sasdata/ cluster1/hpawork -set SPDE_HADOOP_WORK_PATH /sasdata/ cluster1/hpawork

61 SPDELOCKPATH SAS Environment Variable 51 Method OPTIONS statement Example options set=spde_hadoop_work_path= / sasdata/cluster1/hpawork ; SPDELOCKPATH SAS Environment Variable Specifies a pathname for the SPD Engine lock directory for HDFS concurrent access. Valid in: Default: Note: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window The SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eight-character hexadecimal value (which is the checksum of the Hadoop cluster that contains the data set), and the suffix _spdslock9. The SPDELOCKPATH SAS environment variable is available in the third maintenance release for SAS 9.4. See: SPD Engine File System Locking on page 18 Syntax SPDELOCKPATH='pathname' Required Argument 'pathname' specifies the fully qualified pathname to a directory in the Hadoop cluster, which is typically UNIX based. Enclose the primary pathname in single or double quotation marks. An example is '/user/abcdef/'. Tip Specify only one lock directory pathname for each Hadoop cluster so that the same data set is not using different lock directories. Details The following table includes examples of defining the SPDELOCKPATH environment variable:

62 52 Chapter 4 / SPD Engine Reference Table 4.3 Method Defining the SPDELOCKPATH Environment Variable Example SAS configuration file SAS invocation OPTIONS statement -set SPDELOCKPATH /user/abcdef -set SPDELOCKPATH /user/abcdef options set=spdelockpath= /user/abcdef ; SPDEREADLOCK SAS Environment Variable Determines whether a Read access lock file is created. Valid in: Default: SAS configuration file, SAS invocation, OPTIONS statement, SAS System Options window NO See: SPD Engine File System Locking on page 18 Syntax SPDEREADLOCK NO YES Required Arguments NO specifies that a Read access lock file is not created when a data set stored in HDFS is opened for Read access. This is the default behavior for the SPD Engine. Only Write access lock files are created. YES specifies that a Read access lock file is created when a data set stored in HDFS is opened for Read access. Once the lock file is created, no other SAS process can open the data set for Write access.

63 SPDEREADLOCK SAS Environment Variable 53 Details To control the level of concurrent access, you can request a Read access lock file by defining the SAS environment variable SPDEREADLOCK and setting it to YES. Then, when a SAS session opens a data set for Read access, a lock file is created in addition to any Write access lock files. The following table includes examples of defining the SPDEREADLOCK environment variable: Table 4.4 Method Defining the SPDELOCKPATH Environment Variable Example SAS configuration file SAS invocation OPTIONS statement -set SPDEREADLOCK YES -set SPDEREADLOCK YES options set=spdereadlock YES;

64 54 Chapter 4 / SPD Engine Reference

65 55 5 How to Use Hadoop Data Storage Overview: How to Use Hadoop Data Storage Example 1: Loading Existing SAS Data Using the COPY Procedure Details Program Program Description Example 2: Creating a Data Set Using the DATA Step Details Program Program Description Example 3: Adding to Existing Data Set Using the APPEND Procedure Details Program Program Description Example 4: Loading Oracle Data Using the COPY Procedure Details Program Program Description Example 5: Analyzing Data Using the FREQ Procedure Details Program

66 56 Chapter 5 / How to Use Hadoop Data Storage Program Description Example 6: Managing SAS Files Using the DATASETS Procedure Details Program Program Description Example 7: Setting the SPD Engine I/O Block Size Details Program Program Description Example 8: Optimizing WHERE Processing with MapReduce.. 69 Details Program Program Description Overview: How to Use Hadoop Data Storage These examples illustrate how to use Hadoop data storage. The examples show you how to load existing data into a Hadoop cluster, how to create a new data set in a Hadoop cluster, and how to append data to an existing data set in a Hadoop cluster. Other examples show you how to load Oracle data into a Hadoop cluster and how to access data sets stored in a Hadoop cluster for data management and analysis. Note: The example data was created to illustrate SPD Engine functionality to read, write, and update data sets in a Hadoop cluster. The example data does not reflect the type of data or file size that might typically be loaded into a Hadoop cluster.

67 Example 1: Loading Existing SAS Data Using the COPY Procedure 57 Example 1: Loading Existing SAS Data Using the COPY Procedure Details This example loads existing SAS data into a Hadoop cluster. The example uses the default Base SAS engine, the SPD Engine, and the COPY procedure. The data set named MyBase.BigFile is copied, converted to the SPD Engine format, and then written to the Hadoop cluster as an SPD Engine data set named MySpde.BigFile. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname mybase 'C:\SASFiles'; 2 libname myspde spde '/data/spde' hdfshost=default; 3 proc copy in=mybase out=myspde; 4 select bigfile; run; Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The first LIBNAME statement assigns the libref MyBase to the physical location of the SAS library that stores the data set. (The default Base SAS engine for a LIBNAME statement is the V9 (or BASE) engine.)

68 58 Chapter 5 / How to Use Hadoop Data Storage 3 The second LIBNAME statement assigns the libref MySpde to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The COPY procedure copies the data set named BigFile. The SPD Engine creates a subdirectory with the specified data set name and the suffix _spde, converts the data to the SPD Engine format, and writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. The SPD Engine data partition files for the data set BigFile are located at /data/spde/bigfile_spde/. The first partition file is named bigfile.dpf.080e0a8f.0.1.spds9. Example 2: Creating a Data Set Using the DATA Step Details This example creates a data set named MySpde.Fitness in a Hadoop cluster. The example uses the default Base SAS engine, the SPD Engine, and the DATA step SET statement to concatenate several data sets. The data sets are converted to the SPD Engine format and then written to a directory in the Hadoop cluster. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45dl"; libname mybase 'C:\SASFiles'; 2 libname myspde spde '/data/spde' hdfshost=default; 3 data myspde.fitness; 4 set mybase.fitness_2010 mybase.fitness_2011 mybase.fitness_2012; run;

69 Example 3: Adding to Existing Data Set Using the APPEND Procedure 59 Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The first LIBNAME statement assigns the libref MyBase to the physical location of the SAS library that stores the data sets. (The default Base SAS engine for a LIBNAME statement is the V9 (or BASE) engine.) 3 The second LIBNAME statement assigns the libref MySpde to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The DATA statement assigns the name Fitness to the new data set. The SET statement lists the names of existing data sets to be read. The SPD Engine copies the three input data sets, concatenates them into one output data set named Fitness, converts the data to the SPD Engine format, and then writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. Example 3: Adding to Existing Data Set Using the APPEND Procedure Details This example adds data to an existing data set that is stored in a Hadoop cluster. The example uses the default Base SAS engine, the SPD Engine, and the APPEND procedure. The data sets named MyBase.September and MyBase.October are converted to the SPD Engine format and then written to the existing data set named Sales.YearToDate.

70 60 Chapter 5 / How to Use Hadoop Data Storage Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname mybase 'C:\SASFiles'; 2 libname sales spde '/data/spde' hdfshost=default; 3 proc append base=sales.yeartodate data=mybase.september; 4 run; proc append base=sales.yeartodate data=mybase.october; 5 run; Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The first LIBNAME statement assigns the libref MyBase to the physical location of the SAS library that stores the data sets. (The default Base SAS engine for a LIBNAME statement is the V9 (or BASE) engine.) 3 The second LIBNAME statement assigns the libref Sales to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The first PROC APPEND copies the data from MyBase.September to Sales.YearToDate. The SPD Engine converts the data to the SPD Engine format and then writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. 5 The second PROC APPEND copies the data from MyBase.October to Sales.YearToDate. The SPD Engine converts the data to the SPD Engine format and

71 Example 4: Loading Oracle Data Using the COPY Procedure 61 then writes the data to the directory in the Hadoop cluster. HDFS distributes the data on the Hadoop cluster. Example 4: Loading Oracle Data Using the COPY Procedure Details This example loads Oracle data into a Hadoop cluster. The example uses the SAS/ACCESS to Oracle engine, the SPD Engine, and the COPY procedure. The table named MyOracle.Oracle1 is written to the Hadoop cluster as an SPD Engine data set named MySpde.Oracle1. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname myoracle oracle user=myusr1 password=mypwd1 path=mysrv1; 2 libname myspde spde '/data/spde' hdfshost=default; 3 proc copy in=myoracle out=myspde; 4 select oracle1; run; Program Description 1 The OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster.

72 62 Chapter 5 / How to Use Hadoop Data Storage 2 The first LIBNAME statement assigns the libref MyOracle, specifies the Oracle engine, and specifies the connection information to the Oracle database that contains the Oracle table. 3 The second LIBNAME statement assigns the libref MySpde to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 4 The COPY procedure copies the table named Oracle1. The SPD Engine creates a subdirectory with the specified data set name and suffix _spde, converts the data to the SPD Engine format, and writes the data to the directory in the Hadoop cluster as an SPD Engine data set. HDFS distributes the data on the Hadoop cluster. The SPD Engine data partition files for the data set Oracle1 are located at /data/spde/ oracle1_spde/. Example 5: Analyzing Data Using the FREQ Procedure Details This example analyzes the data set StudentID that is stored in a Hadoop cluster. The data set contains 3,231,765 observations and three variables: ID, Age, and Name. The example uses the SPD Engine and the FREQ procedure to produce a one-way frequency table for the students ages. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname class spde '/data/spde' hdfshost=default; 2 proc freq data=class.studentid; 3

73 Example 5: Analyzing Data Using the FREQ Procedure 63 tables age; run; Program Description 1 The two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 To read a data set that is stored in a Hadoop cluster, simply connect to the cluster with the LIBNAME statement for the SPD Engine. The LIBNAME statement assigns the libref Class to a directory in the Hadoop cluster and specifies the SPD Engine. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 3 PROC FREQ produces a one-way frequency table for the students ages. Figure 5.1 PROC FREQ One-Way Frequency Table

74 64 Chapter 5 / How to Use Hadoop Data Storage Example 6: Managing SAS Files Using the DATASETS Procedure Details This example illustrates how to manage SAS files that are stored in a Hadoop cluster. The example uses the DATASETS procedure to list the SAS files, describe the contents of a specific data set, and delete a data set from HDFS. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname myspde spde '/data/spde' hdfshost=default; 2 proc datasets library=myspde; 3 contents data=studentid (listfiles=yes); 4 run; delete bigfile; 5 run; quit; Program Description 1 The two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 To manage your SAS files that are stored in a Hadoop cluster, simply connect to the cluster with the LIBNAME statement for the SPD Engine. The LIBNAME statement

75 Example 6: Managing SAS Files Using the DATASETS Procedure 65 assigns the libref MySpde to a directory in the Hadoop cluster and specifies the SPD Engine. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. 3 PROC DATASETS lists the SAS files that are stored in the directory in the Hadoop cluster. 4 The CONTENTS statement describes the contents of the data set named StudentID, which includes the number of observations, whether the data set has an index, and the observation length. The LISTFILES= data set option lists the complete pathnames of the SPD Engine files such as the data partition files and the metadata file. 5 The DELETE statement removes the data set named BigFile. The SPD Engine data partition, metadata, and index files are removed. The data set name subdirectory is also removed unless the subdirectory contains files other than the data partition files.

76 66 Chapter 5 / How to Use Hadoop Data Storage Figure 5.2 MySpde Directory Listing

77 Example 6: Managing SAS Files Using the DATASETS Procedure 67 Figure 5.3 Contents of StudentID Data Set

78 68 Chapter 5 / How to Use Hadoop Data Storage Example 7: Setting the SPD Engine I/O Block Size Details This example illustrates how to set the SPD Engine I/O block size to improve performance. The example uses the SPD Engine, an uncompressed data set, and SAS procedures to analyze the data. Program options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 1 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname class spde '/data/spde' hdfshost=default; 2 proc means data=class.studentid; 3 var age; run; proc print data=class.studentid (ioblocksize=32768); 4 where age > 18; run; Program Description 1 The two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 2 The LIBNAME statement assigns the libref Class to a directory in the Hadoop cluster. The SPD Engine is specified. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster

79 Example 8: Optimizing WHERE Processing with MapReduce 69 configuration files. The LIBNAME statement does not include the IOBLOCKSIZE= option, so the default I/O block size is 1,048,576 bytes (1 megabyte). 3 The MEANS procedure calculates statistics on the Age variable. Because the Read operation requires a full data set scan, the procedure uses the default I/O block size, which was set from the LIBNAME statement. For this Read operation, including the IOBLOCKSIZE= data set option to specify a larger I/O block size could improve performance. When retrieving a large percentage of the data, a larger block size provides a performance benefit. 4 The PRINT procedure requests output where the value of the Age variable is greater than 18. Because the Read operation requests a subset of the data, the procedure includes the IOBLOCKSIZE= data set option to specify a smaller I/O block size. A smaller I/O block size provides better performance because the SPD Engine does not read large blocks of observations when it only needs a few observations from the block. Example 8: Optimizing WHERE Processing with MapReduce Details This example illustrates how to optimize WHERE processing by requesting that data subsetting be performed in the Hadoop cluster. This example analyzes the data set StudentID that is stored in a Hadoop cluster and submits the WHERE expression to the Hadoop cluster as a MapReduce program. By requesting that data subsetting be performed in the Hadoop cluster, performance is improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SAS client. Program options msglevel=i; 1

80 70 Chapter 5 / How to Use Hadoop Data Storage options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 2 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; libname class spde '/data/spde' hdfshost=default accelwhere=yes; 3 proc freq data=class.studentid; tables age; where age gt 14; 4 run; Program Description 1 The first OPTIONS statement specifies the MSGLEVEL=I SAS system option to request that informative messages be written to the SAS log. For WHERE processing optimization, the SAS log reports whether the data filtering occurred in the Hadoop cluster. 2 The next two OPTIONS statements include the SET system option to define the SAS_HADOOP_CONFIG_PATH environment variable and the SAS_HADOOP_JAR_PATH environment variable. The environment variables set the location of the configuration files and the JAR files for a specific Hadoop cluster. 3 The LIBNAME statement assigns the libref Class to a directory in the Hadoop cluster and specifies the SPD Engine. The HDFSHOST=DEFAULT argument specifies to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. The ACCELWHERE=YES argument requests that data subsetting be performed by a MapReduce program in the Hadoop cluster. 4 PROC FREQ produces a one-way frequency table for the students ages that are greater than 14. The WHERE expression, which defines the condition that selected observations must satisfy, is instantiated as a Java class. The SPD Engine submits the Java class to the Hadoop cluster as a component in a MapReduce program. As a result, only a subset of the data is returned to the SAS client.

81 Example 8: Optimizing WHERE Processing with MapReduce 71 Figure 5.4 PROC FREQ One-Way Frequency Table Optimized WHERE Processing Note: The SAS log reports that there were 2,371,486 observations read from the data set. That number of observations is a subset of the data set stored in the Hadoop cluster, which contains 3,231,765 observations. Log 5.1 SAS Log Reporting WHERE Optimization 1 options msglevel=i; 2 options set=sas_hadoop_config_path="\\sashq\root\u\abcdef\cdh45p1"; 3 options set=sas_hadoop_jar_path="\\sashq\root\u\abcdef\cdh45"; 4 libname class spde '/data/spde' hdfshost=default accelwhere=yes; NOTE: Libref CLASS was successfully assigned as follows: Engine: SPDE Physical Name: /data/spde/ 5 proc freq data=class.studentid; 6 tables age; 7 where age gt 14; whinit: WHERE (Age>14) whinit returns: ALL EVAL2 8 run; NOTE: Writing HTML Body file: sashtml.htm NOTE: There were observations read from the data set CLASS.STUDENTID. WHERE age>14; WHERE processing is optimized on the Hadoop cluster. Hadoop Job ID: job_ _14972 NOTE: PROCEDURE FREQ used (Total process time): real time 2:31.74 cpu time 1.70 seconds

82 72 Chapter 5 / How to Use Hadoop Data Storage

83 73 Hive SerDe for SPD Engine Data Appendix 1 Accessing SPD Engine Data Using Hive Introduction Requirements for Accessing SPD Engine Tables with Hive Deploying the SPD Engine SerDe Registering the SPD Engine Table Metadata in the Hive Metastore Reading SPD Engine Tables from Hive Logging Support How the SPD Engine SerDe Reads the Data Troubleshooting Accessing SPD Engine Data Using Hive Introduction Hive uses an interface called SerDe to translate data that is stored in proprietary formats such as JSON and Parquet into HDFS. SerDe deserializes data into a Java object that HiveQL and other languages that are supported by HiveServer2 can manipulate. Hive provides a variety of built-in SerDes and supports custom SerDes. For more information about Hive SerDes, see your Hive documentation.

84 74 Appendix 1 / Hive SerDe for SPD Engine Data In the third maintenance release for SAS 9.4, SAS provides a custom Hive SerDe for SPD Engine data that is stored in HDFS. The SerDe makes the data available for applications outside of SAS to query. The SPD Engine SerDe does not support creating, altering, or updating SPD Engine data in HDFS using HiveQL or other languages. That is, the SerDe is Read-only and cannot serialize data for storage in HDFS. If you want to process SPD Engine data stored in HDFS using SAS applications, you should access it directly with the SPD Engine. In addition, if the SPD Engine table in HDFS has any of the following features, it cannot be registered in Hive or use the SerDe. You must access it by going through SAS and the SPD Engine. The following table features are not supported: compressed or encrypted tables tables with SAS informats tables that have user-defined formats password-protected tables tables owned by the SAS Scalable Performance Data Server In addition, the following processing functionality is not supported by the SerDe and requires processing by the SPD Engine: Write, Update, and Append operations if preserving observation order is required Requirements for Accessing SPD Engine Tables with Hive The following are required to access SPD Engine tables using the SPD Engine SerDe: You must deploy SAS Foundation using the SAS Deployment Wizard. Select SAS Hive SerDe for SPDE Data.

85 Accessing SPD Engine Data Using Hive 75 Figure A1.1 SAS Deployment Wizard Product Selection Page You must be running a supported Hadoop distribution that includes Hive 0.13: o Cloudera CDH 5.2 o o Hortonworks HDP 2.1 or later MapR or later The SPD Engine table stored in HDFS must have been created using the SPD Engine. The SerDe is delivered as two JAR files, which must be deployed to all nodes in the Hadoop cluster.

Guide to Operating SAS IT Resource Management 3.5 without a Middle Tier

Guide to Operating SAS IT Resource Management 3.5 without a Middle Tier Guide to Operating SAS IT Resource Management 3.5 without a Middle Tier SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. Guide to Operating SAS

More information

SAS 9.4 Intelligence Platform

SAS 9.4 Intelligence Platform SAS 9.4 Intelligence Platform Application Server Administration Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2013. SAS 9.4 Intelligence Platform:

More information

and Hadoop Technology

and Hadoop Technology SAS and Hadoop Technology Overview SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS and Hadoop Technology: Overview. Cary, NC: SAS Institute

More information

SAS 9.4 Logging. Configuration and Programming Reference Second Edition. SAS Documentation

SAS 9.4 Logging. Configuration and Programming Reference Second Edition. SAS Documentation SAS 9.4 Logging Configuration and Programming Reference Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. SAS 9.4 Logging: Configuration

More information

SAS 9.4 PC Files Server

SAS 9.4 PC Files Server SAS 9.4 PC Files Server Installation and Configuration Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. SAS 9.4 PC Files Server: Installation

More information

Grid Computing in SAS 9.4 Third Edition

Grid Computing in SAS 9.4 Third Edition Grid Computing in SAS 9.4 Third Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. Grid Computing in SAS 9.4, Third Edition. Cary, NC:

More information

Scheduling in SAS 9.4 Second Edition

Scheduling in SAS 9.4 Second Edition Scheduling in SAS 9.4 Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. Scheduling in SAS 9.4, Second Edition. Cary, NC: SAS Institute

More information

SAS 9.4 Intelligence Platform

SAS 9.4 Intelligence Platform SAS 9.4 Intelligence Platform Installation and Configuration Guide Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS 9.4 Intelligence

More information

OnDemand for Academics

OnDemand for Academics SAS OnDemand for Academics User s Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS OnDemand for Academics: User's Guide. Cary, NC:

More information

SAS 9.3 Intelligence Platform

SAS 9.3 Intelligence Platform SAS 9.3 Intelligence Platform Application Server Administration Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc 2011. SAS SAS 9.3 Intelligence

More information

SAS Task Manager 2.2. User s Guide. SAS Documentation

SAS Task Manager 2.2. User s Guide. SAS Documentation SAS Task Manager 2.2 User s Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS Task Manager 2.2: User's Guide. Cary, NC: SAS Institute

More information

SAS 9.4 Intelligence Platform: Migration Guide, Second Edition

SAS 9.4 Intelligence Platform: Migration Guide, Second Edition SAS 9.4 Intelligence Platform: Migration Guide, Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS 9.4 Intelligence Platform:

More information

SAS University Edition: Installation Guide for Linux

SAS University Edition: Installation Guide for Linux SAS University Edition: Installation Guide for Linux i 17 June 2014 The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. SAS University Edition: Installation Guide

More information

SAS. Cloud. Account Administrator s Guide. SAS Documentation

SAS. Cloud. Account Administrator s Guide. SAS Documentation SAS Cloud Account Administrator s Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. SAS Cloud: Account Administrator's Guide. Cary, NC:

More information

SAS 9.3 Scalable Performance Data Engine: Reference

SAS 9.3 Scalable Performance Data Engine: Reference SAS 9.3 Scalable Performance Data Engine: Reference SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2011. SAS 9.3 Scalable Performance Data Engine:

More information

SAS University Edition: Installation Guide for Windows

SAS University Edition: Installation Guide for Windows SAS University Edition: Installation Guide for Windows i 17 June 2014 The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS University Edition: Installation Guide

More information

When to Move a SAS File between Hosts

When to Move a SAS File between Hosts 3 CHAPTER Moving and Accessing SAS Files between Hosts When to Move a SAS File between Hosts 3 When to Access a SAS File on a Remote Host 3 Host Types Supported According to SAS Release 4 Avoiding and

More information

SAS 9.3 Logging: Configuration and Programming Reference

SAS 9.3 Logging: Configuration and Programming Reference SAS 9.3 Logging: Configuration and Programming Reference SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2011. SAS 9.3 Logging: Configuration and

More information

SAS 9.4 In-Database Products

SAS 9.4 In-Database Products SAS 9.4 In-Database Products Administrator s Guide Fifth Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS 9.4 In-Database Products:

More information

QUEST meeting Big Data Analytics

QUEST meeting Big Data Analytics QUEST meeting Big Data Analytics Peter Hughes Business Solutions Consultant SAS Australia/New Zealand Copyright 2015, SAS Institute Inc. All rights reserved. Big Data Analytics WHERE WE ARE NOW 2005 2007

More information

SAS. 9.3 Guide to Software Updates. SAS Documentation

SAS. 9.3 Guide to Software Updates. SAS Documentation SAS 9.3 Guide to Software Updates SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2011. SAS 9.3 Guide to Software Updates. Cary, NC: SAS Institute

More information

SAS. 9.4 Guide to Software Updates. SAS Documentation

SAS. 9.4 Guide to Software Updates. SAS Documentation SAS 9.4 Guide to Software Updates SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2013. SAS 9.4 Guide to Software Updates. Cary, NC: SAS Institute

More information

9.1 SAS/ACCESS. Interface to SAP BW. User s Guide

9.1 SAS/ACCESS. Interface to SAP BW. User s Guide SAS/ACCESS 9.1 Interface to SAP BW User s Guide The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2004. SAS/ACCESS 9.1 Interface to SAP BW: User s Guide. Cary, NC: SAS

More information

Technical Paper. Migrating a SAS Deployment to Microsoft Windows x64

Technical Paper. Migrating a SAS Deployment to Microsoft Windows x64 Technical Paper Migrating a SAS Deployment to Microsoft Windows x64 Table of Contents Abstract... 1 Introduction... 1 Why Upgrade to 64-Bit SAS?... 1 Standard Upgrade and Migration Tasks... 2 Special

More information

SAS University Edition: Installation Guide for Amazon Web Services

SAS University Edition: Installation Guide for Amazon Web Services SAS University Edition: Installation Guide for Amazon Web Services i 17 June 2014 The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS University Edition: Installation

More information

SAS IT Resource Management 3.2

SAS IT Resource Management 3.2 SAS IT Resource Management 3.2 Reporting Guide Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc 2011. SAS IT Resource Management 3.2:

More information

Analyzing the Server Log

Analyzing the Server Log 87 CHAPTER 7 Analyzing the Server Log Audience 87 Introduction 87 Starting the Server Log 88 Using the Server Log Analysis Tools 88 Customizing the Programs 89 Executing the Driver Program 89 About the

More information

SAS BI Dashboard 4.4. User's Guide Second Edition. SAS Documentation

SAS BI Dashboard 4.4. User's Guide Second Edition. SAS Documentation SAS BI Dashboard 4.4 User's Guide Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2013. SAS BI Dashboard 4.4: User's Guide, Second

More information

SAS Visual Analytics 7.1 for SAS Cloud. Quick-Start Guide

SAS Visual Analytics 7.1 for SAS Cloud. Quick-Start Guide SAS Visual Analytics 7.1 for SAS Cloud Quick-Start Guide The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. SAS Visual Analytics 7.1 for SAS Cloud: Quick-Start Guide.

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Paper SAS033-2014 Techniques in Processing Data on Hadoop

Paper SAS033-2014 Techniques in Processing Data on Hadoop Paper SAS033-2014 Techniques in Processing Data on Hadoop Donna De Capite, SAS Institute Inc., Cary, NC ABSTRACT Before you can analyze your big data, you need to prepare the data for analysis. This paper

More information

SAS/ACCESS 9.3 Interface to PC Files

SAS/ACCESS 9.3 Interface to PC Files SAS/ACCESS 9.3 Interface to PC Files Reference SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2011. SAS/ACCESS 9.3 Interface to Files: Reference.

More information

SAS Business Data Network 3.1

SAS Business Data Network 3.1 SAS Business Data Network 3.1 User s Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. SAS Business Data Network 3.1: User's Guide. Cary,

More information

SAS Data Loader 2.1 for Hadoop

SAS Data Loader 2.1 for Hadoop SAS Data Loader 2.1 for Hadoop Installation and Configuration Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. SAS Data Loader 2.1: Installation

More information

UNIX Operating Environment

UNIX Operating Environment 97 CHAPTER 14 UNIX Operating Environment Specifying File Attributes for UNIX 97 Determining the SAS Release Used to Create a Member 97 Creating a Transport File on Tape 98 Copying the Transport File from

More information

Communications Access Methods for SAS/CONNECT 9.3 and SAS/SHARE 9.3 Second Edition

Communications Access Methods for SAS/CONNECT 9.3 and SAS/SHARE 9.3 Second Edition Communications Access Methods for SAS/CONNECT 9.3 and SAS/SHARE 9.3 Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2012. Communications

More information

SAS Scalable Performance Data Server 5.1

SAS Scalable Performance Data Server 5.1 SAS Scalable Performance Data Server 5.1 Administrator s Guide Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2013. SAS Scalable Performance

More information

Document Type: Best Practice

Document Type: Best Practice Global Architecture and Technology Enablement Practice Hadoop with Kerberos Deployment Considerations Document Type: Best Practice Note: The content of this paper refers exclusively to the second maintenance

More information

9.4 Hadoop Configuration Guide for Base SAS. and SAS/ACCESS

9.4 Hadoop Configuration Guide for Base SAS. and SAS/ACCESS SAS 9.4 Hadoop Configuration Guide for Base SAS and SAS/ACCESS Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS 9.4 Hadoop

More information

IT Service Level Management 2.1 User s Guide SAS

IT Service Level Management 2.1 User s Guide SAS IT Service Level Management 2.1 User s Guide SAS The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2006. SAS IT Service Level Management 2.1: User s Guide. Cary, NC:

More information

Scheduling in SAS 9.3

Scheduling in SAS 9.3 Scheduling in SAS 9.3 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc 2011. Scheduling in SAS 9.3. Cary, NC: SAS Institute Inc. Scheduling in SAS 9.3

More information

SAS 9.3 Intelligence Platform

SAS 9.3 Intelligence Platform SAS 9.3 Intelligence Platform System Administration Guide Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2012. SAS 9.3 Intelligence

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

SAS 9.3 Open Metadata Interface: Reference and Usage

SAS 9.3 Open Metadata Interface: Reference and Usage SAS 9.3 Open Metadata Interface: Reference and Usage SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2011. SAS 9.3 Open Metadata Interface: Reference

More information

WHAT S NEW IN SAS 9.4

WHAT S NEW IN SAS 9.4 WHAT S NEW IN SAS 9.4 PLATFORM, HPA & SAS GRID COMPUTING MICHAEL GODDARD CHIEF ARCHITECT SAS INSTITUTE, NEW ZEALAND SAS 9.4 WHAT S NEW IN THE PLATFORM Platform update SAS Grid Computing update Hadoop support

More information

SAS 9.3 Intelligence Platform

SAS 9.3 Intelligence Platform SAS 9.3 Intelligence Platform Data Administration Guide Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2012. SAS 9.3 Intelligence

More information

SAS/IntrNet 9.4: Application Dispatcher

SAS/IntrNet 9.4: Application Dispatcher SAS/IntrNet 9.4: Application Dispatcher SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2013. SAS/IntrNet 9.4: Application Dispatcher. Cary, NC: SAS

More information

A Survey of Shared File Systems

A Survey of Shared File Systems Technical Paper A Survey of Shared File Systems Determining the Best Choice for your Distributed Applications A Survey of Shared File Systems A Survey of Shared File Systems Table of Contents Introduction...

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Running a Workflow on a PowerCenter Grid

Running a Workflow on a PowerCenter Grid Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)

More information

SAS 9.3 Drivers for ODBC

SAS 9.3 Drivers for ODBC SAS 9.3 Drivers for ODBC User s Guide Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2011. SAS 9.3 Drivers for ODBC: User s Guide,

More information

SAS. 9.1.3 Intelligence Platform. System Administration Guide

SAS. 9.1.3 Intelligence Platform. System Administration Guide SAS 9.1.3 Intelligence Platform System Administration Guide The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2006. SAS 9.1.3 Intelligence Platform: System Administration

More information

9.1 SAS. SQL Query Window. User s Guide

9.1 SAS. SQL Query Window. User s Guide SAS 9.1 SQL Query Window User s Guide The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2004. SAS 9.1 SQL Query Window User s Guide. Cary, NC: SAS Institute Inc. SAS

More information

Communications Access Methods for SAS/CONNECT 9.4 and SAS/SHARE 9.4 Second Edition

Communications Access Methods for SAS/CONNECT 9.4 and SAS/SHARE 9.4 Second Edition Communications Access Methods for SAS/CONNECT 9.4 and SAS/SHARE 9.4 Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. Communications

More information

Encryption Services. What Are Encryption Services? Terminology. System and Software Requirements APPENDIX 5

Encryption Services. What Are Encryption Services? Terminology. System and Software Requirements APPENDIX 5 207 APPENDIX 5 Encryption Services What Are Encryption Services? 207 Terminology 207 System and Software Requirements 207 Requirements for SAS Proprietary Encryption Services 208 Communications Access

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

The Advantages of Using RAID

The Advantages of Using RAID 1 Quick Guide to the SPD Engine Disk-I/O Set-Up SPD Engine Disk-I/O Set-Up 1 Disk Striping and RAIDs 2 Metadata Area Configuration 3 Assigning a Metadata Area 3 Metadata Space Requirements 3 Data Area

More information

Cloudera Backup and Disaster Recovery

Cloudera Backup and Disaster Recovery Cloudera Backup and Disaster Recovery Important Note: Cloudera Manager 4 and CDH 4 have reached End of Maintenance (EOM) on August 9, 2015. Cloudera will not support or provide patches for any of the Cloudera

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases 3 CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases About This Document 3 Methods for Accessing Relational Database Data 4 Selecting a SAS/ACCESS Method 4 Methods for Accessing DBMS Tables

More information

What's New in SAS Data Management

What's New in SAS Data Management Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases

More information

H2O on Hadoop. September 30, 2014. www.0xdata.com

H2O on Hadoop. September 30, 2014. www.0xdata.com H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms

More information

Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.

Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved. Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document

More information

The Inside Scoop on Hadoop

The Inside Scoop on Hadoop The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop

More information

HDFS Users Guide. Table of contents

HDFS Users Guide. Table of contents Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

CA Workload Automation Agent for Databases

CA Workload Automation Agent for Databases CA Workload Automation Agent for Databases Implementation Guide r11.3.4 This Documentation, which includes embedded help systems and electronically distributed materials, (hereinafter referred to as the

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

SAS Data Set Encryption Options

SAS Data Set Encryption Options Technical Paper SAS Data Set Encryption Options SAS product interaction with encrypted data storage Table of Contents Introduction: What Is Encryption?... 1 Test Configuration... 1 Data... 1 Code... 2

More information

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide Software Release 1.0 November 2013 Two-Second Advantage Important Information SOME TIBCO SOFTWARE EMBEDS OR BUNDLES OTHER TIBCO SOFTWARE.

More information

SAS/IntrNet 9.3: Application Dispatcher

SAS/IntrNet 9.3: Application Dispatcher SAS/IntrNet 9.3: Application Dispatcher SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc 2011. SAS/IntrNet 9.3: Application Dispatcher. Cary, NC: SAS

More information

Transferring vs. Transporting Between SAS Operating Environments Mimi Lou, Medical College of Georgia, Augusta, GA

Transferring vs. Transporting Between SAS Operating Environments Mimi Lou, Medical College of Georgia, Augusta, GA CC13 Transferring vs. Transporting Between SAS Operating Environments Mimi Lou, Medical College of Georgia, Augusta, GA ABSTRACT Prior to SAS version 8, permanent SAS data sets cannot be moved directly

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

SAS LASR Analytic Server 2.4

SAS LASR Analytic Server 2.4 SAS LASR Analytic Server 2.4 Reference Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. SAS LASR Analytic Server 2.4: Reference Guide.

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Plug-In for Informatica Guide

Plug-In for Informatica Guide HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 2/20/2015 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements

More information

SAS Marketing Automation 5.1. User s Guide

SAS Marketing Automation 5.1. User s Guide SAS Marketing Automation 5.1 User s Guide The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2007. SAS Marketing Automation 5.1: User s Guide. Cary, NC: SAS Institute

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Chapter 2 The Data Table. Chapter Table of Contents

Chapter 2 The Data Table. Chapter Table of Contents Chapter 2 The Data Table Chapter Table of Contents Introduction... 21 Bringing in Data... 22 OpeningLocalFiles... 22 OpeningSASFiles... 27 UsingtheQueryWindow... 28 Modifying Tables... 31 Viewing and Editing

More information

Cloudera Backup and Disaster Recovery

Cloudera Backup and Disaster Recovery Cloudera Backup and Disaster Recovery Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.1.x

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.1.x HP Vertica Analytic Database Software Version: 7.1.x Document Release Date: 10/14/2015 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements

More information

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.0.x

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.0.x HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 5/7/2014 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

CA Clarity Project & Portfolio Manager

CA Clarity Project & Portfolio Manager CA Clarity Project & Portfolio Manager Using CA Clarity PPM with Open Workbench and Microsoft Project v12.1.0 This documentation and any related computer software help programs (hereinafter referred to

More information

Postgres Plus xdb Replication Server with Multi-Master User s Guide

Postgres Plus xdb Replication Server with Multi-Master User s Guide Postgres Plus xdb Replication Server with Multi-Master User s Guide Postgres Plus xdb Replication Server with Multi-Master build 57 August 22, 2012 , Version 5.0 by EnterpriseDB Corporation Copyright 2012

More information

9.4 Intelligence. SAS Platform. Overview Second Edition. SAS Documentation

9.4 Intelligence. SAS Platform. Overview Second Edition. SAS Documentation SAS Platform Overview Second Edition 9.4 Intelligence SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2016. SAS 9.4 Intelligence Platform: Overview,

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7 97 CHAPTER 7 DBF Chapter Note to UNIX and OS/390 Users 97 Import/Export Facility 97 Understanding DBF Essentials 98 DBF Files 98 DBF File Naming Conventions 99 DBF File Data Types 99 ACCESS Procedure Data

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage Release: August 2011 Copyright Copyright 2011 Gluster, Inc. This is a preliminary document and may be changed substantially prior to final commercial

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information