EMC ISILON STORAGE BEST PRACTICES FOR ELECTRONIC DESIGN AUTOMATION

Transcription

1 White Paper EMC ISILON STORAGE BEST PRACTICES FOR ELECTRONIC DESIGN AUTOMATION Abstract This paper describes best practices for setting up and managing an EMC Isilon cluster to store data for electronic design automation. June 2013

2 Copyright 2013 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. EMC 2, EMC, the EMC logo, Isilon, FlexProtect, InsightIQ, OneFS, SmartConnect, SmartPools, SmartQuotas, SnapshotIQ, and SyncIQ are registered trademarks or trademarks of EMC Corporation in the United States and other countries. H

3 Table of Contents Introduction... 6 EDA workflows and workloads... 6 EMC Isilon scale-out NAS... 7 Isilon node... 8 Overcome EDA storage challenges with EMC Isilon... 8 Improve runtimes for concurrent jobs... 9 Overview of best practices for EDA... 9 Obtain statistics to tune an EDA workflow... 9 Match workloads with nodes and storage pools... 9 Network connections File system and protocols Limits for pools, directories, files, and names Data protection EMC Isilon SnapshotIQ EMC Isilon SyncIQ EMC Isilon SmartQuotas Identity management, authentication, and access control Permissions Home directories Summary of high-level best practices for EDA Obtain statistics to tune an EDA workflow Optimize performance with storage analytics Test changes before putting them into production Match workloads with nodes and storage pools Analyze connections and access patterns Align workloads with data access patterns Align datasets with storage pools SmartPools for EDA data SSD strategies Global namespace acceleration Write caching with SmartCache Guidelines for file pool management Check the optimization settings of directories and files Plan for network performance and throughput Networking Ethernet speeds Frame sizes Internal InfiniBand network redundancy IP address allocation planning EMC Isilon SmartConnect

4 Connection-balancing strategies High availability with dynamic NFS failover Connection-balancing and failover policies Service subnets and DNS integration SmartConnect and failover IP optimization SmartConnect best practices Size a SmartConnect pool Optimize throughput and performance File systems and data access protocols Structure of the file system Data layout Caching Request block sizes and random read and write performance NFS NFS performance settings Use NFS version 3 or NFS over TCP, not UDP Sync and async options NFS rsize and wsize Hard mounts Multiple clients writing to the same files Enable readdirplus on clients Recommended client mount settings and system controls NFS server threads SMB Change notify SMB signing Limits for pools, directories, files, and names Pool capacity Maximum number of files in a directory Directory structures Maximum depth of a directory Maximum path length of names for nested directories Data protection N+M data protection Data protection best practices Data mirroring Balance data protection with storage utilization Virtual hot spare EMC Isilon SnapshotIQ Snapshot best practices

5 EMC Isilon SyncIQ Business continuance Avoid full dataset replications Select the right source replication dataset Performance tuning guidelines Limitations and restrictions Using SmartConnect with SynclQ Performance and policy job monitoring Target-aware initial synchronization Best practices for SyncIQ jobs EMC Isilon SmartQuotas Include data protection overhead in disk usage calculations Include the space that snapshots consume View reports to manage quotas Additional best practices for setting quotas Permissions for mixed environments On-disk identity Run the repair permissions job after changing the on-disk identity ACL policies for mixed environments Permissions policies Run chmod on a file with an ACL The inheritance of ACLs created on directories by chmod Chown command on files with ACLs Access checks (chmod, chown) Advanced settings Owner and group permissions No deny ACEs Home directories Sizing guidance for storage capacity SnapshotIQ Capacity planning considerations Capacity planning best practices Conclusion

6 Introduction The EMC Isilon scale-out network-attached storage (NAS) platform combines modular hardware with unified software to harness unstructured data. Powered by the distributed EMC Isilon OneFS operating system, an EMC Isilon cluster delivers a scalable pool of storage with a global namespace. The use of distributed software to scale data across commodity hardware sets OneFS apart from other storage systems. Each node in an Isilon cluster controls data requests, boosts performance, and expands the cluster's capacity. For electronic design automation (EDA), the Isilon scale-out distributed architecture minimizes CPU bottlenecks, rapidly serves metadata, and optimizes wall clock performance for concurrent jobs. This paper describes best practices for managing an Isilon cluster to maximize performance for EDA workflows. EDA workflows and workloads An EDA workflow includes a logical and a physical phase. During the logical phase, engineers architect a chip design by compiling source files into a chip model. As engineers create the design, they check out its source code from a software configuration management system, such as Subversion (SVN) or Perforce, to refine specifications. Engineers then simulate the chip design by scheduling and running jobs in a large compute grid. Scheduling the compile and simulation jobs involves using a scheduler, such as the IBM Platform Load Sharing Facility (LSF) or the Oracle Grid Engine. The scheduler distributes the build and simulation jobs to the available slots on the compute resources. Efficiency in creating, scheduling, and executing both build and simulation jobs can reduce the time it takes to bring a chip to market. The logical phase generates an input/output (I/O)-intensive workload: EDA applications read and compile thousands of source files to build and simulate a chip design. The logical phase also invokes other applications, scripts, and verification processes, which vary by environment. A storage system manages the various design projects and files so that different users, scripts, and applications can access the data. Within the storage system, EDA workflows tend to store a large number of files in a single directory amid a deep directory structure on a large cluster, often with more than 20 Isilon nodes. Project directories dominate the file system, and the project directories require performance for the hundreds of projects stored on them. Projects that build source code read and write thousands of small files, while projects that build simulations read and write many large files. The workflow for the projects includes backing up data and taking several snapshots of the directories daily and nightly. Given the large number of files and directories as well as the growing size of design files over time, the storage system must make file management easy and provide seamless access to random file types. EDA uses a large compute grid, or build farm, for high-performance computing. The grid, which can number more than a 1,000 client computers, requires several IP addresses for each Isilon node to distribute client connections effectively and to redistribute connections in case a node fails. A cluster may also include home 6

7 directories for users and the safe archiving of design blocks for reuse in future projects. The clients and applications in the compute grid determine the cluster's workload: Across the EDA industry, I/O profiles vary by tool and design stage, from many small random workloads to large sequential workloads. A common denominator is workloads that run many build jobs concurrently, generating high CPU usage on the storage system as the jobs access directories and files. In many cases, most file system operations get attributes, perform lookups, or retrieve other metadata as the workflow consolidates up to millions of small files, each describing a gate or block, into a file of several terabytes for the physical design phase. Meanwhile, other aspects of the design cycle depend heavily on read and write operations. Hardware-software verification tests, for example, can result in a large single-threaded read operation followed by many slow write operations. For most workflows, the intensity of I/O operations necessitates solid-state drives (SSDs), write coalescing, coherent caching, and clustered RAM to efficiently serve requests for files and metadata. To serve I/O requests with the kind of performance that EDA tools require, the storage system should hold the working set of data in memory. After the logical phase produces a design that works, the physical design phase converts the logical design into a physical chip, verifies it, and prepares it for tape-out to a foundry, which manufactures the chip. The tape-out process combines the many layers of a chip s design into a single model to send to the foundry for manufacturing. The workflows and workloads of the logical phase constitute a requirement to optimize the system that stores the design files. Tuning the storage system to serve data without bottlenecks and to support running concurrent jobs is key to reducing a chip's time to market. As the next sections of this paper demonstrate, the architecture of Isilon scale-out NAS is ideally suited to reduce bottlenecks and optimize concurrency. Later sections consider options for tuning an Isilon cluster to help meet these objectives. EMC Isilon scale-out NAS OneFS combines the three traditional layers of storage architecture file system, volume manager, and data protection into a scale-out NAS cluster. In contrast to a scale-up approach, EMC Isilon takes a scale-out approach by making a cluster of nodes that runs a distributed file system. Each node adds resources to the cluster. Because each node contains globally coherent RAM, as a cluster becomes larger, it becomes faster. Meanwhile, the file system expands dynamically and redistributes content, which eliminates the work of partitioning disks and creating volumes. There is no CPU controller head to cause bottlenecks. Nodes work as peers to spread data across the cluster. Segmenting and distributing data a process known as striping not only protects data, but also enables a client connecting to any node to take advantage of the entire cluster's performance. 7

8 For an EDA workflow, the Isilon scale-out distributed architecture eliminates CPU bottlenecks and optimizes the performance of processing concurrent jobs when you add the right number of nodes to the cluster to serve a compute grid. Isilon node As a rack-mountable appliance, a node includes the following components in a 2U or 4U rack-mountable chassis: memory, CPUs, RAM, NVRAM, network interfaces, InfiniBand adapters, disk controllers, and storage media. Each node runs the Isilon OneFS operating system, the distributed file system software that unites the nodes into a cluster. An Isilon cluster comprises three or more nodes, up to 144. A cluster s storage capacity ranges from a minimum of 18 TB to a maximum of 15.5 PB. When you add a node to a cluster, you increase the cluster's aggregate disk, cache, CPU, RAM, and network capacity. OneFS groups RAM into a single globally coherent cache so that a data request on a node benefits from data that is cached anywhere. NVRAM is grouped to write data with high throughput and to protect write operations from power failures. As the cluster expands, spindles and CPU combine to increase throughput, capacity, and I/O operations per second (IOPS). EMC Isilon makes several types of nodes, all of which can be added to a cluster to balance capacity and performance with throughput or IOPS. Table 1: EMC Isilon nodes Node S-Series X-Series NL-Series Use case IOPS-intensive applications such as EDA High-concurrency and throughput-driven workflows Near-primary accessibility, with near-tape value The EMC Isilon Performance Accelerator extension node provides independent scaling for high performance by adding processing power, memory, bandwidth, and parallel read/write access. Overcome EDA storage challenges with EMC Isilon EMC Isilon S-Series nodes deliver the performance that EDA workflows and workloads demand. The EMC Isilon S200 node delivers 1.1 million network file system (NFS) SPECsfs2008 file operations per second with more than 100 GB/s of aggregate throughput. The S200 combines SSDs with 10,000 RPM 2.5-inch Serial Attached SCSI (SAS) drive technology, two quad-core Intel CPUs, and up to 13.8 TB of globally coherent cache. SSD technology accelerates namespace-intensive metadata operations. You can optimize storage for EDA workflows by creating policies to store metadata and other latency-sensitive data on SSDs. To help eliminate bottlenecks, the OneFS operating system includes several cache types, a prefetching option, and dual 10 gigabit Ethernet (GbE) connections. The globally coherent cache provides rapid access to stored data by connecting to any node. You can also easily scale the globally coherent cache to hold your working data set in memory as the data set grows in size. The local coherent cache delivers rapid access to data stored on a node when a client connects to it. 8

9 By default, OneFS optimizes operations to get attributes (getattr) with a built-in stat() cache. Similarly, OneFS prefetches readdirplus data. You can adjust the number of file nodes that OneFS prefetches for readdirplus an option that you can use to tailor readdirplus to your workflow. OneFS also provides several options for prefetching data to address streaming, random, and concurrent access patterns. The dual 10 GbE connections to each node support the high levels of network utilization that take place during EDA s simulation and verification phrases. Improve runtimes for concurrent jobs A critical question is how many Isilon S-Series nodes should be included in a cluster so that the cluster processes, with the optimal level of performance, concurrent I/O requests from the computer grid. With the right number of nodes, an Isilon cluster can reduce wall clock runtime for concurrent jobs after the time that other storage systems hit a saturation point for the controller. Finding an Isilon cluster s optimal point the point at which it scales in processing concurrent jobs and reduces wall clock runtimes in relation to other systems for the same workload depends on the size of your compute grid, the number of jobs, the working datasets, and other factors. EMC Isilon recommends that you work with your Isilon representative to determine the number of nodes that will best serve your workflow. Overview of best practices for EDA You can tune an EMC Isilon cluster to boost performance and throughput for electronic design automation. But because every EDA workflow, network, and computing environment is different, tailoring a cluster to meet your needs requires research, consultation with your Isilon representative, testing, and a cost-benefit analysis. This document provides guidelines to help you make decisions about how to set up a large cluster that will maximize the storage performance of an EDA workflow. Keep in mind, however, that there are no generic answers or configuration templates for EDA. How you tune your cluster depends on several primary factors: The characteristics of your workflows The characteristics of your working datasets The size of your compute grid Obtain statistics to tune an EDA workflow Before you tune your Isilon cluster or your clients, you should analyze how your EDA workflow interacts with the storage system by gathering statistics about your common file sizes and I/O operations, including CPU usage and latency. Match workloads with nodes and storage pools You should match performance-demanding workflows with high-performance nodes. The Isilon S200 with SSDs performs well for EDA workflows. The SSDs enhance the performance of NFS lookups, getattr operations, and other metadata operations. The 9

10 metadata operations pull data quickly from the disks when the metadata is not found in the cache. An Isilon cluster can also combine nodes of different types into a single file system and then segment nodes into storage pools with different capacity-to-performance ratios. A best practice is to use EMC Isilon SmartPools policies to implement storage pools that maximize performance for working datasets while cost-effectively storing less important data. File pool policies can automate the distribution of different file types to a storage tier that matches their performance requirements. A file pool policy can also change the layout of the data on the underlying disk for optimal read and write performance. Network connections For EDA workflows, you should establish two 10 GbE connections to each node to support the high levels of network utilization that take place during the simulation and verification phrases. Using 10 GbE connections also helps support the workload of EDA tools like Synopsys and Cadence. A best practice is to bind multiple IP addresses to each node interface in an EMC Isilon SmartConnect network pool. Generally, optimal balancing and failover is achieved when the number of addresses allocated to the network pool equals N * (N 1), where N equals the number of node interfaces in the pool. Thus, if a pool is configured with a total of five node interfaces, the optimal IP address allocation would total 20 IP addresses (5 * (5 1) = 20) to allocate four IP addresses to each node interface in the pool. Assigning each workload or data store to a unique IP address enables Isilon SmartConnect to move each workload to one of the other interfaces, minimizing the additional workload that a remaining node in the SmartConnect pool must absorb and ensuring that the workload is evenly distributed across all the other nodes in the pool. File system and protocols You can optimize how the OneFS operating system lays out and prefetches data to match your dominant access pattern concurrent, streaming, or random. For most EDA workflows, the default concurrent access pattern will usually work best. You can also modify the OneFS caches and NFS performance settings to improve performance for EDA. Limits for pools, directories, files, and names A best practice to maintain optimal performance is to fill a cluster's capacity only up to 80 percent per pool, especially for an EDA workflow. In general, it is more efficient to create a directory structure that consolidates files in a single directory than it is to spread files out over many subdirectories because OneFS can more efficiently add a B-tree block to a directory than it can add an inode to a subdirectory. With SSDs, you should limit the number of files in a directory to 100,000. With hard disk drives (HDDs), you should limit the number of files in a directory to 20,

11 You should probably limit the number of subdirectories to 10,000 because exceeding 10,000 subdirectories might affect the performance of command-line tools and other applications. Data protection For most EDA workflows in which optimizing for an aggregation of workloads trumps optimizing for a single job you should consider using +2:1 for clusters with fewer than 18 nodes. For clusters with more than 18 nodes, consider using +2. The protection level that you select, however, depends on various factors, including the number of nodes in the cluster. The right protection level for your cluster might also depend on which version of OneFS you are running. Because so many variables interact to determine the protection level for an environment, a best practice is to consult with your Isilon representative about selecting a protection level. Isilon can analyze the cluster to identify its mean time to data loss and then suggest an optimal policy for you. EMC Isilon SnapshotIQ The best practices for snapshots depend on factors and objectives that vary across environments. For EDA workflows, a key question is whether to take snapshots on a per-project basis or an all-project basis. A per-project approach, for instance, might take a separate snapshot of each directory in which a project resides. An all-project approach, in contrast, might take one snapshot at the high-level directory that contains all the project directories. One strategy is to minimize the number of snapshot policies, thereby reducing the number of snapshots that you need to manage and decreasing the number of snapshot delete jobs. When you delete snapshots, it is preferable to have a small number of snapshots in a snapshot delete job. EMC Isilon SyncIQ For EDA workflows, business continuance is important. EMC Isilon SynclQ maximizes the use of network bandwidth, delivering fast replication times for tight recovery point objectives (RPOs). You can also use SynclQ with EMC Isilon SnapshotIQ for storing as many point-in-time snapshots of your data as needed to support secondary activities like backing up data to tape. The performance tuning guidelines in this document can help you exploit SyncIQ to support business continuance. EMC Isilon SmartQuotas The EMC Isilon SmartQuotas module tracks disk usage with reports and enforces storage limits with alerts. Using the SmartQuotas module is a best practice because it helps optimize storage capacity while minimizing costs. EMC Isilon recommends that you send notifications to users when they approach their storage limits and that you analyze reports to allocate storage more efficiently. Each quota that you set should include the overhead for data protection and the space that snapshots consume. Do not create quotas on the root directory of the default OneFS share (ifs) because the quota might degrade performance. Governing a single directory with overlapping quotas can also degrade performance. 11

12 Identity management, authentication, and access control A best practice is to use Microsoft Active Directory with Windows Services for UNIX and RFC 2307 attributes to manage Linux, UNIX, and Windows systems. Integrating UNIX and Linux systems with Active Directory centralizes identity management and eases interoperability. Make sure that your domain controllers are running Windows Server 2003 or later. If you use Lightweight Directory Access Protocol (LDAP) for Linux systems and Active Directory for Windows systems, the OneFS user mapping service can help manage users across domains. Permissions An Isilon cluster s default settings handle permissions securely and effectively for most networks that mix UNIX and Windows systems. EMC Isilon recommends that you use the option to configure permissions policies manually. Managing permissions policies manually gives you a range of options for responding to the kind of special cases that can surface in a mixed environment. In general, however, EMC Isilon recommends that you avoid changing the advanced settings for permissions. For chmod commands, using the option to merge new permissions with the existing ACL is recommended for mixed networks. Merging the new permissions with the ACL is the best way to balance the preservation of security with the expectations of users. For chown commands, using the option to leave the ACL unmodified is the recommended approach because it preserves the ACEs that explicitly allow or deny access to specific users and groups. Home directories Capacity planning entails scaling an Isilon cluster to accommodate the competing demands of combined workloads. In the case of home directories, workload requirements are driven by several factors: disk capacity to accommodate the combined data storage requirements of all targeted users; sufficient disk throughput to support the combined transactional requirements of all users; and enough network bandwidth to provide adequate throughput. This paper includes guidelines to plan home directories amid competing workflows. Summary of high-level best practices for EDA Analyze storage statistics to identify areas for optimization Maximize the use of EMC Isilon S-Series nodes to deliver high-performance storage for EDA workflows Store metadata and the active working set of file data on solid-state drives Create SmartPools policies to manage how latency-sensitive data is stored Add one or more Isilon Performance Accelerator nodes to help ensure that the active working set stays in memory Work with Isilon sales engineers to establish the right number of nodes for your workflow to improve wall clock performance for concurrent jobs Set up an external network with dual 10 GbE connections to each storage node Bind multiple IP addresses to each node interface in a SmartConnect network pool Use SnapshotIQ and SyncIQ for data protection and business continuance Apply the SmartQuotas module to track disk usage with reports and enforce limits with alerts 12

13 Obtain statistics to tune an EDA workflow Before you tune your Isilon cluster or your clients, you should analyze how your EDA workflow interacts with the storage system by gathering statistics about your common file sizes and I/O operations, including CPU usage and latency. With OneFS 7.0 or later, you can obtain key statistics and timing data for delete, renew lease, create, remove, set userdata, get entry, and other file system operations by connecting to a node with SSH and running the following command as root to turn on the vopstat system control: sysctl efs.util.vopstats.record_timings=1 efs.util.vopstats.record_timings: 0 -> 1 After you turn on vopstats, you can view them by running the following sysctl efs.util.vopstats command as root: sysctl efs.util.vopstats (You can also run the following command on OneFS 6.x, but the results do not include timing data.) Here is an example of the command s output: efs.util.vopstats.ifs_snap_set_userdata.initiated: 26 efs.util.vopstats.ifs_snap_set_userdata.fast_path: 0 efs.util.vopstats.ifs_snap_set_userdata.read_bytes: 0 efs.util.vopstats.ifs_snap_set_userdata.read_ops: 0 efs.util.vopstats.ifs_snap_set_userdata.raw_read_bytes: 0 efs.util.vopstats.ifs_snap_set_userdata.raw_read_ops: 0 efs.util.vopstats.ifs_snap_set_userdata.raw_write_bytes: 0 efs.util.vopstats.ifs_snap_set_userdata.raw_write_ops: 0 efs.util.vopstats.ifs_snap_set_userdata.timed: 0 efs.util.vopstats.ifs_snap_set_userdata.total_time: 0 efs.util.vopstats.ifs_snap_set_userdata.total_sqr_time: 0 efs.util.vopstats.ifs_snap_set_userdata.fast_path_timed: 0 efs.util.vopstats.ifs_snap_set_userdata.fast_path_total_time: 0 efs.util.vopstats.ifs_snap_set_userdata.fast_path_total_sqr_time: 0 The time data captures the number of operations that cross the OneFS clock tick, which is 10 milliseconds. Independent of the number of events, the total_sq_time provides no actionable information because of the granularity of events. To analyze the operations, use the total_time value instead. The following example shows only the total time records in the vopstats: sysctl efs.util.vopstats grep -e "total_time: [^0]" 13

14 efs.util.vopstats.access_rights.total_time: efs.util.vopstats.lookup.total_time: efs.util.vopstats.unlocked_write_mbuf.total_time: efs.util.vopstats.unlocked_write_mbuf.fast_path_total_time: efs.util.vopstats.commit.total_time: efs.util.vopstats.unlocked_getattr.total_time: efs.util.vopstats.unlocked_getattr.fast_path_total_time: efs.util.vopstats.inactive.total_time: efs.util.vopstats.islocked.total_time: efs.util.vopstats.lock1.total_time: efs.util.vopstats.unlocked_read_mbuf.total_time: efs.util.vopstats.readdir.total_time: efs.util.vopstats.setattr.total_time: efs.util.vopstats.unlock.total_time: efs.util.vopstats.ifs_snap_delete_resume.timed: efs.util.vopstats.ifs_snap_delete_resume.total_time: efs.util.vopstats.ifs_snap_delete_resume.total_sqr_time: With OneFS 6.0 or later, you can run the following command to obtain statistics for protocols, client connections, and the file system: isi statistics To view NFS protocol statistics, for example, you can run the isi statistics command like this: isi statistics pstat --protocol=nfs3 Here is an example of the output: NFS3 Operations Per Second null 0.00/s getattr 74.14/s setattr 4.53/s lookup 1.40/s access 9.60/s readlink 0.00/s read /s write /s create 0.20/s mkdir 0.00/s symlink 0.00/s mknod 0.00/s remove 0.00/s rmdir 0.00/s rename 0.20/s link 0.00/s readdir 0.00/s readdirplus 0.00/s statfs 0.18/s fsinfo 0.00/s pathconf 0.00/s commit 10.82/s noop 0.00/s TOTAL /s CPU Utilization OneFS Stats user 1.5% In MB/s system 7.6% Out MB/s idle 91.0% Total MB/s Network Input Network Output Disk I/O MB/s MB/s Disk iops Pkt/s Pkt/s Read 7.52 MB/s Errors/s 0.00 Errors/s 0.00 Write MB/s 14

15 You can also run the isi statistics command with the client argument to view I/O information and timing data by client: isi statistics client And you can run the isi statistics command with the system argument to view CPU utilization by protocol. For example: isi statistics system Node CPU SMB FTP HTTP ISCSI NFS HDFS Total NetIn NetOut DiskIn DiskOut LNN %Used B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s All K K 26K 134M M 37M 101M 178M 289M Other performance statistics appear in the web administration interface. For the various methods of obtaining statistics about the storage system, see the OneFS Administration Guide and the OneFS Command Reference. A key question is, Which types of operations consume the most CPUs or result in the most access latency? Create and remove operations, in particular, can consume more CPUs than other operations, but whether create and remove operations are affecting performance depends on your workflow and working set. For EDA, readdirplus as well as read and write operations can also be costly. If a directory contains a million files, enumeration might cost too many resources in comparison to other operations. Once you can associate a type of operation with a degradation in performance by, for example, analyzing timing data and CPU usage you can tailor the system to process the operation faster. Optimize performance with storage analytics EMC Isilon InsightIQ monitors and analyzes the performance of an Isilon cluster to help you optimize storage resources and forecast capacity. To maximize performance over time, a best practice is to track performance and capacity with InsightIQ, which is an optionally licensed feature. Test changes before putting them into production Before you modify your production cluster, you should, if possible, implement the changes on a test cluster that uses the same version and the same settings as your production cluster. You can validate your key workflows on a test cluster by simulating how your administrators, users, and applications interact with the system. Verifying optimizations on a test cluster can expose issues that could degrade the performance of your production system. But verifying optimizations on a test cluster is unlikely to be feasible for EDA workflows. The test environment will probably not replicate the real workflow, or test results might be inconclusive or, worse, they could be misleading. If you make changes to a production cluster, you must be prepared to revert them immediately if they produce side effects or if they fail to enhance performance. 15

16 Match workloads with nodes and storage pools This section discusses how to match a workflow with the different types of Isilon nodes and how to use storage pools to streamline an EDA workflow. Analyze connections and access patterns A critical factor in building a cluster and tuning its performance is having a clear understanding of the access patterns of users and applications. For example: How many clients does the cluster need to support? What percentage of the client connections do you expect to be active at a given time? What is the dominant file-sharing protocol, Server Language Block (SMB) or NFS? What is the volume of workload that the active connections carry? What types of workloads do the active connections carry? And what are the characteristics of the workloads large files or small files? What are the characteristics of the working datasets? What is the dominant access pattern for each dataset? The number of projected active user connections is a key figure. Storage workloads tend to be driven more by the number of active connections than by the number of overall connections. A thousand users, for example, might not require a high level of sustained throughput if only a hundred of the users maintain active connections at one time. The dominant file-sharing protocol that your clients use can play a role in determining the cluster s requirements for CPUs, memory, and network bandwidth. SMB client connections typically require a higher amount of overhead per client than do comparable workloads from NFS clients, especially those using Network File System version 3 (NFSv3). In contrast, network and disk throughput rates stem more from the workload and use case than from the file-sharing protocol. Align workloads with data access patterns Once you identify your dominant access pattern concurrent, streaming, or random you can optimize how OneFS lays out data to match it. By default, OneFS optimizes data layout for concurrent access. For most EDA workflows, the default concurrent access pattern, with its medium level of prefetching data, will often work best. If your dominant access pattern is streaming that is, it has lower concurrency, higher single-stream workloads you can prefetch data well in advance of data requests to increase sequential read performance. Streaming is most effective on clusters or storage pools serving large files. If your workload is truly random, the random access pattern, which prefetches no data, might increase performance. Table 2: OneFS access pattern options OneFS access pattern Concurrent Streaming Random Description Prefetches data at a medium level Prefetches data far in advance of requests Prefetches no data 16

17 With a SmartPools license, you can apply a different access pattern to each storage pool that you create. This enables you to set the optimal pattern for each dataset. Align datasets with storage pools With a SmartPools license, you can create node pools, file policies, and tiers of storage. Node pools segment nodes into groups so that you can align a dataset with its performance requirements. File polices can isolate and store files by type, path, size, and other attributes. Tiers optimize the storage of data by need, such as a frequently used high-speed tier or a rarely accessed archive. Because you can combine EMC Isilon nodes from the S-Series, the X-Series, and the NL-Series into a cluster, you can set up pools of nodes to accelerate access to important working sets while placing inactive data in more cost-effective storage. For example, you can group S-Series nodes with 900 GB SAS drives and 400 GB SSDs per node in one pool, while you put NL-Series nodes with 3 TB SATA drives in another. As each type of node has a different capacity-to-performance ratio, you should assign different datasets to node pools that meet the dataset s capacity or performance requirements. In this way, the node pools can isolate the datasets of critical applications from other data sets to increase performance for important data and decrease storage costs for immaterial data. After you set up node pools, you can establish file pools and govern them with policies. The policies move files, directories, and file pools among node pools or tiers. Routing files with a certain file extension into a node pool containing S-Series nodes can deliver faster read and write performance. Another policy can evaluate the lastmodified date to archive old files. SmartPools also manages data overflow with a spillover policy. When a pool becomes full, OneFS redirects write operations to another pool. It is a best practice to apply a spillover policy to a pool that handles a critical workflow. SmartPools for EDA data You can build node pools and policies to streamline an EDA workflow. If your EDA workflow includes timely critical data as well as historical, nontimely data, you can set up SmartPools to serve critical data quickly while serving historical data less quickly and less expensively. In addition, you can protect both data types equally. For example, with SmartPools, you can create two pools and one policy. One pool contains higher-performance SAS drives, while the other contains lower-cost SATA drives. Both pools are set at a high protection level. You need to create only one policy to manage the data a policy that keeps current data on the faster drives. To place an active working set on faster drives, the policy can use file attributes to route files to a pool of S-Series nodes with SSDs to increase throughput rates and decrease latency levels. At the same time, the policy can move older files, as identified by the last-modified or last-accessed date, to a data storage target of NL-Series disks to reserve the capacity in the S-Series node pool for newer data. When a policy moves a file or directory from one disk pool to another, the file s location in the file system tree remains the same, and you do not need to reconfigure clients to access the file in the new location. A file pool policy can apply to directories or files, and it can filter files by file name, path, 17

18 type, and size, as well as other file attributes. You can also separate snapshot storage from data storage to route snapshots to a cost-effective target. SSD strategies A SmartPools policy can set an SSD strategy that accelerates access to metadata and files. Moving the file and directory metadata from HDDs to SSDs can yield a noticeable improvement in workflow performance. Table 3 lists the SSD strategies from slowest to fastest. Table 3: OneFS SSD use case options SSD strategy Description Use case Avoid SSDs Writes file data and metadata to HDDs only. Implementing this strategy might degrade performance. The storage pool contains archived data or data without performance requirements. Metadata read acceleration Metadata read and write acceleration Data on SSDs Writes both file data and metadata to HDDs. An extra mirror of the file metadata is written to SSDs. The SSD mirror is in addition to the number required to meet the protection level. Writes file data to HDDs and all metadata mirrors to SSDs. This strategy accelerates metadata writes in addition to reads but requires about four to five times more SSD storage than metadata read acceleration. This strategy is available only with OneFS 7.0 or later. Uses SSD node pools for both data and metadata. This strategy does not result in the creation of additional mirrors beyond the normal protection level but requires significantly increased storage requirements compared with the other SSD strategies. The storage pool contains metadata that is more frequently read than written to, and the workflow s performance depends on rapid metadata read operations. The storage pool contains metadata that is frequently read and written to. The workflow s performance depends on rapid metadata read and write operations. The storage pool contains metadata and files that a critical workflow must access with high performance. Global namespace acceleration When only a few of the nodes in a cluster contain SSDs, you can speed up metadata access an important aspect of an EDA workflow by using global namespace acceleration. If your EDA data consists of many files spread across many project directories, accessing metadata can slow down performance. In such a case, you can forego some faster spinning media for much of the data by speeding up metadata read access with global namespace acceleration. 18

19 With SmartPools, you can keep a copy of all the cluster s metadata on SSDs in a few of the nodes. The result lets you quickly locate any data while reducing the cost of storing many of your files. Global namespace acceleration applies to all the nodes in a cluster; it is not tied to a pool or a policy. Global namespace acceleration, however, requires that a percentage of the cluster s nodes contain SSDs. For more information, see the OneFS Administration Guide. Write caching with SmartCache Write caching improves performance for most workflows. To optimize I/O operations, write caching coalesces data into a write-back cache and writes the data to disk at the best time. EMC Isilon calls write caching SmartCache. In general, EMC Isilon recommends that you leave write caching turned on. OneFS interprets writes as either synchronous or asynchronous, depending on a client's specifications. The impact and risk of write caching depend on the protocols with which your clients write data to the cluster and on whether the writes are interpreted as synchronous or asynchronous. If you disable write caching, OneFS ignores your clients' specifications and writes data synchronously. For more information, see the OneFS Administration Guide. Guidelines for file pool management EMC Isilon recommends the following guidelines to improve overall performance: Plan file pool policies carefully, especially the sequence in which they are applied. Without proper planning and analysis, policies can conflict with or override one another. Note that directories with large amounts of stale data for instance, data that has not been accessed for more than 60 days can be migrated with a file pool policy to archive storage on an NL-Series pool. OneFS supplies a template for this policy. If directories take a long time to load in a file manager like Windows Explorer because there are a large number of objects directories, files, or both enable metadata acceleration for the data to improve load time. For more information on SmartPools file pool policies, see the white paper titled Next Generation Storage Tiering with EMC Isilon SmartPools on the EMC website and the OneFS Administration Guide. Check the optimization settings of directories and files You can check the attributes of directories and files to see your current settings by running the following command and replacing data in the example with the name of a directory or file. The command s output, which shows the properties of a directory named data, is abridged to remove some immaterial data. isi get -D data POLICY W LEVEL PERFORMANCE COAL ENCODING FILE IADDRS default 4x/2 concurrency on N/A./ <1,1, :512>, <1,4, :512>, <1,23, :512>, <2,0, :512>, <3,2, :512> ct: rt: 0 ************************************************* * IFS inode: [ 1,1, :512, 1,4, :512, 1,23, :512, 2,0, :512, 3,2, :512 ] ************************************************* * Inode Version: 6 19

20 * Dir Version: 2 * Inode Revision: 7 * Inode Mirror Count: 5 * Physical Blocks: 0 * LIN: 1:0003:0000 * Last Modified: * Last Inode Change: * Create Time: * Rename Time: 0 * Write Caching: Enabled * Parent Lin 2 * Parent Hash: * Manually Manage: * Access False * Protection False * Protection Policy: Diskpool default * Current Protection: 4x * Future Protection: 0x * Disk pools: policy any pool group ID -> data target s200_13tb_400gbssd_48gb:6(6), metadata target s200_13tb_400gb-ssd_48gb:6(6) * SSD Strategy: metadata * SSD Status: complete * Layout drive count: 0 * Access pattern: 0 * File Data (118 bytes): * Metatree Depth: 1 * Dynamic Attributes (37 bytes): ATTRIBUTE OFFSET SIZE New file attribute 0 23 Disk pool policy ID 23 5 Last snapshot paint time 28 9 ************************************************* * NEW FILE ATTRIBUTES * Access attributes: active * Write Cache: on * Access Pattern: concurrency * At_r: 0 * Protection attributes: active * Protection Policy: Diskpool default * Disk pools: policy any pool group ID * SSD Strategy: metadata ************************************************* Here is what some of these lines mean: This is a OneFS command to display the file system properties of a directory or file. For more information, see the OneFS Command Reference. The directory s data access pattern is set to concurrency. Write caching (SmartCache) is turned on. The SSD strategy is set to metadata, which is the default. Files that are added to the directory are governed by these settings, most of which can be changed by applying a file pool policy to the directory. 20

21 Plan for network performance and throughput As the number of clients that connect to a cluster increases, balancing the connections across network interfaces becomes more important. This section covers best practices for setting up a cluster s network and balancing client connections. Networking A cluster includes two networks: an internal network to exchange data between nodes and an external network to handle client connections. Nodes exchange data through the internal network over InfiniBand. Each node includes redundant InfiniBand ports that enable you to add a second internal network in case the first one fails. Clients reach the cluster with 1 GbE or 10 GbE. Every node includes Ethernet ports. Ethernet speeds Clients connect to a node with either a 1 GbE or 10 GbE connection. The practical throughput of a 1 GbE connection stands at about 100 MB/s per interface. The practical throughput of a 10 GbE Ethernet connection stands at about 1,250 MB/s per interface, but when you use both interfaces at the same time, you are unlikely to see the full combined throughput for a variety of reasons. Here s an example of how to calculate the maximum practical throughput for a cluster. Consider a node that has two 1 GbE connections. The number and size of each node s connections limit the practical throughput for a cluster of nodes of the same type to about 200 MB/s times the number of nodes. For example, a cluster of 10 nodes using both of the 1 GbE external interfaces delivers a total throughput of about 2000 MB/s. For EDA workflows, you should establish dual 10 GbE connections to each node to support the high levels of network utilization that take place during the simulation and verification phrases. Using 10 GbE connections also helps support the workload of EDA tools like Synopsys and Cadence. Frame sizes Jumbo frames where the maximum transmission unit, or MTU, is set to 9000 bytes yield slightly better throughput performance with slightly less CPU usage than standard frames, where the MTU is set to 1500 bytes. For OneFS 6.x, jumbo frames provide about 5 percent better throughput and about 9 percent less CPU usage for 1 GbE connections. For 10 GbE connections, jumbo frames provide about 8 percent better throughput and about 8 percent less CPU usage. Internal InfiniBand network redundancy In a high-performance computing environment, you should use the redundant InfiniBand ports to create a second internal network in case the first network fails. When you build the redundant internal network, be sure to use separate switches. 21

22 IP address allocation planning Internal network As a best practice, you should set up the cluster s internal InfiniBand network for redundancy. Doing so requires two separate subnets, one for the internal-a (int-a) pool and another for the internal-b (int-b) pool. Each internal interface requires a unique IP address. A cluster that might scale to the maximum of 144 nodes requires a larger pool of addresses than a cluster that is not expected to exceed 60 nodes. Once you create an internal subnet, you cannot expand it without taking the entire cluster offline. Thus, before you assign IP addresses to the internal subnet, your plan should include not only your available address ranges but also the number of nodes that you expect the cluster to contain. EMC recommends that you isolate the internal subnets IP address ranges from those of the external network by assigning nonoverlapping private addresses. By isolating the internal IP ranges, you can avoid having routing issues in the cluster and conflicting IP addresses in your enterprise network. External network Both SmartConnect Basic and SmartConnect Advanced static network pools are limited to one IP address per interface. Overprovisioning IP addresses to either type of pool wastes addresses. EMC Isilon SmartConnect By default, the OneFS SmartConnect module balances connections among nodes by using a round-robin policy with static IP addresses and one IP address pool for each subnet. A SmartConnect license adds advanced balancing policies to evenly distribute CPU usage, client connections, or throughput. The licensed mode also lets you define IP address pools to support multiple DNS zones in a subnet. The licensed version supports IP failover, also known as NFS failover, to provide continuous access to data when hardware or a network path fails. Because of the demands that EDA places on CPU usage and throughput, a best practice is to use the licensed version of SmartConnect Advanced. The rest of this section assumes that you have the licensed version in place. SmartConnect pools provide seamless failover for NFS clients. Client connections that use other application-level protocols, such as SMB, do not support the failover mechanism that SmartConnect provides. As a result, SmartConnect Advanced with dynamic pools should be used only for NFS workloads. Static pools are recommended for SMB and all other protocols. SmartConnect requires that you add a new name server (NS) record as a delegated domain to the authoritative DNS zone that contains the cluster. You can improve the performance of both your cluster and your clients by evenly distributing the client connections among the nodes. SmartConnect can remove nodes that have gone offline from the request queue and prevent new clients from mounting a failed node. In addition, you can set SmartConnect to add new nodes to a 22

23 connection-balancing pool. For example, all the nodes in a cluster can be configured to appear as a single network host (that is, a single fully qualified domain name, or FQDN, can represent the entire cluster), or one node can appear as several discrete hosts (that is, a cluster can have more unique FQDNs than nodes). Connection-balancing strategies SmartConnect offers several connection-balancing strategies: Zoning: You can optimize storage performance by designating zones to support workloads or clients. For large clusters, you can use zoning to partition the cluster s networking resources and allocate bandwidth to each workload, which minimizes the likelihood that heavy traffic from one workload will affect network throughput for another. Inclusion and exclusion: You can add node interfaces and interface types to SmartConnect pools. For example, in an inclusive approach, a pool includes all the 1 GbE interfaces in the cluster, regardless of node type. An exclusive strategy, by contrast, limits membership to only one interface type from one node type for example, only 10 GbE interfaces on S-Series nodes. As a result, clients connecting to a pool receive similar performance from any node in the pool. High availability with dynamic NFS failover If a node or one of its interfaces fails, SmartConnect reassigns the dynamic IP addresses to the remaining interfaces in the pool. OneFS then routes connections to the newly assigned member interface without interrupting access. When the failed interface or the failed node comes back online, OneFS updates the list of online pool members and redistributes the dynamic IP addresses across the list without interrupting NFS traffic. A best practice is to use this failback mechanism to prevent interrupting NFS client connections. Connection-balancing and failover policies The licensed version of SmartConnect provides four policies for distributing traffic across the nodes in a network pool. Because you can set policies for each pool, static and dynamic SmartConnect address pools can coexist. The four policies are as follows: Round robin: This policy, which is the default, rotates connections among IP addresses in a list. Round robin is the best balancing option for most workflows. CPU usage: This policy examines average CPU usage in each node and then attempts to distribute the connections to balance the workload evenly across all the nodes. Network throughput: This policy evaluates the overall average network throughput per node and assigns new connections to the nodes with the lowest relative throughput in the pool. Connection count: This policy uses the number of open TCP connections on each node in the pool to determine which node a client will connect to. Table 4 summarizes the usage scenarios and connection-balancing methods for each scenario. Because every environment has unique performance requirements, the table offers suggestions, not rules. 23

24 For an EDA workload, round robin is likely to work best. If you find that round robin does not suit your workflow, you can experiment with the other options to test whether one of them improves performance. Table 4: Example usage scenarios and recommended balancing options Load- balancing policy Usage patterns unknown Heavy activity on a few clients Large number of persistent NFS & SMB connections Large number of transitory connections (HTTP, FTP) NFS automount or UNC paths are used Round robin X X X X X CPU usage X X Network throughput X X Connection count X X Changing the connection-balancing policy for a SmartConnect network pool after it has been provisioned affects only new client connections; the policy does not rebalance established connections. Service subnets and DNS integration When you create a SmartConnect pool, you should set up a service subnet for DNS forwarding. A service subnet is the name of the external network subnet whose SmartConnect service responds to DNS requests for the IP address pool. For each pool, only one service subnet can answer DNS requests. As a best practice, you should designate a service subnet for each SmartConnect Advanced pool that clients use to mount NFS data stores. You should also create a delegation for the subnet s service IP address on the DNS servers that provide name resolution services to the clients. If you leave the service subnet option blank, OneFS ignores incoming DNS queries for the pool s zone name and restricts access to only the IP addresses that you enter manually. In addition, you cannot apply the SmartConnect load-balancing policies to a pool without setting a service subnet. SmartConnect and failover IP optimization While a single interface on a cluster can simultaneously belong to multiple pools and respond to multiple IP addresses, no single IP address can be bound to more than one interface at a time. A SmartConnect dynamic address pool with only one address per interface introduces a performance vulnerability if a failure occurs. If all the workloads on a node interface are bound to a single IP address when the path fails, SmartConnect moves the IP address and all its connections to another interface in the pool. Moving the IP address can degrade performance if the failover node is near its throughput capacity. 24

25 There are two preferred approaches to minimizing the performance impact from a SmartConnect failover event. One requires a large number of IP addresses relative to the number of nodes in a pool, and the other requires that one or more node interfaces in each pool remain idle to anticipate a hardware or path failure on another interface in the same pool. Use multiple IP addresses per interface This approach consists of binding multiple IP addresses to each node interface in a SmartConnect pool. The ideal number of IP addresses per interface depends on the size of the pool. The following recommendations apply to dynamic pools only. Because static pools include no failover capabilities, a static pool requires only one IP address per interface. Generally, you can achieve optimal balancing and failover when the number of IP addresses allocated to the pool equals N * (N 1), where N equals the number of node interfaces in the pool. If, for example, a pool contains five node interfaces, the optimal IP address allocation for the pool is 5 * (5 1) = 20 IP addresses. The equation allocates four IP addresses to each of the five-node interfaces in the pool. For a SmartConnect pool with four-node interfaces in which the N * (N 1) model is followed and results in three unique IP addresses being allocated to each node, a failure on one node interface results in each of that interface s three IP addresses failing over to a different node in the pool, ensuring that each of the three active interfaces remaining in the pool receives one IP address from the failed node interface. If client connections to that node were evenly balanced across its three IP addresses, SmartConnect distributes the workloads to the remaining pool members evenly. Assigning each workload a unique IP address allows SmartConnect to move a workload to another interface, minimizing the additional workload that a remaining node in the pool must absorb and ensuring that SmartConnect evenly distributes the workload across the surviving nodes in the pool. If you cannot use multiple IP addresses per interface because, for example, you hold a limited number of IP addresses or because you must ensure that an interface failure within the pool does not affect performance for other workloads using the same pool you should consider using a hot standby interface per pool. Use one hot standby interface per pool An alternate solution is to configure each SmartConnect dynamic pool with (N 1) IP addresses, where N is again the number of nodes in the pool. With this approach, a node failure leads SmartConnect to failover the node s workload to the unused interface in the pool, minimizing the performance impact on the rest of the cluster. Although this approach protects workloads when a failure occurs, it requires leaving one or more node interfaces unused. 25

26 SmartConnect best practices Although SmartConnect balances traffic evenly across the nodes in a pool, you should develop a plan to distribute resources optimally in the context of your workflow. This section lists best practices for optimizing SmartConnect. Size a SmartConnect pool SmartConnect can balance loads optimally only if you tailor a pool to handle its target workloads. A best practice is to match the number of interfaces in the pool with the number of clients. To evenly distribute connections and optimize performance, EMC Isilon recommends sizing SmartConnect for the expected number of connections and for the expected overall throughput. The sizing factors for a pool include the total number of client connections expected to use the pool s bandwidth, the expected aggregate throughput that the pool needs to deliver, and the minimum performance and throughput requirements in case an interface fails. Optimize throughput and performance As a distributed file system, OneFS provides a client with access to all the metadata and files that are stored on the cluster, regardless of the type of node a client connects to or the node pool on which the data resides. For example, data stored for performance reasons on a pool of S-Series nodes can be mounted and accessed by connecting to an NL-Series node in the same cluster. The different types of Isilon nodes, however, deliver different performance. To avoid unnecessary network latency under most circumstances, EMC Isilon recommends configuring SmartConnect pools to mount workload client connections on the same nodes on which the workload s data resides. In other words, if a workload s data resides on a pool of S-Series nodes for performance reasons, the clients that work with the data should mount the cluster through a pool that includes the same S-Series nodes that host the data. The following guidelines can help you deploy SmartConnect to balance network connections and optimize network performance: The number of active clients per node affects overall performance more than their connection speed (1 GbE or 10 GbE). For EDA workflows, however, 10 GbE can help eliminate network bottlenecks. In the event that a connection failure occurs between the cluster and one of its nodes, connecting NFS clients through dynamic pools yields both the highest available throughput and the highest level of availability. SMB connections should use SmartConnect static pools to mount the cluster because the SMB protocol does not support session failover to another interface without reauthenticating the connection before a new session can be established. Overall network performance on a cluster is at its highest when the network connection is made to the same group of nodes that host the underlying storage data. For example, if home directories are stored on an X-Series node disk pool, mounting users to their home directories through the X-Series nodes network interfaces yields the best performance. 26

27 Idle client connections have a negligible overall effect on cluster resources. When determining how best to configure an Isilon storage cluster s pools, the determining factor should be the number of active user connections, rather than the number of total connections. File systems and data access protocols Structure of the file system OneFS presents all the nodes in a cluster as a global namespace that is, as the default file share, ifs. In the file system, directories are inode number links. An inode contains file metadata and an inode number, which identifies a file's location. OneFS dynamically allocates inodes, and there is no limit to the number of inodes. To distribute data among nodes, OneFS sends messages with a globally routable block address through the cluster's internal network. The block address identifies the node and the drive storing the block of data. Data layout OneFS evenly distributes data among a cluster's nodes with layout algorithms that maximize storage efficiency and performance. The system continuously reallocates data to conserve space. OneFS breaks data down into smaller sections called blocks, and then the system places the blocks in a stripe unit. By referencing either file data or erasure codes, a stripe unit helps safeguard a file from a hardware failure. The size of a stripe unit depends on the file size, the number of nodes, and the protection setting. After OneFS divides the data into stripe units, OneFS allocates, or stripes, the stripe units across nodes in the cluster. The width device list shows the width of a file's stripe. You can view the list by running the isi get command with the D option. When a client connects to a node, the client's read and write operations take place on multiple nodes. For example, when a client connects to a node and requests a file, the node retrieves the data from multiple nodes and rebuilds the file. You can optimize how OneFS lays out data to match your dominant access pattern concurrent, streaming, or random. By default, OneFS optimizes striping for concurrent access. If your dominant access pattern is streaming that is, lower concurrency, higher single-stream workloads, such as with video you can change how OneFS lays out data to increase sequential read performance. To better handle streaming access, OneFS stripes data across more drives. Streaming is most effective on clusters or subpools serving large files. For most EDA workflows, the default concurrent access pattern will often work best. Caching OneFS uses a combination of two caches to serve data requests: a locally coherent cache and a globally coherent cache. If a dedicated build job connects to only one node, you can set OneFS to retrieve file data only from the node s cache, which is the locally coherent cache, also known as the L1 cache. When a dedicated build job 27

28 connects to only a single node, serving file data from its local cache can speed up data requests slightly because it is more efficient to request data locally. The downside to this approach, however, is that if the dedicated build job connects to other nodes, using only the local cache can increase latency. You can set a node to use only its local cache by connecting to the node as root and changing the following system control to 1 in the node s system control override file at /etc/local/sysctl.conf: efs.bam.drop_behind_act_diskless You should remember to document your system controls and to back up the override files before you upgrade a cluster. Request block sizes and random read and write performance In general, larger block sizes perform better. Aligning blocks to multiples of 128k typically yield the best throughput. Read and write operations should be performed by using block sizes in 128k increments when possible. Read operations with block sizes of 512k or 640k obtain better throughput than with block sizes of 576k. Write operations have a similar performance characteristic multiples of 128k provide better performance, but the difference is less pronounced compared with read operations. With both read and write operations, the larger the block size, the better the performance. NFS EDA applications, such as Synopsys and Cadence Design Systems, typically run on Linux systems that use the Network File System version 3 (NFSv3). Linux clients mount the Isilon OneFS file system to access stored data with remote procedure calls. Because OneFS presents the file system as a single namespace, there is no need to spread the directories that the clients mount across multiple volumes. By default, OneFS optimizes the distribution of data in its file system for rapid, concurrent access. NFS performance settings OneFS includes performance settings for NFS. You can adjust the values for commit asynchronously, directory read transfer, read transfer max, read transfer multiple, readdirplus prefetch, write transfer max, write transfer multiple, and other settings. Table 5 discusses several settings that might help streamline an EDA workflow. The values that you choose depend on your workflow, your working data set, and other factors. The defaults values are for OneFS 7.0 or later. For the defaults for OneFS 6.x, see the OneFS Administration Guide for your version of OneFS. Table 5: OneFS NFS performance optimization options Setting Block size Commit asynchronously Discussion The block size reported to NFSv2 and later clients. The default value is This sets NFSv3 and NFSv4 COMMIT 28

29 operations to be asynchronous. Under most conditions, it is important to use asynchronous commits for an EDA workflow because the setting can prevent data loss if a node fails. In contrast, using synchronous commits increases risk because with OneFS 6.x, synchronous commits are not rewritten from NVRAM if a node fails during a write operation. With OneFS 7.0 or later, however, synchronous commits can be recovered from NVRAM. Directory read transfer Read transfer max Read transfer multiple Read transfer preferred Readdirplus prefetch Setattr asynchronous The preferred directory read transfer size reported to NFSv3 and NFSv4 clients. The default value is The maximum read transfer size reported to NFSv3 and NFSv4 clients. The default value is The number of nodes in a cluster might affect the value that you choose for the read transfer max. The recommended read transfer size multiple reported to NFSv3 and NFSv4 clients. The default value is 512. The number of nodes in the cluster might affect the value that you select. The preferred read transfer size reported to NFSv3 and NFSv4 clients. The default value is The number of file nodes to be prefetched on readdir. The default value is 10. For an environment with a high file count, you can test setting the readdirplus prefetch to a value higher than the default. For a low file count environment, you can experiment with setting it lower than the default. In a workload that runs concurrent jobs, you should consider testing your changes until you find the value that works best for you. If set to yes, it performs set attribute operations asynchronously. The 29

30 default value is No. Write datasync action Write datasync reply Write filesync action Write filesync reply Write transfer max Write transfer multiple Write transfer preferred Write unstable action Write unstable reply The action to perform for DATASYNC writes. The default value is DATASYNC. The reply to send for DATASYNC writes. The default value is DATASYNC. The action to perform for FILESYNC writes. The default value is FILESYNC. The reply to send for FILESYNC writes. The default value is FILESYNC. The maximum write transfer size reported to NFSv3 and NFSv4 clients. The default value is The recommended write transfer size reported to NFSv3 and NFSv4 clients. The default value is 512. The preferred write transfer size reported to NFSv3 and NFSv4 clients. The default value is The action to perform for UNSTABLE writes. The default value is UNSTABLE. The reply to send for UNSTABLE writes. The default value is UNSTABLE. For more information and instructions about how to change the settings, see the OneFS Administration Guide. Use NFS version 3 or 4 EMC Isilon recommends avoiding NFS version 2 for mounting because it uses UDP and does not support files larger than 2 GB. Instead, you should use NFS version 3 or 4, but keep in mind that NFS version 4 does not support the failover that NFS version 3 provides because NFS version 3 is stateless and version 4 is stateful. NFS over TCP, not UDP On your NFS clients, you should use NFS over Transmission Control Protocol (TCP), not User Datagram Protocol (UDP). With the UDP, a mismatch between the network 30

31 speeds of a client and a node can lead the protocol to drop packets and retransmit data, degrading performance. For better performance and resilience, use the TCP instead. If security is a concern, add the Kerberos protocol. Sync and async options There are sync and async options for NFS exports and NFS client mounts: Use sync exports from the cluster. With async exports, the cluster immediately replies to write requests, which restricts the cluster from signaling a write failure. In addition, NFS versions 3 and 4 enable safe asynchronous writes within the protocol. As a result, async exports are considered unsafe and unnecessary. Use async mounts from the client. Using sync as a client mount option makes all write operations synchronous, usually resulting in poor write performance. Sync mounts should be used only when a client program relies on synchronous writes without specifying them. NFS rsize and wsize When you mount NFS mount points from a cluster, larger read and write sizes for remote procedure calls improve throughput. An Isilon cluster broadcasts a read size, or rsize, of 128 KB and a write size, or wsize, of 512 KB. By default, an NFS client uses the largest supported size, so you should not set the size on your clients. Setting the values too small on a client overrides the default and undermines performance. Hard mounts For most clients, EMC Isilon recommends using the hard and intr mount options. The hard option retries an I/O operation when NFS encounters an access error; the intr option allows a signal to interrupt the retry loop. Soft mounts, in contrast, can create timeouts of 60 seconds or longer and cause I/O errors. Multiple clients writing to the same files If multiple clients write to the same files, you should probably use locking network log manager (NLM) and disable attribute caching by the client. To disable attribute caching, use the actimeo=0 mount option. You should avoid using the noac mount option unless your workflow requires it. This option turns off all attribute caching, which increases getattr calls. It also turns on sync mode for the mount and makes all writes synchronous, which can reduce write performance. The lock mount option ensures that a client can obtain exclusive access to a file. Because OneFS uses advisory locks, as opposed to mandatory locks, you must set all clients to use locking for consistency. If multiple clients are not accessing the same files, it should be safe to enable attribute caching and use local locks or not use locks. Enable readdirplus on clients If your client supports the readdirplus option, you can use it to enable the use of the readdirplus call, which can improve performance, especially for Mac OS X clients. For more information about readdirplus, see Directory listing is slow with a large amount of files using Linux clients (article number emc in the EMC knowledge base). 31

32 Recommended client mount settings and system controls The mount settings and system controls in knowledge base article emc on the EMC support website can help optimize different clients to connect to an Isilon cluster with NFS over TCP. The recommendations are general, however, and are not tailored to an EDA workflow. The knowledge base article covers the following settings for Linux, FreeBSD, Mac OS X, and Solaris: Rsize Wsize Network buffer for 1 GbE connections Network buffer and mount settings for 10 GbE connections Window scaling Task request slots for remote procedure calls (RPCs) Bandwidth delay product limiting NFS server threads EMC Isilon recommends that for NFS server threads, you set threads_min and threads_max to the same value. Increasing the number of threads can improve performance at the expense of stability. Before you change the number of threads, contact EMC Isilon Technical Support to determine the values that work best for your cluster and workflow. The values vary by CPUs, memory, the number of nodes, and other factors. After you determine the number of threads for your cluster by consulting with Isilon Technical Support, you can adjust the number of nfsd threads by running the following commands, replacing x with an integer: isi_sysctl_cluster vfs.nfsrv.rpc.threads_min=x isi_sysctl_cluster vfs.nfsrv.rpc.threads_max=x If you change a default sysctl, set it in the override file to preserve it when you upgrade a node or a cluster. To permanently override a sysctl, either set it in /etc/mcp/override/sysctl.conf for the cluster or in /etc/local/sysctl.conf for a node. Remember to document your sysctls and back up the override files. SMB Change notify The Microsoft Windows file system API monitors directories and subdirectories for changes. Receiving change notifications undermines performance. As a result, you should consider changing the OneFS SMB performance parameter for change notify to norecurse if the SMB clients in your workflow require no change notifications. For instructions, see "Change notification fails if buffer size is limited to 64 KB" in the EMC knowledge base and the SMB performance settings section of the OneFS Administration Guide. For more information on change notification, see MSDN. SMB signing SMB signing digitally signs packets that are sent between a client and a server over the SMB protocol. SMB signing, which is also known as security signatures, helps verify the packets' origination and authenticity to protect against tampering and attacks. 32

33 Although security signatures improve the security of the SMB protocol, they decrease the performance of file transactions. On OneFS, turning on security signatures disables zero-copy buffer transfers for read and write operations. To verify or calculate a signature on an SMB1 or SMB2 request or response, OneFS must read the entire packet buffer into memory. If a client sends or expects security signatures, the signatures in effect disable the fast path for read and write operations. In contrast, a read or write request without signatures sends the data buffer from the network socket directly to the file system in the kernel, decreasing CPU usage and response latency. OneFS includes two settings for security signatures: Enable security signatures Require security signatures For optimal performance, make sure both settings are turned off, which is the default, by running the following command as root: isi smb settings global view The output looks like this: Access Based Share Enum: No Audit Fileshare: none Audit Global SACL Failure: Audit Global SACL Success: Audit Logon: all Dot Snap Accessible Child: Yes Dot Snap Accessible Root: Yes Dot Snap Visible Child: No Dot Snap Visible Root: Yes Enable Security Signatures: No Guest User: nobody Ignore Eas: No Onefs Cpu Multiplier: 1 Onefs Num Workers: 0 Require Security Signatures: No Server String: Isilon Server Srv Cpu Multiplier: 4 Srv Num Workers: 0 Support NetBIOS: No Support Smb2: Yes For instructions on how to change the settings, see the OneFS Command Reference. For more information on SMB signing, see the Microsoft website. Limits for pools, directories, files, and names Pool capacity To maintain top performance, a best practice is to fill a cluster's capacity only up to 80 percent per pool, especially for an EDA workflow. Although you can continue 33

34 adding data beyond 80 percent of each pool's capacity for example, going to 100 percent on the S-Series pool and spilling over to the NL-Series pool doing so is not recommended for an EDA workflow that places a premium on performance. Maximum number of files in a directory In a workflow with a large number of files in a single directory as well as deep directory structures, as is often the case with EDA, the general best practices are as follows, assuming that the workload enumerates directories rather than looking up files with a fully qualified path and file name: With SSDs, limit the number of files in a directory to 100,000. With HDDs, limit the number of files in a directory to 20,000. Although you can exceed these limits, staying within them helps ensure that OneFS enumerates directories adequately for both users and applications. Directory structures In general, it is more efficient to create a directory structure that consolidates files in a single directory than it is to spread files out over many subdirectories because OneFS can more efficiently add a B-tree block in a directory than add an inode for a subdirectory. Maximum depth of a directory Many EDA workflows contain deep directory structures. An environment with small files and large file systems often results in a large LIN tree. As a result, an inode lookup for a file can include as many as four or five metatree lookups which can affect performance by adding latency to the workload. To minimize this impact, however, OneFS includes an inode cache, an MDS cache, and a metablock prefetching mechanism. The effectiveness of the caching system depends on many factors, making it difficult to recommend a maximum depth for a directory. You should probably limit the number of subdirectories to 10,000. Exceeding 10,000 subdirectories might affect the performance of command-line tools and other applications. You should certainly abide by the limit that the string length places on the path of nested directories. Maximum path length of names for nested directories Both the command-line interface and the web administration interface place a limit of 255 on the total number of concatenated characters in a directory name combined with the names of all its subdirectories. In other words, if you name a directory and all its subdirectories with long names, you will quickly exceed the 255 character limit. This character limit places a de facto limit on the depth of nested directories: If you name each nested directory with a single character name, the maximum number of nested directories, including the top level directory, is

35 Shorter path and file names translate into fewer NFS lookup operations. As a best practice, keep your path names as short as possible, especially in EDA workflows that include a large number of NFS lookup operations. Data protection An Isilon cluster is designed to serve data even when components fail. By default, OneFS protects data with erasure codes, enabling you to retrieve files when a node or disk fails. As an alternative to erasure codes, you can protect data with two to eight mirrors. When you create a cluster with five or more nodes, erasure codes deliver as much as 80 percent efficiency. On larger clusters, erasure codes provide as much as four levels of redundancy. Erasure codes are also known as forward error correction, or FEC. OneFS applies data protection at the level of the file, not the block. You can, however, set different protection levels on directories, files, file pools, subpools, and the cluster. Although a file inherits the protection level of its parent directory by default, you can change the protection level at any time. OneFS protects metadata and inodes with mirroring at the same protection level as their data. A system job called EMC Isilon FlexProtect detects and repairs degraded files. Note that you should not stop or cancel a FlexProtect or FlexProtectLin job unless instructed to do so by Isilon Technical Support. N+M data protection OneFS supports N+M erasure code levels of N+1, N+2, N+3, and N+4. In the N+M data model, N represents the number of nodes, and M represents the number of simultaneous failures of nodes or drives that the cluster can handle without losing data. For example, with N+2, the cluster can lose two drives on different nodes or lose two nodes. To protect drives and nodes separately, OneFS also supports N+M:B. In the N+M:B notation, M is the number of disk failures, and B is the number of node failures. For example, with N +3:1 protection, the cluster can lose three drives or one node without losing data. The default protection level for a cluster is +2:1. The quorum rule dictates the number of nodes required to support a protection level. For example, N+3 requires at least seven nodes so that you can maintain a quorum if three nodes fail. You can, however, set a protection level that is higher than the cluster can support. In a four-node cluster, for example, you can set the protection level at 5x. OneFS protects the data at 4x until a fifth node is added, after which OneFS automatically reprotects the data at 5x. For more information about data protection, see the discussion of protection levels and the space they consume in the OneFS Administration Guide. 35

36 Data protection best practices In general, a best practice for most environments entails using the default level of protection typically, N+2:1. For larger clusters, you can set the level slightly higher, for example, to +2. For most EDA workflows in which optimizing for an aggregation of workloads takes precedence over optimizing for a single job you should consider using +2:1 for clusters with fewer than 18 nodes. For clusters with more than 18 nodes, consider using +2. An EDA workflow of many small files, however, might benefit from using mirroring instead of erasure codes, because the overhead for erasure codes becomes similar to what is used for mirroring small files. In addition, mirroring can help boost performance, but the gain in performance may come at the expense of storage capacity. In general, lower protection levels yield better throughput, but the amount varies by type of operation. Random write performance receives the biggest performance gain from using a lower protection level. Streaming writes, streaming reads, and random reads receive a smaller performance boost from lower protection levels. The protection level that you choose depends on a variety factors, including the number of nodes in the cluster. The right protection level for you might also depend on which version of OneFS you are running. Because so many variables interact to determine the optimal protection level for an environment, a best practice is to consult with an Isilon representative about selecting a protection level. Isilon professional services can analyze the cluster to identify its mean time to data loss and then suggest an optimal protection policy. Data mirroring You can protect on-disk data with mirroring, which copies data to multiple locations. OneFS supports two to eight mirrors. You can use mirroring instead of erasure codes, or you can combine erasure codes with mirroring. Mirroring, however, consumes more space than erasure codes. Mirroring data three times, for example, duplicates the data three times, which requires more space than erasure codes. As a result, mirroring suits transactions that require high performance but consume little space. You can also mix erasure codes with mirroring. During a write operation, OneFS divides data into redundant protection groups. For files protected by erasure codes, a protection group consists of data blocks and their erasure codes. For mirrored files, a protection group contains all the mirrors of a set of file blocks. OneFS can switch the type of protection group as it writes a file to disk. By changing the protection group dynamically, OneFS can continue writing data despite a node failure that prevents the cluster from applying erasure codes. After the node is restored, OneFS automatically converts the mirrored protection groups to erasure codes. Balance data protection with storage utilization You can set protection levels to balance protection requirements with storage space utilization. Higher protection levels typically consume more space than lower levels because you lose an amount of disk space to storing erasure codes. The overhead for 36

37 the erasure codes depends on the protection level, the file size, and the number of nodes in the cluster. Because OneFS stripes both data and erasure codes across nodes, the overhead declines as you add nodes. Virtual hot spare When a drive fails, OneFS uses space reserved in a subpool instead of a hot spare drive. The reserved space is known as a virtual hot spare. In contrast to a spare drive, a virtual hot spare automatically resolves drive failures and continues writing data. If a drive fails, OneFS migrates data to the virtual hot spare to reprotect it. You can reserve as many as four disk drives as virtual hot spares. A best practice is to use one or more virtual hot spares; EMC Isilon recommends that you leave a virtual hot spare enabled. EMC Isilon SnapshotIQ EMC Isilon SnapshotIQ protects data with a snapshot a logical copy of data stored on a cluster. Snapshots take less time and consume less space than backing up data on another storage device. When the data in a snapshot is modified, the snapshot stores a physical copy of the original data and references the copy. Snapshots require less space than a remote backup because a snapshot references, rather than recreates, unaltered data. If several snapshots contain the same data, only one copy of the unmodified data is made. The size of a snapshot reflects the amount of disk space consumed by physical copies of the data stored in that snapshot. There is no available space requirement for creating a snapshot. EMC Isilon recommends that a cluster contain no more than 2,048 snapshots. With OneFS, you can schedule snapshots and assign expiration dates to delete them. It is often advantageous to create more than one snapshot per directory and to assign short duration periods for frequent snapshots. Meanwhile, you can assign longer duration periods for snapshots that you take less frequently. OneFS snapshots, which usually take less than one second to create, are highly scalable. They create little performance overhead, regardless of the level of file system activity, the size of the file system, or the size of the directory being copied. When OneFS updates a snapshot, for efficiency OneFS stores only the blocks of a file that changed. Users can access snapshots through a /.snapshot hidden directory under each file system directory. The granular intervals of a snapshot improve recovery point objective (RPO) timeframes. SnapshotIQ can take read-only, point-in-time copies of any directory or subdirectory within OneFS and provides the following benefits: Snapshots are created at the directory level instead of the volume level, which improves granularity. There is no requirement to reserve space for snapshots. Snapshots can use as much or as little of the file system space as you want. 37

38 Integration with Windows Volume Shadow Copy Service (VSS) lets end users on Windows clients running Windows XP or later to restore from the Previous Versions tab of the file or directory, which enables them to recover their own data. Using SmartPools, snapshots can physically reside on a different disk tier than the original data, which can help keep storage costs low. Up to 1,024 snapshots can be created per directory. Snapshot best practices The best practices for snapshots depend on factors that vary by environment. A key question, however, entails whether to take snapshots on a per-project basis or an allproject basis. A per-project approach, for instance, might take a separate snapshot of each directory in which a project resides. An all-project approach, in contrast, might take one snapshot at the high-level directory that contains all the project directories. The following general best practices can help guide the approach that you decide to take, but keep in mind that the best option depends on your own projects, dataprotection objectives, computing environment, and risk tolerance. Before you decide on an approach, you should conduct a cost-benefit analysis to evaluate the approach that will work best for you. Your cost-benefit analysis should factor in risk. Following are best practices for snapshots: Minimize the number of snapshot policies to reduce the number of snapshots that you need to manage as well as the number of snapshot delete jobs. Take a high-level approach, if possible, to avoid creating a new policy for every new project. Favor a small number of snapshots in a snapshot delete job. Such a job is more efficient to delete, even if the snapshot contains a larger collective amount of data to be removed, because it leverages the concurrency of the cluster. Many smaller snapshot delete jobs, by contrast, can be slower. Spread out your snapshot schedules so that the snapshots do not take place concurrently. Avoid out-of-order snapshot deletions because processing them requires copy-onwrite operations which are unnecessary disk operations. EMC Isilon SyncIQ EMC Isilon SyncIQ replicates data on another Isilon cluster and automates failover and failback operations between clusters. If a cluster becomes unusable, you can fail over to another Isilon cluster. SyncIQ can copy data for disaster recovery, business continuance, disk-to-disk backup, and remote archiving. SyncIQ can also use the same cluster as a target to create local replicas efficiently through the cluster s internal InfiniBand network. In addition, you can set up SyncIQ replication in a hub-and-spoke topology, in which a single source replicates to multiple targets, or a cascading topology, in which each cluster replicates to the next target in a chain. SyncIQ uses snapshots to asynchronously replicate directories and files on another cluster. You should automate replication by creating SyncIQ replication policies. After the first replication, SyncIQ optimizes data transfers over the network by replicating only the blocks that have changed. 38

39 If your storage capacity permits it, a best practice entails taking a snapshot on the target SyncIQ cluster after each replication to keep multiple versions of each replication sequence in case you need to revert to one of the versions. The granularity of the replication dataset depends on how you store your data. For example, NFS data stores let you replicate data by directory, which can help you restore a dataset that was segmented in a directory. EMC Isilon recommends that for reliability, you frequently test the restoration of the replicated datasets. Business continuance For EDA, business continuance is important. The architecture of SynclQ maximizes the use of network bandwidth to deliver fast replication times for tight RPOs. You can also use SynclQ with Isilon SnapshotIQ software to store as many point-in-time snapshots of your data as you need to support secondary activities, such as backing up data to tape. The following performance tuning guidelines for SyncIQ help support business continuance. Avoid full dataset replications Certain configuration changes cause a replication job to run a full baseline replication; that is, the job copies all the data in the source path regardless of whether the data has changed since the last replication. A full baseline replication typically takes much longer than incremental synchronizations; thus, to optimize performance, avoid triggering full synchronizations unnecessarily by changing the file selection criteria on the source dataset. Changes to the following parameters trigger a full synchronization: Source path: root path and the include and exclude paths Source file selection criteria: type, time, and regular expressions Select the right source replication dataset By default, OneFS synchronizes to the target cluster all the files and directories under the root directory that you select. With SynclQ policies, you can control dataset replication by selecting the directories to include or exclude or by creating file-filtering regular expressions. If you explicitly include directories in the policy, OneFS synchronizes only the files in the included directory. Specifying file criteria, however, slows down a copy or synchronize job. Using includes or excludes for directory paths does not affect performance. As a result, a best practice is to use include or exclude directory paths instead of file criteria. Performance tuning guidelines SynclQ uses a job engine to take advantage of aggregate CPU and networking resources. The engine divides a job into work items and assigns them to processes, or workers, that run on all the nodes in a cluster. Each process scans a part of the dataset for changes and transfers the changes to the target cluster. You can adjust the number of workers per node. Although OneFS manages the cluster s resources to maximize the performance of replication jobs, you can apply SyncIQ rules to control the performance of file 39

40 operations and the network. The rules can help protect the performance of other workflows. For more information, see the OneFS Administration Guide. Although no overarching formula dictates changes that can enhance performance, the following guidelines establish a methodology to tune SyncIQ jobs: Establish reference data for network performance by copying data from one cluster to another with common tools such as Secure Copy (scp) or NFS copy. The data establishes a baseline for a single-threaded data transfer over your network. After creating a policy but before running it for the first time, use the OneFS policy assessment option to see how long it takes to scan the source cluster s dataset with the default settings. Increase the workers per node in cases where network utilization is low, such as over a WAN. Increasing the number of workers can help overcome network latency by having more workers generate I/O on the wire. If adding more workers per node does not improve network utilization, avoid adding more workers because of diminishing returns and worker scheduling overhead. Increase the workers per node in datasets with many small files to process more files in parallel. Adding more workers, however, consumes more CPUs because of other cluster operations. Use file rate throttling to roughly control how much CPU and disk I/O SynclQ consumes while jobs are running throughout the day. Use SmartConnect IP address pools to control which nodes participate in a replication job and to avoid contention with other workflows. Use network throttling to control how much network bandwidth SynclQ can consume throughout the day. Use target-aware synchronization prudently. A target-aware synchronization consumes many more CPUs than a regular baseline replication, but a target-aware synchronization potentially yields much less network traffic when the source and target datasets contain similar data. Limitations and restrictions By default, a SynclQ source cluster can run up to five jobs at a time. OneFS queues additional jobs until a job execution slot becomes available. You can cancel jobs that are in the queue. Keep in mind the following limitations and restrictions: The maximum number of workers per node per policy is eight and the default number of workers per node is three. The number of workers per job is a product of the number of workers per node multiplied by the number of nodes in the smallest cluster participating in a job (which defaults to all nodes unless a SmartConnect IP address pool restricts the number of nodes). For example, if the source cluster has six nodes, the target has four nodes, and the number of workers per node is three, the total worker count equals 12. The maximum number of workers per job is 40. At any given time, 200 workers could potentially be running on the cluster (5 jobs with 40 workers each). 40

41 If a user sets a limit of 1 file per second, each worker gets a ration rounded up to the minimum allowed (1 file per second). If no limit is set, all workers are unlimited, and the limit is zero (stop), then all workers get zero. On the target cluster, there is a limit of workers per node, called sworkers, to avoid overwhelming the target cluster if multiple source clusters are replicating to the same target cluster. By default, the limit is 100 workers; you can adjust this limit with the max-sworkers-per-node parameter. To adjust the load on the target cluster that incoming SynclQ jobs generate, contact Isilon Technical Support. The target cluster must be running the same or a later version of OneFS as the source cluster so that you can replicate from a source cluster running earlier versions of OneFS. To turn on SyncIQ automated failover and failback, both clusters must be running OneFS 7.0 or later. Using SmartConnect with SynclQ In most cases, all the nodes in a cluster participate in a SyncIQ replication job. To limit which nodes take part in the jobs, you can use a SmartConnect IP address pool. When you set a policy target cluster name or address, you can use a SmartConnect Domain Name Service (DNS) zone name instead of an IP address or a DNS node name. If you restrict the connection to nodes in the SmartConnect zone, the replication job connects only with the target IP addresses of nodes in the target cluster that are assigned to the zone. On the source cluster, SyncIQ uses the list of target IP addresses to connect local replication workers with remote workers on the target cluster. Although you can set zones for each SynclQ policy, a better practice is to set them globally in the SynclQ settings page in the web administration interface. By default, OneFS applies the global settings to new policies unless you override the settings in the policy. For more information, see the OneFS Administration Guide. Performance and policy job monitoring To help tune performance, a best practice is to use the SynclQ Performance page in the web administration interface to monitor network utilization and file processing rates. You can adjust the network and CPU usage. The SynclQ Summary page in the web administration interface monitors jobs and tracks dataset statistics. The job-specific datasets and statistics can help you improve the performance of SyncIQ jobs. You can also view the SynclQ log files by connecting to a node with SSH and looking at following file: /var/log/isi_migrate.log Target-aware initial synchronization If the target cluster already contains some or all of the data that you want to synchronize, you can improve the performance of the first synchronization by running the job from the OneFS command-line interface as a target-aware initial synchronization, which is also known as a differential replication. (The command option for target-aware initial synchronization is diff_sync). 41

42 After the first synchronization finishes, however, you should turn off target-aware initial synchronization. Otherwise, if the policy definition changes triggering a baseline replication OneFS uses a target-aware initial synchronization instead of a normal full baseline replication. A target-aware initial synchronization consumes CPUs on both source and target clusters while it compares hashes of file blocks. The job also runs slower than an incremental synchronization. On a cluster running OneFS 7.0 or later, for example, you can check whether the target content-aware initial synchronization option is turned off by running the following command, replacing newpolicy1 with the name of your policy: isi sync policy list --policy newpolicy1 --verbose Id: b746ba74c004087c1a2878cfb87ece26 Spec: Name: newpolicy1 Description: Source paths: Root Path: /ifs/data/source Exclude: /ifs/data/media Exclude: /ifs/data/movies Source node restriction: Destination: Cluster: localhost Password is present: no Path: /ifs/data/target Make snapshot: off Restrict target by zone name: off Data replication commands isi sync policy list 147 Force use of interface in pool: off Predicate: -path /data,something Check integrity: yes Skip source/target file hashing: no Disable stf syncing: no Rename snapshot: off Log level: notice Target content aware initial sync (diff_sync): no Log removed files: no Rotate report period (sec): Max number of reports: 2000 Coordinator performance settings: Workers per node: 3 Task: copy manually State: on For more information on the diff_sync option, see the OneFS Command Reference for your version of OneFS. Best practices for SyncIQ jobs The following options can improve the performance of SyncIQ replication jobs: 42

43 Avoid triggering full dataset replications unnecessarily. Certain configuration changes trigger a complete resynchronization, regardless of whether the data in the target directory has changed since the last run. To avoid triggering a full replication, avoid changing the parameters on the source dataset including the root path, the include-exclude paths, the file type, the change time, and regular expressions. Avoid replicating unnecessary data. Not all directories or data requires replication. Do not use hyphens or other special characters in bandwidth or throttle rules. When administering or executing SynclQ jobs remotely over SSH, install SSH client certificates on the Isilon cluster to avoid having to enter the user password for every policy job. Disable block-hash comparisons during replication jobs. By default, SyncIQ jobs compare every block of data between the source and target replication sets to check for changes since the last replication job. The comparison can increase replication time and add considerable CPU load on the nodes conducting the hash calculations in environments where there is a fast network link between the source and target cluster. Under these conditions, CPU load can slow down network transfer, and the hash calculations can disrupt other storage activity. EMC Isilon SmartQuotas The EMC Isilon SmartQuotas module tracks disk usage with reports and enforces storage limits with alerts. In general, using the SmartQuotas module is a best practice because the module optimizes storage capacity while minimizing costs. With quotas, you can buy less storage capacity up front; defer capacity expansions to match your actual usage; and save on power, cooling, and the other costs that are associated with unused capacity. The SmartQuotas module also helps ration storage resources: The quotas motivate users and groups to use their storage allotments economically. Quotas also prevent a user, department, or project from infringing on the storage space of others. SmartQuotas lets you send notifications to users and groups who approach or exceed their quotas. Notifying users as they approach their storage limits is a best practice because it shifts part of the burden of managing storage space to users. The SmartQuotas module also generates reports that you can analyze to identify storage usage patterns patterns that can help define storage policies, allocate storage more efficiently, and plan capacity. Include data protection overhead in disk usage calculations For each quota that you set, you should specify that data protection overhead be included in future disk usage calculations which makes your quota calculations explicit. If you include data protection overhead in usage calculations for a quota, future calculations will include the total amount of space that is required to store files and directories as well as the space that is required to protect the data. 43

44 Include the space that snapshots consume You should configure quotas to include the space that is consumed by snapshots to make your quota calculations explicit. A single path can have two quotas applied to it: one without snapshot usage, which is the default, and one with snapshot usage. The actual disk usage is the sum of the current directory and the snapshots of the directory. You can see which snapshots are included in the calculation by examining the.snapshot directory for the quota path. View reports to manage quotas The SmartQuotas module includes reports to manage cluster resources and analyze usage statistics. You should routinely view the storage quota reports. You can, for example, produce summaries by applying a set of filtering parameters and sort types. Storage quota reports include information about violators, which are grouped by types of thresholds. As a best practice, you should use the reports to manage your quotas and anticipate changes that might require new limits. After you create a new quota with SmartQuotas, the quota begins to report data almost immediately, but the data is incomplete until the QuotaScan job finishes. Before using the data, verify that the QuotaScan job has finished. Additional best practices for setting quotas Here are other best practices for setting quotas: You should not create quotas of any type on the root directory of the default OneFS share (/ifs). A root-level quota may degrade performance. The actions of an administrator logged in as root may push a domain over a quota threshold. For example, changing the protection level or taking a snapshot has the potential to exceed quota parameters. System actions such as repairs may also push a quota domain over the limit. Keep such factors in mind when you administer your cluster. You should not create multiple quotas that govern the same directory. Overlapping quotas can degrade performance. Permissions for mixed environments In most cases, the Isilon cluster s default settings handle permissions securely and effectively for networks that mix UNIX and Windows systems. To deal with those cases when the default settings lead to unexpected behavior, this section discusses other options you can use to obtain the results you want, either by manually configuring OneFS with ACL policies or by using a preset configuration. On-disk identity OneFS freely mixes Windows and UNIX identities. It uses an on-disk identity to transparently map the requesting protocol to the right identifier. A user connecting to the cluster over NFS with a user identifier (UID) and group identifier (GID) is mapped to a security identifier (SID) for access to files that another user stored by using SMB. In the same way, a Windows user connecting to the cluster over SMB with a SID is mapped to a UID and GID to access files stored by a UNIX client. 44

45 When a user connects to a cluster, OneFS expands the user s identity to include the user s other identities and groups. Based on the identity mappings, OneFS generates an access token. When a user attempts to access a file, OneFS compares the access token generated during the connection with the authorization data on the file. Even though OneFS expands an identity to include identities from other identity management systems, OneFS stores an authoritative version of it the preferred ondisk identity. For more information, see the topics on configuring the on-disk identity and configuring identity mapping in the OneFS Administration Guide. Run the repair permissions job after changing the on-disk identity If you change the on-disk identity, a best practice is to run the repair permissions job. For more information, see the OneFS Administration Guide. The EMC Isilon documentation portal includes additional best practices for setting up a cluster to work with multiple directory services. ACL policies for mixed environments The Isilon cluster includes ACL policies that control how permissions are processed and managed. By default, the cluster is set to merge the new permissions from a chmod command with the file s ACL. Merging permissions is a powerful method of preserving intended security settings while meeting the expectations of users. In addition, managing permissions policies manually gives you a range of options to respond to the kind of special cases that can surface in mixed environments. An alternative method is to set the Isilon cluster s global permissions policy to balanced mode a mode designed to automate file sharing management for a network that mixes UNIX and Windows systems. In contrast to manually configuring ACL policies, balanced mode forces you to use a predetermined set of policies. If any of the balanced-mode policies are unsuitable for your mixed environment, it is recommended that you configure the policies manually. When you configure policies manually, there are two categories to choose from: permission policies and advanced settings. The sections that follow look at the application of some of these policies in the context of a multiprotocol file server. For instructions on how to apply them, see the OneFS Administration Guide. Permissions policies This section examines the basic permissions policies. The examples assume that you have selected the policy that allows the Isilon cluster to create ACLs over SMB. In addition, when you manually configure permission policies, some policies apply only in the context of another setting. For details, see the OneFS User Guide. Run chmod on a file with an ACL There are five policy options to configure how OneFS processes a chmod command run on a file with an ACL. The option to merge the new permissions with the ACL is the recommended approach because it best balances the preservation of security with the expectations of users. Table 6 explains each of the four other options. Table 1. Options for configuring how OneFS processes a chmod command 45

46 Setting Merge the new permissions with the existing ACL Remove the existing ACL and set UNIX permissions instead Remove the existing ACL and create an ACL equivalent to the UNIX permissions Remove the existing ACL and create an ACL equivalent to the UNIX permissions for all the users and groups referenced in the old ACL Deny permission to modify the ACL Discussion This option is the recommended approach because it best balances the preservation of security with the expectations of users. For more information, see the white paper titled EMC Isilon multiprotocol data access with a unified security model on the EMC website. This option can cause information from ACLs, such as the right to write a DACL to a file, to be lost, resulting in a behavior that gives precedence to the last person who changed the mode of a file. As a result, the expectations of other users might go unfulfilled. Moreover, in an environment governed by compliance regulations, you could forfeit the rich information in the ACL, such as access control entries that allow or deny access to specific groups, resulting in settings that might violate your compliance thresholds. This option can have the same effect as removing the ACL and setting UNIX permissions instead: Important security information that is stored in the original ACL can be lost, potentially leading to security or compliance violations. This option improves matters over the first two settings because it preserves the access of all the groups and users who were listed in the ACL. This option can produce unexpected results for users who are owners of the file and expect to be able to change its permissions. The inheritance of ACLs created on directories by chmod The inheritance of ACLs that are created on directories by running the chmod command presents a point of contention. The security models of SMB and NFS are at odds. On Windows, the access control entries for directories can define fine-grained rules for inheritance; on UNIX, the mode bits are not inherited. The policy that makes directories not inheritable may be the more secure setting for tightly controlled environments. Such a setting, however, may deny access to some Windows users who would otherwise expect access. For a list of the types of permissions inheritance in an ACE, see the white paper titled EMC Isilon multiprotocol data access with a unified security model on the EMC website. Chown command on files with ACLs The result of changing the ownership of a file is different over SMB and NFS. Over NFS, the chown command changes the permissions and the owner or owning group. On Windows systems and over SMB, changing file ownership leaves ACEs in the ACL 46

47 unmodified. The ACEs in the ACL associated with the old and new owner or group remain the same. The conflict between the two models becomes crucial when you run a chown command over NFS on a file that was stored over SMB. The file has an ACL, which can include ACEs that explicitly allow or deny access to certain users and groups. The chown command could wipe out the ACEs. Because the security models differ at such a crucial point, OneFS lets you choose the approach to changing file ownership that works best for you. For a mixed environment with multiprotocol file sharing, leaving the ACL unmodified is the recommended approach because it preserves the ACEs that explicitly allow or deny access to specific users and groups. Otherwise, a user or group who was explicitly denied access to a file or directory might be able to gain access, possibly leading to security or compliance violations. This setting does not affect chown commands performed over NFS on files with UNIX permissions, and it does not affect ownership changes to Windows files over SMB. Access checks (chmod, chown) With the UNIX security model, only the file owner or the superuser has the right to change the mode or the owner of a file rights to which UNIX users are accustomed. With the Windows model, users other than the file owner can also have the changepermissions right (WRITE_DAC) and the take-ownership right (WRITE_OWNER) rights that they expect to be able to use. In addition, in Windows Server 2008 and Windows Vista, you can curtail the default permissions (WRITE_DAC and READ_CONTROL) that an owner of an object automatically receives by adding the OwnerRights security principal to an object s ACL. With version or later, OneFS works with the OwnerRights SID to curtail the powerful WRITE_DAC right. According to MSDN, using the OwnerRights SID helps ensure that users who create files and folders cannot change the intended access control policy. The asymmetry between the two models that drives the access check policy is as follows: POSIX file systems chmod: only owner and root chown: only root chgrp: only owner and root, and only to groups they are in NTFS Change ACL: requires WRITE_DAC (or owner) Change owner: requires WRITE_OWNER, which can be taken but not given away Change group: requires WRITE_OWNER, which can be given to any group If you configure permissions settings manually, the access check policy enables you to apply the approach most likely to meet the expectations of users in your environment. 47

48 Advanced settings This section examines some of the advanced ACL settings. For instructions on how to set these options, see the OneFS Administration Guide. More of the advanced settings are described in the white paper titled EMC Isilon multiprotocol data access with a unified security model, which is available on the EMC website. Owner and group permissions The advanced setting for owner and group permissions affects only how permissions are displayed in mode bits. For both the owner and group permissions, the option that approximates mode bits by using all possible owner or group ACEs displays a more permissive set of permissions than the actual mode bits themselves. If, for example, you are an owner who has only read access to a file but might be in a group with write and execute access, the owner's permissions are displayed as rwx on a UNIX client as well as on OneFS. Checking all the group access control entries before displaying the permissions is, in terms of computing resources, too expensive. Viewed from an export mounted on a UNIX client, the mode bits of the following file show rwx for group: [root@rhel5d adir]# ls -l print.css -rwxrwx Apr print.css Although the mode bits are also set to rwx for group on OneFS, running the OneFS ls -le command on the Isilon cluster reveals that the group does not explicitly have permission to write to the file: ls -le print.css -rwxrwx W2K3I-UNO\jsmith W2K3I-UNO\market 807 Apr print.css OWNER: user:w2k3i-uno\jsmith GROUP: group:w2k3i-uno\marketing 0: user:w2k3i-uno\jsmith allow file_gen_all 1: group:w2k3i-uno\marketing allow file_gen_read,file_gen_execute 2: user:root allow file_gen_all 3: user:nobody allow file_gen_all 4: everyone allow std_read_dac,std_synchronize,file_read_attr OneFS displays rwx for group because either the root user or the nobody user might be a member of the marketing group. The permissiveness compensates for applications and NFS clients that incorrectly inspect the UNIX permissions when determining whether to attempt a file system operation. By default, this setting ensures that such applications and clients proceed with the operation to allow the file system to properly determine user access through the ACL. In an environment where there are no applications or clients that incorrectly inspect UNIX permissions and which most users are accustomed to seeing a more literal representation of their owner or group permissions, you can set the ACL policy to approximate the owner or group mode bits by using only the ACE with the owner or group identity. Keep in mind, though, that it might result in access-denied problems for UNIX clients. In addition, although the mode bits will correctly represent the 48

49 permissions in nearly every case, they might occasionally make the file appear to be more secure than it really is a situation that, according to RFC 3530, should be avoided. It is therefore recommended that you use the default setting for mixed environments. No deny ACEs For files with mode bits, OneFS generates a synthetic ACL to control the access of Windows users. A synthetic ACL can contain both allow and deny entries, which either allow or deny access to specific users and groups. By default, OneFS removes the deny entries from synthetic ACLs. Although their removal can lead to representations that are less restrictive than the corresponding POSIX mode bits, there are only two fringe cases in which less restrictive representations occur: When a group is explicitly allowed access but a user who is a member of the group is explicitly denied access When everyone is allowed access but a group or a user is explicitly denied access Here is an example. Suppose that a file stored by a UNIX system has the following mode bits and access control entries: -r--rw-rwx SYNTHETIC ACL 0: user:w2k3i-uno\jsmith allow file_gen_read 1: user:w2k3i-uno\jsmith deny file_gen_write, file_gen_execute 2: group:w2k3i-uno\marketing allow file_gen_read, file_gen_write 3: group:w2k3i-uno\marketing deny file_gen_execute 4: everyone allow file_gen_read, file_gen_write, file_gen_execute In the example, the user jsmith is specifically denied write and execute access. If, however, jsmith is in the marketing group, jsmith will obtain write access to the document an unexpected, but unlikely, consequence of the default setting. Such a consequence may be unacceptable in an environment governed by tight security policies or compliance regulations. In such cases, you can set OneFS to preserve the deny ACEs. But doing so can have detrimental side effects. Windows systems require that deny ACEs come before allow ACEs. When a Windows user retrieves a file with a synthetic ACL from the cluster and modifies its ACL, the Windows ACL user interface rearranges the ACEs into the Windows canonical ACL order meaning that all the deny entries are put before the allow entries. When the user returns the file to the Isilon cluster, the new ACE ordering makes the file more restrictive than it was intended to be. Consider the following example: -rwxrw-rwx SYNTHETIC ACL 0: user:0 allow file_gen_all 1: group:0 allow file_gen_read,file_gen_write 2: group:0 deny execute 3: everyone allow file_gen_all 49

50 When the Windows ACL user interface sets the ACEs in canonical order, it puts the deny entries before the allow entries a change that in effect denies execute access to the user and to everyone, even though both were explicitly allowed file_gen_all, which includes the execute access right. Home directories At a high level, capacity planning entails scaling an Isilon cluster to accommodate the competing demands of combined workloads. In the case of home directories, workload requirements stem from several factors: disk capacity to handle the combined storage requirements of all users; sufficient disk throughput to process the combined transactional requirements of all users; and enough network bandwidth to deliver adequate throughput. The following characteristics mark typical home directories: User directories are mapped automatically at the time of user login, typically through either a login script or a persistent user profile connection. Most users store documents, spreadsheets, presentations, images, and media rather than high-transaction, high-volume datasets. Per-user connections are intermittent, with short bursts followed by long pauses. Enterprise requirements and user expectations for throughput and performance are subject to different standards than enterprise file services. Data is often retained for long periods without being accessed or modified. Home directory snapshots and backups are often managed by capture and retention policies that differ from those used for enterprise data. Sizing guidance for storage capacity Determining the overall disk capacity requirement is a straightforward process. The objective is to calculate the amount of disk space necessary to provide sufficient capacity for your users until a capacity expansion adds more disk space. If you expand the cluster s expansion once a year, the initial number of nodes must provide enough disk capacity to last the year. The following factors can help you estimate how much capacity to set aside for home directories: The number of users with home directories The protection level for the top-level directory of home directories, typically /ifs/home The expected rate of growth for data stored in home directories; the rate might be affected by adding more users, increasing the per-user disk-capacity allocation, or changing your SnapshotIQ policies The performance requirements for the data in home directories The expected disk capacity allocation per user; an accurate assessment of the expected allocation may require the inclusion of other factors, such as snapshot settings for example, the size of each SnapshotIQ operation and the rate of change to the home directory dataset; additional factors are space, archive, and retention policies, as well as quota enforcement settings 50

51 Once this information is known, an Isilon technical consultant can determine the total amount of storage space required for home directory data as well as the node type and configuration needed to satisfy the capacity requirement. SnapshotIQ SnapshotIQ can make read-only, point-in-time copies of OneFS directories and subdirectories. The snapshots create little performance overhead, regardless of the level of file system activity, the size of the file system, or the size of the directory. When you develop a plan for snapshots of home directories, consider the following: There is no requirement for reserved space for snapshots in OneFS. Snapshots can use as much or as little of the available file system space as you want. With SmartPools, snapshots can reside on a different disk tier than the original data. Up to 1,024 snapshots can be created per directory. Capacity planning considerations The following guidelines can help you plan the capacity of home directories: While an application-driven workload, such as EDA, may require sizing the cluster first for performance and second for capacity, the reduced SLAs typically in effect for home directories often mean that disk space is the primary consideration. Even with SmartQuotas policies in effect, you might implement policies that allow certain users or data types to bypass the standard quotas. Policies that are less restrictive than the norm may lead to home directories growing faster than your plan anticipates. Different types of users warrant different quota settings. Rather than a single per-user capacity quota and a single exceptions policy, consider a tiered-quota approach that allocates different quotas to different categories of users, such as managers and IT administrators. Some organizations use a stair-step approach to quota policy allocation for example, a base 10 GB policy for most users, then a 25 GB policy for the next tier of users, then 50 GB, then 100 GB, and so forth. This approach lets you increase user quotas as necessary. Although SmartQuotas can constrain data growth, snapshots of home directories create the opposite effect: More frequent snapshots lead to automatic increases in the amount of space in use, while the length of the retention policy (that is, the standard lifecycle of a single snapshot) determines how long the disk space remains in use. SmartQuotas capacity restriction policies can include or exclude the overhead associated with SnapshotIQ data. Isilon s unique architecture requires capacity planning when you consider the number of user connections per node that will be required to provide acceptable performance for home directory data. For more information, see the section on EMC Isilon SmartConnect. Capacity planning best practices EMC Isilon recommends the following for planning and managing home directory disk capacity: EMC recommends configuring a separate accounting quota for the /ifs/home directory to monitor overall disk usage and issue administrative alerts to avoid running out of space. If an organization s SLA with respect to home directory data is different from the default general file services SLA, the snapshot schedule and snapshot retention settings can be adjusted to reduce the amount of capacity that snapshot operations consume. 51

52 Conclusion An EMC Isilon cluster with S-Series nodes and solid-state drives is ideally suited to maximize performance for electronic design automation. The best practices, technology, and strategies described in this document can help you take advantage of the Isilon scale-out distributed architecture to improve performance for working datasets, cost-effectively store less important data, rapidly serve metadata, minimize CPU bottlenecks, and optimize wall clock performance for concurrent jobs. 52