Recent Advancements and Future Challenges of Storage Systems

Transcription

1 INVITED PAPER Recent Advancements and Future Challenges of Storage Systems In the Internet environment, hard drives are becoming global storage devices that can access many types of data, manage and locate desired data, and securely preserve it wherever it resides. By DavidH.C.Du,Fellow IEEE ABSTRACT The rapid development of the Internet, fast drop of storage cost, and the great improvement of storage capacity have created an environment with an enormous amount of digital data. In addition to the traditional goals like performance, scalability, availability, and reliability, other challenges including manageability, searching for the desired information, energy efficiency, long-term data preservation, and data security become increasingly important for storage systems. We first present the evolution path of the past development of storage systems. The impact of the advancement of disk technology on fault tolerance is discussed next. The potential solutions and research issues of the new challenges are also briefly discussed. KEYWORDS Disk technology; file systems; intelligent storage; object-based storage device (OSD) I. INTRODUCTION Manuscript received December 15, 2007; revised February 13, Current version published December 2, This work was supported in part by NSI Grant The author is with the Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN USA. Digital Object Identifier: /JPROC The past decade has witnessed tremendous advances in computing, wired and wireless communication, and storage technologies. The storage device has increased its capacity to 1 Tbytes/drive. The rotational speed of disk spindle has also increased to 15 thousand revolutions per minute (RPM). It is also important that remarkable cost reductions have made large computing and storage capacity available to an increasing number of consumers. Many wired and wireless networks are integrated with Internet protocol (IP) into a single information technology infrastructure called the Internet. with this unprecedented connectivity provided by the Internet, many new applications have emerged and developed. A huge amount of data has become available for access to satisfy the demand of these new applications. Each person has access to a relatively large number of storage devices in desktop and laptop computers as well as in personal digital assistants and cellular phones. The traditional focus of storage systems on high-performance computing has being extended to cover this new environment. It increasingly becomes a challenge to manage the huge amount data available at our fingertips and to locate the data we need at anytime from anywhere. More data types like audio and video are converted into digital format, and data in various types need to be preserved for a very long term (beyond 100 years). 1 In the current computer systems, data are stored as a file under a given file system or in databases. A file system is associated with a particular computer (or computers in the distributed file systems) and its operating system. A file system can be on top of a storage system. A database can be on top of file systems or directly on a storage system. A storage system consists of a number of disks or other storage devices that are connected and perform as a single integrated storage space. A disk usually contains one or more platters, with magnetic heads attached in an arm reading bits from the platters. Each platter is divided into tracks. Each track is further divided into sectors. The same track on all the platters forms a cylinder. To access a particular sector from a platter, the disk arm moves in or out to be on top of the track where the sector is located, while the platter rotates for the magnetic head to be on top of that sector. Since outer tracks are larger than the inner tracks and with the 1 SNIA 100 Year Archive Requirements Survey /$25.00 Ó2008 IEEE Vol. 96, No. 11, November 2008 Proceedings of the IEEE 1875

2 same rotational speed, many high-performance disks employ a partition with more sectors on the outer tracks than the inner. It is called zone bit encoding. A disk employing zone bit encoding divides a platter into multiple zones, each consisting of a set of contiguous cylinders. The cylinders in the same zone have the same recording density. Outer zones have higher numbers of sectors per cylinder and have higher data transfer rate. A challenge for disk technologies is that the magnetic heads need to be close enough to sense the data on the disk platter while far enough to not to touch the surface. It is extremely challenging in physics to reduce platter vibration while rotating at high speed. This is one of the reasons that disk rotation speeds do not improve as much compared to aerial density. Today s popular off-the-shelf disks have rotation speeds of either 7200 or RPM. Highperformance disks have rotation speeds from to RPM. The research on magnetic disks focuses on how to increase the aerial density and how to improve the reliability of the drive. Most of the disks sold today have one or two 32-bit internal microprocessors and several megabytes of internal random access memory (RAM). the processor and ram on disks enable disks to perform complex tasks that are unnoticed by the users. one of the complex tasks is error detection and correction. sophisticated algorithms using digital error checking and correction codes are used on disks to ensure data integrity. error-correction code (ECC) bits are stored along with the user data on disks. Such an error detection and correction mechanism is built into drive control electronics as the first line of error correction to deliver user data without delay. When such a first line of error correction fails, additional error recovery mechanisms will be invoked. Sophisticated algorithms are used to schedule multiple disk requests in the most efficient way to access data. Another feature that has been included in today sdiskdrivesisafailurewarningprovidedtothehost systems when error rates exceed a threshold value. In 1994, a failure warning standard specification called selfmonitoring and reporting technology (SMART) 2 was adopted by the disk drive industry. SMART uses an algorithm running on the drive s microprocessor that continuously checks whether internal drive errors such as bit-read errors and track-seek errors have exceeded a threshold rate value. A warning will be given to the host CPU when the error rates exceed the threshold. On some disks like fiber channel (FC) drives, the processor can be programmed to execute XOR computations. By computing XOR on disks, it helps to reduce the need for such resources as CPU and redundant arrays of independent disk (RAID) controllers in parity computations [1], [2]. The processing capability of a disk is expected to continuously increase, and so is the memory capacity. 2 ANSI-SCSI Informational Exception Control (IEC) document X3T10/ The data available over The internet today is highly unstructured and heterogeneous. Therefore, the traditional way of accessing data via file name or url may not be adequate. As we are moving into the direction of accessing data based on their semantics, how to store semantic information with the data and how to allow data retrieval based on semantic become extremely challenging. The design goals of a traditional storage system are performance, scalability, reliability, and availability. With the dramatic changing of our computing and communication environment, how to deal with various types of data, how to manage and locate the desired data, and how to securely preserve the data for longer terms become increasingly challenging. In the extreme case, we can consider all the data stored on the Web servers over the Internet today as a single large integrated storage system. To meet these challenges, we have to reexamine the potential and basic functions that can be supported by a single storage device (disk drive) as well as the architecture of future storage systems. The solutions may depend on how we take advantage of the onboard computing and memory capability of a disk. We shall call such disks intelligent storage devices. The rest of this paper is organized as follows. Section II describes the evolution path of storage systems from directly attached devices to global storage systems. Section III addresses the fault-tolerant issue of storage systems and the impact of the fault-tolerant storage systems due to the advancement of today s disk drive. Some of the challenges and research issues for current and future storage systems are discussed in Section IV. It is clear that more research is needed to investigate these issues. We believe that intelligent storage is one of the promising approaches. However, we are still far away from solving these challenges. We offer some conclusions in the last section. II. FROM DIRECTLY ATTACHED TO GLOBAL STORAGE DEVICES Ever since the late 1950s and early 1960s, when online digital storage devices were first introduced, scientists and researchers have never stopped searching for technologies to support bigger, better, and more scalable storage solutions. The truth is that the improvements on the storage capacity alone will not satisfy the growing needs for more scalable storage solutions. Many emerging applications such as digital medical imaging systems in picture archive and communication system, digital libraries, and video on demand require not only a huge storage capacity but also high-performance and fault-tolerant storage systems. Other applications such as e-commerce and scientific analysis and visualization of a very large data set such as the human genome requires more powerful and scalable capability for searching and data mining. In the past, the storage model was always assumed to be the direct attached storage (DAS) model. In DAS, storage devices are directly attached to a host system. This 1876 Proceedings of the IEEE Vol. 96, No. 11, November 2008

3 model is the simplest way to connect host and storage devices and is efficient for storage systems in a small scale. It provides security, manageability (in small scales), and performance. However, this model provides a very limited scalability. When a system reaches its limit on the number of devices it can attach, the system must be replicated. In addition, when additional host systems are needed, data stored on the storage systems will need to be either replicated or partitioned. There is no direct device sharing anddatasharingindasmodel.devicesharingbetween two DAS systems must be served by the host systems through computer networks. It introduces extra CPU overhead and access latency. Many old mainframe systems use the DAS model and proprietary protocols to connect host and storage devices. In recent decades, standardized protocols such as small computer systems interface (SCSI) [3] and advanced technology attachment (ATA) (also known as IDE) 3 have been used to reduce the cost and increase the interoperability. However, both SCSI and ATA have very limited connectivity. A SCSI channel can only connect up to 15 devices. Its popular parallel interfaces have the maximum distance in only a few feet. Although the speed and reliability of parallel SCSI and ATA have been improved with their serial versions (serial SCSI and serial ATA) 4 [4], the extensibility and the number of devices that can be connected have not. Examples of DAS include a laptop, desktop PC, or single server storage devices that usuallyresideinsidethesystem.whenmultipleserversare required in an organization, DAS may seem to be low in cost from each server s point of view. When the number of servers becomes larger, the DAS model becomes inefficient and expensive in the total cost of the ownership. Network attached storage (NAS) is a technology that provides file-level access on any type of network. A NAS device consists of one or more storage devices and an engine that implements and provides file services. A host system accesses data on NAS devices using a file system device driver with file access protocol such as network file system (NFS), server message block, and common Internet file system (CIFS). With NAS, the storage can increase in capacity and scale in number independently from the applications. Another advantage of the NAS is that it can use any popular network transport protocols such as asynchronous transfer mode, gigabit Ethernet networks, or even IP networks. Therefore, the cost for networking is relatively low. NAS technology is mature and interoperable. It is popular in provide scalable file systems. However, NAS may not be suitable for applications such as database transaction that require short latency due to its file-level access latency or any high-performance applications. To provide scalable low-latency data transfers from a storage device to a host or another storage device over a network, storage area network (SAN) is a technology that 3 X3T10/0948D Information technologyvata attachment interface with extensions (ATA-2). 4 SATA-1 specification, serial ATA: high speed serial AT attachment. emerged and has gained popularity in recent years. Its primary use is for transferring data between computer systems and storage devices as well as among storage devices. A SAN consists of a communication infrastructure, a management layer, storage devices, and computer systems. The communication infrastructure provides physical connections, and the management layer organizes them. From each computer system s point of view, the storage devices are connected as if they were connected in a DAS model. Nonetheless, storage devices can be shared very easily through the SAN. This is very beneficial and cost effective, especially when sharing such expensive devices as tape libraries and CD-ROM jukeboxes. In addition, storage devices can be added to the storage network independently from the hosts. Therefore, the scalability is greatly improved. Fiber channel [5] and InfiniBand (IB) 5 are two popular standards that provide physical connections and communication protocols in SAN. To a lesser extent, serial storage architecture (SSA) [6] [8] and gigabit (or 10 Gbit) Ethernet are also used for this purpose. Some of these protocols include a layer mapping between SCSI and their lower layer protocols. One of the big differences between NAS and SAN is that SAN provides a block-level access to the storage devices and NAS provides a file-level access. Therefore, SAN could be optimized for data movement from servers to disk and tape drives. Because of its blocklevel access, SAN is more suitable to those applications that require low latency and high performance. Among the communication protocols that support SAN, fiber channel is the most popular one. It provides up to four gigabits per second link bandwidth. It can run on either optical or electrical connections. It supports three different basic topologies. The first supported topology is point to point, which allows a direct connection between a server and a storage device. The second is loop topology, also referred to as fiber channel-arbitrated loop (FC-AL) [9], [10]. Up to 127 devices can be connected to an FC-AL. The third topology is fabric (or switched). It connects hosts and storage devices through fiber channel switches. By adding more ports/ switch or switches, more devices can be added to an SAN and the total bandwidth can be scaled up with the total number of ports in the network. However, fiber channel switches are more expensive. Each fiber channel switch port costs about two to three times more than an IP-based gigabit Ethernet port. FC-AL provides a low-cost alternative to the switched architecture. It provides connectivity up to 127 hosts and devices without the need of a fiber channel switch. FC-AL uses an arbitration protocol to determine who has the access right of the loop. Only the node (a host or device) that won in an arbitration phase is allowed to use the loop for transfer. Only one simultaneous transfer is allowed on a loop. When a transfer ends, the next transfer right will be given to the node that won in the following arbitration phase. 5 Vol. 96, No. 11, November 2008 Proceedings of the IEEE 1877

4 Another alternative solution is InfiniBand. IB was originally designed for Bsystem area networks[ that would connect CPUs, memory, and I/O devices in a scalable cluster environment. IB offers switch fabric based point-to-point connections. Each connection is a bidirectional serial link with transmission rate of 2 Gbit/s in each direction. IB also offers double (DDR) and quad (QDR) data rates at 4 and 8 Gbit/s, respectively. Links can be aggregated in units of four (4X) or 12 (12X). Therefore, a high I/O rate can be easily accomplished. The other unique features of IB include remote direct memory access (RDMA) support for low CPU overhead and quality of service for data transmissions. A storage subsystem needs to be connected via a target channel adapter and each processor via a host channel adapter. Data are transmitted in packets of up to 4 KB. IB uses a superset of primitives from virtual interface architecture [11]. For an application to communicate with another application on a different node via IB, a work queue that consists of a queue pair (QP) has to be created first. For each work queue, a corresponding complete queue is also created. The QP represents the application on the source and destination nodes. An operation (send/receive, RDMA-read, RDMAwrite, or RDMA atomics) is placed in the corresponding work queue as a work queue element (WQE). The channel adapter will pick up the WQE to carry out the intended operation. Once a WQE has being done, a complete queue element (CQE) is created and placed in the complete queue. The benefit of using CQE in the complete queue is to reduce the interrupts that otherwise would be generated. For each created QP, one of the following types of transport service can be chosen: reliable connection (RC), unreliable connection (UC), reliable datagram (RD), unreliable datagram (UD), and raw datagram. The QP with raw datagram is a data link layer service that sends and receives raw datagram messages that are not interpreted. Through these services and a type of credit-based flow control, quality of service can be accomplished in IB. SSA [6] [8] is another possibility and was originally designed by IBM. It provides two input and two output links to and from each node (host or device). Each link has a20mb/s(megabytespersecond)bandwidth.thereare three different topologies supported in SSA. The first one is SSA string. In an SSA string, the nodes are connected one nexttoanotherasastring,with each node in the middle connecting to a left neighboring node and a right neighboring node. The beginning node and the ending node of a string are connected to only one other node. The second topology is the SSA loop. An SSA loop can be formed by connecting the beginning and the ending nodes of an SSA string. An SSA loop provides two nonoverlapped paths from one node to any other node on the same loop. Therefore, SSA loops provide fault tolerance against single link failure. SSA loop is the most common topology in SSA. The third topology is switched that uses a switch node with sixormoreportstoconnectamixofanode,ssastring, and SSA loop. One of the most attractive features in SSA is spatial reuse. In SSA, each link operates independently. As opposed to FC-AL, which requires arbitration before accessing the loop and allows only one simultaneous transfer, SSA does not need any arbitration for accessing the loop and it allows multiple nonoverlapping simultaneous transfers on a loop. Furthermore, since each link is operating independently, simultaneous transfers on the nonoverlapped portion of the loop can utilize the full link bandwidth. This is called spatial reuse. Spatialreuseallows multiple nonoverlapped transfers, with each utilizing the full link bandwidth. With spatial reuse, the total system throughput can be much higher than the link bandwidth. Although technically interesting, SSA is less successful or popular due to its proprietary nature. Wewillnotdiscussgigabitor10GbitEthernetinthis paper since they are popular local-area network standards and are commonly understood by the research community. Recently a new effort to integrate gigabit Ethernet, IB, and FC in a data center is emerged (called data center Ethernet or IEEE Standard). Storage area networks have some common limitations. The first one is their distance limitation. Although fiber channel provides a maximum of 10 km in distance, it is still not long enough for connections across a wide area such as multiple cities and states. Although NAS transfer data via computer networks can travel across wide-area networks, it does not provide the low-latency access that many applications need. For applications that require low-latency accesses over a large geographic area, neither NAS nor SAN can provide a satisfactory solution. The second limitation is the cost. Even with fiber channel s popularity, a fiber channel switch port is about three times of an equivalent IP-based port such as gigabit Ethernet. Expertise in fiber channel, InfiniBand, and SSA is also relatively rare compared to those in IPbased networks. This adds up to the total cost of ownership for SAN. Both of these two limitations lead to IP-based storage solutions. IP storage is widely envisioned as the storage networking of the future to provide low-latency block-level access across longer distance. IP storages take advantage of popular low-cost IP networking and provide a block-level access to the storage devices. Three major IP storage standards are currently being defined: Internet SCSI (iscsi)[12] [15],fiberchanneloverIP(FCIP)[16],and Internet fiber channel protocol (ifcp) [17]. The concept is to transport SCSI packets or fiber channel frames over IP networks. The objective is to utilize the low-cost and highspeed IP networks to provide low-latency block-level accesses to storage devices over a long distance. In the short term, IP storage protocols including iscsi, FCIP, and ifcp will more likely be used for connecting multiple remote SANs. There are still many challenges to overcome for native IP storage to be accepted. A short description of iscsi, FCIP, and ifcp is provided below. iscsi defines a protocol that will enable transferring SCSI data over TCP/IP networks [12] [15]. To 1878 Proceedings of the IEEE Vol. 96, No. 11, November 2008

5 identify an iscsi device, each iscsi node has two types of identifiers: an iscsi name and an iscsi address. A permanent globally unique iscsi name with length up to 255 bytes will be assigned to each iscsi initiator and target by naming authority. The combination of TCP port and IP address provides a unique network address for an iscsi device. While the iscsi name will remain unchanged regardless the location of the iscsi device, the TCP port and IP address combination changes from one subnet to another. The iscsi address is composed of an IP address, a TCP port number, and the iscsi name of the node. In order for an initiator to connect to a storage resource, the Internet storage name service (isns) provides a DNS-type service for device discovery for an iscsi initiator. An initiator would first query an isns server to resolve the IP addresses of potential target resources. The IP address will then be used to establish a TCP/IP connection to the target resource. FCIP [16] is a protocol that is intended for transferring FC frames over IP networks. By creating point-to-point IP tunnels between FCIP endpoints on remote SANs, it forms a single unified FC SAN across IP networks. Once the tunnels are set up, they are transparent to the FC devices. The devices use FC addressing scheme and view each of the devices on the unified SAN as if they were in a local SAN. A single FC fabric namespace will be established. The FCIP bridge at the sending FCIP endpoint encapsulates the FC frames, then uses TCP/IP for transferring. On the receiving FCIP endpoint, the encapsulated FC frames will be decapsulated and transferred in FC frame format on the remote SAN. ifcp [17] is also intended to extend FC SAN over wide-area networks. It provides an alternative to FCIP. As in FCIP, it uses the common FC encapsulation format specified by the Internet Engineering Task Force to encapsulate FC frames. The major difference between FCIP and ifcp is over the addressing schemes. As opposed to FCIP, which establishes point-to-point tunnels to connect two FC SAN, ifcp is a gateway-to-gateway protocol. ifcp protocol allows each interconnected SAN to retain its own independent namespace as opposed to FCIP s single unified namespace. As discussed above, to meet the new challenges of the future computing and communication environment, disk drives can be further improved by adding more intelligence to the devices. Advantages of putting intelligence on the storage devices include direct connection to networks for the parallel processing power with a large number of devices, integrating metadata with data, better security, search and filtering capability, and layout-aware scheduling. The concept of on-device processing leads to the recent development of an object-oriented storage device (OSD) standard by the Storage Network Industry Association. OSD [18], [19] provides a layer between block-level and file-level access that intended to make the existing and future storage access more effective. It includes three logical components: Object Manager, OSD Intelligence, and File Manager. The main function of the Object Manager is to locate objects and secure access of objects. File Manager provides the security mechanism to ensure data privacy and integrity. It also provides an application programming interface to the legacy applications to access files on OSD for backward compatibility. OSD Intelligence is the software that runs on storage devices. It provides the device with the capabilities of sense of time, communication with other OSDs, device and data management, and OSD commands and object attribute interpretations. OSD is intended to create a new class of storage-centric devices that are capable of understanding, interpreting, and processing data they store more effectively. The OSD specification [18] defines a new device-type specific command set in the SCSI standards family. The objectbased storage device model is defined by this specification. It specifies the required commands and behavior that are specific to the OSD device type. Fig. 1 depicts the abstract model of OSD in comparison to the traditional block-based device model for a file system. The traditional functionality of file systems is repartitioned primarily to take advantage of the increased intelligence that is available in storage devices. Object-based storage devices are capable of managing their capacity and presenting filelike storage objects to their hosts. These storage objects are like files in that they are byte vectors that can be created and destroyed and can grow and shrink their size during their lifetimes. Like a file, a single command can be used to read or write any consecutive stream of the bytes constituting a storage object. In addition to mapping data to storage objects, the OSD storage management component maintains other information about the storage objects in attributes, e.g., size, usage quotas, associated user name, access control, and semantic information. A. OSD Objects In the OSD specification, the storage objects that are used to store regular data are called user objects. In addition, the specification defines three other kinds of objects to assist navigating user objects, i.e., root object, partition objects, and collection objects. There is one root object for each OSD logical unit [18]. It is the starting point for navigation of the structure on an OSD logical unit analogous to a partition table for a logical unit of block devices. User objects are collected into partitions that are represented by partition objects. There may be any number of partitions within a logical unit up to a specific quota defined in the root object. Every user object belongs to one and only one partition. The collection represented by a collection object is a more flexible way to organize user Vol. 96, No. 11, November 2008 Proceedings of the IEEE 1879

6 Fig. 1. Comparison of traditional and OSD storage models. objects for navigation. Each collection object belongs to one and only one partition and may contain zero or more user objects belonging to the same partition. Different from user objects, all three kinds of aforementioned navigating objects do not contain a read/write data area. All relationships between objects are represented by object attributes, discussed in the next section. Various storage objects are uniquely identified within an OSD logical unit by the combination of two identification numbers: the partition ID and the user object ID. The OSD standard offers improved but still limited capabilities of intelligence. We believe OSD is the beginning of developing more sophisticated intelligent storage devices. Other possible future features include storing semantic information in the object attributes, indexing objects based on the attribute values and metadata information on the drive, using onboard computing power to respond to some simple queries, and a small search engine to filter unwanted data objects when responding to a query. To fully take advantage of intelligent storage devices, the concept of a global file system is also developed. One possible global file system based on a hybrid architecture is illustrated in Fig. 2. The base level of the multitier architecture contains an arbitrary number of variable-sized regions comprising numerous intelligent OSDs under the control of a single regional manager. These regions then are composed into a peer-to-peer overlay network. These regions can themselves be hierarchical and so may be divided recursively into some appropriate number of subregions. B. Regional Organization To support all of the operations necessary to make each data object accessible from anywhere at any time, each region in addition to a set of OSD devices will include the following two components. The regional manager serves like a metadata server and provides centralized management of the clients and objects within the region. It maintains the metadata for each object whose home is within the region (as well as replicated metadata), performs required security-related tasks, and authenticates access requests from within and outside the region. It also provides name services for all of the objects for which it serves as the home region and supports consistency management, replication, and migration. The regional manager must monitor the storage devices registered within the region, and it must track both permanent and visiting clients. While the regional manager provides centralized functionality, its actual operations can be distributed among a cluster of servers and storage devices. Clients are end users or applications that want to access objects within a region. After obtaining location and access information from the regional manager, clients can use an Internet (IP) connection to access individual objects directly from the appropriate storage device. Clients will have a home region that stores important information about each client, such as basic profile information, but clients are not tied to a specific region and can freely move among regions while still maintaining the same personalized view of the data objects they are allowed to access Proceedings of the IEEE Vol. 96, No. 11, November 2008

7 Fig. 2. A two-tier architecture. The home region of a client can be changed depending on access patterns and performance-related issues. An issue we are investigating is the appropriateness of allowing clients to automatically change their home regions or of limiting this to a manual operation. An intelligent storage device (OSD in this case) can be either a low level storage device like a disk, a storage subsystem (i.e., the intelligence is implemented in the storage controller of the storage subsystem) or a high level storage server. The partitioning of clients and storage devices into regions can be done in response to logical or physical affinities. For instance, it is likely that many of the clients located on a single physical campus of an enterprise will access the same subset of data objects from the overall universe of objectsvthe same subset of accounts, for example. In this case, a physical grouping of clients and storage devices into the same region will improve performance by allowing the system to exploit the physical locality of data accesses. In other situations, however, a logical grouping into a region may be more appropriate, as when a department within an enterprise tends to access a common subset of data but is physically distributed in different locales. Regions will dynamically merge and split as performance characteristics within and between regions change. Some work in the area of distributed and object-based file systems exist. Several projects, for instance, have investigated techniques for sharing storage devices within a cluster environment. The Linux-based Lustre project [3], which uses Object-based Storage Devices (OSD), separates metadata from object data and allows direct data access between storage devices and clients; it does not, however, support data replication and migration or mobile clients, and assumes that all devices within the cluster can be trusted. Both Panasas s ActiveStor 6 and EMC s Centera 7 have implemented some aspects of OSD. The Direct Access File System (DAFS) (DAFS Collaborative), which evolved from the NFS 4.0 system [21], is based on a central server. StorageTank [22], [23] uses a Storage Area Network (SAN) architecture somewhat similar to the description above in which clients obtain the metadata needed to access files from a server cluster, while the data itself is accessed directly from the distributed storage devices. Another similar global file system is the OceanStore project [24], which proposed a global-scale persistent storage infrastructure constructed from what could potentially be millions of untrustworthy storage devices. A key design goal of that project was to support nomadic data that could flow freely through the network while being cached anywhere at any time. Another project with similarities is the SHORE project [25], which examined a hybrid two-tier network with peer-to-peer communication between servers in which each server manages a group of clients. In this system, a client s managing server provides access to both local data and data managed by a remote server. III. FAULT-TOLERANCE STORAGE The storage systems today may consist of hundreds or thousands disks. At any time, the probability of having a platform. Vol. 96, No. 11, November 2008 Proceedings of the IEEE 1881

8 single or multiple disk failures is large enough and cannot be ignored. The fundamental way of dealing with disk failure is to group disks into redundant arrays of independent disks (RAIDs) [26], [27]. RAID uses parity information to provide fault tolerance against disk failure and to strip data over multidisks for the performance. RAID is one of the most important storage technologies. There are several RAID variants. The most popular ones include RAID-0, RAID-1, RAID-3, RAID-5, RAID-6, and RAID-10. In RAID, a number of disks are grouped together as a RAID group. For RAID-0, disk space is divided into blocks of a fixed size. A disk stripe consists of one block from each disk in a RAID group, with different blocks in a disk belonging to a different disk striping. The disk space in the RAID group can be viewed as one huge space that is partitioned into many nonoverlapping disk stripes. When storing data, if data are greater than one disk block, they will be stored into data blocks in the same disk stripe. If more data blocks are needed, they will be stored in the data blocks in the next striping units. RAID-1 is mainly data mirroring. Data on one half of the RAID group are mirrored on the other half of the group. RAID-3 uses a different striping than that in RAID-0. Each data byte is stored across eight disks (one bit on each disk) with an additional disk allocated for storing the parity bit for each stripe. Disks in a RAID-3 group need to be synchronized for either reading or writing data. RAID-5 [28] is the most popular RAID version that provides fault tolerance. It uses striping similar to RAID-0 with a difference that a parity block is allocated in each striping unit for storing the parity data of the striping unit. The parity blocks are allocated in a circular fashion and evenly among the disks. The parity block of stripek is allocated at disk ðk mod nþ, wheren is the number of disks in the RAID-5 group. In RAID-6, the striped set is with dual distributed parity. This will provide fault tolerance from two disk failures. RAID-6 makes large RAID groups more practical since the larger the RAID group is, the higher the probability is of more than one disk failure. RAID-10 is a combination of RAID-1 (mirroring) and RAID-0 (striping). A set of two disks is used. Each disk is paired with another mirroring disk. Therefore, there are N mirrored groups. The data are then striped over the N groups. RAID-10 can tolerate more than one disk failure as long as the two disks in the same mirrored group have not failed. RAID can be found in almost all mission-critical systems. The parity data in RAID will be used for reconstructing the lost data block when a disk failure occurs. When data are written to a data block, the corresponding parity block will also be updated with the exclusive-or (XOR) of the new and old data blocks and the old parity block. This task is usually referred to as a read modify write operation. Based on how and where the parity blocks are reconstructed, traditional RAID implementations can be categorized into two categories. The first category uses RAID controllers to process all the corresponding tasks required in RAID functions. This type of implementation can be referred to as hardware RAID. The cost of a hardware RAID is more expensive because it usually requires multiple channel interfaces for the participating disks, large memory for buffering temporary data, processing units to execute XOR computations, and all the other management tasks. The other category of implementation uses the host s main processor (or CPU) instead for all the RAID-related tasks. The advantage of such implementations is that it reduces the cost of RAID systems since it requires no RAID controllers. On the other hand, the RAIDrelated tasks compete with the applications on the CPU and memory as well as other system resources. As a result, the applications and RAID operations may encounter bad, unstable, and unpredictable performance. With the space capacity of a disk increasing beyond 1 Tbyte and limited I/O bandwidth improvement, the rebuilt time of a failed disk becomes a concern. To rebuild a RAID may take several days instead a few hours for a disk of 1 Tbyte. If a RAID like RAID-5 can tolerate only one disk failure, the chance of loss data is greatly increased since another disk failure can occur during this long rebuilt period. Since the rebuilt period is long, the performance degradation during the rebuilt time also becomes a concern. This tends to be in favor of RAID-6 since a single disk failure will not seriously degrade the performance and RAID-10, since mirroring is the easiest and most efficient way to rebuild a failed disk. RAID-10 can potentially tolerate even more than two disk failures as long the failed disks are not in the same mirrored group. IV. ADDITIONAL CHALLENGES FOR CURRENT AND FUTURE STORAGE SYSTEMS The concept of intelligent storage was first proposed in [29] [31], and [47] as active storage. Further extending this concept, an intelligent storage device can recognize the data objects stored in the device, monitor the usage of data information, and reallocate data objects if necessary to meet the required I/O bandwidth (object storage). It can interact with other intelligent storage devices or metadata servers to enforce certain data storage policy for ease of management, for example, a duplicated copy must be maintained in another storage device for highly critical data objects. Similar to recently proposed self-regulating autonomic computing systems in which processes are performed automatically in response to internal causes or influences, the actions of intelligent storage devices are regulated by the needs of the data they contain (autonomic storage). Some of these possible approaches are discussed in previous sections. We focus our discussion on efficient power energy consumption for storage systems and safe data protection and long-term data preservation in this section. Rapid digitization of content has led to extreme demands on storage systems. The nature of data access such as simulation data dumps, check-pointing, real-time data access 1882 Proceedings of the IEEE Vol. 96, No. 11, November 2008

9 queries, data warehousing queries, etc., warrants an online data management solution. Most online data management solutions make use of hierarchical storage management techniques to accommodate the large volume of digital data. In such solutions, a major portion of the data set is usually hosted by tape-based archival solutions, which offer cheaper storage at the cost of higher access latencies. This loss in performance due to tape-based archive solutions limits the performance of the higher level applications that make these different types of data accesses. This is particularly true since many queries may require access to older, archived data. The decreasing cost and increasing capacity of commodity disks are rapidly changing the economics of online storage and making the use of these large disk arrays more practical for applications of low latency. Large disk arrays also enable system scaling, an important property as the growth of online content is predicted to be enormous. The enhanced performance offered by disk-based solutions comes at a price, however. Keeping huge arrays of spinning disks has a hidden cost, i.e., energy. Industry surveys suggest that the cost of powering our country s data centers is growing at a rate of 25% every year [32]. Among various components of a data center, storage is one of the biggest energy consumers, consuming almost 27% of the total. To make matters worse, increasing performance demands have led to disks with higher power requirements; moreover, storage demands are growing by 60% annually, according to an industry report [33]. Given the well-known growth in total cost of ownership, a solution that can mitigate the high cost of power, yet keep data online, is needed. Various studies of data access patterns in data centers suggest that on any given day, the total amount of data accessed is less than 5% of the total stored [36]. Most energy conservation techniques make use of various optimizations to conserve energy, but this usually comes with a huge performance penalty. Massive array of idle disks(maid)isadesignphilosophyrecentlyadopted[34], [35].ThecentralideabehindMAIDisthatalldisksina MAID storage array are not spinning all the time. Within a MAID subsystem, disks remain dormant (i.e., powered off) until the data they hold is requested. When a request arrives for data on a disk that is off, the controller turns on the disk, which takes around 7 10 s, and services the request. Additionally, a set of disks is designated as cache disks, which are always spinning (i.e., never turned off). This disk-based caching is necessary because the regular memory cache is usually not large enough to hold all of the frequently accessed data. The MAID concept works on the assumption that less than 5% of the stored data actually gets accessed on any given day. Keeping this in mind, the MAID controller tries to make sure that frequently accessed data are moved to the always-on cache disks. For this reason, the response time of the system is very tightly tied to the size of the cache disk set. By increasing the cache hit ratio, the controller tries to minimize the response time and also conserve energy. The savings increase as the storage environments get larger. A commercial product based on this idea, Copan MAID, has seen a great deal of success in the realm of archival systems. One of the main drawbacks with the MAID approach is that it tries to keep the most frequently accessed data in the cache disk set, but this will not ensure good response time for noncached data. Data that are not cached could include data being accessed for the first time or data that cannot be cached due to their sheer volume or the access pattern. A study of using application hints to increase the efficiency of prefetching and to achieve better energy efficient is presented in [37]. Application hinting has drawn a great amount of interest in the high-performance computing community [38], [39]. The idea is to use application hints for the purpose of prefetching data ahead of time, thereby reducing the file system I/O latencies. Other approaches to increasing energy efficiency for storage systems are possible. A new energy conservation technique for disk array-based network servers called popular data concentration (PDC) was proposed in [40]. According to this scheme, frequently accessed data are migrated to a subset of the disks. The main assumption here is that data exhibit heavily clustered popularities. PDC tries to lay data out across the disk array so that the first disk stores the most popular disk data, the second disk stores the next most popular disk data, and so on. Since data blocks are always moved around to different locations in the disk array, this mapping mechanism becomes very important. A new solution called Hibernator was presented in [41]. The main idea here is the dynamic switching of disk speeds based on observed performance. This approach makes use of multispeed disk drives that can run at different speeds but have to be shut down to make a transition between different speeds. Such disks have been demonstrated by Sony. Several other power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy were discussed in [42]. The general consensus of all these works is that normal cache management algorithms are not necessarilythebestoptionwhenit comes to power conservation. Specifically, they explore the use of spatial and temporal locality information together in order to develop cache replacement algorithms. A new type of hard disk drive that can operate at multiple speeds is also explored for energy saving [43]. It was demonstrated that using dynamic revolutions per minute (DRPM) speed control for power management in server disk arrays can provide large savings in power consumption with very little degradation in delivered performance. The DRPM technique dynamically modulates the hard disk rotation speed so that the disk can service requests at different RPMs. For such a scheme to work, a dynamically adjustable speed disk must first be designed, for which there are several manufacturing/mechanical challenges. Vol. 96, No. 11, November 2008 Proceedings of the IEEE 1883

10 Designing a disk that can operate at various speeds is very complex and almost infeasible. Of course, the energy efficient solutions for data centers are not just limited to storage devices/systems. The cooling and servers are important too. However, these topics are outside of the scope of this paper. As for the data protection, one of the major issues is where to protect the data. An earlier (and most common) mechanism is to protect communication link between the client and the storage systems (for example, using IPSec). In this case, the data are assumed to be stored in a safely protected storage system without encryption. When the storage system was trusted, this was considered to be enough. However, as indicated by many recent security incidences, this is not the case. When the disc is stolen, the access control mechanism enforced in the file system is useless. As demonstrated in [43], insider misuse is the most common cause of system break-ins. Administrators have full access to the storage system, and they become the central point of social attacks. Further, the data can pass through several different entities, such as administrators, backup servers, and users. To prevent or mitigate the risk of insidermisuse,thedatahavetobeencryptedbythewriter and decrypted by the reader. This rather simple-looking transition introduces new challenges, since the overhead for key management is now shifted to the client. The client now has to remember (store) all keys used to encrypt any file. When the file has to be shared with other clients, distributing the key becomes the new problem. When the key is lost, the client will not be able to recover any encrypted data. Therefore, designing an efficient, transparent, and user-friendly key management mechanism is crucial. This is especially crucial for the enterprise to adopt data encryption. The enterprise has to control the keys in order to keep accessing the data for many years to come. We have identified a few challenging issues related to key management for long-term data preservation. A. Backup and Efficient Retrieval of Keys Key backup is one of the fundamental requirements of key management for long-term storage. The cryptographic keys should be secured as long as the data are considered in existence. It is almost impossible to recover an encrypted data if the associated keys are lost. This implies the following. 1) If the data are backed up, keys must be backed up. 2) If the data are deleted, keys should be deleted due to ease of management. 3) Keys must be backed up securely. In order to retrieve the data that were encrypted (or signed) several years ago, the system should also retrieve the keys that were used to encrypt (or sign) the data. Considering the large number of keys that will be created in an organization over time, this can be a daunting task. As a result, efficient search and retrieval of these keys is important. B. Key Recovery In practice, just like the data, keys can be lost or accidentally deleted. Loss of keys will result in loss of data. Therefore, the key management mechanisms must ensure that lost keys can be easily recovered. Further, it can happen that the user who encrypted the file may not be available at the time of decryption (e.g., the user no longer works for the organization). Under such circumstances, some designated agents of the organization should be able to retrieve the keys. Key recovery mechanisms must be secure and efficient. The failure of key recovery mechanisms can jeopardize the proper operation, underlying confidentiality, and can also result in improper disclosures of keys. C. Long-Term Management Typically, enterprises get reorganized periodically. Reorganization can require merging or splitting of existing groups. As a result, the keys should be reorganized as well. Therefore, key management mechanisms should efficiently handle such reorganization activities. Since the data and the keys should be protected for a longer duration of time, a lot of unforeseen events can happen. For example, the keys can be compromised, or cryptographic algorithm can be compromised since an attacker has more time to attempt breaking a key or an algorithm. If an attacker can compromise a key, s/he could read encrypted data or even write data without being detected since the attacker can generate a valid signature. D. Usability Strong security with poor usability and/or performance is not sufficient to make the system practical. As illustrated by [44] [46], secure systems and cryptographic software are used less than we would expect due to their lack of consideration for usability of their products. A recent survey [46] performed by Sun Microsystems shows that 45% of the surveyed companies find storage management to be the most expensive task, and 95% of the surveyed companies state that it is the most difficult. Considering these statistics, usability (by the administrator and, naturally, by the client) is one of the major concerns. A user should be able to use the file system that provides long-term key management without having to know any underlying cryptographic operations. Transparency should be provided as much as is necessary. E. Scalability Due to the huge amount of data generated for long term, the number of keys will be increased with respect to the data. The key management system must scale well with respect to the number of keys. Further, the number of keys that have to be secured physically should be minimal. In addition to these requirements, the long-term data preservation by itself poses a great challenge to future storage systems Proceedings of the IEEE Vol. 96, No. 11, November 2008

11 V. CONCLUSION The rapidly changing computing and communication environment has imposed many new challenges on the current and future storage systems. It is our belief that the storage devices and systems have to be drastically redesigned to meet these challenges. In this paper, we have briefly covered some of the challenges and potential solutions. However, many more studies and experiments are necessary before a storage system that can meet these challenges is developed. h Acknowledgment This work was partially supported by NSI Grant REFERENCES [1] S. Shim, Y. Wang, J. Hsieh, T.-S. Chang, and D. H. C. Du, BEfficient implementation of RAID-5 using disk based read modify writes,[ Dept. of Computer Science, Univ. of Minnesota, Tech. Rep., [2] T.-S. Chang, S. Shim, and D. H. C. Du, BThe designs of RAID with XOR engines on disks for mass storage systems,[ in Proc. 6th NASA Goddard Conf. Mass Storage Syst. Technol./15th IEEE Symp. Mass Storage Syst., Mar , 1998, pp [3] Information Technology Small Computer System Interface 2, ANSI X3.131:1994, American National Standard Institute, Sep. 7, [4] Serial Attached SCSI (SAS), working draft, American National Standard Institute. [Online]. Available: t10/drafts/sas2/sas2r12.pdf [5] Fibre Channel 3rd Generation Physical Interface (FC-PH-3), ANSI X3.303:1998, American National Standard Institute, Nov. 5, [6] Information Technology Serial Storage Architecture, Physical Layer 1 (SSA-PH1), ANSI X3.293:1996, American National Standard Institute, Jul. 22, [7] Information Technology Serial Storage Architecture, Transport Layer 2 (SSA-TL2) Rev: 05b, ANSI NCITS.308:1997, American National Standard Institute, Apr. 1, [8] Information Technology Serial Storage Architecture, SCSI-2 Protocol (SSA-S2P), ANSI T10.1/1051D, American National Standard Institute, Apr. 1, 1997, draft proposed rev. 05b. [9] Fibre ChannelVArbitrated Loop (FC-AL), Revision 4.5, ANSI X , American National Standard Institute, Jun. 1, [10] Fibre Channel 2nd Generation Arbitrated Loop (FC-AL-2), Revision 6.4, NCITS 332: 1999, American National Standard Institute, Mar. 26, [11] Compaq, Intel, Microsoft, Virtual interface architecture specification version 1.0. [Online]. Available: ftp://download.intel. com/design/servers/vi/vi_arch_ Specification10.pdf [12] J. Satran et al. (2002, Sep. 16). iscsi Internet draft. Internet Engineering Task Force (IETF). [Online]. Available: [13] Intel iscsi reference implementation. [Online]. Available: projects/intel-iscsi [14] Y. Lu and D. Du, BPerformance evaluation of iscsi-based storage subsystem,[ IEEE Commun. Mag., vol. 41, pp , Aug [15] S. Y. Tang, Y. Lu, and D. Du, BPerformance study of software-based iscsi security,[ in Proc. 1st Int. IEEE Security Storage Workshop, [16] M. Rajagopal et al. (2002, Aug. 28). Fibre channel over TCP/IP (FCIP). Internet Engineering Task Force (IETF). [Online]. Available: [17] C. Monia et al. (2002, Aug. 2). ifcpva protocol for Internet fibre channel storage networking. Internet Engineering Task Force (IETF). [Online]. Available: internet-drafts/draft-ietf-ips-ifcp-13.txt [18] Information Technology SCSI Object-Based Storage Device Commands (OSD), working draft, Project T10/1355-D. [19] D. Du, D. H. He et al., BExperiences on building an object-based storage system based on T10 standard,[ in Proc. IEEE Mass Storage Syst. Technol. Conf [20] P. J. Braam and M. J. Callahan, Lustre: A SAN file system for Linux, white paper. [21] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, and D. Noveck, RFC 3010: NFS Version 4 Protocol, [22] IBM Almaden Research, StorageTank. [Online]. Available: ibm.com/storagesystems [23] D. A. Pease, R. M. Rees, W. C. Hineman, D. L. Plantenberg, R. A. Becker-Szendy, R. Ananthanarayanan, M. S. nzimet, C. J. Sullivan, R. C. Burns, and D. D. E. Long, IBM StorageTankVA distributed storage system, white paper. [24] J. Kubiatowicz, D. Bindel et al., BOceanStore: An architecture for global-scale persistent storage,[ in Proc. ACM ASPLOS, [25] M. Carey et al., BShoring up persistent applications,[ in Proc. ACM SIGMOD Int. Conf. Manage. Data, Minneapolis, MN, 1994, pp [26] D. Patterson, G. Gibson, and R. Katz, BA case of redundant arrays inexpensive disks (RAID),[ in Proc. SIGMOD, Chicago, IL, Jun [27] P. Chen, E. Lee, G. Gibson, R. Katz, and D. Patterson, BRAID: High-performance, reliable secondary storage,[ ACM Comput. Surv., pp , Jun [28] P. Chen and E. Lee, BStriping in a RAID level 5 disk array,[ in Proc ACM SIGMETRICS Conf. Meas. Model. Comput. Syst., May [29] G. A. Gilbson et al., BFile server scaling with network-attached secure disks,[ Sigmetrics, pp , [30] E. Riedel, G. A. Gibson, and C. Faloutsos, BActive storage for large scale data mining and multimedia,[ in Proc Very Large Data Bases Conf. (VLDB), Aug. 1998, pp [31] H. R. Lim, V. Kapoor, C. Wighe, and D. Du, BActive disk file system: A distributed scalable file system,[ in Proc. IEEE/NASA MSST [32] A. P. Corporation (APC). (2004). Determining total cost of ownership for data center and network room infrastructure. [Online]. Available: http: com/salestools/cmrp-5t9pqg R3 EN.pdf [33] M. Hopkins. (2004). The onsite energy generation option, Data Center J. [Online]. Available: News/Article.asp?articleid=66 [34] D. Colarelli and D. Grunwald, BMassive arrays of idle disks for storage archives,[ in Proc ACM/IEEE Conf. Supercomput. (Supercomputing 02), Los Alamitos, CA, 2002, pp [35] D. Colarelli, D. Grunwald, and M. Neufeld. The case for massive arrays of idle disks (MAID). [Online]. Available: ist.psu.edu/colarelli02case.html [36] K. Rajamani and C. Lefurgy, BOn evaluating request distribution schemes for saving energy in server clusters,[ ISPASS, pp , [37] N. Mandagere, J. Diehl, and D. Du, BGreenStor: Application-aided energy-efficient storage,[ in Proc. IEEE Mass Storage Syst. Technol. Conf., [38] R. H. Patterson, G. A. Gibson, and M. Satyanarayanan, BUsing transparent informed prefetching to reduce file read latency,[ in Proc NASA Goddard Conf. Mass Storage Syst., 1992, pp [39] D. Rochberg and G. Gibson, BPrefetching over a network: Early experience with CTIP,[ Proc. SIGMETRICS Perform. Eval. Rev., vol. 25, no. 3, pp , [40] E. Pinheiro and R. Bianchini, BEnergy conservation techniques for disk array-based servers,[ in Proc. 18th Annu. Int. Conf. Supercomput. (ICS 04), New York, 2004, pp [41] Q. Zhu, Z. Chen, L. Tan, Y. Zhou, K. Keeton, and J. Wilkes, BHibernator: Helping disk arrays sleep through the winter,[ SIGOPS Oper. Syst. Rev., vol. 39, no. 5, pp , [42] A. P. Papathanasiou and M. L. Scott, BEnergy efficient prefetching and caching,[ in Proc. Usenix 2004 Annu. Tech. Conf., pp [43] S. Gurumurthi, A. Sivasubramaniam, M. Kandemir, and H. Franke, BDrpm: Dynamic speed control for power management in server class disks,[ in Proc. 30th Annu. Int. Symp. Comput. Architect. (ISCA 03), New York, 2003, pp [44] A. Whitten and J. D. Tygar, BWhy Johnny can t encrypt: A usability evaluation of PGP 5.0,[ in Proc. USENIX Security Symp., [45] J. Kaiser and M. Reichenbach, BEvaluating security tools towards usable security: A usability taxonomy for the evaluation of security tools based on a categorization of user errors,[ in Proc. Usability [46] Sun Microsystems. (2003, Sep.). Importance of total cost of management. [Online]. Available: presentation/files/data_center/tcm.pdf [47] A. Acharya, M. Uysal, and J. Saltz, BActive disks: Programming model, algorithms and evaluation,[ in Proc. 8th Int. Conf. Architect. Support Program. Lang. Oper. Syst. (ASPLOS), San Jose, CA, Vol. 96, No. 11, November 2008 Proceedings of the IEEE 1885

12 ABOUT THE AUTHOR David H. C. Du (Fellow, IEEE) received the B.S. degree in mathematics from National Tsing-Hua University, Taiwan, R.O.C., in 1974, and the M.S. and Ph.D. degrees in computer science from the University of Washington, Seattle, in 1980 and 1981, respectively. He is currently a Qwest Chair Professor in the Computer Science and Engineering Department, University of Minnesota, Minneapolis. He also has been a Program Director with National Science Foundation since His research interests include cyber security, sensor networks, multimedia computing, storage systems, high-speed networking, high-performance computing over clusters of workstations, database design, and computer-aided design for very large-scale integrated circuits. He has authored or coauthored more than 190 technical papers, including 95 referred journal publications, in his research areas. He has also graduated 49 Ph.D. and 80 M.S. students. He is currently serving on a number of journal editorial boards. He has been a Guest Editor for Communications of the ACM. He has been Conference Chair and Program Committee Chair for several conferences in multimedia, database, and networking areas. Dr. Du is a Fellow of the Minnesota Supercomputer Institute. He has been Guest Editor for IEEE COMPUTER Proceedings of the IEEE Vol. 96, No. 11, November 2008