1 ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V WHITE PAPER Better Object Storage With Hitachi Content Platform The Fundamentals of Hitachi Content Platform By Michael Ratner November 2014
2 WHITE PAPER 2 Contents Executive Summary 3 Introduction 4 Main Concepts and Features 4 Object-Based Storage 4 Distributed Design 7 Open Architecture 7 Multitenancy 8 Object Versioning 8 Search 9 Adaptive Cloud Tiering 9 Spin-Down Capability 9 Replication and Global Access Topology 10 Common Use Cases 11 Cloud-Enabled Storage 11 Backup-Free Data Protection and Content Preservation 12 Fixed-Content Archiving 14 Compliance, E-Discovery and Metadata Analysis 14 System Fundamentals 15 Hardware Overview 15 Software Overview 18 System Organization 18 Namespaces and Tenants 20 Main Concepts 20 User and Group Accounts 21 System and Tenant Management 22 Object Policies 22 Content Management Services 24 Conclusion 27
3 WHITE PAPER 3 Better Object Storage With Hitachi Content Platform Executive Summary One of IT s greatest challenges today is an explosive, uncontrolled growth of unstructured data. Continual growth of and documents, video, Web pages, presentations, medical images and the like increase both complexity and risk. These difficulties are seen particularly in distributed IT environments, such as cloud service providers and organizations with branch or remote office sites. The vast quantity of data being created, the difficulties in management and proper handling of unstructured content, and the complexity of supporting more users and applications pose significant challenges to IT departments. Organizations often end up with sprawling storage silos for a multitude of applications and workloads, with few resources available to manage, govern, protect and search the data. Hitachi Data Systems provides an alternative solution to these challenges through Hitachi Content Platform (HCP). This single object storage platform can be divided into virtual storage systems, each configured for the desired level of service. The great scale and rich features of this solution help IT organizations in both private enterprises and cloud service providers. HCP assists with management of distributed IT environments and control of the flood of storage requirements for unstructured content, and it addresses a variety of workloads. The Hitachi Content Platform portfolio products integrate tightly with HCP to deliver powerful file sync and share capability, and elastic backup-free file services for remote and branch offices. Built from end to end by Hitachi Data Systems, Hitachi Content Platform Anywhere (HCP Anywhere) provides safe, secure file sharing, collaboration and synchronization. End users simply save a file to HCP Anywhere and it synchronizes across their devices. These files and folders can then be shared via hyperlinks. Because HCP Anywhere stores data in HCP, it is protected, compressed, single-instanced, encrypted, replicated and access-controlled. Hitachi Data Ingestor (HDI) combines with HCP to deliver elastic and backup-free file services beyond the data center. When a file is written to HDI, it is automatically replicated to HCP. From there, it can be used by another HDI for efficient content distribution and in support of roaming home directories, where users' permissions follow them to any HDI site. Files stay in the HDI file system until free space is needed. Then, HDI reduces any inactive files to pointers referencing the object on HCP. HDI drastically simplifies deployment, provisioning and management by eliminating the need to constantly manage capacity, utilization, protection, recovery and performance of the system. One infrastructure is far easier to manage than disparate silos of technology for each application or set of users. By integrating many key technologies in a single storage platform, Hitachi Data Systems object storage solutions provide a path to short-term return on investment and significant long-term efficiency improvements. They help IT evolve to meet new challenges, stay agile over the long term, and address future change and growth.
4 WHITE PAPER 4 Introduction Hitachi Content Platform (HCP) is a multipurpose distributed object-based storage system designed to support large-scale repositories of unstructured data. HCP enables IT organizations and cloud service providers to store, protect, preserve and retrieve unstructured content with a single storage platform. It supports multiple levels of service and readily evolves with technology and scale changes. With a vast array of data protection and content preservation technologies, the system can significantly reduce or even eliminate tape-based backups of itself or of edge devices HCP and Content Cloud SEE VIDEO connected to the platform. HCP obviates the need for a siloed approach to storing unstructured content. Massive scale, multiple storage tiers, Hitachi reliability, nondisruptive hardware and software updates, multitenancy and configurable attributes for each tenant allow the platform to support a wide range of applications on a single physical HCP instance. By dividing the physical system into multiple, uniquely configured tenants, administrators create "virtual content platforms" that can be further subdivided into namespaces for further organization of content, policies and access. With support for thousands of tenants, tens of thousands of namespaces, and petabytes of capacity in one system, HCP is truly cloud-ready (see Figure 1). Figure 1. A single Hitachi Content Platform supports a wide range of applications. Main Concepts and Features Object-Based Storage Hitachi Content Platform, as a general-purpose object store, allows unstructured data files to be stored as objects. An object is essentially a container that includes both file data and associated metadata that describes the data. The objects are stored in a repository. Each object is treated within HCP as a single unit for all intents and purposes. The metadata is used to define the structure and administration of the data. HCP can also leverage object metadata to apply specific management functions, such as storage tiering, to each object. The objects have intelligence that enables them to automatically take advantage of advanced storage and data management features to ensure proper placement and distribution of content. HCP architecture isolates stored data from the hardware layer. Internally, ingested files are represented as objects that encapsulate both the data and metadata required to support applications. Externally, HCP presents each object either as a set of files in a standard directory structure or as a uniform resource locator (URL) accessible by users and applications via HTTP or HTTPS.
5 WHITE PAPER 5 Object Structure An HCP repository object is composed of fixed-content data and the associated metadata, which in turn consists of system metadata and, optionally, custom metadata and an access control list (ACL). The structure of the object is shown in Figure 2. Fixed-content data is an exact digital copy of the actual file contents at the time of its ingestion. It becomes immutable after the file is successfully stored in the repository. If the object is under retention, it cannot be deleted before the expiration of its retention period, except when using a special privileged operation. If versioning is enabled, multiple versions of a file can be retained. If appendable objects are enabled, data can be appended to an object (with the CIFS or NFS protocols) without modifying the original fixed-content data. Figure 2. HCP Object Metadata is system- or user-generated data that describes the fixed-content data of an object and defines the object's properties. System metadata, the system-managed properties of the object, includes HCP-specific metadata and POSIX metadata. HCP-specific metadata includes the date and time the object was added to the namespace (ingest time), the date and time the object was last changed (change time), the cryptographic hash value of the object along with the namespace hash algorithm used to generate that value, and the protocol through which the object was ingested. It also includes the object's policy settings such as DPL, retention, shredding, indexing and versioning. POSIX metadata includes a user ID and group ID, a POSIX permissions value and POSIX time attributes. Custom metadata is optional, user-supplied descriptive information about a data object that is usually provided as well-formed XML. It is typically intended for more detailed description of the object. This metadata can also be used by future users and applications to understand and repurpose the object content. HCP supports multiple custom metadata fields for each object.
6 WHITE PAPER 6 ACL is optional, user-provided metadata containing a set of permissions granted to users or user groups to perform operations on an object. ACLs control data access at an individual object level and are the most granular data access mechanism. In addition to data objects, HCP also stores directories and symbolic links in the repository. Only POSIX metadata is maintained for directories and symbolic links; they have no fixed-content data, custom metadata or ACLs. All the metadata for an object is viewable; only some of it can be modified. The way metadata can be viewed and modified depends on the namespace configuration, the data access protocol and the type of metadata. Object Representation HCP presents objects to a user or application in 2 different ways, depending on the namespace access interface. With the RESTful HTTP protocols (HCP REST, Amazon S3), HCP presents each object as a URL. Both data and metadata is accessed through the REST interface. Metadata is handled by using URL query parameters and HTTP headers. Clients specify metadata values by including HCP-specific parameters in the request URL; HCP returns system metadata in HTTP response headers. For non-restful namespace protocols (WebDAV, CIFS, and NFS), HCP includes the HCP file system, a standard POSIX file system that allows users and applications to view stored objects as regular files, directories and symbolic links. HCP file system allows data to be handled in familiar ways using existing methods. It presents each object as a set of files in 2 hierarchical directory structures that hold the components of the object: one for the object's data and another for the object's metadata. For a data object (an object other than a directory or symbolic link), one of these files contains the fixed-content data. The name of this file is identical to the object's name, and its content is the same as the originally stored file. The other files contain object metadata. These files, which are either plain text, XML or JSON, are called metafiles. Directories that contain metafiles are called metadirectories. HCP File System HCP file system represents a single file system across a given namespace. Each HCP namespace that has any non-restful access protocol enabled exposes a separate HCP file system instance to clients. HCP file system maintains a directory structure with separate branches for data files and metafiles. The data top-level directory is a traditional file system view that includes fixed-content data files for all objects in the namespace. This directory hierarchy is created by a user adding files and directories to the namespace. Each data file and directory in this structure has the same name as the object or directory it represents. The metadata top-level directory contains all the metafiles and metadirectories for objects and directories. This structure parallels that of data, excluding symbolic links, and is created by HCP file system automatically as data and directories are added to the namespace by an end user. HCP metafiles provide a means of viewing and manipulating object metadata through a traditional file system interface. Clients can view and retrieve metafiles through the WebDAV, CIFS and NFS protocols. These protocols can also be used to change metadata by overwriting metafiles that contain the HCP-specific metadata (that can be changed). A sample HCP file system data and metadata structure, as seen through CIFS, NFS and WebDAV protocols, is shown in Figure 3.
7 WHITE PAPER 7 Figure 3. HCP File System Data and Metadata Structure Distributed Design A single Hitachi Content Platform consists of both hardware and software. It is composed of many different components that are connected together to form a robust, scalable architecture for object-based storage. HCP runs on an array of servers, or nodes, that are networked together to form a single physical instance. Each node stores data objects and can also store search index. All runtime operations and physical storage, including data, metadata and index, are distributed among the system nodes. All objects in the repository are distributed across all available storage space but still presented as files in a standard directory structure. Objects that are physically stored on any particular node are available from all other nodes. Open Architecture Hitachi Content Platform has an open architecture that insulates stored data from technology changes and from changes in HCP itself due to product enhancements. This open architecture ensures that users will have access to the data long after it has been added to the repository. HCP acts as a repository that can store customer data and an online portal. As a portal, it enables access to that data by means of several industry-standard interfaces, as well as through an integrated search facility and Hitachi Data Discovery Suite (HDDS). The industry-standard HTTP REST, Amazon S3, WebDAV, CIFS and NFS protocols support various operations. These operations include storing data, creating and viewing directories, viewing and retrieving objects and their metadata, modifying object metadata, and deleting objects. Objects that were added using any protocol are immediately accessible through any other supported protocol. These protocols can be used to access the data with a Web browser, the HCP client tools, 3rd-party applications, Microsoft Windows Explorer, or native Windows or Unix tools. HCP also allows special-purpose access to the repository through the SMTP protocol in order to support journaling.
8 WHITE PAPER 8 HCP provides a number of HTTP-based RESTful open APIs for easy integration with customer applications. In addition to HCP REST and Amazon S3-compatible HS3 interfaces that are used for namespace content access, HCP supports metadata query API for searching for objects in a namespace and management API (MAPI) for tenant and namespace-level administration. HCP implements the open, standards-based Internet Protocol version 6 (IPv6), the latest version of the Internet Protocol (IP). This protocol allows HCP to be deployed in very large scale networks and ensure compliance with a number of government agencies where IPv6 is mandatory. HCP provides IPv6 dual stack capability that enables coexistence of IPv4 and IPv6 protocols and corresponding applications. HCP can be configured in native IPv4, native IPv6, or dual IPv4 and IPv6 modes where each virtual network will support either or both IP versions. The IPv4 and IPv6 dual-stack feature is indispensable in heterogeneous environments during transition to IPv6 infrastructure. Any network mode can be enabled when desired, and existing IPv4 applications can be upgraded to IPv6 independently and with minimal disruption in service. All standard networking protocols and existing HCP access interfaces are supported and can use either IPv4 and/or IPv6 addresses based on the enabled network mode, which allows seamless integration with existing data center environments. Multitenancy Multitenancy support allows the repository in a single physical Hitachi Content Platform instance to be partitioned into multiple namespaces. A namespace is a logical partition that contains a collection of objects particular to one or more applications. Each namespace is a private object store that is represented by a separate directory structure and has a set of independently configured attributes. Namespaces provide segregation of data, while tenants, or groupings of namespaces, provide segregation of management. An HCP system can have up to 1,000 tenants and 10,000 namespaces. Each tenant and its set of namespaces constitute a virtual HCP system that can be accessed and managed independently by users and applications. This HCP feature is essential in enterprise, cloud and serviceprovider environments. Data access to HCP namespaces can be either authenticated or nonauthenticated, depending on the type and configuration of the access protocol. Authentication can be performed using HCP local accounts or Microsoft Active Directory groups. Object Versioning Hitachi Content Platform supports object versioning, which is the capability of a namespace to create, store and manage multiple versions of objects in the HCP repository. This ability provides a history of how the data has changed over time. Versioning facilitates storage and replication of evolving content, thereby creating new opportunities for HCP in markets such as content depots and workflow applications. Versioning is available in HCP namespaces and is configured at the namespace level. Versioning is supported only with HTTP REST protocol. Other protocols cannot be enabled if versioning is enabled for the namespace. Versioning applies only to objects, not to directories or symbolic links. A new version of an object is created when an object with the same name and location as an existing object is added to the namespace. A special type of version, called a deleted version, is created when an object is deleted. This helps protect the content against accidental deletes. Updates to the object metadata affect only the current version of an object and do not create new versions. Previous versions of objects that are older than a specified amount of time can be automatically deleted, or pruned. It is not possible to delete specific historical versions of an object; however, a user or application with appropriate permissions can purge the object to delete all its versions, including the current one.
9 WHITE PAPER 9 Search Hitachi Content Platform includes comprehensive search capabilities that enable users to search for objects in namespaces, analyze namespace contents, and manipulate groups of objects. To satisfy government requirements, HCP supports e-discovery for audits and litigation. HCP supports 2 search facilities and includes a Web application portal called the search console that provides an interactive interface to these search facilities. HCP provides the only integrated metadata query engine (MQE) on the market. The MQE search facility is integrated with HCP and is always available in any HCP system. The HDDS search facility interacts with Hitachi Data Discovery Suite, and this separate HDS product enables federated search across multiple HCP and other supported systems. HDDS performs search and returns results to the HCP search console. HDDS must be installed separately and configured in the HCP search console. MQE can index and search only object metadata. The HDDS search facility indexes both content and metadata and allows full content search of objects in a namespace. MQE is also used by the metadata query API, a programmatic interface for querying namespaces. Adaptive Cloud Tiering Adaptive cloud tiering expands Hitachi Content Platform capacity to any storage device or cloud service. It enables hybrid cloud configurations to scale and share resources between public and private clouds. It also allows HCP to be used to build custom, evolving service level agreements (SLAs) for specific data sets using enhanced service plans. HCP provides comprehensive storage-tiering capabilities as part of the long-term goal of supporting information lifecycle management (ILM) and intelligent objects. HCP supports a range of storage components that are grouped into storage pools. Storage pools virtualize access to one or more logically grouped storage components with similar price/performance characteristics. The storage components can be either primary storage (HCP storage) or extended storage. Primary storage includes direct attached storage (DAS) and SAN storage; internal DAS storage is always running, while SAN storage may be running or spin-down-capable. Extended storage includes non- HCP external storage devices (NFS and S3-compatible) and public cloud storage services (Amazon S3, Microsoft Azure, Google Cloud Storage and Hitachi Cloud Services). The topology of the adaptive cloud tiering is shown in Figure 4. Objects are stored in storage pools and are managed by object life-cycle policies, which are defined in service plans. Service plans determine content life cycle from ingest to obsolescence or disposition and implement protection strategies at each tier; they effectively represent customer SLAs. Service plans can be offered to a tenant administrator so they can be applied to individual namespaces. Storage tiering functionality is implemented as an HCP service. Storage tiering service applies service plans and moves objects between tiers of storage. Flexible service plans allow storage tiering to adapt to changes. Spin-Down Capability HCP spin-down-capable storage takes advantage of the power savings feature of Hitachi midrange storage systems and is one of the core elements of the storage tiering functionality and adaptive cloud tiering. According to storage tiering strategy that an organization specifies, the storage tiering service identifies objects that are eligible to reside on spin-down storage and moves them to and from the spin-down storage as needed. Tiering selected content to spin-down-enabled storage lowers overall cost by reducing energy consumption for large-scale unstructured data storage, such as deep archives and disaster recovery sites. Storage tiering can very effectively be used with customer-identified "dark data" (rarely accessed data) or data replicated for disaster recovery by moving that data to spin-down storage some time after ingestion or replication.
10 WHITE PAPER 10 Figure 4. Adaptive Cloud Tiering Replication and Global Access Topology Replication, an add-on feature to HCP, is the process that keeps selected tenants and namespaces in 2 or more HCP systems in sync with each other. The replication service copies one or more tenants or namespaces from one HCP system to another, propagating object creations, objects deletions and metadata changes. HCP also replicates tenant and namespace configuration, tenant-level user accounts, compliance and tenant log messages, and retention classes. The replication process is object-based and asynchronous. The HCP system in which the objects are initially created is called the primary system. The second system is called the replica. Typically, the primary system and the replica are in separate geographic locations and connected by a high-speed wide area network. HCP supports advanced traditional replication topologies including many-to-one and chain configurations, as well as revolutionary global access topology where globally distributed HCP systems are synchronized in a way that allows users and applications to access data from the closest HCP site for improved collaboration, performance and availability. Global access topology is based on bidirectional, active-active replication links that allow read-and-write access to the same namespace on all participating HCP systems. The content is synchronized between systems (or locations) in both directions. This enables read-and-write access to data in any namespace and from any location across entire replication topology, essentially creating global content point-of-presence network.
11 WHITE PAPER 11 Common Use Cases Cloud-Enabled Storage The powerful, industry-leading capabilities of Hitachi Content Platform make it well suited to the cloud storage space. An HCP-based infrastructure solution is sufficiently flexible to accommodate any cloud deployment models (public, private or hybrid) and simplify the migration to the cloud for both service providers and subscribers. HCP provides edge-to-core, secure multitenancy and robust management capabilities, and a host of features to optimize cloud storage operations. HCP, in its role as an online data repository, is truly ready for a cloud-enabled market. While numerous HCP features were already discussed earlier in this paper, the purpose of this section is to summarize those that contribute the most to HCP cloud capabilities. They include: Large-scale multitenancy. Management segregation. HCP supports up to 1,000 tenants, each of which can be uniquely configured for use by a separate cloud service subscriber. Data segregation. HCP supports up to 10,000 namespaces, each of which can be uniquely configured for a particular application or workload. Massive scale. Petabyte repository offers 80PB of storage, 80 nodes, 64 billion user objects and 30 million files per directory, all on a single physical system. Best node density in the object storage industry supports 500TB and 800 million objects per node. With fewer numbers of nodes, HCP requires less power, less cooling and less floor space. Unparalleled expandability allows organizations to "start small" and expand according to demand. Nodes and/or storage can be added to expand an HCP system's storage and throughput capacity, without disruptions. Multiple storage systems are supported by a single HCP system. Easy tenant and storage provisioning. Geographical dispersal and global accessibility. Global access topology that enables creation of a global content point-of-presence network. WAN-friendly REST interface for namespace data access and replication. WAN-optimized, high-throughput data transfer. High availability. Fully redundant hardware. Automatic routing of client requests around hardware failures. Load balancing across all available hardware. Adaptive cloud tiering enables hybrid cloud configurations where resources can be easily scaled and shared between public and private clouds. Specific data sets can be migrated on-demand across various cloud services and local storage, and new cloud storage can be easily integrated and existing storage retired.
12 WHITE PAPER 12 Multiple REST interfaces. These interfaces include the HCP REST and Amazon S3-compatible REST APIs for namespace data access, management API and metadata query API. REST API is a technology of choice for cloud enablers and consumers. Some of the reasons for its popularity include high efficiency and low overhead, caching at both the client and the server, and API uniformity. In addition, this technology offers a stateless nature that allows accommodation of the latencies of Internet access and potentially complex firewall configurations. Secure, granular access to tenants, namespaces and objects, which is crucial in any cloud environment. This access is facilitated by the HCP multilayer, flexible permission mechanism, including object-level ACLs. Usage metering. HCP has built-in chargeback capabilities, indispensable for cloud use, to facilitate providersubscriber transactions. HCP also provides tools for 3rd-party vendors and customers to write to the API for easy integration with the HDS solution for billing and reporting. Low-touch system that is self-monitoring, self-managing and self-healing. HCP features advanced monitoring, audit and reporting capabilities. HCP services can automatically repair issues if they arise. Support for multiple levels of service. This support is provided through HCP policies, service plans and quotas that can be configured for each tenant. It helps enforce SLAs and allows the platform to accommodate a wide range of subscriber use cases and business models on a single physical system. Edge-to-core solution. HCP, working in tandem with Hitachi Data Ingestor provides an integrated edge-to-core solution for cloud storage deployments. HCP serves as the "engine" at the core of the HDS cloud architecture. HDI resides at the edge of the storage cloud (for instance, at a remote office or subscriber site) and serves as the "onramp" for application data to enter the cloud infrastructure. HDI acts as a local storage cache while migrating data into HCP and maintaining links to stored content for later retrieval. Users and applications interact with HDI at the edge of the cloud but perceive bottomless, backup-free storage provided by HCP at the core. File-sync-and-share solution. HCP, working in tandem with Hitachi Content Platform Anywhere (HCP Anywhere), provides a secure file and folder synchronization and sharing solution for workforce mobility. HCP again serves as the "engine" at the core of the HDS cloud architecture. HCP Anywhere servers are deployed in conjunction with HCP and client applications that are installed on user devices including laptops, desktops and mobile devices. End users simply save a file to their HCP Anywhere folder and it automatically synchronizes to all of their registered devices and becomes available via popular Web browsers. Once saved to the HCP Anywhere folder, the file is protected, compressed, single-instanced, encrypted, replicated and access-controlled by the well-proven Hitachi Content Platform. Individual files or entire folders can then be shared with a simple hyperlink. Backup-Free Data Protection and Content Preservation Hitachi Content Platform is a truly backup-free platform. HCP protects content without the need for backup. It uses sophisticated data preservation technologies, such as configurable data and metadata protection levels, object versioning and change tracking, multisite replication with seamless application failover, and many others. HCP includes a variety of features designed to protect the integrity, provide the privacy, and ensure the availability and security of stored data. Below is a summary of the key HCP data protection features: HCP Anywhere Benefits Video WATCH Content immutability. This intrinsic feature of HCP "write-once, read-many" (WORM) storage design protects the integrity of the data in the repository. Content verification. The content verification service maintains data integrity and protects against data corruption or tampering by ensuring that the data of each object matches its cryptographic hash value. Any violation is repaired in a self-healing fashion.
13 WHITE PAPER 13 Scavenging. The scavenging service ensures that all objects in the repository have valid metadata. In case metadata is lost or corrupted, the service tries to reconstruct it by using the secondary, or scavenging, metadata (a copy of the metadata stored with each copy of the object data). Data encryption. HCP supports encryption at rest capability that allows seamless encryption of data on the physical volumes of the repository. This ensures data privacy by preventing unauthorized access to the stored data. The encryption and decryption are handled automatically and transparently to users and applications. Versioning. HCP uses versioning to protect against accidental deletes and storing wrong copies of objects. Data availability. RAID protection. RAID storage technology provides efficient protection from simple disk failures. SAN-based HCP systems typically use RAID-6 erasure coding protection to guard against dual drive failures. Multipathing and zero-copy failover. These features provide data availability in SAN-based HCP systems. Data protection level (DPL) and protection service. In addition to using RAID and SAN technologies to provide data integrity and availability, HCP can use software mirroring to store the data for each object in multiple locations on different nodes. HCP groups system nodes into protection sets with the same number of nodes in each set. It tries to store all the copies of the data for an object in a single protection set where each copy is stored on a different node. The protection service enforces the required level of data redundancy by checking and repairing protection sets. In case of violation, it creates additional copies or deletes extra copies of an object to bring the object into compliance. If replication is enabled, the protection service can use an object copy from a replica system if the copy on the primary system is unavailable. Metadata redundancy. In addition to the data redundancy as specified by DPL, HCP creates multiple copies of the metadata for an object on different nodes. Metadata protection level or MDPL is a system-wide setting that specifies the number of copies of the metadata that the HCP system must maintain (normally 2 copies, MDPL2). Management of MDPL redundancy is independent of the management of data copies for DPL. Nondisruptive software and hardware upgrades. HCP employs a number of techniques that minimize or eliminate any disruption of normal system functions during software and hardware upgrades. Nondisruptive software upgrade (NDSU) is one of these techniques. It includes greatly enhanced online upgrade support, nondisruptive patch management, and online upgrade performance improvements. HCP supports media-free and remote upgrades, HTTP or REST drain mode, and parallel operating system (OS) installation. It also supports automatic online upgrade commit, offline upgrade duration estimate, enhanced monitoring and alerts, and other features. Nodes can be added to an HCP system without causing any downtime. HCP also supports nondisruptive storage upgrades that allow online storage addition to SAN-based HCP systems without any data outage. Seamless application failover. This feature is supported by HCP systems in a replicated topology. This capability includes seamless failover routing feature that enables direct integration with customer-owned load balancers by allowing HTTP requests to be serviced by any HCP system in a replication topology. Seamless domain name system (DNS) failover is an HCP built-in, multisite, load-balancing and high-availability technology that is ideal for cost efficient, best-effort customer environments. Replication. If enabled, this feature provides a multitude of mechanisms to ensure data availability. The replica system can be used both as a source for disaster recovery and to maintain data availability by providing good object copies for protection and content verification services. If an object cannot be read from the primary system, HCP can try to read the object from the replica if read-from-replica feature is enabled.
14 WHITE PAPER 14 Data security. Authentication of management and data access. Granular, multilayer data access permission scheme. IP filtering technology and protocol-specific access or deny lists. Secure Sockets Layer (SSL) support for HTTP and WebDAV data access, management access and replication. Node login prevention. Shredding policy and service. Autonomic technology refresh. This feature is implemented as HCP migration service. It enables organizations to maintain continuously operating content stores that allows them to preserve their digital content assets for the long term. Fixed-Content Archiving Hitachi Content Platform is optimized for fixed-content data archiving. Fixed-content data is information that does not change but must be kept available for future reference and be easily accessible when needed. A fixed-content storage system is one in which the data cannot be modified. HCP uses WORM storage technology, and a variety of policies and services (such as retention, content verification and protection) to ensure the integrity of data in the repository. The WORM storage means that data, once ingested into the repository, cannot be updated or modified; that is, the data is guaranteed to remain unchanged from when it was originally stored. If the versioning feature is enabled within the HCP system, different versions of the data can be stored and retrieved in which case each version is WORM. Compliance, E-Discovery and Metadata Analysis Custom metadata brings structure to unstructured content. It enables building massive unstructured data stores by providing means for faster and more accurate access of content. Custom metadata gives storage managers the meaningful information they need to efficiently and intelligently process data and apply the right object policies to meet all business, compliance and protection requirements. Structured custom metadata (content properties) and multiple custom metadata annotations take this capability to the next level by helping yield better analytic results and facilitating content sharing among applications. Regulatory compliance features include namespace retention mode (compliance and enterprise), retention classes, retention hold, automated content disposition, and privileged delete and purge. HCP search capabilities include support for e-discovery for litigation or audit purposes, and allow direct 3rd-party integration through built-in open APIs. The search console offers a structured environment for creating and executing queries (sets of criteria that each object in the search results must satisfy). End users can apply various selection criteria, such as objects stored before a certain date or larger than a specified size. Queries return metadata for objects included in the search result. This metadata can be used to retrieve the object. From the search console, end users can open objects, perform bulk operations on objects (hold, release, delete, purge, privileged delete and purge, change owner, set ACL), and export search results in standard file formats for use as input to other applications. Search is enabled at both the tenant and namespace levels. Indexing is enabled on a per-namespace basis. Settings at the system and namespace levels determine whether custom metadata is indexed in addition to system metadata and ACLs. If indexing of custom metadata is disabled, the MQE index does not include custom metadata. If a namespace is not indexed at all, searches do not return any results for objects in this namespace.
15 WHITE PAPER 15 MQE indexes system metadata, custom metadata (optionally), and ACLs of objects in each search-enabled and index-enabled namespace. In namespaces with versioning enabled it indexes only the current version of an object. Each object has an index setting that affects indexing of custom metadata by the metadata query engine. If indexing is enabled for a namespace, MQE always indexes system metadata and ACLs, regardless of the index setting for an object. If the index setting is set to true, MQE also indexes custom metadata for this object. The MQE index resides on designated logical volumes on the HCP nodes, sharing or not sharing the space on these volumes with the object data, depending on the type of system and volume configuration. The Hitachi Data Discovery Suite search facility creates and maintains its own index that resides separately in HDDS. REST clients can search HCP programmatically using the metadata query API. As with the search console, the response to a query is metadata for the objects that meet the query criteria, in XML or JSON format. Two types of queries are supported: HDDS Demo WATCH Object-based query locates objects that currently exist in the repository based on their metadata, including system metadata, custom metadata and ACLs, as well as object location (namespace or directory). Multiple, robust metadata criteria can be specified in object-based queries. Objects must be indexed to support this type of query. Operation-based query provides time-based retrieval of objects transactions. It searches for objects based on operations performed on the objects during specified time periods. And it retrieves records of object creation, deletion and purge (user-initiated actions) and disposition and pruning (system-initiated actions). Operation-based queries return not only objects currently in the repository but also deleted, disposed, purged or pruned objects. If versioning is enabled, both current and old versions of objects can be returned. The response is retrieved directly from the HCP metadata database and internal logs; thus, no indexing is required to support this type of query. Operation-based queries enable HCP integration with backup servers, search engines (such as HDDS), policy engines and other applications. System Fundamentals Hardware Overview An individual physical Hitachi Content Platform instance, or HCP system, is not a single device; it is a collection of devices that, combined with HCP software, can provide all the features of an online object repository while tolerating node, disk and other component failures. From a hardware perspective, each HCP system consists of the following categories of components: Nodes (servers). Internal or SAN-attached storage. Networking components (switches and cabling). Infrastructure components (racks and power distribution units). System nodes are the vital part of HCP. They store and manage the objects that reside in the physical system storage. The nodes are conventional off-the-shelf servers. Each node can have multiple internal physical drives and/or connect to external Fibre Channel storage (SAN). In addition to using RAID and SAN technologies and a host of other features to protect the data, HCP uses software mirroring to store the data and metadata for each object in multiple locations on different nodes. For data, this feature is managed by the namespace data protection level (DPL) setting, which specifies the number of copies of each object HCP must maintain in the repository to ensure the required
16 WHITE PAPER 16 level of data protection. For metadata, this feature is managed by the metadata protection level (MDPL), which is a system-wide setting. An HCP system uses private back-end and public front-end networks. The isolated back-end network is used for vital internode communication and coordination. It uses a bonded Ethernet interface in each node, 2 Ethernet switches, and 2 sets of cables connecting the nodes to the switches, thereby making it fully redundant. The front-end network is used for customer interaction with the system and also uses a bonded Ethernet interface in each node. The recommended setup includes 2 independent switches that connect these ports to the front-end (corporate) network. HCP runs on a redundant array of independent nodes (RAIN) or a SAN-attached array of independent nodes (SAIN). RAIN systems use the internal storage in each node. SAIN systems use the external SAN storage. HCP is offered as 2 products: HCP 300 (based on RAIN configuration) and HCP 500 (based on SAIN configuration). Hitachi Content Platform RAIN (HCP 300) The nodes in an HCP 300 system are Hitachi Compute Rack 210H (CR 210H) servers. RAIN nodes contain internal storage: RAID controller and disks. All nodes use hardware RAID-5 data protection. In an HCP RAIN system, the physical disks in each node form a single RAID group, normally RAID-5 (5D+1P) (see Figure 5). This configuration helps ensure the integrity of the data stored on each node. Figure 5. HCP 300 Hardware Architecture
17 WHITE PAPER 17 An HCP 300 (RAIN) system must have a minimum of 4 nodes. Additional nodes are added in 4-node increments. An HCP 300 system can have a maximum of 20 nodes. HCP 300 systems are normally configured with a DPL setting of 2 (DPL2), which, coupled with hardware RAID-5, yields an effective RAID-5+1 total protection level. Hitachi Content Platform SAIN (HCP 500/500XL) The nodes in an HCP 500 system are either Hitachi Compute Rack 210H (CR 210H) or Hitachi Compute Rack 220S (CR 220S) servers. The HCP 500 nodes contain Fibre Channel host bus adapters (HBAs) and use external Fibre Channel SAN storage; they are diskless servers that boot from the SAN-attached storage. HCP 500 may use Fibre Channel switches or have nodes directly connected to external storage. The HCP 500 system using direct connect is shown in Figure 6. The nodes in a SAIN system can have internal storage in addition to being connected to external storage. These nodes are called HCP 500XL nodes. They are an alternative to the standard HCP 500 nodes and have the same hardware configuration, except the addition of the RAID controller and internal hard disk drives. A typical 500XL node internal storage configuration includes six 500GB 7200RPM SATA II drives in a single RAID-5 (5D+1P) RAID group, with 2 LUNs: 31GB (operating system) and 2.24TB (database). In HCP 500XL nodes the system metadata database resides on the local disks, which leads to more efficient and faster database operations. As a result, the system has the ability to better support larger capacity and higher object counts per node and address higher performance requirements. The HCP 500XL nodes are usually considered when the system configuration exceeds 4 standard nodes. Figure 6. HCP 500 Hardware Architecture (Direct Connect)
18 WHITE PAPER 18 Typically, the external SAN-attached storage uses RAID-6. Best protection and high availability of an HCP 500 system is achieved by giving each node its own RAID group or Hitachi Dynamic Provisioning (HDP) pool containing one RAID group. SAIN systems support multiple storage arrays in a single system or even for a single node. HCP 500 and 500XL systems are supported with a minimum of 4 nodes. With a SAIN system, additional nodes are added in pairs, so the system always has an even number of nodes. A SAIN system can have a maximum of 80 nodes. Both RAIN and SAIN systems can have a DPL as high as 4, which affords maximum data availability but greatly sacrifices storage utilization. SAIN systems introduce a number of SAN-specific features that help maintain the organization's data availability. They include multipathing, cross mapping and zero-copy failover. In a SAN environment, multiple physical paths may be configured between an HCP node and any given LUN that maps to it. Multipathing facilitates uninterrupted read and write access to the system, protecting it against storage array controller, Fibre Channel switch, fiber optic cable and HBA port failures. The process of one node automatically taking over management of storage previously managed by another, failed node is called zero-copy failover. To support zero-copy failover, each LUN that stores object data or MQE index must map to 2 different nodes. The pair of nodes forms a set such that the LUNs that map to one of the nodes also map to the other. This is called cross-mapping. In a cross-mapped pair of nodes, the LUNs on a node that are managed by this node during normal operation are called primary LUNs; the LUNs from the other node that will be managed by this node after failover are called standby LUNs. Cross-mapping of LUNs from one node to another node in the system allows instantaneous access to data from failed nodes. Software Overview Hitachi Content Platform system software consists of an operating system and core software. The Linux-based HCP operating system is called appliance operating system. The core software includes components that: Enable access to the object repository through the industry-standard HTTP or HTTPS, WebDAV, CIFS, NFS, SMTP and NDMP protocols. Ingest fixed-content data, convert it into HCP objects, and manage the objects data and metadata over time. Maintain the integrity, stability, availability and security of stored data by enforcing repository policies and executing system services. Enable configuration, monitoring and management of the HCP system through a human-readable interface. Support searching the repository through an interactive Web interface (the search console) and a programmatic interface (the metadata query API). System Organization HCP is a fully symmetric, distributed application that stores and manages objects (see Figure 7). An HCP object encapsulates the raw fixed-content data that is written by a client application, and its associated system and custom metadata. Each node in an HCP system is a Linux-based server that runs a complete HCP instance. The HCP system can withstand multiple simultaneous node failures, and acts automatically to ensure that all object and namespace policies are valid.
19 WHITE PAPER 19 External system communication is managed by the DNS manager, a distributed network component that balances client requests across all nodes to ensure maximum system throughput and availability. The DNS manager works in conjunction with a corporate DNS server to allow clients to access the system as a single entity, even though the system is made up of multiple independent nodes. The HCP system is configured as a subdomain of an existing corporate domain. Clients access the system using predefined protocol-specific or namespace-specific names. Figure 7. The High-Level Structure of an HCP System While not required, using DNS is important in ensuring balanced and problem-free client access to an HCP system, especially for the REST HTTP clients. Each node in the HCP system runs a complete software stack made up of the appliance operating system and the HCP core software. All nodes have an identical software image to ensure maximum reliability and fully symmetrical operation of the system. An HCP system node can serve as both an object repository and an access point for client applications and is capable of taking over the functions of other nodes in the event of node failure. All intranode and internode communication is based on scalable performance-oriented cluster communication (SPOCC). This efficient, reliable and easily expandable message-based middleware runs over TCP/IP. It functions as a unified message bus for distributed applications, forming the backbone of the back-end network where all node interaction occurs. SPOCC supports multicast and point-to-point connections and is designed to deal gracefully with network and hardware failures. An HCP system is inherently a distributed system. Many of its core components, including the database, have a distributed nature. To process incoming client requests, software components on a particular node need to interact
20 WHITE PAPER 20 with the components on other nodes across the system by means of the SPOCC-powered system backbone. All runtime operations are distributed among the system nodes. Each node bears equal responsibilities for processing requests, storing data and sustaining the overall health of the system. No single node becomes a bottleneck: All nodes are equally capable of handling any client request, ensuring reliability and performance. Because HCP uses a distributed processing scheme, the system can scale linearly as the repository grows in size and in the number of clients accessing it. When a new node is added to the HCP system, the system automatically integrates that node into the overall workflow without manual intervention. Namespaces and Tenants Main Concepts A Hitachi Content Platform repository is partitioned into namespaces. A namespace is a logical repository as viewed by an application. Each namespace consists of a distinct logical grouping of objects with its own directory structure, such that the objects in one namespace are not visible in any other namespace. Access to one namespace does not grant a user access to any other namespace. To the user of a namespace, the namespace is the repository. Namespaces are not associated with any preallocated storage; they share the same underlying physical storage. Namespaces provide a mechanism for separating the data stored for different applications, business units or customers. For example, there may be one namespace for accounts receivable and another for accounts payable. While a single namespace can host one or more applications, it typically hosts only one application. Namespaces also enable operations to work against selected subsets of repository objects. For example, a search could target the accounts receivable and accounts payable namespaces but not the employees namespace. Namespaces are owned and managed by tenants. Tenants are administrative entities that provide segregation of management, while namespaces offer segregation of data. A tenant typically represents an actual organization such as a company or a department within a company that uses a portion of a repository. A tenant can also correspond to an individual person. Namespace administration is done at the owning tenant level. Clients can access HCP namespaces through HTTP or HTTPS, WebDAV, CIFS, NFS and SMTP protocols. These protocols can support authenticated and/or anonymous types of access. HCP namespaces are owned by HCP tenants. An HCP system can have multiple HCP tenants, each of which can own multiple namespaces. The number of namespaces each HCP tenant can own can be limited by an administrator. Figure 8 shows the logical structure of an HCP system with respect to its multitenancy features.