TECHNOLOGY IN DEPTH 5 KEY QUESTIONS FOR BIG DATA STORAGE STRATEGIES And Comprehensive Answers from DDN GRIDScaler OCTOBER 2012 Big data is an increasingly complex and differentiated workflow with unique requirements at every stage. This presents a serious challenge for even the most sophisticated computing environments at global corporations, let alone smaller companies. These companies may not even be aware that they need big data solutions because they do not recognize the nature of their crucial data. Yet the consequences of traveling down that road without the right infrastructure can be severe: poor business decisions, poor business execution, loss of irreplaceable information, severely limited production, damage to reputations, lost revenue, and even business failure. Fortunately, big data initiatives, big data tools and technologies are becoming democratized and more accessible and affordable for mid-sized companies and the enterprise than ever before. Big data requires the ability to continuously act upon large and growing volumes of data. This process requires fast and frequent data movement between application servers, over network connections and across storage. This is particularly true with big data analytics, which needs exceptionally fast and large feedback loops for decision-making or real-time process inputs. An added complexity is the high proportion of unstructured data without the native advantages of the query-friendly relational database. Big data can include databases but is primarily created from multiple input sources: text, static images, streaming video, web inputs, sensor data, and more. A third major trend is big data s multi-staged workflow and value chain. In this environment, many different data inputs flow into the big data environment. From there, specialized applications analyze, dissect and reform data into meaningful reports and new data sets. Analysts and executives make decisions accordingly, adjust data input requests, and the entire process begins again. If you are evaluating big data storage solutions for your enterprise or mid-sized company, Taneja Group has identified 5 strategic questions that you should ask your vendors during the evaluation process. In this Technology Brief, we ll review these five questions and look to one specific solution in the market DataDirect Networks (DDN s) GRIDScaler and the democratization of Big Data. Why is it important to have a specialized platform for big data? We see four major categories of challenges to managing big data in the enterprise. Big data is a hard-to-manage value chain. A big data workflow is constructed of multiple applications and workloads that may interact with large input data sets and generate yet other data sets as an output. The net effect is a tangled web of data access and processing that can yield a multitude of distributed systems near-simultaneously accessing storage systems for data load, processing, output, and potentially even redistribution. This represents an entirely different degree and range of access that the traditional storage system is not built to handle. Ingestion and analysis. Big Data offers tremendous data retention and access challenges. Traditional storage systems have innovated around delivering massive capacity as in the case of scalable archives, or in delivering high performance as in the case of enterprise-class storage Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 1 of 11
systems. But big data analytics rapidly cascade through high throughput streaming read patterns: accessing large shards, write streaming as data is created, and accessing random IO for further analysis. These storage systems simply can t scale to support required capacity and simultaneous access. Data protection and compliance. Petabyte-scale storage volumes exceed the capabilities of traditional data protection solutions. Moving data protection copies through backup or replication simply can t be done for this amount of data within limited timeframes. Meanwhile, big data systems are typically unusually open for access and are subject to harm, loss, or misuse. Protecting and governing big data can consequently be a nightmare without specialized storage. Scalability. Big data scalability usually distributes storage across many compute nodes. But this will not work for the mainstream market the custom engineering, specialized skillsets and ongoing management requirements are simply too great. Meanwhile, growing analytics infrastructures can rapidly throw a storage infrastructure out of balance and drive OPEX through the roof if the storage infrastructure scales ineffectively. Administrative cycles can be rapidly consumed by load balancing data across a growing number of separate storage systems, while scattered storage silos floorspace and energy consumption can be tremendously inefficient. Evaluating Big Data Storage: The Top 5 Questions Traditional IT storage acquisition practices often revolve around traditional storage capabilities, an identification of planned capacity and performance requirements, and an assessment of total cost of ownership. Too often such practices applied to the acquisition of big data storage will yield poor results. Not only may the fundamental capabilities required be far different than the capabilities expected from traditional storage, but expectations around capacity and performance may be quickly thrown out the window in the face of growing big data initiatives. Consequently, TCO or ROI analyses carried out on the wrong expectations may result in a storage infrastructure completely out of alignment with big data storage requirements. FIVE STRATEGIC EVALUATION QUESTIONS FOR BIG DATA STORAGE SOLUTIONS #1. Is the system specialized enough to handle complex big data workflows and does it work seamlessly with other systems to fulfill the entire big data workflow requirements? #2. Is the system architected to meet the aggregated and unpredictable performance and capacity demands of big data workflows now and in the future? #3. Is the storage system efficient enough to make long term big data initiatives viable at the lowest TCO? #4. Is the storage system resilient under failures at scale and can it provide the required level of QoS even under failure conditions? #5. Is the system simple to manage at scale? In the market today, there are a number of choices for big data storage. Each of these solutions suggests that they are the magic bullet, even though each is typically highly unique and different in approach. It is clear that big data requires a highly specialized solution, but what criteria define the best solution is often less clear. To help narrow the field we have identified 5 strategic questions that should be asked of all prospective vendors when evaluating storage solutions. These questions should inform assessment Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 2 of 11
strategies for big data projects, and will help align the resulting storage infrastructure with big data requirements. After reviewing these questions, we will next turn to applying them to an assessment of one of the leading vendor s solutions for big data storage. QUESTION 1: IS THE SYSTEM SPECIALIZED ENOUGH TO HANDLE BIG DATA WORKFLOWS AND DOES IT WORK SEAMLESSLY WITH OTHER SYSTEMS TO FULFILL ENTIRE BIG DATA WORKFLOW REQUIREMENTS? Big data workflows are very complex. They contain multiple stages with each stage a challenge in itself that are well past the ability of traditional storage systems to efficiently handle. Potential bottlenecks can occur at any stage of the big data workflow: Ingestion. At the core of many big data solutions are data sets that may be relatively static once in place, but undergo routine loading onto systems. This ingestion phase can involve data set sizes that are very large and consequent throughput rates that traditional storage systems cannot handle. Delays in ingestion impact frequency of analysis and may even cause data loss, and thereby degrade the effectiveness of the entire solution. Store and hold. After ingestion, data is at some point held in relatively static form between processing cycles. But for big data solutions this period is one of the few windows of time during which data management can take place. This includes any traditional operations like antivirus scans or backups, but also includes data scrubbing to maintain the integrity of data inplace. This process can be critical for petabyte scale data that might never be formally protected by traditional approaches. Sufficient performance, often well beyond that planned by the organization for processing, is required to ensure these tasks can be completed during store and hold windows. Figure 1: Typical big data workflow. Processing. The most IO intensive and varied stage of a big data workflow, processing, may mean simultaneous access to shards on a storage system by hundreds of clients, analysis or transformation, and then output of new data sets or correlation of new data with existing data sets. Each of the tasks occurring in a processing stage can have widely varying IO and throughput patterns, and may far exceed the capabilities of traditional storage systems. Distribution. Many big data solution sets revolve around the eventual output of content. Much of this content may be large, and go through a distribution step where it is moved to other systems (sometimes for broadcast if content is involved, other times simply for reuse by other systems). Other systems involved may be simultaneously accessing the same data sets at high read rates, in contrast to the segmented data set access that happens during processing. This is yet another challenge for traditional data systems, especially when the working data set size exceeds the working cache of the storage system. Each of these stages has fundamentally different IO characteristics that can severely challenge traditional storage systems. Ingestion needs to be high throughput write, but then data needs to be held cost effectively and with high levels of integrity after ingestion. Then processing can cast that data into a blender of random and/or sequential IO with widely varying IO sizes. Finally, shifting to distribution once again means high throughput, but this time in terms of read as data is output to additional systems. Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 3 of 11
QUESTION #2. IS THE SYSTEM ARCHITECTED TO MEET AGGREGATED AND UNPREDICTABLE PERFORMANCE AND CAPACITY DEMANDS OF BIG DATA WORKFLOWS NOW AND IN THE FUTURE? Big data workflows contain a potent mix of large and small IO that few storage systems can efficiently handle, and so require highly specialized storage. These varied access patterns represent difficult new challenges for workloads and performance, and rapidly grow over time. This necessitates a storage system for big data that can meet challenging levels of performance and capacity demands now, and that can efficiently scale to meet rising demands. Storage systems architected for the traditional enterprise workload will typically not be able to deliver against the performance requirements in these big data workflow stages. Moreover, these stages of big data workflows should be looked at holistically, with an eye toward the aggregate demand for performance both today and after sustained growth of the big data within the business. With those aggregate demands and future growth in mind, it will rapidly become clear that big data performance requires a unique architecture far different than the controller-centric architecture of traditional arrays. QUESTION #3. IS THE STORAGE SYSTEM EFFICIENT ENOUGH TO MAKE LONG TERM BIG DATA INITIATIVES VIABLE AT THE LOWEST TCO? Big data capabilities are more likely than any other technology endeavor to be constrained by efficiency. Big data has all the potential to grow to the limits of data center floorspace, power, or cooling. Systems that aren't sufficiently dense or grow with excessive capacity may even see other infrastructure systems such as networking collapse underneath their weight. Efficiency, and thereby density, will matter in the sustainability of a big data strategy. But just as important are secondary impacts from this limitation. If physical footprints become constraining too soon, big data solutions will be forced to employ additional data silos. And with those data silos comes significant limitations in data accessibility and costly additional management overhead. Customers able to deal with high levels of software customization have often turned to whitebox strategies in the past in the belief that these distributed and data-in-the-compute-node architectures increase density and flexibility. Too often they represent a false economy in total solution efficiency by compromising data consolidation, centralized management, and shared access thereby leading to an eventual proliferation of storage architectures throughout the data center. These ultimately occupy more net floorspace and incur the price of excessive complexity and operational support. In contrast to centralized, packaged solutions that are becoming more widely available in the market today, whitebox infrastructures may represent an extreme of inefficiency. We firmly believe that lack of total solution efficiency will eventually constrain any big data strategy, irrespective of the plans for the big data initiative at project outset. The constraint may vary, but will ultimately impose a serious TCO penalty, irrespective of whether the technological issue is one of rackspace or performance barriers. Failure to plan for achieving maximum IO and capacity per unit of floorspace will heighten the risk of running into physical constraints that force storage silo sprawl and push a big data initiative beyond the enterprise budget. QUESTION #4. IS THE STORAGE SYSTEM RESILIENT UNDER FAILURES AT SCALE AND CAN IT PROVIDE REQUIRED QOS UNDER FAILURE CONDITIONS? As big data scales into petabyte capacities, traditional backup simply cannot keep up. On traditional infrastructures, scaling data sets beyond the ability of the organization to protect them in narrow backup windows may leave data unprotected. Organizations can then find themselves with manually woven together processes to create some semblance of data protection, and may rapidly be overcome Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 4 of 11
by device failures at scale. It may be an impossible race to recover and maintain a reliable storage infrastructure beneath growing big data that is entirely reliant on single storage systems that were built without sufficient resiliency for this use case. At the other extreme, white box architectures may sacrifice alerting and visibility, and compromise data without clear warning that anything is at risk. Meanwhile, across an increasing number of storage spindles, data integrity is constantly at risk from silent bit corruption or media errors. The other critical question is how the storage system maintains quality of service even under failure conditions. Many systems offer a level of QoS by failing over from a downed server to a standby server, or architecting redundant nodes for uninterrupted processing in the event of node failure. However, given the complexity and sheer scale of big data environments, traditional approaches like these are hopelessly inadequate. Customers may attempt to provide QoS and resiliency at huge scales with solutions at enormous complexity and cost storage system or host-based replication that introduces entire other layers of software, hardware, and potentially duplicate standby storage systems. But better approaches will be required by organizations hoping to sustain big data initiatives, and applicable offerings are rapidly emerging in the market place. Such solutions are built from the ground up for high levels of reliability and data integrity that can in effect deliver in-place protection. Equally importantly, such solutions are designed to operate with a lower level of maintenance than more traditional architectures. QUESTION #5. IS THE SYSTEM SIMPLE TO MANAGE AT SCALE? Big data by its nature is already big, but additional analytics and processing adds exponentially more data, and often more rapidly than businesses anticipate. Moreover, longer term analytic opportunities will arise for many big data initiatives that cannot be foreseen today emerging innovations will continue to stream into the market place and bring about new opportunities to mine both existing and new data sources. Enterprises will continue to discover new geospatial and informational data sets and capture new rich sources of user and market information for correlation and mashup against existing data sources. In other fields, new rendering technologies, monitoring or design processes may bring about vast new media, operational or sensor-based data sets. Each of these influences will cause big data to grow many times faster than can reasonably be anticipated today. In turn, any infrastructure that cannot adapt in both capacity and performance over time stands poised to undercut today s big data initiatives. The requirements for this adaptability include: Linear and large. Systems must be built for efficient scaling across both capacity and performance. In addition, the system must provide enough flexibility to fine tune either dimension as the demands for either capacity or performance may evolve cyclically and independently. Scale while preserving efficiency. Scale shouldn t mean relying solely on a loose coupling or federation of systems that doesn t truly integrate data management and access across systems. Such recipes for scaling will inevitably fragment the storage infrastructure, create isolated storage pools with stranded and underutilized capacity, and rapidly increase management overhead. Scaling should simplistically extend a tightly integrated architecture that can maintain consolidated, high speed access at extreme capacities and with extreme performance. Adaptable data management. It may go hand-in-hand with the previously defined requirements for an efficiently scaling architecture, but big data customers should also expect a consistent, high efficiency approach to data management as scaling takes place. Whether data is managed through file versioning, snapshots, or off-system backup, scaling a system shouldn t require additional operations or additional sets of infrastructure systems and software. Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 5 of 11
Differences in how vendors answer this question should elicit significant details about a given vendor s design philosophy and appropriateness for fast-growing big data environments. FOCUS ON DDN GRIDSCALER DataDirect Networks (DDN) is currently the world s largest privately held storage vendor, shipping extremely high performing, scalable and efficient storage for HPC and data-intensive industry solutions since 2000. This heritage puts them right at the forefront of the enterprise big data storage movement they were big data before it had a name. Many of their long term clients have been dealing with the largest of the large data sets in life sciences, media and government (e.g. NASA, Homeland Security) well before newer enterprise clients discovered that they could store and productively analyze their own big data. Over the last decade, working closely with the most demanding clientele, DDN evolved their core S2A architecture to produce a series of world-class storage platforms for extreme bandwidth and real-time applications. Their latest generation Storage Fusion Architecture (SFA) with its unique internal Storage Fusion Fabric adds an extremely low-latency, high IOPS capability while enabling massive scaling both up and out (e.g. 6.7 PB in a single two-rack system). Additionally, SFA takes advantage of convergence to support in-storage processing, which can host up to 16 virtual machines directly in the storage controllers and makes them ideal platforms for DDN s parallel file system and NAS system appliances. Delivering the highest performing, massively scalable storage is great, but where DDN really leaves the competition behind is that they build it with world-class enterprise storage features for data protection and high availability. In addition to supporting mixed multiple RAID types, arbitrarily mixed media (SAS/SATA, HDD/SSD), and single console management, SFA provides features critical to massively scaled architectures that commonly exceed a thousand drives per system: Read IO QoS delivers high performance even under drive failure ReACT intelligent cache management optimizes both serial and random write IO in real-time Journaled drive rebuilds significantly reduces RAID re-build times DirectProtect data integrity provides real-time fault and error detection and dynamic correction DDN s GRIDScaler parallel file system solution, packaged over the SFA platform as a complete appliance, delivers 100,000 s of file operations per second and sustained 200GB/s streaming throughput per building block. Multiple of such building blocks can be aggregated together to reach any given performance and capacity requirements. The parallel file system implementation on SFA supports demanding mixed workloads of both transactional random and high volume streaming IO. As an appliance, it simplifies deployment, maintenance, and operations, while easily scaling to hundreds of PBs and several hundred GB/s of performance. DDN GRIDScaler seems perfectly aimed at massively big datasets. DDN GRIDScaler: Answering the 5 Strategic Questions Let s explore how GRIDScaler stacks up against our five strategic questions for evaluating big data storage. Keep in mind that if a system doesn t deliver each criterion while performing at scale, it s simply not going to be a viable big data storage solution. In many storage systems there can be large trade-offs and compromises between the often competing design goals for high-performance, boundless scalability or rock-solid data protection. Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 6 of 11
QUESTION #1: IS GRIDSCALER SPECIALIZED ENOUGH TO HANDLE COMPLEX BIG DATA WORKFLOWS AND DOES IT WORK SEAMLESSLY WITH OTHER SYSTEMS TO FULFILL THE ENTIRE BIG DATA WORKFLOW REQUIREMENTS? YES GRIDScaler and Big Data Workflows: Broad multi-protocol support is basic to handling big data workflows where incoming data uses a variety of protocols and interconnects. GRIDScaler supports standard storage protocols including NFS and CIFS, and Fibre Channel, InfiniBand - interconnects. GRIDScaler is in fact the defacto leader in InfiniBand-supporting storage, supports RDMA over its InfiniBand connections for a real boost to serious IO apps. There are RDMA modifications available today for Hadoop to directly leverage RDMA, and we expect this to be killer combination emulated by other analytical solutions. Other integrations abound. DDN integrates with SAS Grid Computing s analytic platform. The SAS Grid Computing workload consists of two phases: a highly IO-intensive ETL stage (extract, transform and load) followed by an analytical modeling phase that is both IO- and CPU-intensive. GRIDScaler provides consistent and predictable high processing speeds for both phases, thanks to parallel file sharing and the locking algorithms necessary to support concurrent access. DDN has recently taken a broader approach to their customers multi-stage workflows, with the introduction of DirectMon to provide a single portal of management and visibility across multiple GRIDScaler and Big Data storage repositories, irrespective of whether they are entirely separate or joined together into one massive system. DirectMon looks poised to reduce the management overhead associated with the storage layer underneath big data workflows, and will likely further evolve in the future to deliver even more data intelligence. Finally, GRIDScaler also enables cloud tiering, and connects to clouds built using DDN s Web Object Scaler (WOS). WOS is the architecture behind economical and massively scaled clouds using objectbased storage across distributed locations. And in concert with other tools like the Lustre-toting EXAGrid as well as NASScaler, DDN provides a full Big Data solutions portfolio. It is clear that DDN is charting a course toward comprehensive, and uniquely deep, handling of all the big data customer s big data workflow needs. QUESTION #2: IS GRIDSCALER ARCHITECTED TO MEET THE AGGREGATED AND UNPREDICTABLE PERFORMANCE AND CAPACITY DEMANDS OF BIG DATA GRIDScaler s Performance Architecture: Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 7 of 11
The heart of the GRIDScaler performance story is an industrial-strength parallel file system that can deliver not only high volumes of aggregate file IO but also stream large single files leveraging as many spindles as possible. The real enabler is that DDN has been able to package a complex, fully featured parallel file system into an appliance with simplified management many parallel file system offerings are difficult if not unruly to own and operate, requiring deep dedicated staff expertise. The SFA s internal backplane for storage is a customized redundant and non-blocking SAS fabric that enables high internal bandwidth and low latency over massive disk capacities. DDN has been able to take advantage of this huge internal bandwidth to build in world-class capabilities for performance optimization and resiliency. For example, the seemingly boundless bandwidth comes into play when flash SSD s are deployed. In traditional storage architectures, the ROI of replacing HDD with SDD is severely limited by controller, cache, and busses designed for HDD performance. In GRIDScaler, investments in flash are heavily leveraged as the internal fabric can handle all the flash IO at top volumes and speeds. Performance is also an interconnect story. As we mentioned earlier, GRIDScaler features high-speed front-end interconnects. InfiniBand in particular mirrors the GRIDScaler in delivering extremely highperformance IO at low latencies. InfiniBand has been increasingly migrating from the HPC domain where it s widely used to connect supercomputer clusters and storage down into the enterprise data center to support dense clustering for virtualization, cloud, big data, and big storage. While there are some storage vendors using InfiniBand on their clustered storage backends and a few offering it as a primary storage interface, DDN leads the pack in delivering this data firehose and driving it to potential saturation on the front-end, producing top speeds even when configured for parallel file access. QUESTION #3: IS GRIDSCALER EFFICIENT ENOUGH TO MAKE LONG TERM BIG DATA INITIATIVES VIABLE AT THE LOWEST TCO? YES GRIDScaler s Efficiency Recipe: DDN built its efficiency recipe on lowering TCO in storage processing environments. DDN has a long track record of innovation moving the state-of-the-art in concert towards higher performance, greater scale, and larger efficiency. For example their newest platform, the SFA12K-20E supports what DDN calls In Storage Processing, which can host applications directly in the storage controllers. GRIDScaler s parallel file system then runs right in the controller, eliminating the need for extra servers, cables and switches, the management overhead and performance penalty. The opportunity to host data intensive application modules directly in the controller of such a massively scaled system should get any big data architect salivating. But a few other innovative integration examples exist as well. DDN GRIDScaler is built over the SFA platform which provides a number of design efficiency advantages. First the modular storage enclosure houses 84 drives in just a 4U rack space. The next important contribution stems from the use of the SFA12K-20E platform as the storage building block. Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 8 of 11
Each GRIDScaler appliance scales up to 10 enclosures with 840 drives that at RAID 6 (8+2) deliver 2.3PB usable file storage. Multiple GRIDScaler appliances can be added to a configuration pool in a scale-out fashion in order to create a high performance 10+PB system in a small dense footprint. Realtime Adaptive Cache Technology (ReACT) provides both processing power and massive capacity handling, with the ability to analyze workloads in real time and adapt cache to optimize performance GRIDScaler further lowers TCO by using policies to automatically tier big data using optimized storage tiers of SSD, SATA and SAS. All tiered data remains instantly accessible and manageable from a single namespace. Extreme storage density also saves floorspace and energy in the data center, leading to even lower TCO for massively scaled performance and capacity. The GRIDScaler parallel file system on SFA readily supports simultaneous mixed workloads for both high-bandwidth IO and low-latency transactions avoiding the need to deploy multiple storage systems tailored to each workload type. In addition to exceeding service levels no matter the application IO profile, this effective storage consolidation greatly increases both immediate efficiency and ultimate density across the data center. Furthering efficiency, the SFA12K-20E runs GRIDScaler s parallel file system right on the storage controller, improving performance while removing the need to rack and power extra servers, adapters, and switches. With traditional storage systems it s pick one of two either high processing or massive costefficient capacity. DDN GRIDScaler accomplishes both, a critical capability for any storage system poised to handle big data. QUESTION #4: IS GRIDSCALER RESILIENT UNDER FAILURES AT SCALE AND CAN IT PROVIDE THE REQUIRED LEVEL OF QOS EVEN UNDER FAILURE CONDITIONS? YES GRIDScaler - Resiliency to an Extreme: Internally, GRIDScaler leverages RAID 6 for data and RAID 60 for metadata. Additionally GRIDScaler supports synchronous volume to volume replication and up to 256 snapshots per file system. But what sets DDN apart is the built-in DirectProtect data integrity feature that automatically detects and corrects errors, which when you have thousands of drives is critical. DirectProtect has several functions. On one hand, it is a data-scrubbing engine that constantly verifies the validity of data on disk, and rebuilds data that has become corrupt of deteriorated over time. For Big Data storage at scale, this is critically important, as such data volumes may be difficult to routinely protect with traditional backup mechanisms this makes high integrity at the place of primary storage critically important. Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 9 of 11
DirectProtect also performs fault detection, but goes well beyond the norm of most fault detection technology. DirectProtect actually contains a set of algorithms and sub-routines designed to fully interrogate drives or components that are raising alerts, and then intelligently attempt device recovery. For systems utilizing large numbers of disks, such recovery is important, and can yield a significant reduction in management overhead and field service events. If DirectProtect can t recover a device from an error state, the RAID engine in the -SFA controllers begins a high speed recovery, using multiple distributed drives to rapidly stream rebuilt data onto a new drive at a rate that dwarfs the recovery time of most RAID solutions on the market today. Not to be missed, if a failure does occur, SFA s huge internal bandwidth also enables a feature called Read QoS that guarantees read performance even under disk failure and rebuild with minimal impact to service. Big data makes extreme demands of processing and capacity. It also makes extreme demands of lifecycle storage, which must offer massive capacity, flexible RAID, fast backup, and data/metadata replication. GRIDScaler accomplishes all this with built-in capabilities, and can also optimize thirdparty applications like Tivoli Storage Manager. QUESTION #5: IS GRIDSCALER SIMPLE TO MANAGE AT SCALE? YES GRIDScaler s design supports both scale up and scale out a kind of diagonal scalability. In addition to high drive density per drive enclosure, each new enclosure can introduce high performance SSD or performance- or capacity-oriented rotational drives in any mix. Either performance or capacity can be independently scaled by adding either file system heads or back end disks. Then you can scale-out by sidling up more GRIDScalers to create 10+PB pools. This approach enables growing slowly or taking on extreme jumps in size all at a consistently high performance and efficiency and without excessive architectural complexity. Moreover, for all its sophisticated features and powerful architecture, GRIDScaler is simple to manage even at massive scale. DirectMon not only manages multiple GRIDScaler deployments, but manages across DDN s whole portfolio of storage solutions including SFA storage arrays and EXAScaler file systems. DirectMon enables big data administrators to simplify big data workflow administration across the whole complex value chain, which is made up of multiple applications and workloads. Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 10 of 11
DirectMon is a centralized storage management console that is available as a pre-installed management appliance or a software application for virtual machines. If big data initiatives require other DDN HPC class storage (e.g. massive tier 2 archives), DirectMon can manage them all as one workflow enabling system. Centralized management across a far-flung storage environment is a huge advantage for DDN. Even simple storage systems can suffer from the lack of a single control point, leave alone sprawling big data storage environments. DirectMon enables DDN administrators to effectively maintain high availability in their mission-critical storage environment using wide-ranging policies and automated configurations. System-wide management tasks include setting capacity thresholds and alerts for performance and hardware/software states. IT can run statistics reports from all systems, create meaningful trending graphs and review logs and configuration data for system-wide optimization. DirectMon simplifies user creation and data security with pre-defined and customizable roles and users. System setup is also highly automated and simplified, and IT can discover, configure, duplicate, manage and optimize complete systems directly from the console. DDN s high performance storage arrays and massive parallel file systems are excellent fits for many companies launching big data initiatives. DirectMon takes DDN s offerings a big step further by simplifying the extreme complexity of big data storage management needs. Taneja Group Opinion It is clear from the above criteria that big data offers a number of challenges and unique opportunities to address those challenges. Users should never take big data storage decisions lightly. Even high-end storage products may or may not be suitable for big data; many of the best-known systems are not. Prospective buyers need to understand big data s specialized capabilities and requirements so they can build a long-term sustainable strategy for this business-critical undertaking. On the storage vendor side, these requirements represent a great opportunity for vendors who understand the challenges and can develop specialized and integrated products for each part of the big data lifecycle. The field of big data is changing rapidly, and the right system architectures will let vendors tightly couple their solutions to big data environments. DDN is one of the few vendors who has achieved this level of big data and storage coupling, and can today deliver a solution for almost any big data need, as well as promise to address changing demands in the future. With DDN, it seems that each design point associated with our key questions, enhances performance, enables greater capacity, and increases availability, which are in themselves key elements of a big data storage strategy. Moreover, DDN simultaneously weaves its products together into a well managed whole, which puts DDN in an excellent position to help customers execute on their big data initiatives. Clearly DDN is building a unique big data storage expertise that we re hard pressed to say any other vendor has a chance of rivaling..notice: The information and product recommendations made by Taneja Group are based upon public information and sources and may also include personal opinions both of Taneja Group and others, all of which we believe to be accurate and reliable. However, as market conditions change and not within our control, the information and recommendations are made without warranty of any kind. All product names used and mentioned herein are the trademarks of their respective owners. Taneja Group, Inc. assumes no responsibility or liability for any damages whatsoever (including incidental, consequential or otherwise), caused by your use of, or reliance upon, the information and recommendations presented herein, nor for any inadvertent errors that may appear in this document. Copyright The TANEJA Group, Inc. 2012. All Rights Reserved. 11 of 11