White Paper Version 1.2 May 2015 RAID Incorporated
Introduction The abundance of Big Data, structured, partially-structured and unstructured massive datasets, which are too large to be processed effectively by conventional data processing methods, holds the potential for high value analytical insight never before available. Mining Big Data is an immediate, critical challenge for many businesses and research organizations. Analysis of those data creates new data sets and metadata that must be stored to be acted upon. A new generation of high performance approaches to modeling, optimizing, text mining and statistical analysis are required to turn these terabytes, petabytes and even exabytes of information to actionable analyses. Big data may be spun from scientific and engineering applications, genomic research, biometrics, weather data, financial and consumer information collection, security logs, rich media recognition devices and sources yet to be explored. Streaming from social media and internet feeds account for another almost limitless source of data. Life Sciences and healthcare industries are at the forefront of finding value in the storage and manipulation of Big Data. Biometric data collected over large populations are often key to predicting, identifying, preventing and/or treating disease. More and more, executive decision makers in diverse industries are recognizing the potential within the data they have, or could collect. The need for data visualization is also increasing in importance. The use of charts and graphics can make large data sets understandable in a small, graphical space and enable comparisons and conclusions. It can also help bridge the gap between data scientists and business leaders. Innovation and competitive positioning demands real-time analysis and reporting to provide trending and predictive data for decision makers. As datasets grow, legacy software, hardware and transmission techniques can no longer meet these demands rapidly enough. The term technical computing has been coined to describe approaches that use new mathematical and scientific computing principles to manipulate and analyze huge data sets and provide useful answers to users who are outside of those disciplines. Big Data Analytics Demands Faster, Secure Processing From DNA research to biometrics, technical computing has contributed to the wealth of knowledge about heath, disease and heredity. DNA sequencing was time consuming and costly in just the recent past. New advances in technical computing have made this a viable tool for patient diagnosis and treatment. it took 13 years to map the first genome, at a cost of several billion dollars. In less than 10 years, the time and cost of DNA sequencing of another genome was reduced by a factor of 1 million today, your or my personal genome could be mapped in just a few days for a few thousand dollars. [Forbes]
Businesses are using financial and consumer information collected across their customer base both from terminals and websites to better understand the trends and cycles of their market and better align their offerings and services. The ability to identify trends and perform accurate forecasting is critical to businesses positioning themselves both tactically and strategically. This analysis must be cost effective, meaningful and available within a time frame to be useful. For some industries, such as market financial trading, microsecond time frames can make the difference between success and failure! Meteorology is a scientific field that uses massive amounts of data in real time; another excellent example of the need for processing speed. Scientists use meteorological data collected over wide areas to predict the weather and are critical to the development and deployment of response plans for the general public, as well as governmental agencies and businesses that depend on such warnings. Predictions must be developed and delivered quickly enough to permit proactive response. A weather forecast for yesterday is not useful. Nearly all areas of research need analytics that are faster, more powerful and cheaper -- to address massive data sets growing geometrically. Data Scientists who utilize public or general-purpose cloud resources for capacity and scalability quickly realize how data transfer rates pose a major limitation. Many have moved back to private HPC, whether cloud or purpose-built infrastructure, to surpass these limits. Data must be moved from primary sources to multiple researchers quickly for any time sensitive analysis. New approaches in computing architecture, both hardware and software, are constantly being developed to address these mushrooming data demands while providing the scalability and performance required. Simultaneously, high value data must be protected. Whether the requirement is high availability In the face of hardware or infrastructure failure, or reliable and immediate retrieval of archived data, systems must be designed to accommodate these requirements. Data scientists must also protect data from intrusion, theft, or malicious corruption. Due to the sensitivity of the subject matter involved in many areas of research, privacy, security and regulatory compliance are factors that drive decisions away public and shared cloud environments and towards private cloud and protected infrastructure. Big data analytics can sometimes reduce a greater mass of data by extracting relevant information to be analyzed and producing high value metadata when the original information pool is too large to manipulate in a timely fashion or at a reasonable cost. Big Data Requires Parallel Application and File Systems Legacy systems tended to be centralized and involved serial processing of data. A new approach to processing Big Data is required. To scale beyond what a single system is able to accomplish, parallel processing requires decentralization for optimal results and performance. Huge improvements in
performance have been achieved across these networked processors and storage disks using parallel application and file systems (GPFS) offering almost unlimited scalability. When a Big Data analytics infrastructure is needed, usually termed high performance computing (HPC), advanced systems utilize clusters of computers to address the complex operations required of technical computing. These clusters can contain thousands of individual computers; whatever is required to accomplish the analytical processing in the timeframe required. HPC also requires the fastest, low-latency, high-bandwidth networks. This infrastructure also demands both fast and high bandwidth shared storage access to all of the individual computes in the cluster. Private cloud architectures abstract some of the physical infrastructure, allowing more flexible workloads and burstable computing to public cloud architectures when required. Conclusion Whether your current infrastructure no longer meets your needs, or public cloud solutions bandwidth limitations are unrealistic for your research, you may need to look for a responsive provider who can draw upon significant resources, security expertise to help design and implement a solution. A provider who is large enough to offer you the efficiencies of scale but still able to provide a dedicated customer-centric focus. Most in-house IT personnel don t have the skills or experience needed to architect, build and operate an infrastructure that will scale to your Big Data analytic needs and provide for future expansion. Every component of your High Performance Computing infrastructure must align with the overall needs of the system and work seamlessly with your software ecosystem clustered computers, specialty computing resources, shared storage systems, high-bandwidth network and inter-process communications, switching and security. HPC ecosystems require support by critical and specialized skills that can often be provided or supplemented through a partnership with an experienced provider. RAID, Inc brings two key components to the table as a vendor speaking the language of researchers and understanding how to translate a scientific problem into a computational process, and a broad based level of experience with best of breed products which can be used as building blocks to create a custom optimized infrastructure for supporting that computational process. Whether we are designing to optimize for a limited budget, a performance sensitive application, or a large scale flexible computing platform to support a diverse faculty or audience, RAID has both directly applicable experience and extensive current industry knowledge and can (will) work with a customer to address THEIR needs.
References http://www.hrgresearch.com/high%20performance%20computing.html http://www.raidinc.com/high-performance-computing-for-big-data/ Vanacek, Jacqueline. April 16, 2012. How Cloud and Big Data are Impacting the Human Genome - Touching 7 Billion Lives. Forbes.
http://www.forbes.com/sites/sap/2012/04/16/how-cloud-and-big-data-are-impacting-the-human-ge nome-touching-7-billion-lives/