Utilizing MapReduce to address Big Data Enterprise Needs Leveraging Big Data to shorten drug development cycles in Pharmaceutical industry. www.persistent.com
3 4 5 5 7 9 10 11 12 13 From the Vantage Point What is in for me? Use Case: How Pharmaceutical Companies Can Leverage Big Data to Shorten Drug Development Cycles Digital Microscopy: What is it and what are its challenges? MapReduce and the Building Blocks for Solving the Business Problem The Solution Architecture Key Features Flow Chart Business Opportunities Conclusion Contents
Utilizing MapReduce to address Big Data Enterprise Needs Persistent From the Vantage Point According to IDC, digital content will grow to 2.7 zettabytes (ZB), up 48% from 2011. Over 90% of this information will be unstructured (e.g., images, videos, MP3 files, and files based on social media and Web-enabled workloads) full of rich information, but challenging to understand and analyze*. With this exponentially growing data, enterprises are struggling with information overload and are turning to Big Data technologies to address the challenge of transforming data into opportunity. This is especially critical in the pharmaceutical industry where shorter time to market can make a significant difference in patient s lives, academic research and the healthcare community at large. Inthisyou willfind Use cases of histology (microscopy) data which exemplify the need for an enterprise software platform to upload, store, visualize and analyze enormous amounts of data in a high performance environment. This document explains the architecture and need for a unified technology platform which makes managing, storing, processing and analyzing Big Data faster and more efficient. We introduce the HighPerformance(GPU)&Cloud ComputingEnterpriseSolution using the MapReduce Paradigm which is designed to solve relevant workflows and provide new insights into the increasing data available during different process stages. Due to the strategic nature of the data, high computing capabilities act as the foundation for other applications like Business Intelligence and Dashboards; and in turn help in critical business decision making. Thanks to major IT trends such as GPGPU (General Purpose Graphic Processors Units), Hadoop (MapReduce) and the Cloud, we are able to introduce this solution. The High Performance (GPU) & Cloud Computing Enterprise Solution can be applied to a wide range of areas and industries. This solution is industry agnostic and is beneficial to verticals such as telecom, retail, pharmaceutical, banking, financial services and insurance industries which not only generate huge amounts of data, but also need to process this data to ensure continuous growth and performance. This solution: Handles huge amount of data Provides data analytics in a parallel computing manner Enables business solution Generates critical reports solving underlying business problems Here we provide an example of Pharma-specific paradigm that can easily be extended to other industries. For example, there are underlying similarities in the following case story and personalized healthcare; they both require collecting, managing and processing valuable medical information such as electronic health records, blood samples and DNA sequencing data to allow physicians to make the best recommendations possible to patients. 3
Persistent Utilizing MapReduce to address Big Data Enterprise Needs What is in for me? Faced with an ever increasing amount of data, you will learn how to leverage the MapReduce concept to manage, analyze and utilize latest IT trends such as the evolution of Graphical Processing Units (GPU) deployed on a Cloud Infrastructure to gain enormous compute facilities which together with a well-defined distributed processing layer can handle data and bring out the intelligence hidden within. At a high level, we will: Chart out architectural decisions to be made by an enterprise system leveraging the latest trends in Cloud Computing, Distributed Processing, Mobility, Collaboration and High Performance Computing Show how HighPerformance(GPU)&CloudComputingEnterpriseSolutioncreates value for its stakeholders (i.e. Business Head, VP Engineering, Architect and Solution Specialist) Even though this focuses on a Pharmaceutical use case, the information herein is helpful to any stakeholders involved with an enterprise implementation executives, technical consultants, business analysts, users, and implementation partners, particularly those responsible for the overall success of the systems requiring high performance computing handling enormous data that needs to be processed in close to real time. 4 2015 Persistent Systems Ltd. All rights reserved.
Utilizing MapReduce to address Big Data Enterprise Needs Persistent Use Case: How Pharmaceutical Companies Can Leverage Big Data to Shorten Drug Development Cycles Data explosion occurred early on in the Life Sciences and Healthcare sector. The industry has recognized the need for technologies that could mine, analyze and translate the archipelago of data from the Human Genome Project into specific therapeutic drug targets. The deepening of the R&D productivity crisis that characterizes today s pharmaceutical development pipelines, requires the industry to validate and predict the clinical attributes of a drug earlier in the product lifecycle. Predictive modelling technologies and other BI tools are helping the Pharma industry reduce the attrition rate of their drug pipelines. Translational medicine aims to address the imbalance between the number of disease targets and therapeutic agents and enrich the drug pipeline by allowing scientists and clinicians to make associations between drug and disease earlier in the drug development process. Toxicology departments of Pharmaceutical companies utilize translational medicine and integrate technologies earlier in the product life cycle to predict benign as well as harmful effects of chemicals during the developmental phase. Digital Microscopy: What is it and what are its challenges? There is an critical need in the preclinical toxicology departments of Pharmaceutical companies to retrieve, digitize and store all the histological slides of the various organs from different animal models. Currently most of this analysis happens manually and is the most time consuming portion of drug development, especially in generic drug development. 5
Persistent Utilizing MapReduce to address Big Data Enterprise Needs Pharmaceutical companies are moving from slides to digitalmicroscopy thanks to the advent of scanners with high throughput. But this adds up to enormous amount of digital data (as slides are stored at various resolutions from 5X to 100X ranging 1.6GB to 20 GB per image) for regulatory and analysis purposes. Automated analysis involves segmentation, image processing and classification engines to aid in supervised feature detection and reporting. Moreover the image and associated analytical procedures need to be archived and maintained post necropsy for regulatory reasons. The problem is further compounded by the fact that feature identification and its corresponding abnormal effect due to a chemical has to be cross-verified across various organs on animal models to make correct histo-pathological analysis. DigitalMicroscopyApplication each image in various resolution (5X to 100X) along with its thumbnail data format and store when available generated catalogue to a central location & Algorithms process Data classified engine separating normal tissues vs. abnormal tissues various images, add manual annotations & do peer reviews The amount of digital data created by Digital Microscopy is enormous and the datasets created in this business case are around 2-3 TB. This kind of dataset would previously have been very challenging and expensive to take on with a traditional RDBMS using standard bulk load and ETL approaches. The solution to this problem needs to efficiently combine multiple data sources such as multiple sites across several countries simultaneously or data residing on multiple machines (often dozens). MapReduce platforms handle this effectively by using a distributed file system that's specifically designed to handle datasets residing across distributed server farms. Distributed file systems should also be fault resilient and not impose the requirement of RAID drives on individual nodes. One scanner handles around 300 slides at a time generating 0.5 TB data for each digital scanning High Performance Computing (GPU), Cloud and MapReduce analytics technologies enable study of larger quantities of biological data, with a higher precision and in shorter periods of time, ultimately helping to accelerate advancements in personalized medicine. These technologies are expected to be applied to molecular medicine, pharmaceuticals, biomedicine, and industrial biotechnologies. 6 2015 Persistent Systems Ltd. All rights reserved.
Utilizing MapReduce to address Big Data Enterprise Needs Persistent MapReduce and the Building Blocks for Solving the Business Problem The diagram below highlights the building blocks for solving the business problem associated with managing data created by Digital Microscopy The Service Platform depicts the middleware in a Pharma organization. The platform is augmented by a data acquisition component in order to acquire OEM vendor specific data format ensuring the data compression and security requirements are handled at the network layer. Each lab has its corresponding workflow requirements orchestrating the data analysis procedures based on its own study and biomedical requirements. The Reviewing/Monitoring and Reporting component helps technicians/physicians across sites to validate the decision making process and ensures that patient and insurance agencies receive well documented reports utilizing healthcare protocols. Generic Solution Platform for a Pharmaceutical Company 2015 Persistent Systems Ltd. All rights reserved. 7
Persistent Utilizing MapReduce to address Big Data Enterprise Needs MapReduce Component The open source implementation of MapReduce called Hadoop provides a distributed file system through a MapReduce component. This component processes data from multiple inputs (creating the "map"), and then reduces it using an image processing function specific to a given organ (as defined in the workflow component) which will distill and extract the desired results. The MapReduce component is planned to scale over thousands of nodes and tends to have high latency. GPGPU enabled nodes allow MapReduce nodes to perform processing of large volumes in write-once data format. Hadoop provides efficient data file processing across various organs and across various sites. This enables distributed data processing without forcing data to be collected and processed in a central location. Compute Unified Device Architecture (CUDA) helps to parallelize program in the second level when the MapReduce framework is regarded as the first level parallelization. MapReduce MapReduce is a simple yet very powerful method for processing and analyzing extremely large data sets, reaching up to the multi-petabyte level Algorithm Component This represents the implementation of learning algorithms based on organ histology as defined by pathologist for a given animal model. Graphics processors based implementation will surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods. It is presumed that GPU based machine learning task will be designed keeping constraints of instruction types and memory accesses for GPU architecture. 8 2015 Persistent Systems Ltd. All rights reserved.
Utilizing MapReduce to address Big Data Enterprise Needs Persistent The Solution Architecture The solution proposes the usage of Enterprise Information Architecture modelled on High Performance Computing grid utilizing GPU nodes on Hadoop MapReduce Architecture. 2015 Persistent Systems Ltd. All rights reserved. 9
Persistent Utilizing MapReduce to address Big Data Enterprise Needs Key Features 10 2015 Persistent Systems Ltd. All rights reserved.
Utilizing MapReduce to address Big Data Enterprise Needs Persistent Flow Chart 1 2 3 Digital microscopy with help of scanners to generate the image Automated analysis involving image processing, segmentation and classification engines to aid in supervised feature detection Feature identification and its corresponding abnormal effect due to a chemical, is verified across various organs through a conclusion engine on animal models to make correct histopathological analysis 11
Persistent Utilizing MapReduce to address Big Data Enterprise Needs Business Opportunities Due to security and data constraints across sites, the case study solution is an on-premise solution for the Pharmaceutical companies. However enterprises in other industries are now willing to move data into private-public hybrid cloud environments once the de-identification and compliance requirements are fulfilled. Very low cost commodity hardware can be used to power MapReduce clusters since redundancy and fault resistance are built into the software platform offering an alternative to an expensive enterprise hardware or software with proprietary solutions. A public-hybrid cloud solution based on OpenStack can be commercialized for this purpose. This makes it easier to add more capacity (and therefore scale) making the above solution/platform an affordable and very granular way to scale out instead of up. $ With public cloud vendors providing options to choose GPU capabilities and computational power to the node level, processing enormous amount of data becomes feasible. It also enables companies to carry out detailed analysis of business data that would take too long or would be too expensive to carry out using a traditional RDBMS. The ability to take mountains of inbound or existing business data, spread the work over a large distributed cloud, add structure (workflow and GPU power), and import the result into an RDBMS makes this solution very generic across various industries. Many organizations already have proven code that is tested and hardened and ready to use but is limited without an enabling framework. The above enterprise platform solution depicted along with a mature distributed computing layer can transition these assets to a much larger and more powerful environment. 12 2015 Persistent Systems Ltd. All rights reserved.
Utilizing MapReduce to address Big Data Enterprise Needs Persistent Conclusion Graphical processors have emerged as a commodity platform for parallel computation. However the development team needs knowledge of GPU architecture and effort in tuning the performance. The High Performance (GPU) & Cloud Computing Enterprise Solution uses a GPU based MapReduce implementation which is scaled over public and private cloud. The platform will be extended in the future with data mining capabilities to utilize datasets shared across private and public domain. Already, some academic implementations (i.e. Stanford s MapReduce framework on graphics processors (MARs)) have been proven to be successful. Amazon.com is working aggressively to convert the same architecture into commercially available solutions. Taking projects like MARs, expanding network fabric (over public and private cloud) and adding more power at the node level through GPU will allow solutions to be used across different industries. In the future, projects will see various levels of maturity where operations will monitor power, cooling (example GreenHDFS) and the reliability of a problem and automatically orchestrate the components of a class of service based on accounting or rating rules. Rules will process the problem across a portfolio of sensors, network fabrics, storage fabrics, desktop and servers. Massively parallel methods and building supervised engines to intelligently process the data will help resolve Big Data problems across industries. Surpassing the computational capabilities of multicore CPUs, modern graphics processors will revolutionize the applicability of deep unsupervised learning methods. 2015 Persistent Systems Ltd. All rights reserved. 13
Persistent Utilizing MapReduce to address Big Data Enterprise Needs About Persistent Systems Persistent Systems (BSE & NSE: PERSISTENT) builds software that drives our customers' business; enterprises and software product companies with software at the core of their digital transformation. For more information, please visit: www.persistent.com India Persistent Systems Limited Bhageerath, 402, Senapati Bapat Road Pune 411016. Tel: +91 (20) 6703 0000 Fax: +91 (20) 6703 0009 USA Persistent Systems, Inc. 2055 Laurelwood Road, Suite 210 Santa Clara, CA 95054 Tel: +1 (408) 216 7010 Fax: +1 (408) 451 9177 Email: info@persistent.com 14 2015 Persistent Systems Ltd. All rights reserved.
Utilizing MapReduce to address Big Data Enterprise Needs Persistent References IDC Predictions 2012: Competing for 2020 (Doc #231720) Architectural Description of Component-Based Systems - David Garlan et.al., white paper, research community project at Wright Laboratory, Aeronautical Systems Center, Air Force Materiel Command, USAF Generation of Component Based Architecture from Business Processes: Model Driven Engineering for SOA Dahman, K. et.al. 2010 IEEE 8th European Conference Hadoop-GIS: a High Performance Query System for Analytical Medical Imaging - interdisciplinary biomedical research, which accelerates the diagnosis and understanding of brain tumor for better cure. J. Dean and S. Ghemawat: MapReduce: Simplified data processing on large clusters. OSDI 2004. CUDA - http://developer.nvidia.com/nvidia-gpu-computing-documentation Shubin Zhang et. al.: SJMR:Parallelizing Spatial Join with MapReduce on Clusters. IEEE Clusters Computing. http://doubleclix.wordpress.com/2011/03/17/hadoop-2-0-openstack-pbj/ 15