Streamlining analytics and visualization infrastructure at the University of Calgary



Similar documents
Using GPUs in the Cloud for Scalable HPC in Engineering and Manufacturing March 26, 2014

The University of Alabama at Birmingham. Information Technology. Strategic Plan

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

Jean-Pierre Panziera Teratec 2011

Make the Most of Big Data to Drive Innovation Through Reseach

Teaching Computational Thinking using Cloud Computing: By A/P Tan Tin Wee

Stream Processing on GPUs Using Distributed Multimedia Middleware

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Future Directions in Canadian Research Computing: Complexities of Big Data TORONTO RESEARCH MANAGEMENT SYMPOSIUM, DECEMBER 4, 2014

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

New Jersey Big Data Alliance

SGI HPC Systems Help Fuel Manufacturing Rebirth

The digital future and dealing with disruption

Scalability in the Cloud HPC Convergence with Big Data in Design, Engineering, Manufacturing

Bringing Big Data Modelling into the Hands of Domain Experts

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

BIG DATA & DATA SCIENCE

Copyright 1

Big Data: Overview and Roadmap eglobaltech. All rights reserved.

I. Justification and Program Goals

High Performance Computing

The IBM Solution Architecture for Energy and Utilities Framework

Cornell University Center for Advanced Computing A Sustainable Business Model for Advanced Research Computing

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

New solutions for Big Data Analysis and Visualization

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Bricata Next Generation Intrusion Prevention System A New, Evolved Breed of Threat Mitigation

DATA MANAGEMENT FOR THE INTERNET OF THINGS

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS

Using Cloud-Based Technologies in Clinical Trials by Niki Kutac, Director, Product Management

How To Manage Research Data At Columbia

SMART ASSET MANAGEMENT MAXIMISE VALUE AND RELIABILITY

Cloud Computing on a Smarter Planet. Smarter Computing

Charting the Evolution of Campus Cyberinfrastructure: Where Do We Go From Here? 2015 National Science Foundation NSF CC*NIE/IIE/DNI Principal

What s New in MATLAB and Simulink

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

Big Data Performance Growth on the Rise

How To Change Medicine

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Emerging Technology for the Next Decade

2015 The MathWorks, Inc. 1

Scaling LS-DYNA on Rescale HPC Cloud Simulation Platform

From Big Data to Smart Data Thomas Hahn

Deploying an Operational Data Store Designed for Big Data

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

Acceleration for Personalized Medicine Big Data Applications

MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant

Data Centric Systems (DCS)

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Large-Scale Reservoir Simulation and Big Data Visualization

Healthcare, transportation,

High-Performance Computing and Big Data Challenge

Compute Canada Technology Briefing

FDA STAFF MANUAL GUIDES, VOLUME I - ORGANIZATIONS AND FUNCTIONS FOOD AND DRUG ADMINISTRATION OFFICE OF OPERATIONS

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

Big Data Executive Survey

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Understanding Big Data Analytics for Research

Steps to Migrating to a Private Cloud

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Smart Data Innovation Lab (SDIL)

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Deep Learning Meets Heterogeneous Computing. Dr. Ren Wu Distinguished Scientist, IDL, Baidu

Realizing the Benefits of Data Modernization

How To Understand The Benefits Of Big Data

Microsoft Research Windows Azure for Research Training

ebook Utilizing MapReduce to address Big Data Enterprise Needs Leveraging Big Data to shorten drug development cycles in Pharmaceutical industry.

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

HPC technology and future architecture

BIG DATA-AS-A-SERVICE

Transcription:

Streamlining analytics and visualization infrastructure at the University of Calgary Submitted to the Vice- President (Research) by Advisory Committee on Analytics & Visualization (ACAV) Big data = big opportunity When he was a kid in 1969, James Gosling saw his first computer in a lab at the University of Calgary. He thought it was so cool that he started coming back to the computer science lab breaking in by figuring out the simple combination lock on the door and using one of the smallest computers, the size of a refrigerator, to teach himself how to write code. Gosling graduated from the University of Calgary with his BSc in 1977 and went on to create Java, the universal programming language that helped the Internet develop. In the four decades since, the Internet and technological infrastructure has quite simply revolutionized our society, our lives and our work. And the revolution continues. Information technology is advancing at a staggeringly fast rate. Along with the hardware and software, the exponential proliferation of data is continuing it s estimated that 90 per cent of the data in the world has been created in the last few years. And it all has to be stored, analyzed and understood so that better decisions can be made for the future. The University of Calgary is striving to become one of the top research universities in the country. Our scholars are ever active in collecting and creating more data. Big data is creating a big opportunity for the university to organize and streamline how we manage our essential cyberinfrastructure, how we make sense of vast amounts of research data, and how we increase the impact of research results on society. But we have not only reached, but surpassed, our capacity in high powered computing. Our inability to keep pace with technology is slowing us down. And if we don t speed it up, we will no longer be competitive. For example, advancements in cryo electron microscopy which allows researchers to see individual proteins in atomistic detail is revolutionizing structural biology and accelerating the amount of structural information available. But without sufficient HPC, our very strong concentration of experts in simulations of proteins won t be able to compete. Our researchers are losing ground to groups in other universities and we are at risk of being unable to attract funding, excellent graduate students, postdocs and collaborators. Without increasing our HPC capacity to be competitive, some of our scholar s research programs are not viable. Consider: The Reservoir Simulation Group at the Schulich School of Engineering needs to run data with 2 billion grid cells to accurately predict the performance of a petroleum reservoir. But that s impossible given the current compute power on campus. The group has an IBM cluster with limited memory size and speed and so it also uses the Parallel cluster to run large- scale reservoir

simulations. Parallel s memory speed isn t fast enough and it s often bottlenecked. Because many labs use Parallel, the queue to run large simulations can be weeks or more. Libin Institute researcher Wayne Chen has discovered a protein, ryanodine receptor, that s responsible for the initiation of calcium waves and calcium- triggered arrhythmias. This will lead to a better understanding of the molecular basis of anti- arrhytmic treatment. But we do not have sufficient HPC to analyze the protein and capitalize on our expertise. The research team was able to make temporary collaborations with institutions that do have the computing capacity. Computational Biophysics researcher Gurpreet Singh needed to run a scaling test for a molecular simulation that required using 50 Nvidia graphics processing units (GPUs) for 20 minutes. The WestGrid/Compute Canada Parallel cluster is the only machine in Western Canada capable of running the test and it is booked solid. Luckily, the cluster was booked for an annual operating system upgrade and Singh s job was allowed to run. But without an outage of some kind, our researchers had no chance to access the machine in the near future. Mark Lowerison of the Clinical Research Unit (CRU) was also able to take advantage of WestGrid down times. He needed to run 300,000 simulations utilizing the R statistical package within a three week time period. Each run needed five to 90 minutes. Normally, this would be impossible. In this case, the annual upgrade and subsequent down time created spare cycles on the three WestGrid clusters. Plugging in to high performance computing (HPC) Having significant compute and storage capacity for all researchers at the university high performance computing is every bit as essential as having sufficient electricity to power our operations. We have a state of the art cogeneration plant to help provide our main campus with power. It s time we had a strategy to ensure we have access to leading edge compute technology. Our current cyberinfrastructure is at capacity. As individual researchers and projects apply for funding and procure technology, we re not always acquiring the technology that would best serve the entire university community nor are we creating a sustainable support infrastructure. The three WestGrid clusters on campus are already over- committed to researchers from the Compute Canada catchment area. We have little room to maneuver except when an outage is scheduled (as the examples above illustrate). The WestGrid machines are due for retirement starting in 2016, but replacement clusters which still would need to be funded will only replace the current cycles, they will not add much needed supplemental capacity. And, the new Compute Canada clusters will not be placed on- campus but at other institutions reducing our ability to access downtime cycles. The global trend in HPC is toward sharing services and we see Campus Alberta capacity as a pre- requisite for many projects. With very few exceptions, having access and control is more important than the physical location of any equipment. It s imperative that we are competitive. We have to ensure that our research community has the capacity it requires. We have to be more strategic in how and what cyber infrastructure we acquire as well as how me make it accessible to researchers on campus and beyond. The Advisory Committee on Analytics & Visualization (ACAV) was struck in 2013 by the Vice President (Research) to examine the university s growing requirements for technology, our current methods of procuring technology through research grant proposals and other means and identify and explore opportunities for improvement.

Strategic recommendations for the path forward Take inventory of all major analytics and visualization equipment across campus and define clear paths to access and support it. Surveying faculties and departments to provide a clearer picture of cyberinfrastructure demands, requirements and bottlenecks. The survey could also help make an asset map that incorporates research interests. As well as the hardware and infrastructure, we need sufficient support and the correct expertise to manage it. These expenses should be considered an annual operational expenditure not a capital one. The level of cyberinfrastructure service provided needs to be evaluated regularly to ensure sufficient capacity. Cleary define points of contact to manage and access cyberinfrastructure on and off campus, including senior level research personnel to coordinate initiatives across faculties, an UCIT contact person for system architecture and data scientists. Access and control to HPC resources is more important than physical location. Deployment of new cyber- infrastructure in non- UofC data centres could be used if it is cost effective. Ensure major cyber- infrastructure grants are coordinated centrally. The university needs to provide a high- level of service and cover partial costs to encourage researchers to contribute to a shared infrastructure. Researchers successful in major infrastructure grant proposals will have priority access to the equipment while surplus capacity if made available to other researchers on Campus. Develop an institutional strategy to catalogue and curate a research data library and make it available for future research projects. The research data archive can also facilitate sharing data which would enable collaborations with people inside and outside the university. Different researchers and research groups would have easy access to data in a controlled and secure manner. Create a Digital Data Commons (DDC), a physical space, as a nexus for collaboration o Sharing space will facilitate face- to- face collaboration between analytics and visualization research groups and application researchers. Core analytics/vis researchers can be hosted in this space and be joined by other researchers with needs that could be addressed by big data analytics. o o The digital data commons should have oversight on shared digital research infrastructure (I.e. HPC, data analytics hardware, curated archives of research data) Support personal should be part of the DDC + research leadership needs to be provided (I.e. AKA an academic director) Develop benchmarks in 2015/16 that would demonstrate progress on access to HPC/data analytics capabilities. Strategic recommendations for infrastructure investment Invest in general research data storage and archiving. This would benefit many groups across campus and would also help meet requirements of tri- council and other funding bodies to keep research data for five years or longer. Invest in an analytics cloud system, such as one based on Hadoop. To overcome our currently limited capacity, we need to engage with partner organizations to increase abilities. Support personal and data scientists need to be made available to all researchers.

Invest in super computers that combine GPUs or other accelerators with more standard CPUs. Canada lags significantly in this area with no large GPU- based HPC facilities. Storing results in separate facilities and moving them to local equipment for analysis is no longer adequate where data analysis requires the power of the HPC facility combined with substantial storage for intermediate and final results (e.g. genomics, large- scale biomolecular simulation, whole- cell simulations, other computational biology, materials research, large- scale geospatial modeling). Invest in next generation sequencing. Its related techniques already provide severe challenges for storing data in genomics and bioinformatics. Certain datasets have specific security and privacy concerns, particularly in the growing field of patient- related data. Sequencing data could soon be linked with patient data to enable personalized medicine, creating specific requirements for secure storage. Invest in large- scale visualization equipment for making sense of vast amounts of data and inviting opportunities for collaborative work. Upgrade two existing major visualization facilities (CCIT and TFDL), including more powerful graphics hardware driving the CCIT projectors Invest in software such as TechViz XL to allow seamless integration of key software applications (such as Petrel, Matlab and Paraview) with our virtual reality environments. Further demand may exist for: High- throughput streaming analytics capacity will be of use for oil/gas/health. An important application area has been smart cities and analytics problems related to managing the cities of the future. Future of ACAV: Strategic recommendations Identify co- chairs that can act as first point of contact for researchers and senior administration. ACAV becomes a strategic oversight committee that will among other things, review strategic recommendations every few years. Sponsor domain- specific networking events that connect researchers across campus with complimentary research interests. Examples could be Informatics for the life sciences or Energy analytics.

ACAV committee members 2014-2015 Name Frank Maurer (co-chair) Sam Wiebe (co-chair) Carey Williamson Christopher Hugenholtz Deborah Marshall Jason de Koning Karen Bourrier Kim Koh Laleh Behjat Loren Falkenberg Michael Ranelli Michael Ullyot Parsa Samavati Paul Galpern Peter Tieleman Robin Winsor Sergei Noskov Sheelagh Carpendale Stafford Dean Steve Liang Thomas Hickerson Faculty/Department Medicine/CRU Geography Community Health Sciences Genomics/ACRI Arts Education ENEL Haskayne UCIT Arts Undergrads EVDS Bio Cybera Biological Sciences AHS, DIMR Geomatics Library