1 Scaling up to Production
Overview Productionize then Scale Building Production Systems Scaling Production Systems Use Case: Scaling a Production Galaxy Instance Infrastructure Advice 2
PRODUCTIONIZE THEN SCALE 3
Productionize then Scale Productionize Put into operation for end users Scale-up Ability to manage increasing workloads Characteristics of production process impacts how you scale-up 4
Whole Genome Sequencing WGS for humans 60x coverage ~200GB of read data ~80MB of variant data 3 hours of compute (e.g. Intel s highly optimized algorithms) Source: Gullapalli, et. al., Next generation sequencing in clinical medicine: Challenges and lessons for pathology and biomedical informatics, Journal of Pathology Informatics, Year 2012, Volume 3, Issue 1 [p. 40] 5
WGS in the Clinic Patients Physicians Clinics Insurance Test Lab Regulators Developers EMR Source: Gullapalli, et. al., Next generation sequencing in clinical medicine: Challenges and lessons for pathology and biomedical informatics, Journal of Pathology Informatics, Year 2012, Volume 3, Issue 1 [p. 40] 6
WGS at Scale Every newborn: 140 million/yr 28000 PB of read data 11.2 PB variant data 4.2 x 10 7 hours of compute PER YEAR Other Applications Cancer T/N, time series, multi-tissue Familial studies Source: Gullapalli, et. al., Next generation sequencing in clinical medicine: Challenges and lessons for pathology and biomedical informatics, Journal of Pathology Informatics, Year 2012, Volume 3, Issue 1 [p. 40] 7
BUILDING PRODUCTION SYSTEMS 8
Building Production Systems: Overview Understanding Users Managing Data Assessing Applications Additional Considerations Optimizing the Process Managing Change 9
Understanding Users User profiles End-users, support staff, finance, etc. Use cases Number of users Tip: create a quick matrix of user profiles and the number of users within each profile currently and in 3 years 10
Managing Data Data amount Data access Data characteristics Data locality Tip: write down an estimate for the low and high bound of data coming into the system and from where data generated by the system data being delivered from the system and to where 11
Assessing Applications Application types Application requirements Application characteristics Tip: document a few runs to get an idea of the memory use, CPU use, runtime, etc. of an application 12
Regulatory Compliance Revolves around organizational policies to enforce best practices ISO, Good Clinical Practice, Safe Harbor, CLIA, FDA, EMR/EHR, EMA Significant time and financial investment is required to achieve regulatory compliance 13
Optimizing the Process Evaluate users, data, applications and other considerations individually and as a whole What are the bottlenecks? What can be optimized? What can be automated? 14
Optimizing the Process Bottlenecks Optimization Automation CPU bound IO bound Memory speed Memory size Network bandwidth Network latency Instrument capture & data movement Application CPU Memory Input/Output Networking Options Program Translation QC Business rules Testing Notifications Workflows Error handling Reporting 15
Optimizing the Process: WGS Example Network from sequencer to storage Storage space for data Available RAM and compute Accessibility of data to compute resources Secure storage and data transfer Parallelize of mapping, variant calling Automate QC metrics 16
Managing Change We are in an dynamic field Tools will change, metrics will be refined, file formats will evolve, new regulations will be made Key concepts often forgotten Modularization Interoperability Usability 17
SCALING PRODUCTION SYSTEMS 18
Scaling Production Systems When to Scale How to Scale Infrastructure 19
When to scale BE PROACTIVE NOT REACTIVE Forecast increases in number of users number of jobs computational intensity of jobs amount of data Periodic reassessment of forecasts 20
How to Scale Factors to consider when scaling Number of users Number of jobs Types of jobs Amount of data 21
Number of Users Access control User account management Resource allocation Prioritization Individual usage monitoring/tracking 22
Number of Jobs Job submission management Job queues Priorities Status information Ability to appropriately use resources for jobs Load balancing 23
Types of Jobs Memory intensive IO intensive Compute intensive Optimized/custom applications 24
Amount of Data Tiered storage Handles different levels of availability Cost trade-offs Implement data management policy Data Transfer Network bandwidth and latency Data movement accelerators Data Ingestion 25
Infrastructure Technology Fat Nodes & Appliances Rackmounts and blades Scale-out architectures Shared memory architectures Network requirements 26
CASE STUDY: SCALING A PRODUCTION GALAXY INSTANCE 27
Scaling a Production Galaxy Instance Galaxy Overview Considerations for Local Installation Scaling Galaxy 28
Galaxy Overview Galaxy is an open, web-based platform for data intensive biomedical research 29
Considerations for Local Installation Hosting Galaxy Locally Managing the Software Supporting the Users 30
Hosting Galaxy Locally Host server must have sufficient storage because Galaxy need direct access to data Input data, intermediate data, results, metadata, static data resources Host server must have a sufficient network connection Host server needs sufficient compute resources Analysis tools 31
Hosting Galaxy Locally Personal workstation Flexibility that comes with self-management Very limited resources, requires know-how Local shared cluster Better ROI on upfront install effort investment A lot of support and management overhead Appliance Dedicated high performance server Automated software management Leverages other infrastructure to scale 32
Managing the Software Galaxy software versions Analysis tools versions Software dependencies Performance optimization 33
Supporting Users Number of Users Manage job submission Manage user accounts Setting up quotas Support Services 34
Scaling Galaxy Software Hardware Storage Compute Network 35
Software Job scheduler Grid Engine Galaxy Database PostGreSQL, MySQL Proxy server Apache Optimize configurations 36
Hardware: Storage Storage local to host server must be large enough to store all data handled by Galaxy Increase local capacity Network more storage Port instance to machine with more storage Backup Move data off local storage to make room 37
Hardware: Compute Host must have access to sufficient compute Port Galaxy to a more powerful host Build additional computational resources on the Galaxy host Leverage job scheduler to span resources Burst to the cloud or other clusters 38
Hardware: Network Connectivity to data sources Internal network Connection to HTP data generation instrument External network Supporting a global user base Transfer protocols Optimized tools for transferring data 39
INFRASTRUCTURE ADVICE 40
Infrastructure Advice Science is changing faster than we can refresh IT Consider future flexibility as much as current needs Avoid things that lock you into a vendor or platform Continually evaluate your default assumptions 41
Infrastructure Advice Physical and network data ingestion Think about edge cases and the unexpected Don t go crazy with upfront investment Compute and analysis Pay attention to areas that need optimization for your operations 42
Infrastructure Advice Cloud strategy Spend time to develop policies, procedures, assess risk etc. Consider laying the technical groundwork now so it is easier to make use of the cloud when needed 43
Infrastructure Advice Storage and data management Spend bulk of attention and budget here Understand the diversity of products and features to minimize risk and mistakes Define your storage approach to match you organization s funding and staffing model. 44
Final Takeaway If you think big data is here now... Know your key systems, technologies, and bottlenecks Researchers and IT must work together to build environments that enable science 45
Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 46 Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.