Geoclustering Git. Delivering Performance and Reliability When Using Git for Global Development Teams. Brett Taylor, Go2Group October 2015



Similar documents
TABLE OF CONTENTS THE SHAREPOINT MVP GUIDE TO ACHIEVING HIGH AVAILABILITY FOR SHAREPOINT DATA. Introduction. Examining Third-Party Replication Models

EMC VPLEX FAMILY. Transparent information mobility within, across, and between data centers ESSENTIALS A STORAGE PLATFORM FOR THE PRIVATE CLOUD

Real-time Protection for Hyper-V

High Availability for Citrix XenApp

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

HIGHLY AVAILABLE MULTI-DATA CENTER WINDOWS SERVER SOLUTIONS USING EMC VPLEX METRO AND SANBOLIC MELIO 2010

Software-Defined Networks Powered by VellOS

Atlassian Confluence. Performance, Scalability, Clustering. Go2Group 138 North Hickory Avenue Bel Air, MD

Successfully managing geographically distributed development

Big data management with IBM General Parallel File System

HP StorageWorks Data Protection Strategy brief

Windows Server Failover Clustering April 2010

A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN

Distributed Software Development with Perforce Perforce Consulting Guide

Constant Replicator: An Introduction

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

EMC VPLEX FAMILY. Continuous Availability and Data Mobility Within and Across Data Centers

EMC VPLEX FAMILY. Continuous Availability and data Mobility Within and Across Data Centers

Veritas InfoScale Availability

Non-Native Options for High Availability

Solving performance and data protection problems with active-active Hadoop SOLUTIONS BRIEF

PROTECTING MICROSOFT SQL SERVER TM

Protecting Microsoft SQL Server

CA Cloud Overview Benefits of the Hyper-V Cloud

Mirror File System for Cloud Computing

Enhance visibility into and control over software projects IBM Rational change and release management software

High availability and disaster recovery with Microsoft, Citrix and HP

Cisco WAAS for Isilon IQ

Using the cloud to improve business resilience

Neverfail for Windows Applications June 2010

High Availability with Windows Server 2012 Release Candidate

Continuous Data Protection for any Point-in-Time Recovery: Product Options for Protecting Virtual Machines or Storage Array LUNs

The Benefits of Virtualizing

ScaleArc for SQL Server

Windows Geo-Clustering: SQL Server

AppSense Environment Manager. Enterprise Design Guide

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

SQL Server Storage Best Practice Discussion Dell EqualLogic

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

Rose Business Technologies

Successfully Deploying Globalized Applications Requires Application Delivery Controllers

Backup and Redundancy

Cisco Wide Area Application Services Optimizes Application Delivery from the Cloud

Affordable Remote Data Replication

Developing a dynamic, real-time IT infrastructure with Red Hat integrated virtualization

Fujitsu Cloud IaaS Trusted Public S5. shaping tomorrow with you

Nutanix Solution Note

Appendix A Core Concepts in SQL Server High Availability and Replication

Hoster Uses Virtualization to Support Automation, Geo-Diversity, and Cost Savings

Online Transaction Processing in SQL Server 2008

Collaboration solutions for midsized businesses Buyer s guide

FlexNetwork Architecture Delivers Higher Speed, Lower Downtime With HP IRF Technology. August 2011

Lufthansa Systems Uses Hybrid Cloud to Trim IT Delivery to Hours and Reduce Costs

Get Control of Your Data Center. Application Delivery Controllers

for Lync Interaction Recording

Symantec Cluster Server powered by Veritas

cloud functionality: advantages and Disadvantages

Disaster Recovery for Oracle Database

Top 10 Reasons why MySQL Experts Switch to SchoonerSQL - Solving the common problems users face with MySQL

Virtualized Disaster Recovery from VMware and Vision Solutions Cost-efficient, dependable solutions for virtualized disaster recovery and business

Automated file management with IBM Active Cloud Engine

be architected pool of servers reliability and

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

The Benefits of Continuous Data Protection (CDP) for IBM i and AIX Environments

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

Maximizing Data Center Uptime with Business Continuity Planning Next to ensuring the safety of your employees, the most important business continuity

WHITE PAPER Overview of Data Replication

Riverbed WAN Acceleration for EMC Isilon Sync IQ Replication

IBM Virtualization Engine TS7700 GRID Solutions for Business Continuity

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

SOLUTION BRIEF Citrix Cloud Solutions Citrix Cloud Solution for Disaster Recovery

Scala Storage Scale-Out Clustered Storage White Paper

A virtual SAN for distributed multi-site environments

WHITE PAPER. Header Title. Side Bar Copy. Real-Time Replication Is Better Than Periodic Replication WHITEPAPER. A Technical Overview

WHITE PAPER. The 5 Critical Steps for an Effective Disaster Recovery Plan

Cloud Optimize Your IT

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper

Stretched Clusters and VMware

Multi-Datacenter Replication

The Impact Of The WAN On Disaster Recovery Capabilities A commissioned study conducted by Forrester Consulting on behalf of F5 Networks

Hitachi Data Systems and Brocade Disaster Recovery Solutions for VMware Environments

Effective Storage Management for Cloud Computing

Microsoft and Citrix: Joint Virtual Desktop Infrastructure (VDI) Offering

NetApp SnapMirror. Protect Your Business at a 60% lower TCO. Title. Name

Skelta BPM and High Availability

Veritas Cluster Server from Symantec

Contents. SnapComms Data Protection Recommendations

COST-BENEFIT ANALYSIS: HIGH AVAILABILITY IN THE CLOUD AVI FREEDMAN, TECHNICAL ADVISOR. a white paper by

HA / DR Jargon Buster High Availability / Disaster Recovery

Integration Guide. EMC Data Domain and Silver Peak VXOA Integration Guide

IBM Software Information Management. Scaling strategies for mission-critical discovery and navigation applications

High Availability Server Clustering Solutions

HRG Assessment: Stratus everrun Enterprise

Synchronous Replication of Remote Storage

F5 and Oracle Database Solution Guide. Solutions to optimize the network for database operations, replication, scalability, and security

Double-Take Replication in the VMware Environment: Building DR solutions using Double-Take and VMware Infrastructure and VMware Server

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Veritas Cluster Server by Symantec

Cloud Infrastructure Foundation. Building a Flexible, Reliable and Automated Cloud with a Unified Computing Fabric from Egenera

Transcription:

Geoclustering Git Delivering Performance and Reliability When Using Git for Global Development Teams Brett Taylor, Go2Group October 2015

TABLE OF CONTENTS Introduction... 3 GIT: the fastest growing version control system... 4 Inherent value... 4 Challenges... 5 Achieving enterprise-class resiliency with Git... 6 Clustering architectures... 6 Types of clustering... 7 Enterprise Git: Atlassian s Bitbucket and Bitbucket Data Center... 8 Bitbucket s geographic limitations... 8 Go2Group s Geoclusters... 10 Overview... 10 Data Flow... 11 Bitbucket high availability options... 13 Benefits of geoclustered architecture... 13 Performance data... 15 Conclusion... 16 Contacting Go2Group... 17 Go2Group and GSA... 17 Contact... 17 Notice... 18 Geoclustering Git Go2Group 2

Introduction Git is the fastest growing version control system. But few Git systems meet enterprise requirements for performance and reliability, especially when deployed by globally diverse software development teams. In this white paper, we will take a closer look at how geoclustering Git placing clustered instances on multiple servers at multiple locations guarantees availability and enhances performance by sharing the workload and preventing outages. Go2Group s Geoclusters for Atlassian Bitbucket provides the always-on, always-available experience modern enterprises demand. Smart companies recognize faulty code as a significant business risk. In one of the biggest outages of 2014, cloud storage company Dropbox experienced a global outage when a bug in an upgrade script tried to reinstall an operating system on an active machine. Therefore, IT s focus on fail-safe structure has moved down the stack from the network to the application server and the developer s application code is a focus. Software development teams depend on version control systems to improve the product lifecycle delivery process. Git has become the fastestgrowing and most widely distributed version control system on the market, and Atlassian s Bitbucket has become one of the most popular versions of Git for the always-on enterprise. As a proven architecture for both local and remote clusters for Bitbucket, Go2Group Geoclusters allows any company to benefit from a fully supported global mirroring solution for Bitbucket. Geoclusters provide redundant local Bitbucket mirrors for the best possible performance and an additional level of availability protection for intellectual property. With this architecture s ability to span data centers around the world, distances over 100 kilometers are made viable. Geoclusters build on Git s native mirroring capability to provide local performance speeds at remote sites, clustering to support continuous integration (CI) farms, and multiple copies of critical source code as part of a comprehensive disaster recovery (DR) solution. In this paper we ll discuss the use of Git as a version control system, achieving Git resiliency, incorporating Go2Group Geoclusters, and performance data. Geoclustering Git Go2Group 3

GIT: the fastest growing version control system Version control systems are key for any organization that develops software because software development is rarely a solitary effort. Modern development requires large amounts of data, and there s an ongoing demand for developers to version all information required for the release of a product. Git meets that demand. Created in 2005 by Linus Torvalds, the father of Linux, Git is the fastest growing version control system. As of May 2014, 42.9% of all software developers used either Git or Github as their primary source control system, according to the Eclipse Community Development Survey. As of June 2015, GitHub has over 10,000,000 users. Thirty-three percent of respondents to a 2014 Forrester Consulting enterprise survey indicated that 60% or more of their code was currently stored and managed by Gitbased systems. Inherent value Git provides access to local repositories for developers, giving them the ability to make changes and branches locally. Since Git is inherently a distributed version control system, developers may use it to work on a shared project that requires a different workflow than that of a centralized version control system. Often, separate repositories are used to model more stable branches and whichever maintains the more stable repository will pull completed work from those of the contributor. While all distributed version control systems provide some degree of disconnected operation, the major benefit of Git is its ability to work in an environment where network connectivity is unreliable or unavailable. The value of disconnected operation depends on how many of the developers involved in a project are regularly working while disconnected, how frequently they are doing so, and for how long. Successful businesses seldom work in single-site isolation. Forrester Research describes this as the extended enterprise where employees are expected to perform their jobs anytime and anywhere. The more a project involves being disconnected, the more value a Git system provides. Of course, the ability to work while disconnected is not the only benefit of having a local repository. Congestion frequently occurs within central repositories, often when working on an especially large project, during integration. In this case, speed of operation depends on how many people are trying to integrate at the same time, the number of conflicts, and the strength of the control system s merging capabilities. Geoclustering Git Go2Group 4

However, when the time comes to share work in an enterprise setting, all changes must eventually flow back into the central repository. Challenges If a developer is working in India and the master repository is in California, for example, every push suffers from latency due to network delays. And since Git doesn t offer any out-of-the-box mirroring capabilities, even read operations, like clones and pulls, can be slow. The same problems hold true for a master repository that is supporting a large number of concurrent users. While Git is distributed and doesn t have the vulnerabilities of centralized version control systems, it still has some shortcomings, including: Access control: In order to cater to geographically dispersed teams, Git allows access to all parts of a company s source code. It can authenticate, but local mirrors do not automatically apply Bitbucket s authorization rules. That is, it allows users to verify who they claim to be but has no way to ensure that those users have the right to access something. Backup and recovery: Procedures in Git must discover and account for all important repositories. All distributed version control systems require a comprehensive backup/recovery system to avoid outages if the central repository goes down. The centralized master repository feeds the build automation, code review, and other ALM systems. Centralized usage: While Git allows for great freedom of use of local branches and repositories, a central repository is still the focal point of collaboration. A centralized model poses performance bottlenecks for remote teams and scalability bottlenecks for larger sites. Geoclustering Git Go2Group 5

Achieving enterprise-class resiliency with Git Forrester describes continuous availability as those times when high availability and disaster recovery are at the point of being one and the same. An easy, automated process simplifies disaster recovery, reduces administration and application recovery times, facilitates business continuity, and minimizes user impact. As part of a comprehensive layered availability strategy, enterprises choose to rely on replicated data kept up to date in near real-time. Clustering architectures While there are many components required to achieve continuous availability, the most appropriate technology for the always-on, alwaysavailable Bitbucket is a clustered architecture. This solution provides multiple redundant copies of critical data, with either centralized or independent management of related metadata. Clusters were first devised over 50 years ago, when it was first realized that work could no longer be made to fit on a single computer. Clusters are defined as a set of servers viewed as a single system that, together, provide a more available and scalable platform for hosting applications. With clustering, work can be done in parallel. The goal of a cluster is to pool the resources of several servers while achieving high availability and sustained performance. As distributed solutions, they are often harder to set up and maintain than their centralized counterparts. However, they offer more resilience to failure and allow systems to grow beyond the capacity of a single server. How does a clustered architecture augment distributed version control systems with regard to distance, latency, and the degree of protection? All clustered servers participate equally in servicing user requests and other processing, so the read load (typically 90% or more of the load on a Git server) is evenly balanced and distributed among the servers. If one server goes down, failover to the other servers happens automatically without manual intervention, typically within seconds. When a new server joins (or an existing one rejoins) the cluster it begins to service user requests and other processing automatically, as soon as it comes online. Geoclustering Git Go2Group 6

Types of clustering Examples of high availability clusters 1, their attributes, and geographies include: Local cluster: A single set of servers located at one data center of location. Network latency can be neglected. Data is accessed synchronously by all servers. Metro cluster: A set of servers placed within a metro distance (generally up to 50 kilometers) with all sites connected by fiber. Network latency is usually low (<5 ms for distances of approximately 20 miles). Data is frequently replicated, either with mirroring or synchronous replication. Geocluster: Multiple geographically dispersed sites, each with a local cluster, that are thousands of kilometers apart. The sites communicate via IP. Geoclustering keeps multiple instances of servers, so it doubles as high availability redundancy while also offering performance benefits, since it s local to each team. Geoclusters need to cope with 1 Clusters of this kind have been referred to by many names, including local clusters, campus clusters, metro clusters, geo clusters, stretched clusters, and extended clusters. Geoclustering Git Go2Group 7

limited network bandwidth and high latency. Data is replicated asynchronously. Most of the servers are not local and are set up with some distance between them. In geographies where systems are too far apart, communication must be done asynchronously between multiples sites. When specifying a solution for dispersed geographies, considerations need to include: how to make sure that a cluster is up and running how to make sure that resources are only started once how to manage failover between sites how to deal with high latency in the event that resources need to be stopped how to ensure a workload will be restarted on another cluster in a far removed location in the event of a catastrophe Enterprise Git: Atlassian s Bitbucket and Bitbucket Data Center As with all types of software, there are many flavors of Git. The different types of Git used by developers are: Atlassian Bitbucket, Collabnet TeamForge, GitHub Enterprise, GitLab, and Wandisco Git Multsite. Atlassian Bitbucket was released in 2012. It is a development tool that serves as Atlassian s Git repository management tool for enterprise teams. It allows for everyone in an organization to easily collaborate on Git repositories. Atlassian released Bitbucket Data Center in 2014, in an effort toward further scalability and resiliency.. Bitbucket Data Center was introduced with enterprise workloads in mind. Furthermore, Atlassian integrated two of its own tools into the Bitbucket Data Center service to speed the development process: the JIRA bug tracking software and the Bamboo continuous integration software for quickly testing new versions of a program. Bitbucket s geographic limitations Atlassian s Bitbucket Data Center popularized the concept of highavailability Git through its active/active cluster configuration, but it is designed for clustered servers in a single data center, not for multiple sites. Since, like all Git solutions, Bitbucket encourages a distributed developer enterprise environment, remote sites suffering from high network latency during Git operations may perform slowly. Geoclustering Git Go2Group 8

Because modern-day developers are often geographically remote, committed code is frequently moved from one repository to another. Git works well in a local environment that has integrated development on the same location and network. However, it does not work well for distributed development teams spread across various locations. The pressing questions for many code developers is What are my requirements for code availability, accessibility, and geographies? Geoclustering Git Go2Group 9

Go2Group s Geoclusters As the only ALM-specific geoclustering solution, Go2Group Geoclusters lets developers create clusters at any distance to maintain business continuity. Performance and availability for read-only operations take a significant leap forward, while all operations benefit from Bitbucket s authorization rules. Servers can be in different buildings or different continents. Overview Go2Group Geoclusters for Bitbucket allows teams in remote locations to share the same code base as the local teams working on a project, while limiting latency and bandwidth issues and staying current with the updated code base by using geo synchronization. Prior to Geoclusters, remote offices had a difficult time receiving the most up-to-date code and dealing with resource-draining bandwidth requests from remote Git servers. They also had a tough time supporting agile development and testing, known as Git branch builds. Now developers can seamlessly connect their remote teams worldwide, as if they were all in the same location. Geoclusters involve the use of multiple redundant computing resources located in different geographical locations to form what appears to be a single, highly-available system. The biggest challenge in geoclustering is to make sure that system states and their associated data are concurrent at multiple locations. Geoclustering Git Go2Group 10

Go2Group Geoclusters overview Synchronous replication from Bitbucket to the Geocluster nodes serves as an always-on backup, eliminating the need for conventional disk mirroring solutions that only work over a LAN. New changes are pushed to each mirror as they arrive and monitoring tools provide up-to-date status of all mirrors. Data Flow The diagrams below show the data flow for user read and write operations. Read operations using geoclusters Geoclustering Git Go2Group 11

Write operations using geoclusters The system automatically keeps the mirrors in sync as new updates arrive at the central repository. Data synchronization Geoclustering Git Go2Group 12

Bitbucket high availability options Atlassian Bitbucket Atlassian Bitbucket Data Center Bitbucket with Go2Group Geoclusters # of sites Single-site Single-site Multi-site Bandwidth High Medium Low Servers Single Server Multiple Servers Multiple Servers Clustering None Active-Active Synchronous push Scalability Zero High Scalability Highest Scalabiity Network Latency Negligible Low High Replication Synchronous Synchronous Synchronous Communications None Network Connection Internet Protocol (IP) Distance None Single Data Center Unlimited Overall Rating Good Better Best Benefits of geoclustered architecture Go2Group Geoclusters enables continuous availability by employing both architectural and design advantages over single-site solutions, including: Protection through multiple redundant copies of repositories: Go2Group Geoclustering works with Atlassian Bitbucket to improve recovery time objectives (RTO) by making multiple copies of valuable repositories available at several locations. Rapid recovery: Should one mirror site fail due to an event like a flood, Geoclusters can route all work to another site, which can take over the processing with nearly no interruption for connected users. Each mirror is periodically verified to make sure that it is consistent with the central repository. Geoclustering Git Go2Group 13

Full utilization of resources: Geoclusters ability to distribute read activity across all servers, including running a single workload across the whole cluster, allows the greatest flexibility in terms of resources. Since Geoclusters uses one physical database across the distance, there is neither a lag in data freshness nor any requirement for implementing conflict schemes. Simplicity in setup, managing, and monitoring: Metrics, verification, and system health for each site s status are presented in an intuitive graphical interface. Bandwidth efficiency in the WAN and improved remote site performance: Bandwidth is free in the LAN but not in the WAN. With Geoclusters, remote WAN users experience the same LAN-speed read performance as local users. This is done by maintaining the equivalent of a single copy of the data across the system. Checkouts and other read operations are always local, so no WAN traffic is generated. Geoclustering Git Go2Group 14

Performance data The following tests performed by Go2Group were over distances of zero (local), 50, and 100 kilometers. The following graph shows the overall performance impact on Bitbucket due to distance measured as a percentage of local performance. Note: Write-intensive Bitbucket is generally more affected by distance then read-intensive Bitbucket. Given these numbers, it can be concluded that Atlassian Bitbucket Data Center performs acceptably in general at distances under 50 kilometers all the way up to 100 kilometers. When distances exceed 100 kilometers, Go2Group Geoclusters for Bitbucket improves on Atlassian s Bitbucket Data Center. Geoclustering Git Go2Group 15

Conclusion Distance can have a huge effect on performance, so keeping the distance short and using dedicated, direct attached networks is optimal, but not always possible. Go2Group Geoclustering for Bitbucket is an attractive architecture that allows scalability, rapid availability, and even partial disaster recovery protection. Compared to an Atlassian Bitbucket and Bitbucket Data Center configuration, Go2Group s geoclustering architecture provides the highest level of availability for an Atlassian environment where developers must have an always-on, always-available experience. Geoclustering Git Go2Group 16

Contacting Go2Group Go2Group is a global provider of consulting services, third-party application integrations, data migrations, software testing, and training services in Application Lifecycle Management (ALM) systems. We ve implemented thousands of enterprise-level migrations. We specialize in complex, multi-platform, ALM integration projects. Our goal: Make it easy. Our clients say, "We feel like you are part of our team. An Enterprise and Platinum Atlassian Expert, we offer a full suite of services for all Atlassian products and are the world s largest reseller of Atlassian tools. We re certified partners for the best-of-breed ALM solutions, including Atlassian, HP, IBM, Microsoft, Perforce. We specialize in integrating ALM tools such as Atlassian, HP, IBM, Microsoft, Perforce, ServiceNow, and many more: Users work in the tools they prefer, the data is synchronized automatically. Go2Group and GSA Products and services from Go2Group and its partners, including Atlassian, Microsoft, and Perforce, are available via GSA or several GWACs and procurement vehicles. We re expert in government policies and strategies. Contact http://www.go2group.com/ Corporate Office, USA: 138 North Hickory Avenue, Bel Air, MD 21014 Hawaii: 7007 Hawaii Kai Drive, Suite C26, Honolulu, HI 96825 Japan: Le Premier Akihabara 11th Floor, 73 Kanda Neribei-cho, Chiyodaku, Tokyo 101-0022 China: Great Wall Computer Building A301, 38 Xueyuan Road, Haidian District, Beijing 100083 Telephone: 877-442-4669 (U.S. toll free); +1-410-879-8102 (U.S.) Email: sales@go2group.com Geoclustering Git Go2Group 17

Notice 2015 Go2Group, Inc. All rights reserved. Subject to change without notice. ConnectALL is a registered trademark of Go2Group, Inc. in the U.S. and other countries. Bitbucket and Bitbucket Data Center are registered trademarks of Atlassian. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. This white paper is for informational purposes only. Go2Group makes no warranties, express, implied, or statutory, as to the information in this document. WP-G2G-1000 Geoclustering Git Go2Group 18