Controller Failover for SDN Enterprise Networks

Cntrller Failver fr SDN Enterprise Netwrks Vasily Pashkv, Alexander Shalimv, Ruslan Smeliansky Lmnsv Mscw State University, Applied Research Center fr Cmputer Netwrks Mscw, Russia pashkv@lvk.cs.msu.su, ashalimv@lvk.cs.msu.su, smel@cs.msu.su Abstract In SDN netwrk based n OpenFlw a cntrller perfrms lgically centralized cntrl f enterprise netwrk infrastructure, netwrk plicies, and data flws. At the same time the cntrller is a single pint f failure which can cause a very serius prblem (e.g. netwrk utage) fr netwrk reliability and prductin use cases. T address this prblem, we cnsider different active/standby strategies t prvide a cntrller failver in case f cntrller failure. We prpse a high-available cntrller (HAC) architecture, which allws t deply a highavailability cntrl plane fr enterprise netwrks. We develp a HAC prttype t demnstrate the efficiency f ur slutin and als describe initial experimental results. Keywrds Sftware-Defined Netwrking; Cntrl Plane Design; Cntrller Architecture; Fault-Tlerance; Redundancy. I. INTRODUCTION SDN is a new apprach in netwrking, which significantly imprves the prgrammability and flexibility f netwrk management, simplifies the lgic f netwrk devices and reduces the cst f the netwrk infrastructure deplyment and the cst f its maintenance in cmparisn with traditinal appraches [1, 2]. SDN separates the cntrl plane and the data plane, which enables their independent deplyment, scalability and maintenance. SDN invlves centralized management f netwrk infrastructure and data flws, but this apprach can lead t netwrk resilience and scalability prblems. The cntrl plane can be deplyed n ne r several SDN cntrllers, which is running n dedicated servers [3]. The set f hardware and sftware cmpnents fr prviding f centralized netwrk management in SDN is a cntrl platfrm. The cntrller supprts an actual glbal netwrk view (GNV), which is stred in its netwrk infrmatin base (NIB). Using netwrk view cntrller applicatins cntrl netwrk devices states and data flws. That is why SDN netwrk perfrmance, reliability and scalability is defined by cntrl platfrm characteristics. In spite f the SDN advantages, ne f the serius prblems f SDN is that the cntrller is a critical pint f failure and, therefre, the cntrller decreases verall netwrk availability. A Cntrller failure can be caused by varius reasns: failure f the server where a cntrller is running, the server perating system failure, pwer utage, abnrmal terminatin f the cntrller prcess, netwrk applicatin failure, netwrk attacks n the cntrller and many thers. In this paper we address t cntrl plane fr OpenFlw netwrks, as the ne f the mst prmising implementatins f SDN apprach [4]. OpenFlw prtcl is the pen interface between the cntrl plane and the data plane. The cntrl plane in OpenFlw includes a cntrller (r NOS netwrk perating system) t mnitr and cntrl the state f OpenFlw switches, a set f netwrk applicatins fr netwrk traffic and plicy management, OpenFlw cmmunicatin channels between the cntrller and switches and OpenFlw prtcl fr their interactin. OpenFlw cntrller can install rules in OpenFlw switches fr data flws supprting predictive, reactive, and practive r hybrid flw installatin mdes. At the present time there are abut 30 different OpenFlw cntrller implementatins [5, 6]: NOX, POX, Beacn, MUL, Fldlight and the thers. Hwever mst f them d nt supprt the cntrl plane restratin mechanisms in the case f cntrller failure. Only distributed cntrl platfrms Onix [7], Kand [8] and sme prprietary cntrllers with OpenFlw 1.0 [4] supprt restratin prcedure in case f a cntrller failure. Thus, a cntrller failure in the cntrl plane f SDN/OpenFlw is a pressing issue. An apprach fr imprving the SDN cntrl plane availability in case f a cntrller failure in the enterprise sftware-defined netwrks is presented in the paper. In summary, in this paper the fllwing pints are presented: cmparative analysis f the different active/standby strategies t prvide a cntrller failver; a fault-tlerant cntrl plane architecture fr enterprise sftware-defined netwrks; a High-Available Cntrller (HAC) architecture that prvides the netwrk ability t fast recvery f the cntrl plane; the cntrl recvery prcedure and the prcedure fr netwrk view synchrnizatin between active and standby cntrller instances; the HAC prttype implementatin with supprting OpenFlw versin 1.3.

II. BACKGROUND A. Typical SDN cntrller architecture A typical SDN/OpenFlw cntrller [6, 9, 10 and 11] includes: Cntrller cre which handles and supprts cnnectivity with switches and translates cntrl prtcl messages (e.g. OpenFlw) int internal cntrller events and vice versa. Cntrller netwrk services which cntrl, frm and mnitr netwrk view, states f netwrk devices, prvide an interface (Nrthbund API) fr cntrller applicatins. Usually netwrk services include event dispatching, device managing, tplgy managing and the thers. Cntrller netwrk applicatins which cnfigure netwrk infrastructure and manage data flws t slve sme business use cases. The interactin between Netwrk, cntrller services and applicatins is based n the publish-subscribe mdel. B. Active/Standby strategies analysis fr cntrl platfrm Let us cnsider the basic redundancy appraches t imprve cntrl platfrm availability fr enterprise sftwaredefined netwrks. The cntrllers can be active r standby mde in the cntrl plane. An active cntrller directly receives and prcesses OpenFlw messages frm netwrk devices. A standby cntrller duplicates the functinality f the active cntrller, but receives and prcesses OpenFlw messages frm netwrk devices nly in case f active cntrller failure. The number f standby cntrllers may be increased t tlerate mre than ne failure at a time. The primary cntrller fr a netwrk segment is a cntrller which cnfigures netwrk devices f its segment and installs the rules fr data flws in this segment. There are active/standby strategies and active/active strategies fr cntrller redundancy. In case f active/standby strategies cntrl platfrm has nly ne active primary cntrller. In case f active/active strategies cntrl platfrm can have multiple active cntrllers. But in this paper we cnsider nly active/standby strategies with ne active primary cntrller in the cntrl plane. In case f primary cntrller failure the standby cntrller autmatically takes ver netwrk infrastructure cntrl and data flws management. This prcedure is called cntrller failver. Cntrller failback is the reverse prcedure t failver. This prcedure is used when the primary cntrller is restred. The active/standby strategies based n the peratinal status f standby cntrllers (switch n/ff and lading n/ff befre the start f wrk) and failver transparency are: n standby; cld standby; warm standby; ht standby. «N standby» strategy. The cntrl plane has a single active primary cntrller withut standby cntrllers cnnectivity. In case f primary cntrller failure the netwrk administratr manually resets the cntrller r replaces it. Thus, the cntrl plane recvery time is significant and unpredictable and depends n the efficiency f supprt service. «Cld standby» strategy. The cntrl plane has an additinal unladed server cnnected t a server f the primary cntrller. The «cld standby» strategy uses autmatically failver prcedure. The standby cntrller is stateless. In case f primary cntrller failure a standby server runs the standby cntrller and its services and applicatins (including tplgy discvery service t frm netwrk view) frm scratch. This strategy is preferable t use fr stateless services and applicatins. Recvery time is determined by the cntrller start and time t restre the actual standby cntrller state. In the case f cld standby strategy a redundancy hardware cmpnent is ften unladed, that is why it can be used fr any ptinal extra wrk: fr testing, debugging, maintenance and ther services (e.g. testing f the new versins f cntrller netwrk services and applicatins). «Warm standby» strategy includes peridically primary cntrller state replicatin t standby cntrllers and autmatically failver prcedure. The «warm standby» strategy is usually prvided by hardware and sftware redundancy. In case f primary cntrller failure the standby cntrller replaces a failed cntrller and cntinues t perate n the basis f its previus state. Cntrl plane services fr netwrk devices are interrupted and sme state is getting lst. The lst part f the cntrl plane state is thse state changes which were between the last state synchrnizatin prcedure and the primary cntrller failure. «Ht standby» strategy includes full state synchrnizatin f the primary and standby cntrllers and autmatically failver prcedure. N lss f the cntrller state prvides the minimum recvery time. State f the primary cntrller is replicated t the standby cntrller fr any change in it. In case f primary cntrller failure the standby cntrller replaces a failed cntrller and cntinues basing n the current state. The «ht standby» strategy is implemented by sftware and hardware redundancy. TABLE I. COMPARATIVE ANALYSIS OF THE ACTIVE/STANDBY STRATEGIES Criterin Active/Standby strategies N Cld Warm Ht Redundancy hardware hardware hardware and sftware hardware and sftware Active 1 1 1 1 cntrllers Failver prcedure State lss manually cmplete lss f the state partial lss f the state autmatically cmplete lss f the state autmatically autmatically withut lss f the state State and n n regularly up-t-date

Criterin Failver time Cst Netwrk user impact Active/Standby strategies N Cld Warm Ht (any change) 1 1+N 1+N 1+N data syncrnizatin Redundancy rate unpredictable n cst t lw cst frm minutes t secnds mderate secnds mderate t high frm secnds t millesecnds mderate t high high mderate lw nne C. Key metrics Key metrics which characterize a fault-tlerant cntrl platfrm are the fllwing: Cntrller redundancy degree is a number f standby cntrller instances included in the cntrl platfrm. It determines the cst f the cntrl platfrm and the number f failures that can be avided. Cntrller delay in the wrst case is the maximum delay fr prcessing the netwrk device flw installatin request by the cntrller which is attained in the cntrl recvery prcess. Cntrller failver time is the time during which netwrk device requests can be lst due t absence f the primary cntrller in the netwrk, i.e. the time during which the netwrk is nt the primary cntrller. Thus, the SDN cntrl platfrm shuld have cntrller redundancy degree at least ne, delay in the wrst case n mre than 150 millisecnds as recmmended maximum time delay fr services. Failver time shuld be as lw and clse t zer. D. Fault-Tlerant cntrl plane requirements T supprt redundant cntrllers the cntrl platfrm must meet the fllwing requirements: there are t be at least tw servers; identical hardware and sftware server cnfiguratin; internal netwrk between servers t decuple cntrl platfrm cmmunicatins frm OpenFlw cmmunicatin channels and fr accessing t data stre; each server must have access t SDN netwrk segment with independent links; identical cntrller instances (with identical versins f cntrller netwrk services and applicatins). These requirements are due t the fllwing reasns: t avid single pint f failure; standby cntrller must have sufficient cmputing resurces fr netwrk infrastructure and data flws management in case f primary cntrller failure; standby cntrller must prvide the same set f functins as the primary cntrller. III. PROPOSED APPROACH A. Prpsal Since we have previusly discussed the active/standby strategies it is very imprtant t define cntrller state. Cntrller state includes states f cntrller services and applicatins, event queue state, cntrller netwrk view and cntrller data. The state f the cntrller service r applicatin includes values f internal service/applicatin significant variables. Service/applicatin snapsht is a service/applicatin state at a particular time. Cntrller snapsht is cntrller services and applicatins snapshts and current netwrk view. Fr slving the cntrller failure issue using active/standby strategies we need t define the basic mdes fr cntrl platfrm: an initial mde which describes the rder f launch cntrller instances, an peratinal mde which describes cntrller instances synchrnizatin prcedure and a primary cntrller failure mde which describes failure detectin and failver prcedures. Initial cntrller mde. Running the primary cntrller f the cntrl plane: The cntrller starts in accrdance with the cnfiguratin file. The cntrller launches a timer fr cnnecting standby cntrllers. Running the standby cntrller f cntrl plane: The cntrller starts in accrdance with the cnfiguratin file. Standby cntrller establishes a cnnectin t the primary cntrller via the internal cntrl netwrk between cntrllers. Standby cntrller requests a list f netwrk services and applicatins f the primary cntrller and launches a similar set f applicatins and services. Standby cntrller requests the current Netwrk view and netwrk interfaces list fr cntrl channels cnnectins, current states f netwrk services and applicatins. Standby cntrller launches primary cntrller state mnitring service. Operatinal cntrller mde. In this mde primary cntrller prcesses OpenFlw messages frm netwrk devices and cntrls netwrk data flws, the standby cntrllers mnitr the primary cntrller state and synchrnize with it. Cntrllers state synchrnizatin includes: netwrk view synchrnizatin; cntrller netwrk services and applicatins states synchrnizatin; cntrller data synchrnizatin.

In this paper we use tw strategies fr cntrllers. Fr netwrk view redundancy and synchrnizatin we use ht active/standby strategy. Primary cntrller pushes up each netwrk view change t all standby cntrllers. Fr cntrller netwrk services and applicatin redundancy and synchrnizatin we use warm active/standby strategy. Primary cntrller peridically r cnditinally pushes up snapshts f services and applicatins t all standby cntrllers. Fr cntrller data synchrnizatin we use reliable shared data strage between cntrllers. Primary cntrller failure mde. The cntrl plane recvery prcedure cnsists f tw stages: Failure detectin stage. Primary cntrller failure detectin mechanism is based n the heartbeat. The main parameters are: heartbeat interval the time interval between heartbeat messages, and dead interval the time interval thrugh which standby cntrller fixes primary cntrller failure. Recvery stage. The recvery stage starts after primary cntrller failure detectin. It includes the fllwing steps: Defining a new primary cntrller. The new primary cntrller is a standby cntrller with the highest ID (r IP). The new primary cntrller infrms the ther cntrller abut its status change. Cntrller netwrk services and applicatin restratin. Cntrl netwrk interfaces up. B. High-Available Cntrller Architecture High-available cntrller (HAC) architecture is based n adding f additinal cluster middleware between the cntrller cre and cntrller netwrk services and applicatins (see Figure 1). T prvide fault-tlerance f the cntrl platfrm the HAC cluster middleware includes the fllwing managers and services: Cntrller Manager t crdinate start/restart/stp cntrller netwrk services and applicatins and up and dwn cntrl interface fr netwrk devices cnnectins. Cluster Manager t cntrl the peratin f the cntrllers cluster and distribute respnsibilities (primary r standby) in accrdance with the cluster cnfiguratin file. Sync Manager t cntrl cntrller netwrk services and applicatins synchrnizatin between cntrller instances in the cluster. Recvery Manager t crdinate the recvery prcess (failver and failback) in case f cntrller instance failure in the platfrm. Message Service t prvide cntrl message distributin t ther cntrller instances in the cntrller cluster. Event Service t prvide filtering, distributin and prcessing t r frm ther cntrller instances. Heartbeat Service t mnitr the peratinal status f the cntrllers and detects cntrller failures in the cntrller cluster. Fig. 1. High-Available Cntrller architecture C. Cntrl Plane Design with HAC In rder t avid single pint f failures in the cntrl platfrm we prpse the fllwing design f the cntrl platfrm (see Figure 2). Services and apps snapshts Primary cntrller Cntrller Data Strage Heartbeat messages Cntrl messages Netwrk view events Netwrk Restratin frm snapshts Standby cntrller Fig. 2. Fault-tlerant cntrl plane design with HAC cntrller

D. HAC Imlementatin Based n the review f mdern pen-surce SDN cntrl platfrms [5] as a base cntrller fr HAC cntrller we chse NOX13flib frm the labratry CPqD [12]. This cntrller supprts OpenFlw cntrl prtcl versin 1.3.0 [13]. All HAC cluster middleware managers and services have been implemented in C++ with using Qt 4.8.1 and Bst libraries. Applicatins and services snapshts are frmed using bst serializatin mechanism. Interactin between the middleware services and managers prvides thrugh the Qt signal-slt mechanism t ensure the independence frm the base cntrller. IV. EVALUATION The HAC cntrller prttype implementatin has been deplyed n a Linux virtual machine fr functinal and perfrmance testing. Our experimental evaluatin includes tw parts: synchrnizatin verhead evaluatin and cntrller failver time evaluatin. In the first part we evaluate perfrmance verhead cnnected with primary and standby HAC cntrller synchrnizatin. Using cbench we evaluate thrughput f NOX13flib and thrughput f tw-nde faulttlerant HAC cluster. Figure 4 shws the change f respnse time depending n the packet-in message index during primary cntrller failure and cntrller failver prcedure. Initial experimental results shwed that average failver time fr tw-nde HAC cluster are frm 40 t 50 ms, which is less than the maximum delay fr services and that is why netwrk services fr end users will nt be interrupted during cntrller failver. V. CONCLUSION AND FUTURE WORK Cntrller is a critical cmpnent f enterprise sftwaredefined netwrks. In this paper, we shwed the relevance and significance f the cntrl plane availability prblem fr SDN in case f cntrller failure. We cnsidered and carried ut a cmparative analysis f active/standby strategies fr their applicability t the cntrl plane. We frmulated a set f necessary requirements fr cntrller redundancy. Mrever, we presented cntrl plane design, HAC cntrller architecture and tls fr cntrller netwrk applicatins and services synchrnizatin and netwrk view synchrnizatin between cntrllers in the cntrl plane. We implemented the HAC cluster middleware that can be easily adapted t ther mre prductive basic cntrller implementatin. We shwed that ur initial evaluatin results are quite encuraging. Thus, in this paper we prpsed apprach t slve cntrller failver prblem fr SDN cntrl platfrm, we prpsed middleware implementatin which prvides pprtunities fr active/active strategies studies and distributed cntrller develpment. We are cntinuing the implementatin f the HAC cluster middleware with fcus n develping cntrller state synchrnizatin algrithms and adding f lad balancing mechanisms between cntrller instances. We plan t extend the list f failures that the cntrl platfrm can prevent. Fig. 3. HAC cntrller synchrnizatin perfrmance verhead Synchrnizatin verhead range is frm 5 t 23 percent f the nx13flib cntrller thrughput (see Figure 3). Fig. 4. Respnse time during HAC failver REFERENCES [1] N. McKewn et al., OpenFlw: Enabling innvatin in campus netwrks, ACM SIGCOMM Cmputer Cmmunicatin Review, vl. 38, i 2, April 2008, pp. 69-74. [2] R.L. Smeliansky, Sftware-Defined Netwrks, Open Systems, N.9, 2012. [in Russian] [3] Open Netwrking Fundatin, Sftware-Defined Netwrking: The New Nrm fr Netwrks, ONF White Paper, 2012. [4] Open Netwrking Fundatin, OpenFlw Switch Specificatin, Versin 1.0.0 (Wire Prtcl 0x01), 2009. [5] A. Shalimv, D.Zuikv, D.Zimarina, V. Pashkv, R. Smeliansky, Advanced Study f SDN/OpenFlw cntrllers, Prceedings f the CEE-SECR13: Central and Eastern Eurpean Sftware Engineering Cnference in Russia, ACM SIGCOFT, Octber 23-25, 2013, Mscw. [6] A. Shalimv et al. Analysis f SftwareDefined Netwrks Perfrmance and Functinality, editr-in-chief R. Smelianskiy. M.:MAKS Press, 2014. 148 p. [in Russian] [7] T. Kpnen et al., Onix: A Distributed Cntrl Platfrm fr Largescale Prductin Netwrks, in OSDI, 2010. [8] S. H. Yeganeh and Y. Ganjali, Kand: a framewrk fr efficient and scalable fflading f cntrl applicatins, in HtSDN, 2012. [9] D. Ericksn, The Beacn OpenFlw cntrller, in Prc. HtSDN, Aug. 2013.

[10] N. Gude, T. Kpnen, J. Pettit, B. Pfaff, M. Casad, N. Mckewn, and S. Shenker, NOX: Twards an Operating System fr Netwrks, in SIGCOMM CCR, 2008. [11] Fldlight OpenFlw Cntrller. http://fldlight.penflwhub.rg/. [12] NOX 1.3 Oflib, https://github.cm/cpqd/nx13flib. [13] Open Netwrking Fundatin, OpenFlw Switch Specificatin, Versin 1.3.0 (Wire Prtcl 0x04), 2012.