Joseph L White, Juniper Networks
SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this material in presentations and literature under the following conditions: Any slide or slides used must be reproduced without modification The SNIA must be acknowledged as source of any material used in the body of any document containing material from these presentations. This presentation is a project of the SNIA Education Committee. Neither the Author nor the Presenter is an attorney and nothing in this presentation is intended to be nor should be construed as legal advice or opinion. If you need legal advice or legal opinion please contact an attorney. The information presented herein represents the Author's personal opinion and current understanding of the issues involved. The Author, the Presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK. 2
Abstract Extending storage networks across distance is essential to BC/DR (Business Continuance/Disaster Recovery), compliance, and data center consolidation. This tutorial will provide both an overview of available techniques and technologies for extending storage networks into the Metro and Wide area networks and a discussion of the applications and scenarios where distance is important. Transport technologies and techniques discussed will include SONET, CWDM, DWDM, Metro Ethernet, TCP/IP, FC credit expansion, data compression, and FCP protocol optimizations (Fast Write, etc). Scenarios discussed will include disk mirroring (both synchronous and asynchronous), remote backup, and remote block access. Learning Objectives Overview of transport technologies used in Metro and Wide area networks Overview of protocol and transport optimizations for Metro and Wide area networks including data compression and fast write Overview of deployment scenarios and business drivers for extending storage networks across metro and wide are networks 3
Outline Motivation Basic definitions SAN MAN WAN Protocols SCSI FCP FCoE iscsi FCIP ifcp FICON Transport FC TCP/IP Ethernet WDM Transport TDM (SONET/SDH) Effects of Distance Sources of Latency Performance Droop Buffers and Data Bandwidth-delay product Application Behavior Synch vs Asynch Continuous vs Snapshot/Backup Optimizations Compression Acceleration (eg fast write, tape acceleration ) 4
Why is Distance Important? BC/DR Human HW/SW Power Outages Nature Business Consolidation Virtualization Security Lost Tapes Regulatory HIPAA SoX Finance It s about Data Protection! Source: US Geological Survey & FEMA Minimize Risk from a Single Threat Source Primary Location Distant Enough for Safety Close Enough for Cost-Effective Performance Secondary Location 5
The WAN as seen by Storage Separate disaster recovery center not always required. Active data centers can back each other up. Remote Data Centers used to support Disaster Recovery (DR) Close proximity Data Centers used to support Business Continuance (BC) Synch vs. Asynch replication applications is a separate distinction from BC/DR Must determine sites, distances, applications, etc by data classification and risk analysis while considering Recovery Time Objectives (RTO) or Recovery Point Objectives (RPO) 6
The MAN as seen by Storage 150-200 Km max diameter Effective range of synchronous applications Increasing longer range deployments (100Km+) New Jersey Can be as short as a few 100 meters i.e. to the next building 5-10 Km separation between sites common Older installs + newer SMBs Long range optics 40-80 km reach for direct connect Philadelphia New York City Commonly used infrastructure Direct Fibre (may have been Dark previously DWDM/CWDM SONET/SDH (TDM) FC direct connect common at shorter ranges FCIP comes in at longer ranges 7
The SAN SANs always deployed full dual rail FC Services Configuration database Network Management Simple FC Fabric MAN N F F F FC Switch 1 E E E ISL E N E E N N N N F N E F E FC Switch 3 E E E E FC Switch 4 F F FC Switch 2 E E E E E E E E E E E E F N E F F F F E Gateway Services Appliance WAN Conventionally a collection of FC switches operating together as a Fabric supporting a set of FC services and allowing servers and storage devices (disk, tape, arrays) to communicate with each other using block protocols. MAN Access by direct connect Appliances can be attached to provide data services (block virtualization, encryption, etc) Gateways can be attached to provide WAN access 8
Layers Remote Office, Central Office, Data Centers all linked MAN and WAN carry already carry converged traffic WAN traffic is largely TCP/IP MAN traffic is mixed Gateways used to connect FC SANs to MAN/WAN WAN accelerators also used for optimized remote office access Lots of physical interconnect options Lots of layering possible For example: if talking FC we could have FC over SONET/SDH FC over IP over Ethernet FC over native optical FC over WDM etc 9
Interconnect Topology/Technology Direct Optical Interconnect WDM Interconnect 1 Colored Optics External box is mux only WDM Interconnect 2 Native interface locally protocol agnostic External box does wavelength shifting TDM Interconnect Bit level protocol dependencies (inter-frame gap etc) Gateway Interconnect across other WAN infrastructure FC and above dependencies 10
Applications Relative Cost Equipment / Recurring / Resources Continuous Data Protection Remote Disk Mirroring Geo Clustering Driven by business requirements More on application behavior later Remote Disk Replication Remote Tape Backup Real-Time Minutes Hours Days Weeks Time to Recover Data / Age of Data 11
Storage Networking Protocols SCSI is the protocol and command set for block storage access. Multiple SCSI Transports Applications/Operating System SCSI Gateway Gateway FCP FCP iscsi SRP FCP FCIP FCP ifcp TCP TCP TCP FCoE IP IP IP zo/s ECKD FICON NAS Apps/OS FS NFS/ CIFS TCP IP FC CEE Ethernet IB Ethernet Ethernet FC Ethernet 1/2/4/8G +10G 10/100/1G/10G 10G/20G 1/2/4/8G +10G FC No-drop, credit flow control Fabric Services Switched Network FCoE Transport FC over Ethernet, while maintaining FC operational model Consolidate I/O for SAN, LAN fabrics CEE Converged Enhanced Ethernet Makes Ethernet directly suitable for storage traffic iscsi Direct Transport of SCSI over TCP/IP Fabric Services (isns) Runs over existing infrastructure FCIP Tunnels FC using TCP/IP Mainly a long-distance solution Interconnects FC infrastructure into distributed SANs ifcp Interconnects FC devices across an IP network Local FC infrastructure isolated from remote infrastructure FICON Transport of ECKD across FC infrastructure IBM Mainframe Replaced ESCON Infiniband Mostly HPC environment NAS File access semantics (instead of block) Shown here for context 12
Local FCP: FC, FCoE FCP is the serialization of SCSI commands across Fibre Channel transport Uses the FC exchange, sequence, frame structures Maps SCSI task management to FC constructs Generally the term FC applies to FCP/FC plus the FC Fabric Services FCoE refers to replacing the FC-1 and FC-0 layers with Ethernet as a transport Realistically CEE Converged Enhanced Ethernet Check out SNIA Tutorial: Fibre Channel Over Ethernet (FCoE) 13
Distance FCP: FCIP, ifcp FCIP FC over IP FC run across TCP/IP tunnel between Gateways Connects FC SAN segments into one SAN FC devices & fabric services used as-is SAN routing can be used to isolate FC fabrics just like for ifcp ifcp Internet-Fibre Channel Protocol FCP over TCP/IP Provides isolation of local FC SANs In practice used like FCIP Native ifcp devices would be allowed but none were implemented 14
iscsi iscsi directly implements a SAN across an IP network using TCP/IP Traditionally for SME or SMB market A iscsi-fc gateway can be used to access native FC devices FC Storage, FC Initiator, iscsi Initiator, iscsi Storage in any combination Usual deployment is FC Storage and iscsi storage accessed by iscsi Servers isns provides fabric services (name, zone, config, SCN) Severs can contain a HBA, TOE, or standard NIC Direct access across local and metro distances Could access WAN devices since IP is a fully routed protocol but most implementations would suffer significant performance degradation 15
IP MAN/WAN Networking For LAN IP rides over Ethernet to hop across the network For MAN/WAN IP rides over SONET/SDH, WDM OR over native Ethernet that is in turn carried over SONET/SDH, WDM IP and Ethernet generally carries TCP or UDP traffic Under I/O Convergence Ethernet also carries Storage traffic Ethernet + IP (will also apply to FCoE) Well understood and accepted in IT world Low service cost points for best-effort services Short-term bursty, file-based, small packets, connectionless Congestion common, retransmits, variable/high latency Services available Ethernet Private Line, MPLS, RPR, Carrier Ethernet 16
Characteristics of TCP For WAN networking TCP is Critical (FCIP, iscsi, ifcp) Connection Oriented Full Duplex Byte Stream (to the application) Port Numbers identify application/service endpoints within an IP address Connection Identification: IP Address pair + Port Number pair ( 4-tuple ) Well known port numbers for some services Reliable connection open and close Capabilities negotiated at connection initialization (TCP Options) Reliable Guaranteed In-Order Delivery Segments carry sequence and acknowledgement information Sender keeps data until received Sender times out and retransmits when needed Segments protected by checksum Flow Control and Congestion Avoidance Flow control is end to end (NOT port to port over a single link) Sender Congestion Window Receiver Sliding Window 17
TCP Send/Receive Illustration 18
TCP Retransmission Timeout 80.000 70.000 60.000 Retransmission timeouts Unrecoverable drops 10 ms RTT rate (MB/s) 50.000 40.000 30.000 Too many of these and the session closes Can take minutes 20.000 10.000 0.000 0.00 0.50 1.00 1.50 2.00 2.50 3.00 time (s) time oldest sent, unacknowledged data Requires RTT estimation for connection (typically 500 ms resolution TCP clock) Retransmission timeouts are 500 ms to 1 s with exponential back-off as more timeouts occur 19
TCP Fast Retransmit, Fast Recovery 10 ms RTT Packet drop 40.00 35.00 30.00 Fast Recovery Congestion Avoidance rate (MB/s) 25.00 20.00 15.00 10.00 5.00 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 time (s) Dropped frames can be detected by looking for duplicate ACKs 3 dup ACKs frames triggers Fast Retransmit and Fast Recovery With Fast Retransmit there is no retransmission timeout. 20
TCP/IP for Block Storage Scaled receive windows Quick Start Modify Congestion Controls Deal with network reordering Detect retransmission timeouts faster Implement Selective Acknowledgement (SACK) Reduce the amount of data transferred (compression) Aggregate multiple TCP/IP sessions together Bandwidth Management, Rate Limiting, Traffic Shaping 21
TCP/IP Summary TCP/IP is both good and bad for block storage traffic TCP/IP s fundamental characteristics are good Connection oriented full duplex guaranteed in-order delivery Basic latency is not significant when compared to native FC Application SCSI iscsi TCP/IP Ethernet Application SCSI FCP FC TCP/IP s congestion controls and lost segment recovery can cause problems for block storage Large latencies CAN occur when drops are happening (this is bad) However, Many of TCP/IP drawbacks can be mitigated Some changes only improve TCP behavior For example better resolution TCP timers leading to more precise Or SACK Some have a possible negative effect on other traffic For example removing congestion avoidance completely 22
Physical Interconnects FC Ethernet Protocol Agnostic WDM (Wave Division Multiplexing) TDM CWDM, DWDM SONET/SDH ATM and legacy dark fiber Optical fiber in place but not used (i.e. unlit) 23
MAN/WAN Transport Options Understand your actual throughput needs: Changed data size by backup window = data rate Many considerations: Application Performance Latency Bandwidth Security Protection Distance Availability Cost Equipment / Service Cost Ethernet TDM WDM 10 20 30 40 50 150 250 500 750 1000 2000 3000 Daily Throughput Requirement (GBytes/day) 24
Fibre Channel Switched network protocol Loop can still apply to disk array internal interconnects 1/2/4/8G + 10G speeds Provides transport with credit based link level flow control A credit corresponds to 1 frame independent of size Amount of credit supported by a port with average frame size taken into account determines maximum distance that can be traversed Switches organized into fabrics with Fabric Services FC has fabric and services related frames (Basic and Extended Link Services) in addition to transporting FCP or FICON FC can transport other protocols including IP but this is not generally done Check out SNIA Tutorial: Fibre Channel Technologies: Current and Future 25
Ethernet Layer 2 interconnect 10/100/1000 (1GE) /10000 (10GE) Carries IP traffic (TCP, UDP) FCoE Protocol Features 802.3x: Flow Control (PAUSE) 802.1d/802.1w: STP/RSTP 802.3ad: Link Aggregation 802.1p: Class of Service 802.1q: VLAN Pause frames and distance When the sender needs to be stopped the receiver sends a frame to notify the sender If the buffer is overrun then frames can be dropped This puts a hard limit on the distance for storage traffic Unlike the case for FC using BB Credits Pause can also cause extensive congestion spreading 26
CEE: Converged Enhanced Ethernet Expand Ethernet such that it is better suited to converged networks Proposals to present to T11 and IEEE and IETF Owed by IEEE and IETF (See standards organizations web pages for details) Check out SNIA Tutorial: Ethernet Enhancements for Storage: Deploying FCoE Priority-based flow Control (PFC) 802.1Qbb Provides no packet-drop behavior Enhanced Transmission Selection (ETS) 802.1Qaz Multiple priority groups with bandwidth guarantees. Strict priority. Data Center Bridging Exchange protocol (DCBX) Uses LLDP (802.1AB) to advertise connectivity and management information between two link peers Congestion Management (CM) 802.1Qau Provides link level congestion management and notification TRILL (IETF) Transparent Interconnect of Lots of Links allows L2 multipath Shortest Path Bridging 802.1aq Eliminate spanning tree for L2 topologies and allows L2 multipath 27
WDM MAN/WAN Networking Wavelength Division Multiplexing Multiple Lasers each shooting light of a particular wavelength through a single fiber allow multiple streams of data traffic to be carried simultaneously! New Line Productions, Inc Prisms or their electronic equivalent combine and split the light at each end of the long haul optical link. 50Mbps 0.5G 1G 2G 4G 10Gbps with each wavelength carrying up to an input connection of full-rate throughput 28
WDM Infrastructure Colored optics inserted into device WDM combines light MUX is prism or electronic Standard interface used in device WDM shifts wavelength MUX still combined signals Input can also be TDM Can have multi-input TDM card to put several standard interfaces onto single wavelength. A Wavelength or lambda is really tight range. Resolving power of equipment determines how many lambdas fit onto the fiber 29
WDM: Flavors and Features DWDM: Dense WDM 8-40+ waves per fiber 500mile reach with amplification 2.5Gbps & 10Gbps common Optical protection Optics experience needed CWDM: Coarse WDM 4-8 waves per fiber 50mile reach 2.5Gbps Optical protection Lower cost with passive optics OC-3 ESCON OC-48 Fiber OC-12 GbE / FC / FICON OC-48/192 / 10GbE OTU1 / OTU2 Each wavelength (aka lambda) can utilize its full bandwidth capacity for multiple services 30
TDM MAN/WAN Networking TDM Time Division Multiplexing SONET/SDH (OC-1+/T1+/E1+/DS1+/etc) Well established and widely available Any distance support from Metro to Wide area Connection based with predictable low latency Highly reliable with path protection SDH is the international equivalent of SONET Some extension gateways have direct SONET/SDH interfaces Used to aggregate slower traffic onto faster links This applies to combining fast into superfast links for example stretched data centers across metro distances. 31
Latency Command Completion Time is important Contributing Factors: (sum em all up!) Distance (due to speed of light ) - latency of the cables (2x10 8 m/s gives 1 ms RTT per 100Km separation) Hops latency through the intermediate devices Queuing delays due to congestion Protocol handshake overheads Target response time Initiator response time 32
Performance Droop Many of sources of performance droop Transport Buffer relative to bandwidth delay product Number of credits TCP transmit and receive buffering Available Data relative to bandwidth delay product Outstanding commands Command request size i.e. must do bandwidth delay at each protocol level Protocol handshakes or limitations For example, transfer ready in FCP write command 33
Bandwidth Delay Product Long Fat Networks have a large bandwidth-delay product Bandwidth-delay product = amount of data in flight needed to saturate the network link For this example we need 2.56 MB of both transmit data and receive window to sustain line rate 1 Gb pipe 100 Mb pipe but for this example only 256KB is needed to sustain line rate 1 ms = 128 KB buffering at 1Gb/s 1 ms = 100 Km a maximum separation 34
Performance Droop due to distance throughput 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% Droop at GE line rate (125 MB/s) Lines represent varying sizes of buffer space or outstanding data 0.00% 100.00% 90.00% 80.00% 70.00% 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 distance (km) 1 MB 256 KB 64 KB 32 KB Droop at OC-3 line rate (~18MB/s) throughput 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 distance (km) 1 MB 256 KB 64 KB 32 KB 35
Application-Storage Interaction Synchronous Replication each command must be remotely completed before it is locally completed Local Array Remote Array Asynchronous Replication commands are completed locally as they happen Data written remotely in the background Local Array Remote Array Distance between the local and remote array has a large effect on execution of the application. Most Synchronous replication has about a 200 Km range. 36
Application-Storage Interaction The storage device or server could do a snapshot or backup of the data instead of continuous writing to remote storage In this case a relatively large block of data must be moved across the network. This may have to happen in a specific backup window. It is important to understand the behavior of the applications and storage devices SAN to know what demands this places upon the MAN or WAN network. 37
Optimization Examples Compression Write Acceleration Tape Acceleration Transport Accelerations such as for TCP/IP already discussed 38
Compression Increases the Effective network capacity by the compression ratio. Compression Ratio The size of the incoming data divided by the outgoing data Determined by the data pattern and algorithm History buffers help the compression ratio since they retain more data for potential matches Compression Rate Speed of incoming data processing Different algorithms need different processing power Many algorithms Higher compression ratios generally require more processing power to achieve the same throughput Encrypted data incompressible Latency added by compression not usually significant on MAN or WAN time scales (adds about a frame delay) 39
Write Acceleration (Fast Write) NO write acceleration 2 x RTT + response times With write acceleration iscsi can do this trick with immediate data and unsolicited data 1 x RTT + response times 40
Tape Read Acceleration Tape devices only allow 1 outstanding command. Remote Gateway reads ahead by issuing commands itself Data sent to and buffered by Local gateway until command received Works well because a tape is a sequential access device 41
Tape Write Acceleration Tape devices only allow 1 outstanding command. Apply both Write Acceleration and early response to allow pipelined commands 42
Security Didn t discuss this at all but clearly important! There is an entire track on the security topic Check out SNIA Tutorial Track: Security In general data in transit needs to be secured whenever it traverses an exposed network segment This can be lots of places but generally it is where the network leaves a secure data center Technologies include IPSec, FC_SP, etc 43
The End MAN and WAN storage networking is a big topic Lots of diverse technologies Once the technologies are chosen There are still lots of moving parts to worry about Must design SAN to match MAN/WAN AND Must design MAN/WAN to match SAN This world overlaps with WAN accelerators, remote file system access, grids & clouds, etc, etc, etc 44
Q&A / Feedback Please send any questions or comments on this presentation to SNIA: tracknetworking@snia.org Joseph L White Simon Gordon Viswesh Ananthakrishnan Howard Goldstein Walter Dey Based upon the presentation by Stephen Barr Greg Schulz Many thanks to the following individuals for their contributions to this tutorial. - SNIA Education Committee 45
Appendix: References Resilient Storage Networks - Designing Flexible Scalable Data Infrastructures Greg Schulz Elsevier/Digital Press Books ISBN: 1555583113 46