A VIRTUALIZATION ARCHITECTURE FOR WIRELESS NETWORK CARDS

Transcription

1 A VIRTUALIZATION ARCHITECTURE FOR WIRELESS NETWORK CARDS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Ranveer Chandra January 2006

3 A VIRTUALIZATION ARCHITECTURE FOR WIRELESS NETWORK CARDS Ranveer Chandra, Ph.D. Cornell University 2006 This doctoral dissertation describes the design and applications of a new virtualization architecture for wireless network cards, called MultiNet. MultiNet virtualizes a single wireless card to appear as multiple virtual wireless cards to the user. Each virtual card can then be configured separately on a physically different network. The goal of MultiNet is to provide a user-level illusion of simultaneous connectivity on all virtual cards although the network card is on a single network at any instant. MultiNet achieves this transparency using intelligent buffering and switching algorithms. The switching and buffering mechanisms are implemented as a kernel driver, while the policies are implemented as a user-level service. The MultiNet system has been implemented over Windows XP and has been operational for over two years. It is agnostic of the upper layer protocols, and works well over popular IEEE wireless LAN cards. Further, MultiNet enables a new class of applications, which were earlier only possible with multiple wireless cards in the device. This dissertation describes two such applications: Slotted Seeded Channel Hopping (SSCH) and Client Conduit. SSCH is a new channel hopping protocol that works over MultiNet, and utilizes frequency diversity to increase the capacity of IEEE wireless networks. Each node using SSCH switches across channels in such a manner that nodes desiring to communicate overlap, while disjoint communications do not overlap, and hence do not interfere with each other. To achieve this, SSCH uses a novel scheme for distributed rendezvous and synchronization. Simulation results show that SSCH significantly increases network capacity in several multihop and single hop wireless networking scenarios. Client Conduit is a novel technique for providing connectivity to disconnected wire-

4 less clients with the help of nearby connected clients. It is based on MultiNet and takes advantage of the beaconing and probing mechanisms of IEEE to ensure that connected clients do not pay unnecessary overheads while helping disconnected clients. Client Conduit has been implemented over Windows XP as part of an architecture for diagnosing faults in wireless networks.

5 BIOGRAPHICAL SKETCH Ranveer was born in Jamshedpur, an industrial town in Eastern India on August 27, 1976 as the third in a family of four children. He lived in Jamshedpur for the first 18 years of his life and decided to appear for the IIT exam after finishing high school. Ranveer secured a good rank in the IIT qualifying exam and decided to go to IIT Kharagpur, which was within 100 miles of Jamshedpur. IIT Kharagpur provided an ideal setting for Ranveer to complete his undergraduate education in an environment that had good professors, extraordinary peers, little distraction, and still a lot of fun. Ranveer majored in Computer Science, and developed a keen interest in computer networking and distributed systems. The opportunity of solving challenging problems in these fields motivated Ranveer to study further. He applied to a few schools in the United States, and decided to go to Cornell University in Ithaca, NY for his PhD in Computer Science. Over the six years at Cornell University Ranveer worked with a number of people at Cornell. He also spent three summers in Microsoft Research and one at AT&T Labs - Research, and enjoyed working in industrial research labs. After completing his PhD, Ranveer is headed towards the North-West, where he has accepted an offer from Microsoft Research in Redmond, WA. iii

6 ACKNOWLEDGEMENTS First, I want to thank my advisor, Ken Birman, for his constant support and guidance during my six years of PhD study at Cornell University. He kept me motivated and provided the right direction that enabled me to finish these challenging years of work. His sharp intellect and great comments were always the guiding feature in my PhD. Further, his towering figure in the field of Computer Science has been and will always be a role model of what I want to achieve with my research. Secondly, I am grateful to Victor Bahl for bringing out in me what I really wanted to do in research. Interactions with him during the three internships made me realize the open problems in wireless networking, and what I needed to do to make an impact in this field. Victor has also been a constant source of encouragement. His unbridled enthusiasm on seeing results always motivated me to go further in research. I am also grateful to my other committee members, Eva Tardos, Zygmunt Haas and Robbert VanRenesse, who have been supportive of my research in every step of my PhD. Their comments have been very valuable in rewriting the final draft of this dissertation. I would also like to thank my other coauthors at Microsoft Research. In particular, Atul Adya has been a great influence during my PhD. His views and ideas have influenced the way I write, present and do my research. Lili Qiu has showed me how perseverance, patience and good work always pays off. Finally John Dunagan has been of great help in reviewing my work, and showing me the right direction. In addition I would also like to thank Alec Wolman and Jitu Padhye for great research conversations. Finally, I would like to acknowledge the contribution of my family members and friends for keeping me motivated to finish my PhD. My parents have shown their belief in me and supported me in every possible way. My sister and brother-in-law have always been with me through the troubled phases of my PhD. I would also like to thank iv

7 Meenakshi, Biswanath, Ben, Rimon, Indranil and Rama for making the six years of stay in Ithaca very enjoyable. v

8 TABLE OF CONTENTS 1 Introduction Problems with Existing Wireless Networks Thesis and Its Contributions Limitations of this Dissertation Roadmap of this Dissertation The MultiNet Virtualization Approach Introduction Motivating Scenarios Prior Work Background Limitations in Existing Systems Power Save Mode (PSM) of IEEE Next Generation of IEEE WLAN cards MultiNet Assumptions about the System MultiNet Design Goals The MultiNet Approach Delivering Packets to Virtual Interfaces Determining the Activity Period for a Network Handling Ad Hoc Networks with Multiple MultiNet Nodes Implementation MultiNet Driver MultiNet Service Implementing Buffering Implementing Slotted Synchronization System Evaluation Test Configuration Reducing the Switching Delay Comparing Different Switching Strategies Adaptive Switching using MultiNet MultiNet with and without Buffering MultiNet with Slotted Synchronization MultiNet on a Mobile Node MultiNet versus Multiple Radios Maximum Connectivity in MultiNet Summary Discussion on the MultiNet Architecture Reducing the Switching Overhead Network Port Based Authentication Can MultiNet be done in the Firmware? Future Research vi

9 2.10 Summary SSCH: Capacity Improvement Using MultiNet Introduction Background and Motivation Hardware and MAC Assumptions Prior Work SSCH Packet Scheduling Channel Scheduling Mathematical Properties of SSCH Performance Evaluation Microbenchmarks Macrobenchmarks: Single-hop Case Macrobenchmarks: Multihop and Mobility Implementation Considerations Alternatives to SSCH Future Research Summary Client Conduit and Fault Diagnosis in Wireless Networks Introduction Faults in a Wireless Network Related Work System Architecture System Requirements System Components System Scaling System Security Client Conduit The Client Conduit Protocol Client Conduit Security and Attacks Fault Detection and Diagnosis Locating Disconnected Clients Network Performance Problems Rogue AP Detection Implementation System Evaluation Cost of Individual Operations Client Conduit Location Determination Estimating Wireless Delays Rogue AP Detection Scalability Analysis vii

10 4.9 Future Work Summary Conclusion 158 References 160 viii

11 LIST OF TABLES 2.1 The Switching Delays between IS and AH networks for IEEE cards with and without the optimization of trapping media connect and disconnect messages The average throughput in the ad hoc and infrastructure networks using both strategies of MultiNet and two radios The average packet delay in infrastructure mode for the various strategies The average packet delay in infrastructure mode on varying the number of MultiNet connected networks Different fault diagnosis mechanisms and entities that can diagnose them; the last column indicates if the solution can be supported using legacyaps Times for different operations: U means time measured from user-level code; rest are times taken for the corresponding ioctl to complete ix

12 LIST OF FIGURES 2.1 The MultiNet Layer maintains virtual interfaces for networks 1, 2 and 3, and switches the physical card across all these networks. It gives the illusion of connectivity on all networks although the card is on network 2 at this instant The steps of Spoofed Buffering when a node uses MultiNet to connect to two networks Two nodes in communication range and using MultiNet that fail to overlap in the ad hoc network abd hence experience a logical partitioning The Network Stack with MultiNet Time taken to complete a 47 MB FTP transfer on an ad hoc and infrastructure network using different switching strategies Variation of the activity period for two networks with time. The activity period of a network is directly proportional to the relative traffic on it TCP Performance with and without Spoofed Buffering Effect on UDP flows when a node uses Slotted Synchronization to join an ad hoc network MultiNet in a Mobile Scenario Packet trace for the web browsing application over the infrastructure network Packet trace for the presentation and chat workloads over the ad hoc network Comparison of total energy usage when using MultiNet versus two radios Energy usage when using MultiNet and two radios with IEEE Power Saving Only one of the three packets can be transmitted when all the nodes are on the same channel Channel hopping schedules for two nodes with 3 channels and 2 slots. Node A always overlaps with Node B in slot 1 and the parity slot. The field of the channel schedule that determines the channel during each slot is shown in bold The problem with a naive synchronization scheme. Node A has two slots, with (channel, seed) pairs represented by A 1 and A 2 ; nodes B and C are similarly depicted. At time t 1, node A synchronizes with node B. Node B synchronizes with node C at time t 2, after which A and B are no longer synchronized Need for De-synchronization: All nodes converge to the same channel without de-synchronization Switching and Synchronizing Overhead: Node 1 starts a maximum rate UDP flow to Node 2. We show the throughput for both SSCH and IEEE a x

13 3.6 Overhead of an Absent Node: Node 1 is sending a maximum rate UDP stream to Node 2. Node 1 then attempts to send a packet to a nonexistent node Overhead of a Parallel Session: Node 1 is sending a maximum rate UDP stream to Node 2. Node 1 then starts a second stream to Node Overhead of Mobility: Node 1 is sending a maximum rate UDP stream to Node 2. Node 1 starts another maximum rate UDP session to Node 3. Node 3 moves out of range at 30 seconds, while Node 1 continues to attempt to send until 43 seconds Overhead of Clock Skew: Throughput between two nodes using SSCH as a function of clock skew Disjoint Flows: The throughput of each flow on increasing the number of flows Disjoint Flows: The system throughput on increasing the number of flows Non-disjoint Flows: The average throughput of each flow on increasing the number of flows. There is a flow from every node in the network Non-disjoint Flows: The system throughput on increasing the number of flows. There is a flow from every node in the network Effect of Flow Duration: Ratio of SSCH average throughput to IEEE a average throughput for flows having different durations TCP over SSCH: Steady-state TCP throughput when varying the number of non-disjoint flows Multihop Chain Network: Variation in throughput as chain length increases Mulithop Mesh Network of 100 Nodes: Average flow throughput on varying the number of flows in the network Impact of SSCH on Unmodified MANET Routing Protocols: The average time to discover a route and the average route length for 10 randomly chosen routes in a 100 node network using DSR over SSCH Dense Multihop Mobile Network: The per-flow throughput and the average route length for 10 flows in a 100 node network in a 200m 200m area, using DSR over both SSCH and IEEE a Sparse Multihop Mobile Network: The per-flow throughput and the average route length for 10 flows in a 100 node network in a 300m 300m area, using DSR over both SSCH and IEEE a Number of wireless related complaints logged by the IT department of a major US corporation Fault Diagnosis Architecture Client Conduit Mechanism (Steps 1 through 5 are described below) Decision steps taken by the DS to determine if an AP is a Rogue AP or not Components on DC and DAP xi

14 4.6 CPU usage in Promiscuous mode (1 GHz machine) Breakdown of costs for Client Conduit. The protocol steps are executed from the bottom entry in the legend to the topmost, i.e., starting at Set channel Time taken by a disconnected client to transfer data via Multinet Median error in locating disconnected clients. The lower and upper bounds of error bars correspond to min and max error. E(i) denotes that the i th connected client s location contains error EDEN s accuracy of estimating the delay at a client Breakdown of delay at the client, AP, and the medium as estimated by EDEN Overlapping channels on which an AP is overheard Overlapping channels heard relative to distance The maximum idle time duration available during every 5-minute period at different times of the day xii

15 CHAPTER 1 INTRODUCTION There has been a recent interest in using multiple wireless cards in a device [9,64,87,95, 115, 119]. This dissertation provides a cheaper and more energy-efficient scheme to get the functionality of multiple wireless cards while using only a single physical network interface. This approach is called MultiNet, which is a new architecture for virtualizing wireless cards. MultiNet is very useful in solving some of the key problems in wireless networks, and we explore it in greater detail in the rest of this chapter. 1.1 Problems with Existing Wireless Networks Wireless technology has an increasing presence in our life from cellular phones, wireless LANs, Bluetooth headphones, cordless phones, location systems, to smart homes, and many more. This trend will grow with an increasing deployment of sensor networks [61, 88], mesh networks [50, 93], and the recent WiMAX initiative [63, 133]. Although they are increasingly common, wireless networks are still relatively fragile and underutilized. In order to make wireless networks robust we have to solve a number of important problems, some of which can be grouped under the following categories: Manageability: Wireless networks are frustratingly opaque. This leads to long delays in resolving performance and connectivity problems, as well as high manageability costs [5, 7, 39, 103, 131]. The state of the art will be significantly enhanced by a management infrastructure for wireless networks that diagnoses problems with minimum human intervention and informs the user of ways to recover from them [3, 8]. 1

16 2 Capacity: Although the bandwidth of wireless networks is steadily increasing, capacity is still a bottleneck for many applications [40, 65, 95]. Any scheme that increases wireless capacity, through advanced antennas [34, 42] and smarter protocols [16, 114] will greatly impact the wireless performance of a number of applications. Power: Limited battery power is the Achilles heel for wireless applications [72]. Applications and protocols for mobile computing should prolong battery life by using schemes such as maximizing sleep durations of wireless cards [71], using transmit power control [69], or avoiding multiple wireless interfaces [30]. 1.2 Thesis and Its Contributions This doctoral dissertation contributes towards solving these problems for IEEE wireless networks [58] by proposing a new virtualization architecture called MultiNet. MultiNet virtualizes a single wireless card to make it appear as multiple wireless cards to the user. The user can configure each virtual card separately to be on a physically different network. For example, when using an IEEE card the user can connect one virtual card on an infrastructure network, and the other virtual card on an ad hoc network, although the network card is on a single physical network at any instant. The goal of MultiNet is to provide a user-level illusion of simultaneous connectivity on all wireless networks. MultiNet achieves this transparency using intelligent buffering and switching algorithms. MultiNet has been implemented over Windows XP and is available for download. In addition to describing this architecture, this thesis also explores three ways in which MultiNet alleviates the above problems of wireless networks. Firstly, MultiNet enables a number of techniques to reduce power consumption. For example, it allows the functionality of multiple interfaces to be provided in situations

17 3 where the fixed energy cost of multiple physical interfaces is not feasible. MultiNet also enables a new power saving mechanism by allowing nodes to function as relays using only one wireless card: nodes with low battery power can send their traffic to the Access Point at a lower transmit power using intermediate relay nodes. Secondly, MultiNet facilitates a way to increase the capacity of wireless ad hoc networks by exploiting channel diversity. The capacity of ad hoc networks is known to scale poorly with the number of communicating nodes [67]. When multiple neighboring node pairs want to communicate using IEEE , only one pair can be active at a time. However, other nodes can talk simultaneously if they are on orthogonal frequency channels, since traffic on orthogonal channels do not interfere. But this breaks the semantics of wireless networks: two neighboring nodes in a network might be on different channels and cannot communicate. MultiNet helps to solve this problem. The number of virtual interfaces is the number of orthogonal channels. This dissertation proposes a new scheduling algorithm, called Slotted Seeded Channel Hopping (SSCH), which works with MultiNet to improve network capacity. The goal of SSCH is to have communicating nodes on the same channel and other nodes on randomly different channels at any instant, while ensuring that any two neighboring nodes overlap within a fixed period. SSCH achieves this goal by introducing the technique of partial synchronization and also makes use of existing techniques such as pseudo-random generators. It is shown mathematically that SSCH has desired synchronization properties. Using simulations in QualNet, it is shown that SSCH significantly improves wireless capacity of IEEE Finally MultiNet enables a novel communication mechanism for disconnected machines, called Client Conduit, which is used to diagnose faults in infrastructure wireless networks. A recent surge in the deployment of large-scale enterprise and city-wide

18 4 wireless networks [37] entails a pressing need for wireless network management tools, similar to wired networks [56, 94]. Network administrators want to know why users are suffering from poor performance and frequent disconnections. They are interested in locating security breaches, for example an unauthorized (rogue) access point plugged into an enterprises Ethernet jack that jeopardizes its resources. In our architecture, Client Conduit allows disconnected clients to transfer diagnostic messages to and from a backend server. It is implemented using MultiNet, since it allows connected clients to stay on the infrastructure network using one virtual interface, and form an ad hoc network with the disconnected client on another virtual interface. This thesis presents a lightweight mechanism to implement Client Conduit, where virtual interfaces are added dynamically and a connected client suffers no penalty in the common case. It also proposes algorithms to detect rogue access points, locate disconnected clients, and diagnose poor wireless performance. This architecture has been prototyped over Windows XP using off the shelf wireless cards and access points. 1.3 Limitations of this Dissertation Although MultiNet has been implemented over Windows XP, it has not been tested over all cases and in large deployments. Consequently, simulation results were used to show the feasibility MultiNet. Further, the inability of available hardware to quickly switch across frequency channels limited all results on SSCH to simulations in QualNet [62]. However, realistic simulation parameters were chosen and a mathematical analysis of SSCH was done to show that SSCH will significantly improve the capacity of wireless networks when the required hardware is available. MultiNet, SSCH and our fault diagnosis architecture have additional limitations, and we enumerate them in Chapters 2, 3 and 4 respectively.

19 5 1.4 Roadmap of this Dissertation Chapter 2 describes the MultiNet architecture in detail. It also shows that MultiNet consumes less energy than an alternative approach of using multiple wireless cards. Chapter 3 describes the SSCH protocol and its properties, and analyzes the performance of SSCH. Chapter 4 then presents our fault diagnosis architecture, and describes and evaluates the design of Client Conduit. Finally, Chapter 5 concludes this dissertation. Most of the contents of Chapters 2, 3 and 4 are adapted from previously written independent papers, in particular [30], [16] and [3] respectively. The contributions of coauthors of each of these papers is listed in the last paragraph of each chapter.

20 CHAPTER 2 THE MULTINET VIRTUALIZATION APPROACH 2.1 Introduction Systems research over the last two decades has revealed a number of benefits of virtualizing different systems components, such as virtual machines [20, 49, 126, 130], virtual storage [55, 81] and virtual memory [23] among others. However, the benefits of virtualizing a wireless card have not been explored. This chapter describes MultiNet, a new virtualization architecture that abstracts a single wireless LAN (WLAN) [60] card, making it appear as multiple virtual cards to the user. MultiNet enables several compelling scenarios. These include increased connectivity for end users; increased range of the wireless network; bridging between infrastructure and ad hoc wireless networks, and painless secure access to sensitive resources. We discuss these in detail in Section 2.2. To explore this problem space with current technology, one would have to use a single WLAN card for each desired network [64, 115]. Doing so is costly, cumbersome, and consumes energy resources that are often limited. An alternative is to use the MultiNet virtualization approach. Virtualizing a wireless card poses several research challenges. Firstly, a virtual wireless card should appear as a real (physical) wireless card to the user. Secondly, the user should get an illusion of simultaneous connectivity on all virtual cards, although the physical wireless card can only be on one network at any instant [58]. Thirdly, the system should be deployable and compatible with nodes not using virtualization. Moreover, the virtualization software should not require modifications to existing backbone infrastructure, such as Access Points (APs) [58] and routers. MultiNet solves the above problems by creating a new virtual interface for each net- 6

21 7 work to which connectivity is desired. The virtual interface exports itself as a new physical device to the network layer. It also maintains the state of the physical card required for connecting to the wireless network corresponding to this virtual interface. Multi- Net achieves the illusion of simultaneous connectivity over all networks by switching the physical network card across the desired networks and activating the corresponding virtual interface. Further, MultiNet is deployable as it does not require changes to APs and routers. This is achieved by a new protocol called Spoofed Buffering, which leverages the Power Save Mode of the IEEE [58] standard, and is described in Section This main contributions of this chapter can be summarized as follows: It presents the design of MultiNet, which is a new architecture for virtualizing WLAN cards. As part of the design it describes the state that needs to be stored for every virtual wireless card. It also describes in detail the implementation of Multi- Net over Windows XP. The implementation works with modest modifications to the Operating System kernel, and without any modifications to the wireless card drivers. It proposes a new protocol, called Spoofed Buffering, which delivers packets sent to a node using MultiNet when it is on another network. APs buffer packets for the nodes that have switched to another network, and deliver them when nodes switch back to their network. Spoofed Buffering achieves this functionality without requiring any changes to APs. This protocol has also been used in a recent work for fast handoff in IEEE networks [100]. It analyzes the performance of MultiNet over a number of commercial WLAN cards, and shows that MultiNet is suitable for most applications. It describes a technique to reduce the overhead of switching a wireless card across networks,

22 8 and shows that MultiNet consumes less battery power than an alternative approach of using multiple wireless cards in the device. As of this writing, MultiNet has been operational for over two years. During this time, we have refined the protocols and analyzed them in greater detail. Many of the results we present in this chapter are based on real working systems that include current and next generation IEEE wireless cards. For cases where it is not possible to study the property of the system without large scale deployment and optimized hardware, we carry out simulation based studies. Most of our simulations are driven by traffic traces that represent typical traffic. For IEEE , our study shows that MultiNet nodes can save upto 50% of the energy consumed by nodes with two cards, while providing similar functionality. We also quantify the delay versus energy tradeoff for switching nodes over performance sensitive applications. The rest of this chapter is organized as follows. Section 2.2 presents some scenarios and applications that motivate the need for MultiNet and for which MultiNet is currently being used. Section 2.3 presents some related research and Section 2.4 provides the background needed for the rest of the chapter. The MultiNet architecture is presented in Section 2.5, and its implementation is described in Section 2.6. Performance and feasibility are discussed in Sections 2.7 and 2.8. Future work is presented in Section 2.9 and we conclude in Section Motivating Scenarios MultiNet enables several new applications that were earlier not possible using a single wireless card. A few examples include: Concurrent Connectivity: A user can connect to multiple wireless networks. He specifies a list of networks, and MultiNet simultaneously connects to all of them.

23 9 Network Elasticity: The range of an infrastructure network can be extended if border nodes use MultiNet to function as relays for authorized nodes that are outside the range of the Access Point (AP). We implemented this functionality as part of our fault diagnosis architecture, and describe it in detail in Chapter 4. Gateway Node: A node that is part of a wireless ad hoc network and close to an AP, connected to the Internet, can use MultiNet to stay connected on both networks, and become a gateway node for the ad hoc network [26]. Network Security: Different groups (e.g. human resources personnel, secretaries, developers etc.) within a company may be given different permissions to access data servers. These servers could be on physically different networks. A privileged user, who has permission to access different networks, can use MultiNet to simultaneously connect to multiple networks. Increased Capacity: The capacity of ad hoc networks can be increased if nodes within interference range communicate on orthogonal frequency channels [16, 114]. In Chapter 3, we describe SSCH, which uses MultiNet to virtualize a wireless card into as many instances as the number of orthogonal channels, and simultaneously connects on all of them. Virtual Machines: Existing Virtual Machine architectures (for example, [28, 126, 130]) restrict all virtual machine instances to stay connected on the same wireless network. MultiNet allows users to connect different virtual machines to physically different wireless networks using only a single wireless card. Seamless Roaming: The time to handoff from one AP to another is a significant overhead in mobile wireless networks [113]. MultiNet allows a wireless card to

24 10 connect to an AP without disconnecting from its previous one. This technique has been used in a recent work, called SyncScan [100]. All the above scenarios require nodes to stay connected on more than one wireless network, and MultiNet achieves this functionality with only one wireless card. 2.3 Prior Work Virtualization has been studied extensively for abstracting a single system resource as multiple available resources to the user. For example, Virtual Machine architectures, such as VMWare [126], Denali [130], Xen [20], Terra [49], etc., virtualize a single computer to give an illusion of many smaller virtual machines, each running its own operating system. Storage Virtualization systems, such as Facade [81] and Stonehenge [55], virtualize a storage device into multiple logical storage devices. Similarly, Virtual Memory [23,41] presents an illusion of larger memory to user programs than is actually available. MultiNet is similar to the above systems in abstracting a single resource, in this case a wireless card, as multiple wireless cards to the user. However, to the best of our knowledge, it is the first system that virtualizes wireless network cards. Prior work has looked at virtualizing the wired network interface on a machine. The Virtual Machine architectures discussed above [20, 28, 49, 126, 130] virtualize all hardware resources, including the network interface [120]. Other systems for low latency communication, such as U-Net [128] and VIA [29, 38], virtualize the network interface to multiple local virtual interfaces, one for each process. The physical network interface is multiplexed across the virtual interfaces to send packets sent by a process. Network Cloning [138] brings up multiple network stacks for a single physical interface. Similar to these systems, MultiNet abstracts the wireless interface as multiple virtual interfaces, and multiplexes the physical card across the virtual instances. However, it faces different

25 11 challenges that do not arise in the case of wired networks. Firstly, each virtual wireless card might require connectivity to a physically different wireless network. Therefore, as a contrast to the above systems, only one virtual instance is physically on the network at any time. Secondly, switching to a different network takes a few hundred milliseconds, as we show in Section 2.7. So, the approach used by the above systems, where packets from different virtual interfaces are serviced by the wired interface in the order in which they arrive, might incur a network switch overhead for every packet. This scheme may not be suitable for virtualizing wireless cards. MultiNet uses different switching and buffering algorithms, which are described in Section Another set of related work looked at smart channel hopping schemes over a single wireless radio [66, 89, 114]. The idea is to distribute interfering traffic on different frequency channels to increase the capacity of wireless networks. MultiNet differs from these systems in two aspects. Firstly, MultiNet has to switch across multiple networks instead of channels, and consequently MultiNet has to store more state for each network. Secondly, all the above protocols have only been evaluated in simulation. We are not aware of any prior implementation of these systems. As part of MultiNet s design goals, which we will describe in Section 2.5.1, any two neighboring nodes in an ad hoc network should overlap on the same frequency channel within a definite period. Our solution to this problem, described in Section 2.5.6, relies on clock synchronization provided by the Timer Synchronization Function (TSF) of IEEE [58]. The algorithm or its variants [54, 74, 110] are based on an algorithm proposed by Lamport [75], which shows that given the clock accuracy, link delay and network diameter, and assuming that a timestamp is sent successfully along each link at a constant frequency, the timing values of the entire network is guaranteed to be within an established bound. A previous work [54] has shown that these algorithms

26 12 work reasonably well when there are no Byzantine failures [76] in the network. For our algorithms to work with such failures, we would need clock synchronization algorithms with stronger guarantees [116, 125]. However, handling these failures is out of scope of this dissertation. To the best of our knowledge, the idea of simultaneously connecting to multiple wireless networks has not been studied before in the context of wireless LANs. A related problem was considered for scatternet formation in Bluetooth [92] networks [77, 78]. Bluetooth networks comprise basic units, called piconets, that can have at most 7 nodes. Piconets are used to form bigger networks, called scatternets, by having some nodes on multiple piconets. However, the problem of enabling nodes in Bluetooth networks to listen to multiple piconets is significantly different from the problem of allowing nodes to connect to multiple IEEE networks. Bluetooth uses a frequency hopping scheme for communication between multiple nodes on the network. A node can be on two networks simultaneously if it knows the correct hopping sequence of the two networks and hops fast enough. IEEE networks, on the other hand, have no such scheme as is described next in Section 2.4. An alternative to virtualizing wireless cards is to use multiple radios in the device, and this approach has been commonly used in commercial products [64, 115, 119] and wireless networking research [9, 87, 95]. However, as we show in Section 2.7, using multiple radios consumes more power, which is a scarce resource in battery operated devices. Further, a recent result shows that the performance of multi-radio systems is significantly degraded by the self interference among the radios on the device [106]. In Section 2.7.8, we show that MultiNet solves these problems of multi-radio systems at a cost of reduced throughput.

27 Background This section first discusses the limitations of IEEE and describes why maintaining simultaneous connections to multiple wireless networks is a non-trivial problem. It then briefly describes the Power Save Mode (PSM) [58] feature of IEEE , which is used in the Spoofed Buffering Protocol described in Section Finally, it discusses the next generation of WLAN cards, over which we evaluate MultiNet Limitations in Existing Systems Popular wireless networks, such as IEEE , work by association. Once associated to a particular network, either an AP based (infrastructure) or an ad hoc network, the wireless card can receive and send traffic only on that network. The card cannot interact with nodes in another network if the nodes are operating on a different frequency channel. Further, a node in an ad hoc network cannot interact with a node in the infrastructure network even when they are operating on the same channel. This is because the IEEE standard defines different protocols for communication in the two modes and it does not address the difficult issue of synchronization between different networks. As a matter of practical concern, most commercially available WLAN cards trigger a firmware reset each time the mode is changed from infrastructure to ad hoc or vice versa Power Save Mode (PSM) of IEEE The IEEE standard defines Power Save Mode (PSM) for infrastructure wireless networks as a means to save battery power. When a node wants to use PSM, it sends a message to the AP and sets its wireless interface to sleep mode. The message to the AP also contains the duration for which the node wants to sleep. This duration is called the

28 14 Listen Interval. When the AP receives a packet destined for the sleeping node, it buffers the packet. After a Listen Interval period, the node using PSM wakes up, and receives the packets buffered at the AP. Usually, the Listen Interval is set to be a multiple of the Beacon Period, where the Beacon Period is the interval at which an AP broadcasts its beacon. The Beacon Period is a parameter of the AP, while the Listen Interval is a parameter of the node using PSM Next Generation of IEEE WLAN cards In order to reduce the cost and commoditize wireless cards, IEEE WLAN card vendors [11, 102] are minimizing the functionality of the code residing in the microcontroller of their cards. These next generation of wireless cards, which we refer to as Native WiFi cards, implement just the basic time-critical MAC functions, while leaving their control and configuration to the operating system. More importantly, these cards allow the operating system to maintain state and do not undergo a firmware reset on changing the mode of the wireless card. This is in contrast to the existing cards, which we refer to as legacy wireless cards in the rest of this dissertation. 2.5 MultiNet This section first formulates the MultiNet problem and enumerates its design goals. It then describes the MultiNet system in detail Assumptions about the System MultiNet is designed to work in a Wireless LAN environment, such as IEEE We make the following assumptions about such a network:

29 15 All nodes in a network are synchronized to within a millisecond. IEEE maintains a timer and uses a distributed Timer Synchronization Function (TSF) [58] to synchronize these timers at all nodes in a network. IEEE b synchronizes the timers at all nodes in a network to within 224 µs [60]. TSF, or its modifications, ATSP [54, 74] or ASCP [110] can be used to achieve the required synchronization granularity even when broadcast packets are lost. APs implement Power Save Mode (PSM), and have enough buffer space to support all nodes using PSM in the network. This feature is defined in the IEEE standard [58], and is implemented in some existing WLAN products [35, 121, 122]. There is an overhead of switching a wireless card from one network to another. This comprises the time to switch to another channel and associate to the network. As we discuss in Section 2.7, this overhead is a few hundred milliseconds for most commercial wireless cards. MultiNet will give better performance when this delay is reduced, using the schemes we discuss in Section The applications on machines running MultiNet can tolerate variable throughput and delays. Some sample applications supported by MultiNet are browsing, file transfers and web downloads. The reason why other applications are not supported is explained in Section The device driver of a wireless card sends a disconnect message to the network layer when it disconnects from a network, and a connect message when it successfully connects to one. On modern operating systems, such as Linux and Windows XP, these messages are passed up to the user level and are used to display the current status of the physical interface. In Windows XP, the device driver sends

30 16 a media disconnect and media connect message on disconnection and connection respectively. In Linux, the device driver calls netif carrier off and netif carrier on methods. A user knows if MultiNet is being used by more than one machine in an ad hoc network. Further, all nodes in an ad hoc network agree to install software to support MultiNet. Since ad hoc networks are usually cooperative networks, we expect this assumption to hold in most cases MultiNet Design Goals A virtualized physical wireless card appears as multiple virtual network interfaces, where each virtual interface corresponds to a physically different wireless network. Further, MultiNet also strives to achieve the following design goals when virtualizing a wireless card: Transparency: To reduce the learning curve in using the system, we require virtual interfaces to appear as physical wireless cards to the user. He should be able to connect different virtual cards to different wireless networks, although the physical card is only on one network at any instant. The architecture should ensure that packets sent to and from a virtual interface are not discarded if the physical wireless card is not on the corresponding network at that instant. Further, when a machine is mobile, the virtual interface should appear disconnected when the machine moves out of range of the network. However, it should appear connected when the machine moves back in the network range. Performance: The system should give the illusion of simultaneous connectivity on all virtual interfaces. Packet delays on a virtual interface should be minimized.

31 17 The user should also be able to prioritize different virtual interfaces, so that packets on a more important network are sent and received with lesser delay. Deployability: The system should be easy to deploy in an existing wireless network. It should work over the commonly used IEEE standard, and with commercial wireless cards. Further, it should not require changes to the wireless card drivers or the network infrastructure. Nearly all of the modifications should be on the user s machines. In addition to the above design goals, there are a few plausible goals that Multi- Net does not achieve. Firstly, it does not aim to support real-time applications over the network, such as Voice over IP(VoIP) [127] or streaming video. This constraint arises from the few hundred milliseconds overhead when switching from one network to another. Unless this overhead is reduced, MultiNet will be unable to provide response time guarantees of less than a few hundred milliseconds on all networks. Secondly, MultiNet does not handle Byzantine failures in the network. Handling these failures would require changes to our buffering and synchronization protocols described in Sections and respectively, and is out scope of this dissertation. Thirdly, we defer the discussion of using MultiNet in multi-hop ad hoc networks to Chapter 3. In the rest of this chapter, we limit our discussion to using MultiNet in single hop ad hoc networks, where all nodes are in communication range of each other, and in infrastructure wireless networks. Finally, the current implementation of MultiNet allows a node to stay connected on only one ad hoc network in which multiple nodes use MultiNet. Enabling a node to use MultiNet for maintaining connections to more than one such ad hoc network is part of future work.

32 The MultiNet Approach MultiNet achieves the above design goals by introducing functionality in a new layer, between the network and physical layers of the network stack, as shown in Figure 2.1. This layer, called the MultiNet Layer, initializes and maintains a new virtual network interface for every new network on which the user wants to stay connected. The IEEE parameters [58] of the physical wireless card is duplicated at each virtual interface. So, each virtual interface has its own Service Set Identifier (SSID) and Network Mode and appears as a separate wireless card to the network layer. All virtual interfaces appear as connected to the network layer, even though the physical card is connected to only one wireless network at any instant. This is shown in Figure 2.1 where IP sees virtual interfaces 1, 2 and 3 as connected to networks 1, 2 and 3 respectively, although the physical card is connected to Network 2. Since all virtual interfaces appear as connected, the user might send packets on any of them. Packets sent to a virtual interface, when the physical card is not on its corresponding wireless network, are buffered in a packet buffer maintained at each virtual interface. Packets are sent over the network without any delay if the physical card is on the network. MultiNet provides an illusion of simultaneous connectivity on all networks by multiplexing the physical wireless card across all virtual interfaces. The physical card stays connected on a network long enough to send and receive one or more packets on the corresponding virtual interface. The MultiNet Layer then switches the physical card to a network corresponding to another virtual interface. The information about the network is retrieved from the state stored in the virtual interface. After switching the physical card to another network, MultiNet waits for a media connect message from the lower layers. This message is sent only if the physical card successfully switches to another network. On receiving this message, MultiNet sends the packets buffered on

33 19 Application User Level Kernel Level Transport (TCP, UDP) IP Network 1 Network 2 Network 3 MultiNet Layer Network 1 Network 2 Network 3 MAC and PHY Wireless Card is on Network 2 Figure 2.1: The MultiNet Layer maintains virtual interfaces for networks 1, 2 and 3, and switches the physical card across all these networks. It gives the illusion of connectivity on all networks although the card is on network 2 at this instant. the virtual interface, and stays connected to this network for some time. This cycle continues in round-robin fashion across all virtual interfaces. Before describing the architecture further, we briefly define some terms we use in the rest of this chapter. The period of time for which a card stays on a network after successfully connecting to it is called the Activity Period for the network. The time to switch to another network, from the time switching is initiated to the time the card is associated to the wireless network, is called the Switching Delay for the network. The Activity Period is the useful time when a card sends and receives packets, while the Switching Delay is an overhead when the card is not on any network. The performance

34 20 of MultiNet is better when the Switching Delays are small. The sum of the Activity Periods and Switching Delays over all connected networks is called the Switching Cycle. Switching from one network to another requires the physical card to disconnect from one network and connect to the other. Correspondingly, as described in Section 2.5.1, the physical layer sends disconnect and connect messages to the upper layers. These messages change the connectivity status of the virtual network interface, and as a result only one virtual interface appears as connected at any time. This is a problem for Multi- Net since the operating system drops packets sent on a disconnected network interface. MultiNet solves this problem by trapping the disconnect message sent by the physical layer immediately after a disconnection. This message is received at the MultiNet Layer and is prevented from going up the network stack. Consequently, layers above the MultiNet Layer see all the virtual interfaces as connected although the physical card switches across different networks. MultiNet also manages the state of a virtual interface when a network disconnection is caused by factors such as mobility or weak signal strength. The virtual interface is made to appear disconnected when the physical card is unable to connect to its network, and is made to appear connected when the physical card regains connectivity to the network. MultiNet achieves this functionality by not trapping the disconnect message when it is caused by any source other than MultiNet. As a result the virtual card appears disconnected whenever the physical wireless card is unable to connect to its network. Further, MultiNet attempts to connect to all networks in its Switching Cycle, even if its previous attempt to connect was unsuccessful. When the physical wireless card successfully connects to a network, the connect message is passed up the network stack, and the corresponding virtual interface appears connected. We demonstrate this functionality in a mobile scenario in Section

35 21 This design of MultiNet poses two interesting questions. Firstly, how are packets delivered to a virtual interface if the card has switched to another network? Secondly, how long should the card stay on a network? We first answer these questions for the scenario when only one machine in any ad hoc network uses MultiNet. We then develop our approach to handle the case when MultiNet is used by more than one node in an ad hoc network. An important question we defer to future work, in Section 2.9, is the interaction of TCP with MultiNet Delivering Packets to Virtual Interfaces In this section, we present a buffering protocol that prevents packets sent to a virtual interface from being discarded when the physical card is not on the corresponding network. As part of the protocol, we describe a new approach that allows MultiNet to work in infrastructure networks without modifying the APs. The buffering protocol works differently for ad hoc and infrastructure networks. For ad hoc networks, just before switching out of the network, a node broadcasts a packet that informs all other nodes in the network of its unavailability and when it will be back in the network. On switching back to the ad hoc network, the node broadcasts another packet announcing its availability. Packets destined for this node are buffered by other nodes in the ad hoc network, until either of the following two conditions hold: the broadcast announcing availability of the node is received, or the time by which the node was expected to be back in the network has elapsed. If the node is available, then the buffered packets are sent to the it. Otherwise, if the timer has elapsed, then the buffered packets are discarded. This protocol requires modifications at all nodes in the ad hoc network, even if they do not use MultiNet to connect to multiple networks. This should not be very difficult to achieve as was explained in Section

36 22 MultiNet could use a similar protocol for infrastructure networks. However, APs would need to be modified to buffer packets destined for nodes using MultiNet on its network. This significantly affects the deployability of MultiNet, as discussed in Section MultiNet solves this problem by proposing a new protocol, called Spoofed Buffering. Spoofed Buffering buffers packets at the APs without requiring modifications to them. Spoofed Buffering works as follows. MultiNet spoofs sleep mode to the AP just before switching out of an infrastructure network. It sends a special IEEE packet to the AP, which informs the AP that it is using IEEE PSM to go to sleep mode, and the time for which it will sleep. While the AP knows the node to be sleeping, MultiNet switches the physical card to another network. As described in Section 2.4.2, PSM requires APs to buffer packets for nodes that are sleeping in its network, and to send the buffered packets when the nodes wake up. So, packets destined for the MultiNet nodes are buffered at the AP until the node switches back to the infrastructure network. The node then informs the AP that it is awake by sending another IEEE packet. On receiving this packet, the AP sends all the buffered packets, which are received by the corresponding virtual interface. Figure 2.2 illustrates the steps of Spoofed Buffering when a node uses MultiNet to connect to two wireless networks. Before switching out of network 1, the node informs the AP that it is going to sleep for a certain time. It then switches to network 2, where it announces that it is awake. The AP in network 2 then sends the buffered packets to the node, which forwards them up to the corresponding virtual interface. The virtual interface also sends its buffered packets to the AP. The node then stays on network 2 for the Activity Period. It then sends a message to the AP of network 2 announcing that it is going to sleep, and switches to network 1 and informs the AP of network 1 that it is

37 SERIAL ETHERNET 23 awake. These steps continue as long as the node requires connectivity on both wireless networks. Network 2 3) Send/Receive packets 2) I am awake 4) I am sleeping 1) I am sleeping 5) I am awake MultiNet node connected to networks 1 and 2 Figure 2.2: The steps of Spoofed Buffering when a node uses MultiNet to connect to two networks. We note that despite our buffering protocol, packets might still be lost due to other reasons, such as mobility, wireless signal fade or interference. Further, buffering might not be possible at other nodes in the network, due to lack of cooperation from nodes in the ad hoc network or PSM support at the APs. In such scenarios, MultiNet relies on higher layer protocols, such as TCP, to recover the lost packets. We compare MultiNet with and without buffering support in Section 2.7.5, and show that although MultiNet performs much better when the buffering protocols are implemented, its performance is reasonable in the bad case when no packets are buffered.

38 Determining the Activity Period for a Network The Activity Period is the duration for which a wireless card stays connected on a network. MultiNet supports three schemes for determining this duration, each of which is useful in different scenarios. Fixed Slot Duration: In this approach, MultiNet divides time into slots of fixed duration. Every time the physical card switches to a network, it stays on that network for one slot. The slot duration includes the Switching Delay. This scheme is simple to implement, and is useful in cases where synchronization is required between multiple nodes using MultiNet in an ad hoc network. We use it for our algorithms in Section and for SSCH described in Chapter 3. User Defined Priority: This scheme requires the user to prioritize all his networks, and define the Total Activity Period. The Total Activity Period is the sum of Activity Periods of all networks, which is equal to the difference between the Switching Cycle and the sum of Switching Delays across all networks. Multi- Net then calculates the Activity Period for each network based on its priority. So, if a user requires connectivity to a set of wireless networks, and has given network i a priority x i, then the Activity Period of any network j is given by x j (1/( k x k )) (T otalactivityp eriod). This scheme is useful when there exists a predefined priority across all networks. For example, the Client Conduit Protocol, described in Chapter 4, uses user defined priorities to limit the duration for which a connected machine helps a disconnect wireless client. Adaptive Schemes: This approach does not require any intervention from the user. It dynamically prioritizes networks based on the amount of traffic seen on it, and uses these priorities to calculate the Activity Period for each network.

39 25 Consequently, a network that sends and receives more packets has a longer Activity Period as compared to a less active one. So, if MultiNet has to switch across different networks, and network i has seen P i packets in its last Activity Period AT P i, then the node stays in network j for an Activity Period given by (P j /AT P j ) (1/( k P k /AT P k )) ( k AT P k ). The first term gives the network utilization of network j, the second gives the utilization across all networks, and the final term is the total amount of time the node is active across all networks. This approach is useful in scenarios where the user wants to get the best performance on multiple networks, without worrying about the traffic patterns on each network. We use this strategy to provide true zero configuration over MultiNet, as described in Section We evaluate two adaptive strategies for MultiNet in Section Adaptive Buffer is a naive approach that prioritizes networks based on the number of packets buffered by their corresponding virtual interfaces during a Switching Cycle. Adaptive Traffic is a more sophisticated approach that maintains a history of packets sent and received on all virtual interfaces over a certain number of Switching Cycles. It then uses this history to prioritize across networks, and adapt their Activity Periods Handling Ad Hoc Networks with Multiple MultiNet Nodes Supporting multiple nodes to use MultiNet in an ad hoc network poses a new problem. Any two nodes using MultiNet might not overlap in the ad hoc network for a significant period of time. Consequently, these nodes will be unable to communicate with each other for long durations even though they are in communication range of each other. This significantly affects the performance of MultiNet on the ad hoc network. Figure 2.3 illustrates this problem when two nodes A and B are in communication range

40 26 of each other and use MultiNet with Fixed Slot Duration to connect to two networks: Infrastructure Network 1 and Ad Hoc Network 2. In this scenario, nodes A and B do not overlap in the ad hoc network, and cannot communicate in this network. However, note that this problem is specific to ad hoc networks, as these nodes can communicate in the infrastructure network using Spoofed Buffering to buffer packets at the APs. Further, this problem also arises for other switching protocols described in Section 2.5.5, as two nodes might overlap for a very small period of time, which is too small to send even a single packet. Machine A IS Network 1 AH Network 2 IS Network 1 AH Network 2 AH Network 2 IS Network 1 AH Network 2 IS Network 1 Machine B time Figure 2.3: Two nodes in communication range and using MultiNet that fail to overlap in the ad hoc network abd hence experience a logical partitioning. This section presents a simple approach, called Slotted Synchronization, to synchronize an overlap between any two nodes using MultiNet in a single hop ad hoc network. We discuss SSCH, which is a more sophisticated and efficient approach for multihop networks in Chapter 3. Slotted Synchronization has a limitation that it allows a node to connect to only one ad hoc network in which multiple nodes use MultiNet. Extending this approach to allow nodes to stay connected in many ad hoc networks with multiple nodes using MultiNet is part of future work. Slotted Synchronization uses what we term the Fixed Slot Duration switching scheme, in which time is divided into slots and nodes switch to a network at the beginning of a

41 27 slot. All nodes use the same slot duration, and clocks at all nodes in a network are synchronized to within a millisecond of each other. The slot duration is chosen to be a few hundred milliseconds to accommodate the Switching Delay when switching to a network. We quantify the Switching Delay in Section Slotted Synchronization makes the assumption, as described in Section 2.5.1, that the node starting an ad hoc network knows if more than one node in its network is going to use MultiNet. Slotted Synchronization works as follows. The initiator node of an ad hoc network defines a recurrence period for the network. The recurrence period is the periodicity, measured in slots, at which MultiNet connects to the ad hoc network. As we show in Section 2.6.4, the SSID field of the IEEE Beacon [58] can be modified to carry the information about the recurrence period of the network and offset within the slot when the Beacon is transmitted. When a node uses MultiNet to join this network, it uses this information to synchronize the start time of its slots to that of the ad hoc network. Then, after every recurrence period slots, MultiNet switches the physical card of this node to the ad hoc network. Over the remaining slots, MultiNet switches the physical card across all the other networks. This algorithm ensures that all nodes in the ad hoc network overlap for one slot every recurrence period slots, even when some nodes use MultiNet to stay connected on other networks. Slotted Synchronization achieves this guarantee by synchronizing the slots at all nodes in the network to the parameters specified by the initiator. Further, slot synchronization occurs only at the time of joining the network and so this algorithm is not affected by mobility in the network. Note that this algorithm might not work if a node uses it to synchronize slots to multiple networks, since the initiator s slots of these disjoint networks might not be synchronized. Therefore, we limit a node to use MultiNet to stay connected on only one ad hoc network in which multiple nodes use MultiNet.

42 28 However, it can connect to many infrastructure networks and ad hoc networks in which it is the only node using MultiNet. 2.6 Implementation We implemented MultiNet on Windows XP as a combination of a kernel driver with a user level service. The mechanisms for storing network state, and for switching and buffering across networks are implemented in the kernel, while the respective policies are implemented in the service. The kernel driver is an NDIS intermediate driver, which exists as a layer between the network device drivers and IP. 1 MultiNet performs best when APs implement PSM and other nodes in an ad hoc network buffer packets for nodes using MultiNet. However, no changes are required in the wired nodes for Multi- Net to work. The rest of this section describes the details of our implementation MultiNet Driver The MultiNet driver provides all the mechanisms required by the MultiNet architecture. It initializes and maintains the virtual interfaces, and provides support to switch a wireless card from one network to another and to buffer packets at the virtual interfaces if the physical card is not on the wireless network. This driver also sends the buffered packets when it receives a media connect message after switching to another network. The MultiNet driver is implemented entirely as a Windows NDIS Intermediate driver. NDIS requires the lower binding of a network protocol, such as IP, to be a network miniport driver 2, such as the driver of a network interface. Similarly, NDIS requires the 1 Network Driver Interface Specification (NDIS) is a Windows construct that provides transport independence for the network card vendors. All networking protocols used by Windows call the NDIS interface to access the network. 2 A miniport driver directly manages a network interface card (NIC) and provides an

43 29 upper binding of miniport drivers to be a network protocol driver. We accommodate this requirement in the design of the MultiNet Driver, which includes two components: the MultiNet Protocol Driver (MPD), which provides an upper binding to the network card miniport driver, and the MultiNet Miniport Driver (MMD), which provides a lower binding to the network protocols, such as TCP/IP. The modified stack is illustrated in Figure 2.4. Application Mobile Aware Application MultiNet Service User WinSock 2.0 Kernel Legacy Protocols TCP/IP Native media -aware protocols NDIS Net 1 Net 2 Net N MultiNet Miniport Driver (MMD) MultiNet Protocol Driver (MPD) NDIS WLAN extensions MultiNet Driver NDIS miniport NDIS WLAN miniport Hardware Figure 2.4: The Network Stack with MultiNet The MPD manages multiple virtual interfaces over one wireless card. It switches the association of the underlying card across different networks, and buffers packets if the SSID of the associated network is different from the SSID of the sending virtual interface to higher-level drivers.

44 30 interface. MPD also buffers packets on the instruction of the MultiNet Service, as we describe later in Section Further, the MPD handles packets received by the wireless card. A packet received on the wireless card is sent to the virtual interface associated with the network on which the packet is received. The MMD manages a virtual interface of a wireless card. It maintains the state for each virtual interface, which includes the SSID and operational mode of the wireless network. It is also responsible for handling query and set operations directed for the underlying physical wireless interface MultiNet Service The MultiNet service implements the algorithms for switching across networks and buffering packets, described in Sections and respectively. This service is a user level daemon that uses I/O Control Codes (ioctls) to interact with the MultiNet Driver. It also broadcasts packets to interact with the service running at other nodes. These messages coordinate the buffering protocol for ad hoc networks, described in Section Further, all the switching algorithms discussed in Section are implemented in the MultiNet service. The service determines the duration of the Activity Period, and sends a signal to MPD when the Activity Period expires. This signal initiates the switching mechanism implemented in MPD. Finally, the service also coordinates the synchronization protocol described in Section It embeds the recurrence period and offset in the IEEE Beacon frame, and uses this information to synchronize the slot times of all nodes in the network.

45 Implementing Buffering Spoofed Buffering, described in Section 2.5.4, buffers packets for MultiNet over infrastructure networks using IEEE PSM. We successfully implemented this scheme over Native WiFi cards, which were described in Section For non-native WiFi (legacy) cards, we were constrained by the proprietary software on the card drivers. Their software does not expose any APIs in Windows to programmatically set the resolution of power save mode. Therefore, we were unable to implement the buffering algorithm for these WLAN cards. However, for prototyping Spoofed Buffering, we buffer packets at the end points of infrastructure networks, using a scheme similar to the one described for ad hoc networks in Section The MultiNet service keeps track of the end points of all on-going sessions, and buffers packets if the destination is currently in another network Implementing Slotted Synchronization The Slotted Synchronization Protocol, described in Section 2.5.6, requires an ad hoc network with multiple MultiNet nodes to have two parameters, in addition to the ones specified by IEEE In particular, the initiator of such an ad hoc network has to specify the recurrence period and the offset within the slot when the IEEE Beacon is sent. Any node joining this network has to learn of both these parameters for Slotted Synchronization to work. One way to implement this requirement is to modify IEEE packets to carry more information. However, this requires modifications to the wireless card driver, and might reduce the interoperability of MultiNet, as discussed in Section We use an alternative approach to solve this problem. The two parameters are embedded in the SSID field of an IEEE Beacon, which is broadcast once every

46 32 fixed interval. 3 The SSID field of the Beacon frame is 32 bytes in length. The recurrence period is measured in slots, and the maximum value is the number of networks to which a user can connect to. We limit this to be 255, and so 1 byte is sufficient to carry this information. Further, the offset within the slot is measured in microseconds, and we limit the maximum slot duration to 5 seconds. So, 2 bytes are enough to embed the value of the offset. Therefore, the user is allowed to use a 29 characters long SSID for such ad hoc networks. Based on experience, we believe that this does not significantly reduce the usability of IEEE networks. 2.7 System Evaluation We studied the performance of MultiNet using a real implementation and a custom simulator. The implementation was used to study the throughput behavior with different switching algorithms. We then simulated MultiNet with realistic parameters, and compared it with the alternative approach of using multiple radios to connect to multiple networks. We compare the two approaches with respect to energy consumption and the average delay encountered by the packets. The results presented in this section confirm that MultiNet is a more energy-efficient way of achieving connectivity to multiple networks as compared to using multiple radios Test Configuration MultiNet has been deployed and tested over a dozen commercial IEEE wireless cards. The results in this section were derived over an IEEE b network [60]. The wireless cards used were the Cisco 340 series, Compaq WLAN 200, Orinoco Gold, 3 The IEEE protocol for joining an ad hoc network requires the joining node to use the information in the Beacons of that network.

47 33 Netgear WAG 511 and the Native WiFi cards from AMD [11] and Realtek [102]. All these cards have a maximum data rate of 11 Mbps. The APs used were the Cisco 340 Series, EZConnect 2656, DLink DI-614+ and Native WiFi APs. IEEE PSM was implemented only in the Native WiFi APs. Most of our results were consistent across all these network equipments Reducing the Switching Delay Good performance of MultiNet depends on a short delay when switching across networks. However, legacy IEEE b cards perform the entire association procedure every time they switch to a network. We carried out a detailed analysis of the time to associate to an IEEE network. The results showed significant overhead when switching from one network to another. In fact, an astronomical delay of 3.9 seconds was observed from the time the card started associating to an ad hoc network, after switching from an infrastructure network, to the time it started sending data. Table 2.1: The Switching Delays between IS and AH networks for IEEE cards with and without the optimization of trapping media connect and disconnect messages. Switching Unoptimized Optimized Optimized From Legacy Legacy Native WiFi IS to AH 3.9 s 170 ms 25 ms AH to IS 2.8 s 300 ms 30 ms Our investigations revealed that the cause of this delay is the media disconnect and media connect notifications to the IP stack. The IP stack damps the media disconnect and connect for a few seconds to protect itself and its clients from spurious signals. The

48 34 spurious connects and disconnects can be generated by network interface cards due to a variety of reasons ranging from buggy implementations of the card or switch firmware to the card/switch resetting itself to flush out old state. Windows was designed to damp the media disconnect and connect notifications for some time before rechecking the connectivity state of the adapter and taking the action commensurate with that state. In the case of MultiNet, switching between networks is deliberate and meant to be hidden from higher protocols, such as IP and the applications. We hide switching by having MPD trap the media disconnect and media connect messages when it switches between networks. Since the MPD is placed below IP, it can prevent the network layer from receiving these messages. This minor modification significantly improves the Switching Delay as shown in Table 2.1. Using the above optimization, we were able to reduce the switching delay from 2.8 seconds to 300 ms when switching from an ad hoc network to an infrastructure network and from 3.9 seconds to 170 ms when switching from an infrastructure network to an ad hoc network. These numbers are further reduced to as low as 30 ms and 25 ms respectively, when Native WiFi cards are used. We believe that this overhead is extraneous for purposes of MultiNet and in Section 2.8 we suggest additional ways to make this delay negligible. A nice consequence of masking the media connect and media disconnect messages is that all virtual adapters are visible to IP as connected, and our architecture of Section is therefore consistent Comparing Different Switching Strategies We implemented three switching strategies described in Section 2.5.5, i.e. User Defined Priority, Adaptive Buffer, and Adaptive Traffic. The test environment comprised a node that used MultiNet to stay connected to an infrastructure and an ad hoc network. The

49 35 Switching Delays from the ad hoc to the infrastructure network and vice versa were overestimated at 500 ms and 300 ms respectively. 4 The total time available for switching between networks was 1 sec. We evaluated the switching strategies when simultaneously transferring a file of size 47 MB using FTP from the MultiNet node to two nodes on the different networks. An independent transfer of the file over the ad hoc network took seconds, while it took seconds over the infrastructure network. Figure 2.5 shows the time taken to simultaneously transfer this file over MultiNet using different switching strategies for legacy cards. We evaluated 3 different User Defined Priority switching schemes. In the 50%IS 50%AH strategy the node stays on each network for 500 ms. In the 75%IS 25%AH scheme it stays on the infrastructure network for 750 ms and on the ad hoc network for 250 ms, and in the 25%IS 75%AH scheme the node stays on the infrastructure network for 250 ms and the ad hoc network for 750 ms. For the Adaptive Traffic algorithm we used a window of 3 switching cycles to estimate the Activity Periods. In this case the window is 3*1.8 = 5.4 seconds since the Switching Cycle is = 1800 ms. Different switching strategies show different behavior and each of them might be useful for different scenarios. For the User Defined Priority strategies, the network with higher priority gets a larger slot to remain connected. Therefore, the network with a higher priority takes lesser time to complete the FTP transfer. The results of the adaptive algorithms are similar. The Adaptive Buffer algorithm adjusts the time it stays on a network based on the number of packets buffered for that network. Since the maximum throughput on an infrastructure network is more than the throughput of an ad hoc network 5, the number of packets buffered for the infrastructure network is more. There- 4 This overprovisioning helped to evenly compare all the switching schemes by fixing the duration of the Switching Cycle 5 Separate experiments revealed that the average throughput on a wireless network with commercial APs and wireless cards is 5.8 Mbps for an isolated infrastructure net-

50 Seconds %IS 75%AH 50%IS 50%AH 75%IS 25%AH Adaptive Buffer Adaptive Traffic IS AH Figure 2.5: Time taken to complete a 47 MB FTP transfer on an ad hoc and infrastructure network using different switching strategies fore the FTP transfer completes faster over the infrastructure network as compared to the 50%IS 50%AH case. For a similar reason the FTP transfer over the infrastructure network completes faster when using Adaptive Traffic switching. MultiNet sees much more traffic sent over the infrastructure network and proportionally gives more time to it. Overall, the adaptive strategies work by giving more time to faster networks if there is maximum activity over all the networks. However, if some networks are more active than the others, then the active networks get more time. We expect these adaptive strategies to give the best performance if the user has no priority and wants to achieve the best performance over all the MultiNet networks. work and 4.4 Mbps for an isolated two node ad hoc network. These results are consistent with [52].

51 Traffic (in packets) Activity Period (in ms) Time (in seconds) Ad hoc Infrastructure TrafficAH TrafficIS Figure 2.6: Variation of the activity period for two networks with time. The activity period of a network is directly proportional to the relative traffic on it Adaptive Switching using MultiNet The adaptability of MultiNet is demonstrated in Figure 2.6. The Adaptive Traffic switching strategy is evaluated by running our system for two networks, an ad hoc and an infrastructure network, for 150 seconds. The plots at the top of Figure 2.6 show the traffic seen on both the wireless networks, and the ones at the bottom of this figure show the corresponding effect on the Activity Period of each network. The adaptive switching strategy causes the Activity Period of the networks to vary according to the traffic seen on them. Initially when there is no traffic on either network, MultiNet gives equal time to both networks. After 20 seconds there is more traffic on the ad hoc network, and so MultiNet allocates more time to it. The traffic on the infrastructure network is greater than the traffic on the ad hoc network after around 110 seconds. Consequently, the infrastructure network is allocated more time. This correspondence between relative traffic on a network and its activity periods is evident in Figure 2.6.

52 38 MultiNet, when used with adaptive switching schemes, provides true zero configuration. Prior schemes, such as Wireless Zero Configuration (WZC), require users to specify a list of preferred networks, and WZC only connects to the most preferred available wireless network. The adaptive switching strategies require a user to specify a list of preferred networks, and the card connects to all the networks giving time to a network based on the amount of traffic on it MultiNet with and without Buffering We have implemented Spoofed Buffering on infrastructure networks with Native WiFi cards using IEEE PSM. However, many commercial APs do not implement PSM. Further, the ad hoc network buffering protocol, described in Section 2.5.4, relies on broadcast packets, which are more unreliable than unicast packets [91]. These packets might get lost, and packets destined to MultiNet s virtual interface might get dropped. The worst case occurs when no packets are buffered due to lost broadcast packets or lack of PSM support from commercial APs. Figure 2.7 compares this worst case to the scenarios when MultiNet implements buffering. In our test scenario, we consider an infrastructure network with and without Spoofed Buffering. Packets were sent, using ntttcp, over the infrastructure network from the MultiNet node to another node in the network. Ntttcp, which is a port of ttcp [118] to Windows, works by establishing a TCP session between two nodes and sending the packets at the maximum rate. The Activity Period for both networks was fixed at 500 ms. We present results for three scenarios in Figure 2.7. NoMultiNet corresponds to the case when the sender and receiver are connected to just one network, MultiNetNoBuffer is when the sender is connected to two networks using MultiNet and the AP does not implement Spoofed Buffering, and the APs implement Spoofed Buffering in MultiNetBuffer. Re-

53 39 9.E+03 8.E+03 7.E+03 TCP Sequence # 6.E+03 5.E+03 4.E+03 3.E+03 2.E+03 1.E+03 0.E Time (in seconds) NoMultiNet MultiNetBuffer MultiNetNoBuffer Figure 2.7: TCP Performance with and without Spoofed Buffering. sults show that the performance drops by a factor of four when using MultiNet with Spoofed Buffering and drops further when the AP does not buffer packets. When APs buffer packets, the MultiNet node can achieve a throughput proportional to the duration of its Activity Period, which is around a fourth of the Switching Cycle. Without buffering, the throughput of the system in this case goes down to a seventh of the maximum achievable throughput. Notice that although performance drops significantly, MultiNet is still usable with a throughput of around 500 Kbps MultiNet with Slotted Synchronization We now study the performance of Slotted Synchronization, described in Section We set up a three node network. The first machine always stays on the infrastructure network. Both the other machines use MultiNet. Before we start this experiment, the second node is connected to two networks, an ad hoc network and an infrastructure network. It is initially the only node in the ad hoc network. The third node, which we

54 40 Connecting to AH Network Throughput (in Mbps) Time (In Seconds) IS Network Ad Hoc Network Figure 2.8: Effect on UDP flows when a node uses Slotted Synchronization to join an ad hoc network also use as our test machine, is initially connected to only the infrastructure network. We start a UDP flow between the test machine and the first machine, which is only on the infrastructure network. We use Fixed Slot Duration switching, and set the duration of each slot to 800 ms. This duration contains the Switching Delay. IPerf [1] was used to initiate UDP flows of 1 Mbps with 512 bytes packets. The MPD was also instrumented to report the total number of successful packets sent and received in every slot. This setup used Spoofed Buffering. Figure 2.8 illustrates the instantaneous throughput, measured once per Switching Cycle, achieved by UDP flows when the test machine joins an ad hoc network that has more than one MultiNet node. Initially, when the test machine is only in the infrastructure network, there is no Switching Delay, and consequently the UDP throughput is around 1 Mbps. After 13 seconds, the test machine uses MultiNet to connect to the

55 41 ad hoc network, which already has one MultiNet node. The test machine takes around 15 seconds to initialize another virtual interface, build up its state, synchronize the slots to the MultiNet node in the ad hoc network and get a DHCP address for the virtual interface. After this time, the UDP flow between the test machine and the infrastructure network node resumes. We immediately start another UDP flow between the two MultiNet nodes in the ad hoc network. As we see in the figure, UDP throughput in the infrastructure network drops to around half the initial throughput. This is because the infrastructure network gets one of two slots in Fixed Slot Duration Switching since MultiNet connects to two networks. The Switching Delay does not reduce the throughput further, because MultiNet is able to send the buffered packets over the Activity Period at the network s bandwidth, which is greater than the IPerf flow rate of 1 Mbps. Further, the flow over the ad hoc network roughly achieves the same throughput as over the infrastructure network, which implies that Slotted Synchronization maintains a good overlap between multiple nodes using MultiNet to connect to an ad hoc network MultiNet on a Mobile Node MultiNet does not aim to hide mobility from the user. As discussed in Section 2.5.2, MultiNet s virtual interfaces should behave as physical wireless cards when nodes are mobile. To illustrate this behavior, the same experimental setup of Section was used. However, in this case, we focused on the throughput in the ad hoc network. After around 28 seconds, the test machine was moved away from the other MultiNet node in the ad hoc network. As we see in Figure 2.9, the IPerf throughput over the ad hoc network keeps falling as the machine moves away from the other node in the ad hoc network. With an increase in distance between the two nodes, the signal strength decreases, which increases the loss rate and reduces the throughput. After some time the

56 Motion Start Connection Lost Connection Regained 0.7 Throughput (in Mbps) Time (in seconds) Figure 2.9: MultiNet in a Mobile Scenario connection over the ad hoc network is lost. This state is propagated to the application layer, which halts IPerf. However, MultiNet keeps trying to reconnect to the ad hoc network, as described in Section It regains connectivity at around 52 seconds. The IPerf flow is started immediately between the two nodes. As we see in the figure, the two nodes using MultiNet achieve the same throughput after reconnection, as they had before the connection was lost. This shows that there is a significant overlap between the two nodes, and the performance of Slotted Synchronization is not significantly affected with mobility. The test machine was again moved at around 70 seconds, and we see a corresponding reduction in throughput MultiNet versus Multiple Radios MultiNet is one way of staying connected to multiple wireless networks. The alternative approach is to use multiple wireless cards. Each card connects to a different network, and the machine is therefore connected to multiple networks. We simulated

57 43 this approach, and compared it with the MultiNet scheme with respect to the energy consumed and the average delay of packets over the different networks. We first present our simulation environment, and then compare the results of the MultiNet scheme to the alternative approach using multiple radios. Simulation Environment We simulated both approaches for a sample scenario of people wanting to share and discuss a presentation over an ad hoc network and browse the web over the infrastructure network at the same time. This feature is extremely useful in many scenarios. For example, consider the case where a company, say Kisco s, employees conduct a business meeting with another company, say Macrosoft s, employees at Macrosoft s headquarters. With MultiNet and a single wireless network card, Kisco employees can share documents, presentations, and data with Macrosoft s employees over an ad hoc network. Macrosoft s employees can stay connected to their internal network via the access point infrastructure while sharing electronic information with Kisco s employees. Macrosoft does not have to give Kisco employees access in their internal network in order for the two parties to communicate. We model traffic over the two networks, and analyze the packet trace using our simulator. Traffic over the infrastructure network is considered to be mostly web browsing. We used Surge [18] to model http requests according to the behavior of an Internet user. Surge is a tool that generates web requests with statistical properties similar to measured Internet data. The generated sequence of URL requests exhibit representative distributions for requested document size, temporal locality, spatial locality, user off times, document popularity and embedded document count. For our purposes, Surge was used to generate a web trace for a 1 hour 50 minute duration, and this web trace

58 Packet Size (in Bytes) Time (seconds) Figure 2.10: Packet trace for the web browsing application over the infrastructure network was then broken down to a sample packet trace for this period. The distribution of the packet sizes over the infrastructure network is illustrated in Figure2.10. The ad hoc network is used for two purposes: sharing a presentation, and supporting discussions using a sample chat application. Three presentations are shared in our application over a 1 hour 50 minute period. Each presentation is a 2 MB file, and is downloaded to the target machine using an FTP session over the ad hoc network. They are downloaded in the 1st minute, the 38th minute, and the 75th minute. Further, the user also chats continuously with other people in the presentation room, discussing the presentation and other relevant topics. Packet traces for both the applications, FTP and chat, were obtained by sniffing the network, using Ethereal [45], while running the respective applications. MSN messenger was used for a sample chat trace for a 30 minute duration. The Packet traces for FTP and chat were then extended over the duration of our application, and are illustrated in Figure 2.11.

59 45 In our simulations we assume that wireless networks operate at their maximum TCP throughput of 4.4 and 5.8 Mbps for an ad hoc and infrastructure network respectively. We then analyze the packet traces for independent networks, and generate another trace for MultiNet. We use a 75%IS 25%AH switching strategy presented in Section with a switching cycle time of 400 ms. The switching delay is set to 1 ms, and we explain the reason for choosing this value in Section Further, the power consumed when switching is assumed to be negligible. We do not expect these simplifying assumptions to greatly affect the results of our experiments. We analyze packet traces for the two radio and MultiNet case and compute the total power consumed and the average delay encountered by the packets. All the cards are assumed to be Cisco AIR-PCM350, and their corresponding power consumption numbers are used from [111]. Specifically, the card consumes 45 mw of power in sleep mode, 1.08W in idle mode, 1.3W in receive mode, and 1.875W in transmit mode. Further, in PSM, the energy consumed by the Cisco AIR-PCM 350 in one power save cycle is given by: n t millijoules, where n is the Listen Interval and t is the Beacon Period of the AP. The details of these numbers are presented in [111]. Table 2.2: The average throughput in the ad hoc and infrastructure networks using both strategies of MultiNet and two radios Network Two Radio MultiNet Ad Hoc 4.4 Mbps 1.1 Mbps Infrastructure 5.8 Mbps 4.35 Mbps Despite the performance advantages seen in Table 2.2, using multiple radios consumes more power. Each radio is always on, and therefore keeps transmitting and receiving over all the networks. Even when it is not, the radio is in idle mode, and drains

60 Packet Size (in Bytes) Time (seconds) Figure 2.11: Packet trace for the presentation and chat workloads over the ad hoc network a significant amount of power. Figure 2.12 shows the amount of energy consumed by the MultiNet scheme and the two radio scheme for the above application. Two radios consume almost double the power consumed by the single MultiNet radio. Table 2.3: The average packet delay in infrastructure mode for the various strategies Scheme Avg Delay (in Seconds) Two Radio MultiNet Two Radio PS MultiNet PS 0.167

61 Energy Consumed (In Joules) Time (In Seconds) Two Radios MultiNet Figure 2.12: Comparison of total energy usage when using MultiNet versus two radios With Power Save Mode The multiple radio approach can be modified to consume less power by allowing the network card in infrastructure mode to use PSM. Figure 2.13 shows the energy usage when the infrastructure radio uses PSM for our application. The Beacon Period is set to 100 ms, and the Listen Interval is 4. The amount of energy consumed in the two radio case using PSM is very close to the consumption of MultiNet without PSM. However, this saving comes at a price. It is no longer possible to achieve the high throughput for infrastructure networks if the cards are in PSM. Simulated results in Table 2.3 show that the average packet delay over the infrastructure network with PSM is now close to the average packet delay for MultiNet. Therefore, using two radios with PSM does not give significant benefits as compared to MultiNet without PSM.

62 48 Without Power Save Mode We analyze the two schemes of connecting to multiple networks with respect to the performance on the network and the amount of power consumed. In our simulated scenario, each of the radios gives the best achievable throughput on both the networks. As shown in Table 2.2, the average throughput of MultiNet in the infrastructure mode is 4.35 Mbps compared to 5.8 Mbps in the two radio case. The average throughput in the ad hoc network is 1.1 Mbps in MultiNet and 4.4 Mbps when using two radios. Switching results in lesser throughput across individual networks, since it is on a network for a smaller time period. Consequently, the scheme of using multiple cards gives much better throughput as compared to MultiNet when connected to multiple networks. The power consumption of MultiNet can be reduced further by allowing it to enter the power save mode for infrastructure networks as described in Section In our experiment we chose the Switching Cycle to be 400 ms, with 75%IS 25%AH switching. For consistency in comparison, the Listen Interval is set to 4 and the Beacon Period to 100 ms. Consequently, every time the card switches to infrastructure mode, it listens for the traffic indication map from the AP. After it has processed all its packets it goes to sleep and wakes up after 300 ms. It then stays in the ad hoc network for 100 ms, and then switches back to the infrastructure network. The modified algorithm results in greater energy savings as shown in Figure The average delay per packet over the infrastructure network is not seriously affected, while the energy consumed is reduced by more than a factor of 3. We conclude that MultiNet is superior to the use of multiple cards when connecting to multiple networks in applications seeking convenience and power efficiency. Note that we do not evaluate power saving in ad hoc mode because we are unaware of any commercial cards that implement this feature. As a result we were unable to get

63 49 performance numbers when using PSM in ad hoc mode. However, we believe that if such a scheme is implemented, we will be able to incorporate it in MultiNet, and further reduce the power consumption Energy (In Joules) Time (In Seconds) Two Radio PS MultiNet No PS MultiNet PS Figure 2.13: Energy usage when using MultiNet and two radios with IEEE Power Saving Maximum Connectivity in MultiNet We use the simulation environment of Section to evaluate the performance on increasing the number of connected networks. Table 2.4 presents the average delay seen by packets over the infrastructure network on varying the number of MultiNet networks from 2 to 6. We used a Fixed Priority switching strategy with equal priorities to all the networks. An increase in the number of connected networks results in a smaller Activity Period for each connected network when using Fixed Priority Switching. As a result, more packets are buffered and the average delay encountered by the packets on a network increases. This is shown in Table 2.4.

64 50 Table 2.4: The average packet delay in infrastructure mode on varying the number of MultiNet connected networks Num Networks Avg Delay (in Seconds) Summary We summarize the conclusions of our performance analysis as follows: No single switching strategy is best under all circumstances. Adaptive strategies are best when no network preference is indicated. Both Adaptive Buffer and Adaptive Traffic give similar performance For the applications studied, MultiNet consumes 50% less energy than a two card solution. As expected, the average packet delay with MultiNet varies linearly with an increase in the number of connected networks when all the networks are given equal activity periods. Spoofed Buffering significantly improves the performance of MultiNet. However, MultiNet works even without Spoofed Buffering, although the performance goes down by a factor of 4. Masking media connects and media disconnects below IP leads to significant

65 51 reduction in the switching overhead. The switching delay for legacy cards is reduced to around 300 ms, while this number goes down to 30 ms for Native WiFi cards. Adaptive Switching eliminates the current zero configuration requirement to prioritize the preferred network. With MultiNet based zero configuration, the user connects to all preferred networks. In mobile scenarios, MultiNet exposes the same connectivity status as a real card. Further, Slotted Synchronization works well in ad hoc networks with commercial wireless cards. 2.8 Discussion on the MultiNet Architecture This section discusses ways in which the performance of MultiNet can be improved. In particular, it focuses on reducing the switching overhead, enabling 802.1X [57] authentication, and deployment Reducing the Switching Overhead Good performance of MultiNet depends on low switching delays. The main cause of the switching overhead in current generation wireless cards is the network association and authentication protocol [58], which is executed every time the card switches to a network. Further, these cards do not store state for more than one network in the firmware, and worse still, many card vendors force a firmware reset when changing the mode from ad hoc to infrastructure and vice versa. Most of these problems are fixed in the next generation Native WiFi cards. These cards do not incur a firmware reset on changing their mode. Moreover, since switching

66 52 is forced by MultiNet, Native WiFi cards do not explicitly disconnect from the network when switching. However, they still carry out the association procedure that causes the 25 to 30 ms delay. By allowing upper layer software to control associations, instead of automatically initializing them, this delay can be made negligible. The only overhead on switching is then the synchronization with the wireless network. This can be done reactively, if the card requests a synchronization beacon when it switches to a network. Using the above optimizations, a WLAN card can switch to a network as fast as the network card can switch to a different channel and the speed with which a network s state can be loaded into a flash card. Recent research has shown that the time to switch to a different channel is less than 100 µsec for an IEEE a wireless card [51]. Further, as the network state to load is around 100 bytes, and data transfer speeds for flash cards is 8 Mbps [13], we expect the switching overhead to be less than 1 ms Network Port Based Authentication The IEEE 802.1X is a port based authentication protocol that is becoming popular for enterprise wireless LANs. For MultiNet to be useful in all environments it has to support this authentication protocol. However, the supplicant 802.1X protocol is implemented in the Wireless Zero Configuration Service (WZC) for Windows XP, and we had to turn off WZC for MultiNet to work. Only minor changes are needed in WZC for it to work with MultiNet. However, achieving good performance with IEEE 802.1X is difficult. We measured the overhead of the IEEE 802.1X authentication protocol and found it to be approximately 600 ms. It is clear that we need to prevent the card from going through a complete authentication procedure every time it switches across IEEE 802.1X enabled networks. We can eliminate the authentication cycles by storing the IEEE 802.1X state in the MPD and using this state instead of redoing the authentication procedure. Further,

67 53 the IEEE standard recommends an optimization called Preauthentication for the APs. Preauthentication works by having the APs maintain a list of authenticated nodes. When implemented, this optimization will eliminate the authentication overhead every time the wireless card switches to an 802.1X enabled network Can MultiNet be done in the Firmware? The simple answer is yes, however we strongly believe that the right place to implement MultiNet is as a kernel driver. Buffering imposes memory requirements that are best taken care of by the operating system, and the policy driven behavior can bloat the firmware. Additionally, by moving the intelligence into a general purpose PC, the cost of the wireless hardware can be reduced further, which is the trend for the next generation of WLAN cards we described in Section Future Research The switching behavior of MultiNet augurs badly for TCP performance. MultiNet is implemented below IP, and so TCP sees fluctuating behavior for packets sent by it. It receives immediate acknowledgements for packets sent when the network is active, and delayed acknowledgements for buffered packets. The above behavior affects the way TCP adjusts the RTT for the session, and from the way it is calculated, the RTT will always be an upper bound. An overestimate of RTT results in larger timeout values to ensure that packets are not lost. However, a larger than required RTT has other consequences with respect to flow control, and congestion response. This problem is generally relevant for networks that have periodic connectivity. A solution to this problem has to mask the delay encountered by the buffered packets. We are currently exploring ways to achieve this, and improve TCP performance.

68 54 Another open problem, as discussed in Section 2.5.6, is the synchronization of more than one ad hoc network that has multiple nodes using MultiNet. Solving this problem requires MultiNet to synchronize its slots to initiators of multiple ad hoc networks, and those initiators slots might not be synchronized. We are looking at ways to handle this scenario by allowing all nodes, including the initiator, to resynchronize their slots. As stated previously, we do not consider scenarios when a MultiNet node is participating in a multi-hop ad hoc network. The synchronization problem is complicated for such scenarios. A scheme that supports multi-hop networks has to handle partitioning issues of the ad hoc network, and ways to resynchronize it. SSCH, described in Chapter 3, is a step towards making MultiNet work in multi-hop networks. We hope to build on SSCH, and implement a protocol that works in such scenarios Summary The main contributions of this chapter can be summarized as follows: It describes a new virtualization architecture for wireless network cards, called MultiNet. Several compelling real-life scenarios are described that motivate the need for such an architecture. To the best of our knowledge, MultiNet is the first to articulate this problem and propose a solution for IEEE hardware. It proposes a deployable architecture for MultiNet. As part of the architecture, it presents Spoofed Buffering, which leverages IEEE PSM to buffer packets at the APs without modifying them. Three switching algorithms are presented that are useful in different applications of MultiNet. It also presents Slotted Synchronization, which is a simple synchronization protocol that works in ad hoc networks with multiple MultiNet nodes.

69 55 It describes the implementation of MultiNet on Windows XP and over commercial wireless network cards. MultiNet requires no modifications to the wireless card drivers. This chapter also analyzes the performance of MultiNet in a number of scenarios, such as mobility and without Spoofed Buffering. Further, MultiNet is more power efficient than an alternative of using multiple wireless cards in the device. MultiNet achieves the design goals of transparency, deployability and performance. Transparency is achieved by making virtual wireless cards appear as physical wireless cards to the user. Its deployability has been demonstrated by an implementation on Windows XP over commercial wireless cards. Finally, the performance of MultiNet has been studied in detail, and is shown to give good performance in most scenarios. The MultiNet software is available for free download, and more information can be found at: The contents of this chapter have benefitted from several helpful suggestions and comments. In particular, Victor Bahl and Pradeep Bahl were involved in discussions that helped develop the MultiNet architecture. Slotted Synchronization and the performance results were revised after inputs from Ken Birman. Further, some of MultiNet s design goals were motivated by the requirements of MultiNet users.

70 CHAPTER 3 SSCH: CAPACITY IMPROVEMENT USING MULTINET 3.1 Introduction The problem of supporting multiple senders and receivers in wireless networks has received significant attention in the past decade. One domain where this communication pattern naturally arises is fixed wireless multi-hop networks, such as community networks [21, 70, 107, 109]. Increasing the capacity of such wireless networks has been the focus of much recent research (e.g., [40, 65, 95]). An obvious way to increase the network capacity is to use frequency diversity [4,114]. Commodity wireless networking hardware commonly supports a number of orthogonal channels, and distributing communication across channels permits multiple simultaneous communication flows. Channelization was added to the IEEE standard to increase the capacity of infrastructure networks neighboring access points are tuned to different channels so traffic to and from these access points does not interfere [4]. Non-infrastructure (i.e., adhoc) networks have thus far been unable to exploit the benefits of channelization. The current practice in ad-hoc networks is for all nodes to use the same channel, irrespective of whether the nodes are within communication range of each other [107, 109]. Among its constructions, this thesis proposes a new protocol, Slotted Seeded Channel Hopping (SSCH), which extends the benefits of channelization to ad-hoc networks. Logically, SSCH operates at the link layer, but it can be implemented in software over an IEEE compliant wireless Network Interface Card (NIC). The SSCH layer in a node handles three aspects of channel hopping (i) implementing the node s channel hopping schedule and scheduling packets within each channel, (ii) transmitting the channel hopping schedule to neighboring nodes, and (iii) updating the node s channel 56

71 57 hopping schedule to adapt to changing traffic patterns. SSCH is a distributed protocol for coordinating channel switching decisions, but one that only sends a single type of message, a broadcast packet containing that node s current channel hopping schedule. The simulation results show that SSCH yields a significant capacity improvement in ad-hoc wireless networks, including both single-hop and multi-hop scenarios. The primary research contributions of SSCH can be summarized as follows: It is a new protocol that increases the capacity of IEEE ad-hoc networks by exploiting frequency diversity. This extends the benefits of channelization to ad-hoc networks. The protocol is suitable for a multi-hop environment, does not require changes to the IEEE standard, and does not require multiple radios. SSCH introduces a novel technique, optimistic synchronization, for distributed rendezvous and synchronization. This technique allows control traffic to be distributed across all channels, and thus avoids control channel saturation, a bottleneck identified in prior work on exploiting frequency diversity [114]. SSCH introduces a second novel technique to achieve good performance for multihop communication flows. The partial synchronization technique allows a forwarding node to partially synchronize with a source node and partially synchronize with a destination node. This synchronization pattern allows the load for a single multi-hop flow to be distributed across multiple channels. 3.2 Background and Motivation In this section, the discussion will be limited to the widely-deployed IEEE Distributed Coordination Function (DCF) protocol [60]. We begin by reviewing some relevant details of this protocol. IEEE recommends the use of a Request To Send

72 58 (RTS) and Clear To Send (CTS) mechanism to control access to the medium. A sender desiring to transmit a packet must first sense the medium free for a DCF interframe space (DIFS). The sender then broadcasts an RTS packet seeking to reserve the medium. If the intended receiver hears the RTS packet, the receiver sends a CTS packet. The CTS reserves the medium in the neighborhood of the receiver, and neighbors do not attempt to send a packet for the duration of the reservation. In the event of a collision or failed RTS, the node performs an exponential backoff. For additional details, the reader is referred to [60]. The IEEE standard divides the available frequency into orthogonal (nonoverlapping) channels. IEEE b specifies 11 channels in the 2.4 GHz spectrum, 3 of which are orthogonal, and IEEE a specifies 13 orthogonal channels in the 5 GHz spectrum. Packet transmissions on these orthogonal channels do not interfere if the communicating nodes on them are reasonably separated (at least 12 inches apart for common hardware [4]). Using only a single channel limits the capacity of a wireless network. For example, consider the scenario in Figure 3.1 where 6 nodes are within communication range of each other, all nodes are on the same channel, and 3 of them have packets to send to distinct receivers. Due to interference on the single channel, only one of them, in this case node 3, can be active. In contrast, if all 3 orthogonal channels are used, all transmissions can take place simultaneously on distinct channels. SSCH captures the additional capacity provided by these orthogonal channels. There were three important constraints in the design of SSCH: SSCH should require only a single radio per node. Some of the previous work on exploiting frequency diversity has proposed that each node be equipped with multiple radios [4, 135]. Multiple radios draw more power, and energy consump-

73 Figure 3.1: Only one of the three packets can be transmitted when all the nodes are on the same channel. tion continues to be a significant constraint in mobile networking scenarios. By requiring only a single standards-compliant NIC per node, SSCH faces fewer deployability hurdles than schemes with additional hardware requirements. SSCH should use an unmodified IEEE protocol (including RTS/CTS) when not switching channels. Requiring standards-compliant hardware allows for easier deployment of this technology. SSCH should not cause logical partitions; any two nodes in communication range should be able to communicate with each other despite channel hopping. Because SSCH switches each NIC across frequency channels, different NICs may be on different channels most of the time. Despite this, any two nodes in communication range will overlap on a channel with moderate frequency (e.g., at least 10 ms out of every half second) and discovery is accomplished during this time. As we

74 60 will show in Section 3.5.3, the mathematical properties of the SSCH protocol guarantee that this overlap always occurs, even in the absence of synchronization. SSCH exploits frequency diversity using an approach that we term optimistic synchronization. This design makes the common case be that nodes are aware of each other s channel hopping schedules. However, SSCH also allows any node to change its channel hopping schedule at any time. If node A has traffic to send to another node B, and A knows B s hopping schedule, A will probably be able to quickly send to B by changing its own schedule. In the uncommon case that A does not know B s schedule, or A has out-of-date information about B, then the traffic incurs a latency penalty while A discovers B s new schedule. The SSCH design achieves this good common case behavior when SSCH is used with a workload where traffic patterns change (i.e., new flows are started) with lower frequency than hopping schedule updates are propagated. Because hopping schedule update propagation requires only tens of milliseconds, this is a good workload assumption for many wireless networking scenarios. Section 3.6 gives absolute numbers for these qualitative claims. SSCH is designed to work in a single-hop or multi-hop environment, and therefore must support multi-hop flows. We introduce a partial synchronization technique to allow one node, say B, to follow a channel hopping schedule that overlaps half the time with another node A, and half the time with a third node C; this is necessary for node B to efficiently forward traffic from node A to node C. Although it is trivially possible for node B to have a channel hopping schedule that is an interleaving of A and C s schedules, this leaves open how B will schedule itself when a fourth node desires to synchronize with B. The channel hopping design described in Section resolves this issue.

75 Hardware and MAC Assumptions We assume that all nodes are using IEEE a SSCH could also be used with other MACs in the IEEE family, but evaluation of such options are beyond the scope of this dissertation. IEEE a supports 13 orthogonal channels, and we assume no co-channel interference, a reasonable assumption for physically separated nodes [4]. We expect wireless cards to be capable of switching across channels. The clocks at all nodes are assumed to be synchronized to within 1 ms of each other using the Timer Synchronization Function of IEEE [58] or its modifications proposed in the literature, such as ATSP [54,74] or ASCP [110]. We justified this assumption in Chapter As we discuss in more detail at the beginning of Section 3.6, recent work has reduced this switching delay to approximately 80 µs ( [51, 84]). We assume that each wireless card contains only a single half-duplex single-channel transceiver. We require that NICs with a buffered packet wait after switching for the maximum length of a packet transmission before attempting to reserve the medium. This prevents hidden terminal problems from occurring just after switching. This hardware requirement is not necessary if the NIC packet buffer can be cleared whenever the channel is switched. 3.4 Prior Work We divide prior work relevant to SSCH into two categories: prior uses of pseudo-random number generators in wireless networking, and alternative approaches to exploiting frequency diversity. In the first category, we find that pseudo-random number generators have been used for a variety of tasks in wireless networking. For example, the SEEDEX protocol [108] uses pseudo-random generators to avoid RTS/CTS exchanges in a wire-

76 62 less network. Nodes build a schedule for sending and listening on a network, and publish their seeds to all the neighbors. A node attempts a transmission only when all its neighbors (including the receiver) are in a listening state. Assuming relatively constant wireless transmission ranges, this protocol also helps in overcoming the hidden and exposed terminal problem caused by the RTS/CTS approach. The TSMA protocol [31,32] is a channel access scheme proposed as an alternative to ALOHA and TDMA, for timeslotted multihop wireless networks. TSMA aims to achieve the guarantees of TDMA without incurring the overhead of transmitting large schedules in a mobile environment. Each node is bootstrapped with a fixed seed that determines its transmission schedule. The schedules are constructed using polynomials over Galois fields (which have pseudo-random properties), and the construction guarantees that each node will overlap with only a single other node within a certain time frame. The length of the schedule depends on the number of nodes and the degree of the network. Porting these schedules to a multichannel scenario, where the number of channels is fixed, remains an open problem, and even such a porting would not meet the SSCH goal of supporting traffic-driven overlap. Redi et al. [33] use a pseudo-random generator to derive listening schedules for battery-constrained devices. Each device s seed is known to a base station, which can then schedule transmissions for the infrequent moments when the battery-constrained device is awake. Although pseudo-random generators have been used for a number of tasks (as this survey of the literature makes clear), to the best of our knowledge, SSCH is the first protocol to use a pseudo-random generator to construct a channel hopping schedule. A second category of prior work focusses on increasing network capacity by exploiting frequency diversity. This is a significant body of research. The first division we make in this body of work is between approaches that assume a single NIC capable of

77 63 communicating on a single channel at any given instance in time, and those that assume more powerful radio technology, such as multiple NICs [4, 112] or NICs capable of listening on many channels simultaneously [66,89], even if they can only communicate on one. Our work falls in to the former category; the SSCH architecture can be deployed over a single standards-compliant NIC supporting fast channel switching. Dynamic Channel Assignment (DCA) [135] and Multi-radio Unification Protocol (MUP) [4] are both technologies that use multiple radios (in both cases, two radios) to take advantage of multiple orthogonal channels. DCA uses one radio on a control channel, and the other radio switches across all the other channels sending data. Arbitration for channels is embedded in the RTS and CTS messages, and is executed on the control channel. Although this scheme may fully utilize the data channel, it does so at the cost of using an entire radio just for control. MUP uses both radios for data and control transmissions. Radios are assigned to orthogonal channels, and a packet is sent on the radio with better channel characteristics. This scheme gives good performance in many scenarios. However, it still only allows the use of as many channels as there are radios on each physical node. From our perspective, the key drawback to both DCA and MUP is simply that they require the use of multiple radios. Recently, commercial products have appeared that support multiple radios on a single NIC [44]. It is not known whether these products will achieve as many radios on a NIC as there are available channels, nor what their power consumption will be. A straightforward way to view the different potential gains of SSCH compared to a true multiple radio design is to consider two distinct sources of bottleneck in a singleradio, single-channel system: the saturation of the channel, and the saturation of any particular radio. Conceptually, SSCH significantly increases the channel bandwidth, without increasing the bandwidth of any individual radio. In contrast, a true multiple

78 64 radio design increases both. A specific example of this difference is that a node using MUP (a true multiple radio design) can simultaneously send and receive packets on separate channels, while a node using SSCH can only perform one of these operations at a time. We next turn our attention to work assuming more powerful radio technology than is currently technologically feasible. HRMA [137] is designed for frequency hopping spread spectrum (FHSS) wireless cards. Time is divided into slots, each corresponding to a small fraction of the time required to send a packet, and the wireless NIC is on a different frequency during each slot. All nodes are required to maintain synchronized clocks, where the synchronization is at the granularity of slot times that are much shorter than the duration of a packet. Each slot is subdivided in to four segments of time for four different possible communications: HOP-RESERVED/RTS/CTS/DATA. The first three segments of time are assumed to be small in comparison with the amount of time spent sending a segment of the packet during the DATA time interval. To the best of our knowledge, a FHSS wireless card that supports this type of MAC protocol at high data rates is not commercially available. Another line of related work assumes technology by which nodes can concurrently listen on all channels. For example, Nasipuri et al [89] and Jain et al [66] assume wireless NICs that can receive packets on all channels simultaneously, and where the channel for transmission can be chosen arbitrarily. In these schemes, nodes maintain a list of free channels, and either the sending or receiving node chooses a channel with the least interference for its data transfer. Wireless NICs do not currently support listening on arbitrarily many channels, and we do not assume the availability of such technology in the design of SSCH. We finally consider prior work that only assumes the presence of a single NIC with a

79 65 single half-duplex transceiver. The only other approach that we are aware of to exploiting frequency diversity under this assumption is Multichannel MAC (MMAC) [114]. Like SSCH, MMAC attempts to improve capacity by arranging for nodes to simultaneously communicate on orthogonal channels. Briefly, MMAC operates as follows: nodes using MMAC periodically switch to a common control channel, negotiate their channel selections, and then switch to the negotiated channel, where they contend for the channel as in IEEE This scheme raises several concerns that SSCH attempts to overcome. First, MMAC has stringent clock synchronization requirements, and to the extent that these are relaxed, MMAC must spend more time on the common control channel doing discovery. Tight clock synchronization is particularly hard to provide in multi-hop wireless networks [54]. In contrast, SSCH does not require tight clock synchronization because SSCH does not have a common control channel or a dedicated neighbor discovery interval. Secondly, synchronization traffic in MMAC can be a significant fraction of the system traffic, and the common synchronization channel can become a bottleneck on system throughput. SSCH addresses this concern by distributing synchronization and control traffic across all the available channels. A third concern with MMAC is that it assumes wireless NICs are capable of switching across channels in less than a microsecond. As we will see in the beginning of Section 3.6, an 80 µs switching time better reflects the current state of the art in wireless NIC design, and SSCH performs well with this assumption. A fourth concern with MMAC is that it may not efficiently support multi-hop flows because forwarding nodes may not predictably split their time between their sending and receiving neighbors. SSCH addresses this by allowing nodes to achieve predictable partial synchronization with multiple neighbors. Although this survey does not cover all related work, it does characterize the current state of the field. At the level of detail in this section, prior work such as CHMA [124]

80 66 is similar to HRMA [137], and MAC-SCC [80] and the MAC protocols implicit in the work of Li et al [79] and Fitzek et al [46] are similar to DCA [135]. However, a final related channel hopping technology that is worth mentioning is the definition of FHSS channels in the IEEE [60] specification. At first glance, it may seem redundant that SSCH does channel hopping across logical channels, each one of which (per the IEEE specification) may be employing frequency hopping across distinct frequencies at the physical layer. The IEEE specification justifies this physical layer frequency hopping with the scenario of providing support for multiple Basic Service Sets (BSS s) that can coincide geographically without coinciding on the same logical channel. In contrast, SSCH does channel hopping so that any two nodes can coincide as much or as little of the time as they desire. This is also at the heart of the difference between SSCH and past work on channel-hopping protocols where nodes overlap a fixed fraction of the time [32] the degree of overlap between any two nodes using SSCH is traffic-dependent. 3.5 SSCH SSCH switches each radio across multiple channels and distributes flows within interfering range of each other on orthogonal channels. This results in significantly increased network capacity when the network traffic pattern consists of such flows. SSCH is a distributed protocol, suitable for deployment in a multi-hop wireless network. It does not require synchronization or leader election. Nodes do attempt to synchronize, but lack of synchronization results in at most a mild reduction in throughput. SSCH is designed to work with MultiNet, where a slot is defined to be the time spent on a single channel. We choose a slot duration of 10 ms to amortize the overhead of channel switching. At 54 Mbps (the maximum data rate in IEEE a), 10 ms is

81 67 equivalent to 35 maximum-length packet transmissions. A longer slot duration would have further decreased the overhead of channel switching, but would have increased the delay that packets encounter during some forwarding operations. The channel schedule is the list of channels that the node plans to switch to in subsequent slots and the time at which it plans to make each switch. Each node maintains a list of the channel schedules for all other nodes it is aware of this information is allowed to be out-of-date, but the common case will be that it is accurate. The good performance exhibited by SSCH (Section 3.6) validates this claim. We develop the SSCH protocol by first describing packet transmission attempts that are made by each node within a slot, and we refer to this as the packet schedule (Section 3.5.1). Next, we define the policy for updating the channel schedule and for propagating the channel schedule to other nodes (Section 3.5.2). We then describe the mathematical properties that guided SSCH s design (Section 3.5.3). Finally, we discuss implementation considerations for SSCH (Section 3.6.4) Packet Scheduling SSCH maintains packets in per-neighbor FIFO queues. These queues maintain standard higher-layer assumptions about in-order delivery. The per-neighbor FIFO queues are, in turn, maintained in a priority queue ordered by perceived neighbor reachability. The SSCH scheduling strategy aims to maximize bandwidth utilization by minimizing the number of packets sent to nodes that are unreachable. It works as follows. At the beginning of a slot, packet transmissions are attempted in a round-robin manner among all flows. If a packet transmission to a particular neighbor fails, the corresponding flow is reduced in priority until a period of time equal to one half of a slot duration has elapsed this limits the bandwidth wasted on flows targeted at nodes that are currently

82 68 on a different channel to at most two packets per slot whenever a flow to a reachable node also exists. Packets are only drawn from the flows that have not been reduced in priority unless only reduced priority flows are available. Because nodes using SSCH will often be on different channels, broadcast packets transmitted in any one slot are likely to reach only some of the nodes within physical communication range. The SSCH layer handles this issue through repeated link-layer retransmission of broadcast packets enqueued by higher layers. Although broadcast packets sent this way may reach a different set of nodes than if all nodes had been on the same channel, we have not found this to present a difficulty to protocols employing broadcast packets in Section 3.6 we show that as few as 6 transmissions allows DSR (a protocol that relies heavily on broadcasts) to function well. This behavior is not surprising because broadcast packets are known to be less reliable than unicast packets, and so protocols employing them are already robust to their occasional loss. However, the SSCH retransmission strategy may not be compatible with all uses of broadcast, such as its use for synchronization [43]. Also, deploying SSCH in an environment with a different number of channels might require the choice of 6 transmissions to be revisited. Finally, although retransmission increases the bandwidth consumed by broadcast packets, SSCH still delivers significant capacity improvement in the traffic scenarios we studied (Section 3.6). An SSCH node with a packet to send may discover that a neighbor is not present on a given channel when no CTS is received in response to a transmitted RTS. However, the node may very well be present on another channel, in which case SSCH should still deliver the packet. To handle this, we initially retain the packet in the packet queue. Packets are dropped only when SSCH gives up on all packets to a given destination, and this dropping of an entire flow occurs only when we have failed to transmit a packet to

83 69 the destination node for an entire cycle through the channel schedule. We will explain the meaning of a cycle through the channel schedule in Section 3.5.2, but with our chosen parameter settings the timeout is 530 ms. After a flow has been garbage collected, new packets with the same destination inserted in the queue are assigned to a new flow, and attempted in the normal manner. This packet scheduling policy is simple to implement, and yields good performance in the common case where node schedules are known, and information about node availability is accurate. A potential drawback is that a node crash (or other failure events) can lead to a number of wasted RTSs to the failed node. When summed across channels, the number may exceed the IEEE suggested value of 7 retransmission attempts for RTS packets. In Section 3.6, we quantify the cost of such failures and show that it is small Channel Scheduling We begin our description of channel scheduling by describing the data structure used to represent the channel schedule. We then describe the policy nodes use to act on their own channel schedule, the mechanism to communicate channel schedules to other nodes, and finally the policy nodes implement for updating or changing their own channel schedule. The channel schedule must capture a given node s plans for channel hopping in the future, and there is obvious overhead to representing this as a very long list. Instead, we compactly represent the channel schedule as a current channel and a rule for updating the channel in particular, as a set of 4 (channel, seed) pairs. Our experimental results show that 4 pairs suffice to give good performance (Section 3.6). We represent the (channel, seed) pair as (x i,a i ). The channel x i is represented as an integer in the range [0, 12] (13 possibilities), and the seed a i is represented as an integer in the range [1, 12].

84 70 Each node iterates through all of the channels in the current schedule, switching to the channel designated in the schedule in each new slot. The node then increments each of the channels in its schedule using the seed, x i (x i + a i )mod13 and repeats the process. We introduce one additional slot to prevent logical partitions. After the node has iterated through every channel on each of its 4 slots, it switches to a parity slot whose channel assignment is given by x parity = a 1. The term parity slot is derived from the analogy to the parity bits appended at the end of a string in some error correcting codes. The mathematical justification for this design is given in Section We use the term cycle to refer to the 530 ms iteration through all the slots, including the parity slot. In Figure 3.2, we illustrate possible channel schedules for two nodes in the case of 2 slots and 3 channels. In the Figure, node A and node B are synchronized in one of their two slots (they have identical (channel, seed) pairs), and they also overlap during the parity slot. The field of the channel schedule that determines the channel during each slot is shown in bold. Each time a slot reappears, the channel is updated using the seed. For example, node A s slot 1 initially has (channel, seed) = (1,2). The next time slot 1 is entered, the channel is updated by adding the seed to it mod 3 (mod 3 because in this example, there are 3 channels). The resulting channel is given by (1 + 2) mod 3 = 0. Nodes switch from one slot to the next according to a fixed schedule (every 10 ms in our current parameter settings). However, the decision to switch channels may occur while a node is transmitting or receiving a packet. In this case we delay the switch until after the transmission and ACK (or lack thereof) have occurred. Nodes learn each other s schedules by periodically broadcasting their seeds and the offset within this cycle through the channel schedule. We use the IEEE Long

85 A: (x1, a1) (x2, a2) (1, 2) (2, 1) (1, 2) (2, 1) (0, 2) (0, 1) (0, 2) (0, 1) (2, 2) (1, 1) (2, 2) (1, 1) (1, 2) (2, 1) (1, 2) (2, 1) Slot: Parity B: (x1, a1) (x2, a2) (1, 2) (0, 1) (1, 2) (0, 1) (0, 2) (1, 1) (0, 2) (1, 1) (2, 2) (2, 1) (2, 2) (2, 1) (1, 2) (2, 1) (1, 2) (2, 1) Slot: Parity 1 X Node goes to Channel X in this slot. Y Parity Slot. Node goes to Channel Y, and then repeats the cycle. Figure 3.2: Channel hopping schedules for two nodes with 3 channels and 2 slots. Node A always overlaps with Node B in slot 1 and the parity slot. The field of the channel schedule that determines the channel during each slot is shown in bold. Control Frame Header format to embed both the schedule and the node s current offset this is discussed in more detail in Section The SSCH layer at each node schedules one of these packets for broadcast once per slot. Nodes also update their knowledge of other nodes schedules by trying to communicate and failing. Whenever a node sends an RTS to another node, and that node fails to respond even though it was believed to be in this slot, the node sending the RTS updates the channel schedule for the other node to reflect that it does not currently know the node s schedule in this slot. We now turn to the question of how a given node changes its own schedule. Schedules are updated in two ways: each node attempts to maintain that its slots start and stop at roughly the same time as other nodes, and that its channel schedule overlaps with

86 72 nodes for which it has packets to send. We embed the information needed for this synchronization within the Long Control Frame Header as well. Using this information, a simple averaging scheme such as described by Elson et al [43] can be applied to achieve the loose synchronization required for good performance (Section 3.6 shows that a 100 µs skew in clock times leads to less than a 2% decrease in capacity). At a high level, each node achieves overlap with nodes for which it has traffic straightforwardly, by changing part of its own schedule to match that of the other nodes. However, a number of minor decisions must be made correctly in order to achieve this high level goal. Node A (A1, A2) (A1, B2) (A1, B2) Node B (B1, B 2 ) A B (B1, C 2 ) B C Node C (C1, C 2 ) t 1 t2 time Figure 3.3: The problem with a naive synchronization scheme. Node A has two slots, with (channel, seed) pairs represented by A 1 and A 2 ; nodes B and C are similarly depicted. At time t 1, node A synchronizes with node B. Node B synchronizes with node C at time t 2, after which A and B are no longer synchronized. Nodes recompute their channel schedule right before they enqueue the packet announcing this schedule in the NIC (and so at least once per slot). In a naive approach, this node could examine its packet queue, and select the (channel, seed) pairs which lead to the best opportunity to send the largest number of packets. However, this ignores the interest this node has in receiving packets, and in avoiding congested channels. An

87 73 example of the kind of problem that might arise if one ignores the interest in receiving packets is given in Figure 3.3. Here, A synchronized with B, and then B synchronized with C in such a way that A was no longer synchronized with B. This could have been avoided if B had used its other slot to synchronize with C, as it would have if it considered its interest in receiving packets. To account for this node s interest in receiving packets, we maintain per-slot counters for the number of packets received during the previous time the slot was active (ignoring broadcast packets). Any slot that received more than 10 packets during the previous iteration through that slot is labeled a receiving slot; if all slots are receiving slots, any one is allowed to be changed. If some slots are receiving slots and some are not, only the (channel, seed) pair on a non-receiving slot is allowed to be changed for the purpose of synchronizing with nodes it wants to send to. SSCH has to avoid the scenario where all nodes in a network converge on the same (channel, seed) pair value. This situation could arise in a number of scenarios. For example, if a node, say A, initiates a flow to another node, say B, and then node C initiates a flow to node A, then A, B and C will synchronize to the same (channel, seed) value. Moreover, if these were the only nodes in the network, they would never change their (channel, seed) value. This situation is a problem for SSCH since all nodes will hop to the same channel in every slot, and therefore all flows will be on the same channel. Hence, the benefits of channelization are lost, and SSCH becomes equivalent to a single-channel MAC. To account for this channel congestion, we propose a new de-synchronization scheme. A node compares the (channel, seed) pairs of all nodes from which it received packets in a given slot, with the list of (channel, seed) pairs of all the other nodes in its list of channel schedules. If the number of nodes synchronized to the same (channel, seed)

88 74 pair is more than twice the number that this node communicated with in the previous occurrence of the slot, we attempt to de-synchronize it from these other nodes. Desynchronization just involves choosing a new (channel, seed) pair for this slot # Synchronized Nodes Time (in Slots) Without Desync With Desync Figure 3.4: Need for De-synchronization: All nodes converge to the same channel without de-synchronization. The need for de-synchronization is illustrated in Figure 3.4. Our protocol is simulated for 10 stationary nodes, and one of them is randomly picked as a test node. All nodes are within communication range of each other, the slot duration is 10 ms, and each node has 4 (channel, seed) pairs. We consider IEEE a [59], which has 13 orthogonal channels. Initially, every node starts a flow to a randomly chosen destination node for a random duration between 1 and 500 ms. At the end of a flow, a node starts a different flow with a randomly picked destination and duration. Figure 3.4 plots the number of neighbors of the test node that have the same (channel, seed) pair in a slot as the test node. Without de-synchronization, the number of nodes with the same (channel,

89 75 seed) pair increases monotonically over time for each of the 4 (channel, seed) values. After around 370 slots, which is 370*10 ms = 3.7 seconds, all 9 neighbors of the test node converge to the same (channel, seed) pair on all slots. Consequently, all nodes always switch to the same channel all the time, and SSCH becomes equivalent to single channel IEEE This scenario is avoided by our de-synchronization mechanism. In our experimental scenario, de-synchronization never allows more than 4 neighbors to have the same (channel, seed) pair as the test node. The final constraints we add moderate the pace of change in schedule information. Each node only considers updating the (channel, seed) pair for the next slot, never for slots further in the future. If the previous set of criteria suggest updating some slot other than the next slot, we delay that decision. Given these constraints, picking the best possible (channel, seed) pair simply requires considering the choice that synchronizes with the set of nodes for which we have the largest number of queued packets. Additionally, the (channel, seed) pair for the first slot is only allowed to be updated during the parity slot this helps to prevent logical partition, as will be explained in more detail in Section This strategy naturally supports nodes acting as sources, sinks, or forwarders. A source node will find that it can assign all of its slots to support sends. A sink node will find that it rarely changes its slot assignment, and hence nodes sending to it can easily stay synchronized. A forwarding node will find that some of its slots are used primarily for receiving; after re-assigning the channel and seed in a slot to support sending, the slots that did not change are more likely to receive packets, and hence to stabilize on their current channel and seed as receiving slots for the duration of the current traffic patterns. Our simulation results (Section 3.6) support this conclusion. We refer to the technique of enabling this synchronization pattern as partial synchronization.

90 Mathematical Properties of SSCH Our discussion of the mathematical properties of SSCH will initially focus on the static case. The behavior of SSCH when channel schedules are not changing assures us that in a steady-state flow setting, nodes will rendezvous appropriately, in a sense that we make precise below. We will then expand our discussion to include the dynamics of channel scheduling in an environment where flows are starting and stopping. In our discussion, we assume that all nodes use IEEE to synchronize their clocks within 1 ms of each other, and there are no Byzantine failures in the network. A node never sends false information about its schedule. The channel scheduling mechanism has three simultaneous design goals: allowing nodes to be synchronized in a slot, infrequent overlap between nodes that do not have data to send to each other, and ensuring that all nodes come into contact occasionally (to avoid a logical partition). To achieve these goals, we rely on a very simple mathematical technique, addition modulo a prime number [12]. Consider two nodes that want to be synchronized in a given slot. If they have identical (channel, seed) pairs for this slot, then clearly they will remain synchronized in future iterations (using the static assumption). Now consider two nodes that are not synchronized because they have different seeds. A simple calculation shows that these two nodes will overlap exactly one out of every 13 iterations in this slot (recall that 13 is the number of channels). This is the behavior we want from these nodes: they overlap regularly enough that they can exchange their channel schedules, but they are mostly on different channels, and so do not interfere with each other s transmissions. Now consider the rare case that two nodes share identical seeds in every slot, but different channels accompany each seed this has at most a 1 in , 000 chance of occurring for randomly chosen (channel, seed) pairs. In this case, the nodes will

91 77 march in lock-step through the same set of channels in each slot, never overlapping. This would be problematic, and it is this situation that the parity slot prevents. To justify this claim, we consider two distinct situations. If both nodes enter their parity slot at the same time, then they overlap there because the parity channel is equal to the seed for the first slot for both nodes. With our chosen parameter settings of 10 ms per slot, 4 slots, and 13 channels, this overlap occurs once every 530 ms and lasts for 10 ms. If their parity slots do not occur at the same time, then the first node s parity slot offers a fixed target for the slot in which the second node is changing channels, and again, the two nodes will overlap. This overlap occurs once every 7 seconds. Although both these cases will be rare, the SSCH time synchronization mechanism allows us to ignore the second case entirely a relative clock skew of 5 ms or less is sufficient to guarantee that two parity slots overlap in time. Now considering the dynamic case (and assuming clock synchronization to within 5 ms), we note that nodes are not permitted to change the seed for the first of their four slots except during a parity slot. Therefore they will always overlap in either the first slot or the parity slot, and hence will always be able to exchange channel schedules within a moderate time interval. The use of addition modulo a prime to construct channel hopping schedules does not restrict SSCH to scenarios where the number of channels is a prime number. If one desired to use SSCH with a wireless technology where the number of channels is not a prime, one could straightforwardly use a larger prime as the range of x i, and then map down to the actual number of channels using a modulus reduction. Though the mapping would have some bias to certain channels, the bias could be made arbitrarily small by choosing a sufficiently large prime. A final point about the use of addition modulo a prime is that SSCH can be modified

92 78 to require fewer bits to represent a node s schedule by reducing the number of choices for a seed. The only penalty to this reduction is increasing the protocol s reliance on the parity slot for avoiding logical partitions. 3.6 Performance Evaluation This section presents the simulation results of SSCH in QualNet and compares its performance with the commonly used single-channel IEEE a protocol. Subsection presents microbenchmarks quantifying the different SSCH overheads. Subsection 3.6.2, presents macrobenchmarks on the performance of SSCH with a large number of nodes in a single hop environment. Subsection extends the macrobenchmark evaluation to encompass mobility and multihop routing. Our results show that SSCH incurs very low overhead, and significantly outperforms IEEE a in a multiple flow environment. The simulation environment comprises a varying number of nodes in a 200m 200m area. All nodes in a single simulation run use the same MAC, either SSCH or IEEE a. All nodes are set to operate at the same raw data rate, 54 Mbps. We assume 13 usable channels in the 5 GHz band. SSCH is configured to use 4 seeds, and each slot duration is 10 ms. All seeds are randomly chosen at the beginning of each simulation run. The macrobenchmarks in subsections and are averages from 5 independent simulation runs, while the microbenchmarks in subsection are drawn from a single simulation run. We primarily measure throughput under a traffic load of maximum rate UDP flows. In particular, we use Constant Bit Rate (CBR) flows of 512 byte packets sent every 50 µs. This data rate is more than the sustainable throughput of IEEE a operating at 54 Mbps.

93 79 For all our simulations, we modified QualNet to use a channel switch delay of 80 µs. This choice was informed by recent work in solid state electronics on reducing the settling time of the Voltage Control Oscillator (VCO) [85]. Switching the channel of a wireless card requires changing the input voltage of the VCO, which operates in a Phase Locked Loop (PLL) to achieve the desired output frequency. The delay in channel switching is due to this settling time. The specification of Maxim IEEE b Transceivers [84] shows this delay to be 150 µs. More recent work [51] shows that this delay can be reduced to µs for IEEE a cards Microbenchmarks We present microbenchmarks measuring the overhead of SSCH in several different scenarios. In Section 3.6.1, we measure the overhead during the successful initiation of a CBR stream. In Section 3.6.1, we measure the overhead on an existing session of failing to initiate a parallel CBR stream. In Section 3.6.1, we measure the overhead of supporting two streams simultaneously. In Section 3.6.1, we measure the overhead of continuing to attempt transmissions to a mobile node that has moved out of range. These scenarios cover many of the different dynamic events that a MAC must appropriately handle: a flow starting while a node is present, a flow starting while a node is absent, simultaneous flows where both nodes are present, simultaneous flows where one node moves out of range, etc. Finally, the last scenario (Section 3.6.1) measures the overhead of SSCH with respect to a different kind of event, clock skew. Overhead of Switching and Synchronizing In this experiment, we measured the overhead of successfully initiating a CBR stream between two nodes within communication range of each other. The first node initiates

94 80 the stream just after the parity slot. This incurs a worst-case delay in synchronization, because the first of the four slots will not be synchronized until 530 ms later. In Figure 3.5, we graph the instantaneous throughput at the receiver node. The sender quickly synchronizes with the receiver on three of the four slots, as it should, and on the fourth slot after 530 ms. The figure shows the throughput while synchronizing (oscillating around 3/4 of the raw bandwidth), and the time required to synchronize. After synchronizing, the channel switching and other protocol overheads of SSCH lead to only a 400 Kbps penalty in the steady-state throughput relative to IEEE a. This penalty conforms to our intuition about the overheads in SSCH: a node spends 80 µs every 10 ms switching channels (80 µs/10 ms =.008), and then must wait for the duration of a single packet to avoid colliding with pre-existing packet transmissions in the new channel (1 packet/35 packets =.028). Adding these two overheads together leads to an expected cumulative overhead of 3.6%, which is in close agreement with the measured overhead of (400 Kbps/12 Mbps) = 3.3%. Note that the throughput of the session reaches a maximum of only 13 Mbps, although the raw data rate is 54 Mbps. This low utilization can be explained by the IEEE a requirement that the RTS/CTS packets be sent at the lowest supported data rate, 6 Mbps, along with other overheads [52]. Overhead of an Absent Node SSCH requires more re-transmissions than IEEE in order to prevent logical partitions. These retransmissions waste bandwidth that could have been dedicated to a node that was present on the channel. To quantify this overhead, we initiated a CBR stream between two nodes, allowed the system to quiesce, and then initiated a send from the first node to a non-existent node. We present a moving average of the throughput over 80

95 81 Throughput (in Mbps) Time to totally synchronize Time (in seconds) SSCH a Figure 3.5: Switching and Synchronizing Overhead: Node 1 starts a maximum rate UDP flow to Node 2. We show the throughput for both SSCH and IEEE a. ms in Figure 3.6. It shows that the sender takes 530 ms to timeout on the non-existent node. During this time the session throughput drops by 550 Kbps, which is a small fraction (4.6%) of the total throughput. Overhead of a Parallel Session Next, we quantify the ability of SSCH to fairly share bandwidth between two flows, and to quickly achieve this fair sharing. We start with Node 1 sending a maximum rate UDP stream to Node 2. At 21.5 seconds, Node 1 starts a second maximum rate UDP stream to Node 3. Figure 3.7 presents a moving average of the throughput achieved by both nodes over a period of 140 ms. It illustrates the instantaneous throughput achieved at Nodes 2 and 3

96 82 14 Throughput (in Mbps) Attempting send to absent node Time (in seconds) Figure 3.6: Overhead of an Absent Node: Node 1 is sending a maximum rate UDP stream to Node 2. Node 1 then attempts to send a packet to a non-existent node. (the receivers). The bandwidth is split between the receivers nearly perfectly (and with no decrease in net throughput) within 200 ms. Overhead of Mobility We now analyze the effect of mobility at a micro-level on the performance of SSCH. Ideally, SSCH should be able to detect a link breakage due to movement of a node, and subsequently re-synchronize to other neighbors. We show that SSCH can indeed handle this scenario with an experiment comprising 3 nodes and 2 sessions, and in Figure 3.8 we present a moving average of each session throughput, averaged over a period of 280 ms. Node 1 is initially sending a maximum rate UDP stream to Node 2. Node 1 initiates a second UDP stream to Node 3 at around 20.5 seconds. This bandwidth is then shared between both the sessions (as in the experiment of Section 3.6.1) until 30 seconds, when

97 83 14 Throughput (in Mbps) Time (in seconds) Node 2 Node 3 Figure 3.7: Overhead of a Parallel Session: Node 1 is sending a maximum rate UDP stream to Node 2. Node 1 then starts a second stream to Node 3. Node 3 moves out of the communication range of Node 1. Our experiment configures Node 1 to continue to attempt to send to Node 3 until 43 seconds, and during this time it continues to consume a small amount of bandwidth. In contrast, the experiment in Section measured the overhead of enqueueing a single packet to an absent node. When the stream to Node 3 finally stops, Node 2 s received throughput increases back to its initial rate. Overhead of Clock Drift As we described in Section 3.5.2, SSCH tries to synchronize slot begin and end times, though it is also designed to be robust to clock skew. In this experiment, we quantify the robustness of SSCH to moderate clock skew. We measure the throughput between two nodes after artificially introducing a clock skew between them, and disabling the SSCH

98 84 Throughput (in Mbps) Time (in seconds) Node 2 Node 3 Figure 3.8: Overhead of Mobility: Node 1 is sending a maximum rate UDP stream to Node 2. Node 1 starts another maximum rate UDP session to Node 3. Node 3 moves out of range at 30 seconds, while Node 1 continues to attempt to send until 43 seconds. synchronization scheme for slot begin and end times. We vary the clock skew from 1 ns (10 6 ms) to 1 ms such that the sender is always ahead of the receiver by this value, and present the results in Figure 3.9. Note the log scale on the x-axis. The throughput achieved between the two nodes is not significantly affected by a clock skew of less than 10 µs. The drop in throughput is more for larger clock skews, although the throughput is still acceptable at 10.5 Mbps when the skew value is an extremely high 1 ms. These results provide justification for the design choice we made not to require nodes to switch synchronously across slots, as described in Section For example, a node will delay switching to receive an ACK, or to send a data packet if its channel reservation

99 85 14 Throughput (in Mbps) ns 10 ns 100 ns 1 µs 10 µs 100 µs 1 ms Clock Drift Figure 3.9: Overhead of Clock Skew: Throughput between two nodes using SSCH as a function of clock skew. is successful. In the 100 node experiment described in Section 3.6.3, we measured the skew in channel switching times for a traffic pattern of 50 flows to be approximately 20 µs. Figure 3.9 shows that this is a negligible amount Macrobenchmarks: Single-hop Case We now present simulation results showing SSCH s ability to achieve and sustain a consistently high throughput for a traffic pattern consisting of multiple flows. We first evaluate this using steady state UDP flows. We then extend our evaluation to consider a dynamic traffic scenario where UDP flows both start and stop. Finally, we study the performance of TCP over SSCH. Disjoint Flows We first look at the number of disjoint flows that can be supported by SSCH. All nodes in this experiment are in communication range of each other, and therefore two flows

100 86 are considered disjoint if they do not share either endpoint. Ideally, SSCH should utilize the available bandwidth on all the channels on increasing the number of disjoint flows in the system. We evaluate this by varying the number of nodes in the network from 2 to 30 and introducing a flow between disjoint pairs of nodes the number of flows varies from 1 to 15. Throughput (in Mbps) Per-flow Throughput # Flows a SSCH Figure 3.10: Disjoint Flows: The throughput of each flow on increasing the number of flows. Figure 3.10 shows the average per-flow throughput, and Figure 3.11 shows the total utilized system throughput. IEEE a performs marginally better when there is just one flow in the network. When there is more than one flow, SSCH significantly outperforms IEEE a. An increase in the number of flows decreases the per-flow throughput for both SSCH and IEEE a. However, the drop for IEEE a is much more significant. The drop for IEEE a is easily explained by Figure 3.11, which shows that the overall system throughput for IEEE a is approximately constant.

101 System Throughput Throughput (in Mbps) # Flows a SSCH Figure 3.11: Disjoint Flows: The system throughput on increasing the number of flows. It may seem surprising that the SSCH system throughput has not stabilized at 13 times the throughput of a single flow by the time there are 13 flows. However, this can be attributed to SSCH s use of randomness to distribute flows across channels. These random choices do not lead to a perfectly balanced allocation, and therefore there is still unused spectrum even when there are 13 flows in the system, as shown by the continuing positive slope of the curve in Figure Non-disjoint Flows We now consider the case when the flows in the network are not disjoint nodes participate as both sources and sinks, and in multiple flows. This scenario stresses SSCH s ability to efficiently support sharing among simultaneous flows that have a common endpoint. Each node in the network starts a maximum rate UDP flow with one other randomly chosen node in the network. We vary the number of nodes (and thus flows)

102 88 from 2 to 20. As in the previous experiment, all nodes are within communication range of each other. We present the per-flow and system throughput for SSCH and IEEE a in Figures 3.12 and 3.13 respectively. The curves are not monotonic because variation in the random choices leads to some receivers being recipients in multiple flows (and hence bottlenecks). This lack of monotonicity persisted even after averaging over 5 simulation runs. As in the disjoint flow experiment, SSCH performs slightly worse in the case of a single flow, but much better in the case of a large number of flows. 8 7 Per-flow Throughput Throughput (in Mbps) # Flows a SSCH Figure 3.12: Non-disjoint Flows: The average throughput of each flow on increasing the number of flows. There is a flow from every node in the network. Effect of Flow Duration SSCH introduces a delay when flows start because nodes must synchronize. This overhead is more significant for shorter flows. We evaluate this overhead for maximum rate UDP flows with different flow lengths. In the first experiment the flow duration is cho-

103 89 45 Throughput (in Mbps) System Throughput # Flows a SSCH Figure 3.13: Non-disjoint Flows: The system throughput on increasing the number of flows. There is a flow from every node in the network. sen randomly between 20 and 30 ms, while for the second experiment it is between 0.5 and 1 second. In both the experiments, each node starts a flow with a randomly selected node, discards all packets at the end of the designated sending window, pauses for a second at the end of the flow, and then starts another flow with a new randomly selected node. This process continues for 30 seconds. We run these experiments for both SSCH and IEEE a, and vary the number of nodes from 2 to 16. We present the ratio of the average throughput achieved by SSCH to that achieved by the flows when using IEEE a in Figure For small numbers of sufficiently short-lived flows, IEEE a offers superior performance; short flows do indeed suffer from a more pronounced synchronization overhead. However, as soon as there are more than 4 simultaneous flows in the network, the ability of SSCH to spread transmissions across multiple channels leads to a higher total throughput than IEEE a in both the short and long flow scenarios.

104 90 Throughput Ratio (SSCH/802.11) # Nodes Duration ms Duration second Figure 3.14: Effect of Flow Duration: Ratio of SSCH average throughput to IEEE a average throughput for flows having different durations. TCP Performance over SSCH We now study the behavior of TCP over SSCH. SSCH allows a node to stay synchronized to multiple nodes over different slots. However, this might cause significant jitter in packet delivery times, which could adversely affect TCP. To evaluate this concern quantitatively, we run an experiment where we vary the number of nodes in the network from 2 to 9, such that all nodes are in communication range of one another. We then start an infinite-size file transfer over FTP from each node to a randomly selected other node. This choice to use non-disjoint flows is designed to stress the SSCH implementation by requiring nodes to be synchronized as either senders or receivers with multiple other nodes. In Figure 3.15 we present the resulting cumulative steady-state TCP throughput over all the flows in the network. Figure 3.15 shows that the TCP throughput for a small number of flows is lower

105 Throughput (in Mbps) # Flows SSCH a Figure 3.15: TCP over SSCH: Steady-state TCP throughput when varying the number of non-disjoint flows. for SSCH than the throughput over IEEE a. However, as the number of flows increases, SSCH does achieve a higher system throughput. Although TCP over SSCH does provide higher aggregate throughput than over IEEE a, the performance improvement is not nearly as good as for UDP flows. This shows that jitter due to SSCH does have an impact on the performance of TCP. A more detailed analysis of the interaction between TCP and SSCH, and modifications to support better interactions between TCP and SSCH, is a subject we plan to address in our future work Macrobenchmarks: Multihop and Mobility We now evaluate SSCH s performance when combined with multihop flows and mobile nodes. We first analyze the behavior of SSCH in a multihop chain network. We then

106 92 consider large scale multihop networks, both with and without mobility. As part of this analysis, we study the interaction between SSCH and MANET routing protocols. Performance in a Multihop Chain Network IEEE is known to encounter significant performance problems in a multihop network [136]. For example, if all nodes are on the same channel, the RTS/CTS mechanism allows at most one hop in an A B C D chain to be active at any given time. SSCH reduces the throughput drop due to this behavior by allowing nodes to communicate on different channels. To examine this, we evaluate both SSCH and IEEE a in a multihop chain network Throughput (in Mbps) # Nodes SSCH a Figure 3.16: Multihop Chain Network: Variation in throughput as chain length increases. We vary the number of nodes, which are all in communication range, from 2 to 18. We initiate a single flow that encounters every node in the network. Although more

107 93 than 4 nodes transmitting within interference range of each other would be unlikely to arise from multihop routing of a single flow, it could easily arise in a more general distributed application. Figure 3.16 shows the maximum throughput as the number of nodes in the chain is varied. We see that there is not much difference between SSCH and IEEE a for flows with few hops. As the number of hops increases, SSCH performs much better than IEEE a since it distributes the communication on each hop across all the available channels. Performance in a Multihop Mesh Network We now analyze the performance of SSCH in a large scale multihop network without mobility. We place 100 nodes uniformly in a m area, and set each node to transmit with a power of 21 dbm. The Dynamic Source Routing (DSR) [68] protocol is used to discover the source route between different source-destination pairs. These source routes are then fed into a static variant of DSR that does not perform discovery or maintain routes. We vary the number of maximum rate UDP flows from 10 to 50. We generate source and destination pairs by choosing randomly, and rejecting pairs that are within a single hop of each other. We present the average flow throughput in Figure Increasing the number of flows leads to greater contention, and the average throughput of both SSCH and IEEE a drops. For every considered number of flows, SSCH provides significantly higher throughput than IEEE a. For 50 flows, the inefficiencies of sharing a single channel are sufficiently pronounced that SSCH yields more than a factor of 15 capacity improvement.

108 Throughput (in Mbps) # Flows SSCH a Figure 3.17: Mulithop Mesh Network of 100 Nodes: Average flow throughput on varying the number of flows in the network. Impact of Channel Switching on MANET Routing Protocols Previous work on multi-channel MACs has often overlooked the effect of channel switching on routing protocols. Most of the proposed protocols for MANETs, such as DSR [68], and AODV [97] rely heavily on broadcasts. However, neighbors using a multi-channel MAC could be on different channels, which could cause broadcasts to reach significantly fewer neighbors than in a single-channel MAC. SSCH addresses this concern using a broadcast retransmission strategy discussed in Section We study the behavior of DSR [68] over SSCH in the same experimental setup used in Section 3.6.3, with 100 nodes in a 200 m 200 m area. However, we reduce the transmission power of each node to 16 dbm to force routes to increase in length (and hence to stress DSR over SSCH). We select 10 source-destination pairs at random, and we use DSR to discover routes between them. In Figure 3.18 we compare the performance of

109 Time to Discover Route (s) Average Route Length for IEEE Average Route Discovery Time for IEEE # Broadcasts Route Discovery Time Avg Route Length Average Route Length (hops) Figure 3.18: Impact of SSCH on Unmodified MANET Routing Protocols: The average time to discover a route and the average route length for 10 randomly chosen routes in a 100 node network using DSR over SSCH. DSR over SSCH, when varying the SSCH broadcast transmission count parameter (the number of consecutive slots in which each broadcast packet is sent once). Figure 3.18 shows that the performance of DSR over SSCH improves with an increase in the broadcast transmission count. The DSR Route Request packets see more neighbors when SSCH broadcasts them over a greater number of slots. This increases the likelihood of discovering shorter routes, and the speed with which routes are discovered. However, there seems to be little additional benefit to increasing the broadcast parameter to a value greater than 6. The slight bumpiness in the curves can be attributed to the stochastic nature of DSR, and its reliance on broadcasts. Comparing SSCH to IEEE a, we see that the SSCH discovers routes that are

110 96 comparable in length. However, the average route discovery time for SSCH is much higher than for IEEE a. Because each slot is 10 ms in length, broadcasts are only retransmitted once every 10 ms, and this leads to a significantly longer time to discover a route to a given destination node. We believe that this latency is a fundamental difficulty in using a reactive protocol such as DSR with SSCH. We plan to explore the interaction of other proactive and hybrid routing protocols with SSCH in the future. Performance in Multihop Mobile Networks We now present the impact of mobility in a network using DSR over IEEE a and SSCH. In this experiment, we place 100 nodes randomly in a square and select 10 flows. Each node transmits packets at 21 dbm. Node movement is determined using the Random Waypoint model. In this model, each node has a predefined minimum and maximum speed. Nodes select a random point in the simulation area, and move towards it with a speed chosen randomly from the interval. After reaching its destination, a node rests for a period chosen from a uniform distribution between 0 and 10 seconds. It then chooses a new destination and repeats the procedure. In our experiments, we fix the minimum speed at 0.01 m/s and vary the maximum speed from 0.2 to 1.0 m/s. Although we have studied SSCH at higher speeds, the results are not significantly different. We performed this experiment using two different areas for the nodes, a 200m 200m area and a 300m 300m area. We refer to the smaller area as the dense network, and the larger area as the sparse network the average path is 0.5 hops longer in the sparse network. For all these experiments, we set the SSCH broadcast transmission count parameter to 6. Figure 3.19 shows that in a dense network, SSCH yields much greater throughput than IEEE a even when there is mobility. Although DSR discovers shorter

111 97 Throughput (in Mbps) Speed (in m/s) Throughput SSCH Throughput # Hops SSCH # Hops Route Length (Hops) Figure 3.19: Dense Multihop Mobile Network: The per-flow throughput and the average route length for 10 flows in a 100 node network in a 200m 200m area, using DSR over both SSCH and IEEE a. routes over IEEE a, the ability of SSCH to distribute traffic on a greater number of channels leads to much higher overall throughput. Figure 3.20 evaluates the same benchmarks in a sparse network. The results show that the per-flow throughput decreases in a sparse network for both SSCH and IEEE a. This is because the route lengths are greater, and it takes more time to repair routes. However, the same qualitative comparison continues to hold: SSCH causes DSR to discover longer routes, but still leads to an overall capacity improvement. DSR discovers longer routes over SSCH than over IEEE a because broadcast packets sent over SSCH may not reach a node s entire neighbor set. Furthermore, some optimizations of DSR, such as promiscuous mode operation of nodes, are not as effective in a multi-channel MAC such as SSCH. Thus, although the throughput of mobile nodes

112 98 Throughput (in Kbps) Route Length (Hops) Speed (in m/s) Throughput SSCH Throughput # Hops SSCH # Hops -0.5 Figure 3.20: Sparse Multihop Mobile Network: The per-flow throughput and the average route length for 10 flows in a 100 node network in a 300m 300m area, using DSR over both SSCH and IEEE a. using DSR over SSCH is much better than their throughput over IEEE a, we conclude that a routing protocol that takes the channel switching behavior of SSCH into account will likely lead to even better performance Implementation Considerations When simulating SSCH in QualNet [62], we made two technical choices that seem to be relatively uncommon based on our reading of the literature. The first technical choice relates to how we added SSCH to an existing system, and the second relates to a littleutilized part of the IEEE specification. In order to implement SSCH, we had to implement new packet queuing and retransmission strategies. To avoid requiring modifications to the hardware (in QualNet, the

113 99 hardware model) or the network stack, SSCH buffers packets below the network layer, but above the NIC device driver. To maintain control over transmission attempts, we configure the NIC to buffer at most one packet at a time, and to attempt exactly one RTS for each packet before returning to the SSCH layer. By observing NIC-level counters before and after every attempted packet transmission, we are able to determine whether a CTS was heard for the packet, and if so, whether the packet was successfully transmitted and acknowledged. All the necessary parameters to do this are exposed by the hardware model we used in QualNet. For efficiency reasons, we choose to use the IEEE Long Control Frame Header format to broadcast channel schedules and current offsets, rather than using a full broadcast data packet. The most common control frames in IEEE (RTS, CTS, and ACK) use the alternative short format. The long format was included in the IEEE standard to support inter-operability with legacy 1-Mbps and 2-Mbps DSSS systems [60]. The format contains 6 unused bytes; we use 4 to embed the 4 (channel, seed) pairs, and another 2 to embed the offset within the cycle (i.e., how far the node has progressed through the 530 ms cycle). Lastly, we comment that the beaconing mechanism used in IEEE ad-hoc mode for associating with a Basic Service Set (BSS) works unchanged in the presence of SSCH. A newly-arrived node can associate to a BSS as soon as it overlaps in the same channel with any already-arrived node. 3.7 Alternatives to SSCH This Section discusses alternative designs for SSCH within the constraints that were enumerated in Section 3.2. SSCH distributes the rendezvous and control traffic across all the channels. One

114 100 straightforward alternative scheme, which still only requires one radio, is to use one of the channels as a control channel, and all the other channels as data channels (e.g., [66]). Each node must then somehow split its time between the control channel and the data channels. Such a scheme will have difficulty in preventing the control channel from becoming a bottleneck. Suppose that two nodes exchange RTS/CTS on the control channel, and then switch to a data channel to do transmission. Unless all other nodes were also on the control channel during the RTS/CTS exchange, these two nodes will still need to do an RTS/CTS on this channel in order to avoid the hidden terminal problem. The two nodes should wait to even do the RTS/CTS until after an entire packet transmission interval has elapsed, because another pair of nodes might have also switched to this channel, orchestrating that decision on the control channel during a time that the first pair of nodes were not on the control channel. In order to amortize this startup cost, the nodes should have several packets to send to each other. However, while any one node remains on a data channel, any other node that desires to send it a packet must remain idle on the control channel waiting for the node it desires to reach to re-appear. If the idle node on the control channel chooses not to wait, and instead switches to a data channel with another node for which it has traffic, it may repeatedly fail to rendezvous with the first node, leading to a significant imbalance in throughput and possibly a logical partition. The problems with a dedicated control channel may be solvable, but it is clear that a straightforward approach with un-synchronized rendezvous presents several difficulties. If one instead tried to synchronize rendezvous on the control channel, the control channel could again become a bottleneck simply because many nodes simultaneously desire to schedule packets on that channel.

115 Future Research SSCH is a promising technology. In our future work, we plan to investigate how SSCH will perform when implemented over actual hardware, and subjected to the normal environmental vagaries of wireless networks, such as unpredictable variations in signal strength. As part of this implementation effort, we also plan to evaluate how metrics reflecting environmental conditions, such as ETX [40], can be integrated into SSCH. Our results in Section show that existing routing protocols do not give the best performance over SSCH. In particular, we find that the time to discover a route can be quite large in a reactive routing protocol being run over SSCH. In the future, we plan to more thoroughly evaluate routing over SSCH (as opposed to classical single channel routing), and to explore a wider variety of proactive and hybrid routing protocols over SSCH. There are at least four additional topics that would also need to be addressed before SSCH can be deployed. One is interoperability with nodes that are not running SSCH. Another is the evaluation of power consumption under this scheme. We have not attempted to evaluate the energy cost of switching channels, nor have we attempted to enable a power-saving strategy such as in the IEEE specification for accesspoint mode. A third topic of investigation is the evaluation of SSCH in conjunction with auto-rate adaptation mechanisms. A fourth topic is a more detailed evaluation of the interplay between SSCH and TCP. 3.9 Summary We have presented SSCH, a new protocol that extends the benefits of channelization to ad-hoc networks. This protocol is compatible with the IEEE standard, and is

116 102 suitable for a multi-hop environment. SSCH achieves these gains using a novel approach called optimistic synchronization. We expect this approach to be useful in additional settings beyond channel hopping. We have shown through extensive simulation that SSCH yields significant capacity improvement in a variety of single-hop and multi-hop wireless scenarios. In the future, we look forward to exploring SSCH in more detail using an implementation over actual hardware. More information about SSCH and the QualNet simulation code can be obtained from: Work on SSCH was done jointly with people at Microsoft Research. The SSCH protocol was co-developed with John Dunagan. Victor Bahl was involved in the entire research project and made sure that we proceeded in the right direction. Finally, this work benefitted greatly from Ken Birman s insightful comments.

117 CHAPTER 4 CLIENT CONDUIT AND FAULT DIAGNOSIS IN WIRELESS NETWORKS 4.1 Introduction The convenience of wireless networking has led to a wide-scale adoption of IEEE networks [58]. Corporations, universities, homes, and public places are deploying these networks at a remarkable rate. However, a significant number of pain points remain for end-users and network administrators. Users experience a number of problems such as intermittent connectivity, poor performance, lack of coverage, and authentication failures. These problems occur due to a variety of reasons such as poor access point layout, device misconfiguration, hardware and software errors, the nature of the wireless medium (e.g., interference, propagation), and traffic congestion. Figure 4.1 shows the number of such wireless-related complaints logged by the Information Technology (IT) department of Microsoft corporation over a period of six months. The company has a large deployment of IEEE networks with several thousand Access Points (APs) spread over more than forty buildings. Each complaint is an indication of end-user frustration and loss of productivity for the corporation. Furthermore, resolution of each complaint results in additional support personnel costs to the IT department; our research revealed that this cost is several tens of dollars and this does not include the cost due to the loss of end-user productivity. To resolve complaints quickly and efficiently, network administrators need tools for detecting, isolating, diagnosing, and correcting faults. To the best of our knowledge, there is no previous research that addresses fault diagnostic problems in IEEE infrastructure networks. However, as discussed in Section 4.3, there has been considerable prior work on fault diagnosis in other setting, which we can leverage here. The 103

118 104 No. of wirelessrelated problems Month Figure 4.1: Number of wireless related complaints logged by the IT department of a major US corporation importance of diagnosing these problems in the real-world is apparent from the number of companies that offer solutions in this space [5, 7, 39, 103, 131]. These products do a reasonable job of presenting statistical data from the network; however, they lack a number of desirable features. Specifically, they do not do a comprehensive job of gathering and analyzing the data to establish the possible causes of a problem. Furthermore, most products only gather data from the APs and neglect the client-side view of the network. Some products that monitor the network from the client s perspective require hardware sensors, which can be expensive to deploy and maintain. Also, current solutions do not provide any support for disconnected clients even though these are the ones that need the most help. We discuss these products in more detail in Section 4.3. This chapter presents a flexible architecture for detecting and diagnosing faults in infrastructure wireless networks. We instrument wireless clients and (if possible) access points to monitor the wireless medium and devices that are nearby. Our architecture supports both proactive and reactive fault diagnosis. We use this monitoring framework to address some of the problems plaguing wireless users. We present a novel technique called Client Conduit that enables disconnected clients to diagnose their problems with

119 105 the help of nearby clients. This technique takes advantage of the beaconing and probing mechanisms of IEEE to ensure that connected clients do not pay unnecessary overheads for detecting disconnected clients. We also present a simple technique for finding the approximate location of disconnected clients. We present a technique that uses nearby wireless clients for diagnosing wireless network performance problems. Finally, we show how our monitoring architecture naturally lends itself to detecting rogue or unauthorized access points in enterprise wireless networks. We have implemented and evaluated the basic architectural framework, Client Conduit, and Rogue AP detection on the Windows operating system using off-the-shelf IEEE network cards; we have evaluated our other mechanisms using tools such as AiroPeek [132] and WinDump [134]. Our results show that our techniques are effective; furthermore, they impose negligible overheads when clients are not experiencing problems. We summarize the primary contributions of our chapter as follows: We believe ours is the first work to identify fault diagnosis in IEEE infrastructure networks as an important area of research. The identification of various problems in such environments is an important contribution since wireless fault diagnosis is an area that needs attention. We present a flexible client-based architecture for detecting and diagnosing faults in an IEEE infrastructure network. Our fault-diagnosis approach is unique in the wireless context since we use clients (and if possible, infrastructure APs) to monitor the network and the radio frequency (RF) environment. We describe a simple and efficient technique called Client Conduit that allows disconnected clients to communicate via nearby connected clients; this mechanism can be used to bootstrap wireless clients and resolve connectivity problems.

120 106 We present novel solutions that use our architecture for detecting and diagnosing a variety of faults: locating disconnected clients, diagnosing performance problems, and detecting Rogue APs. Our work is just a first step in the direction of self-healing wireless networks and there are a number of issues that still need to be addressed. From the vast number of wireless problems faced by end-users and network administrators everyday, we have focused only on a subset of those problems; our selection was based on conversations with network administrators [24] along with the high-priority problems observed in user-complaint logs. Even though some of our techniques are applicable to other deployments (e.g., hotspots, homes), our main emphasis has been diagnosing faults in enterprise wireless networks. We ensure that our techniques do not introduce new security attacks but we do not focus on denial-of-service and greedy MAC attacks [101]. The rest of the chapter is organized as follows: In Section 4.2, we discuss the most important problems that users and network administrators complain about with respect to wireless LAN deployment. Section 4.4 describes the components of our client-based architecture. Section 4.5 presents the Client Conduit protocol. Section 4.6 focuses on locating disconnected clients, performance isolation, and Rogue AP detection. Section 4.7 describes the implementation of our system and Section 4.8 presents an evaluation of our techniques. Section 4.3 discusses related work. Finally, we discuss future work in Section 4.9 and conclude in Section Faults in a Wireless Network We enumerate the most important problems that users and network administrators face when using and maintaining corporate wireless networks. Our list has been derived from interviews and discussions we conducted with network administrators and operation

121 107 engineers of Microsoft s IT department. These individuals are responsible for managing over 4,400 IEEE APs distributed over forty buildings in the company. Connectivity problems: End-users complain about inconsistent or a lack of network connectivity in certain areas of a building. Such dead spots or RF holes can occur due to a weak RF signal, lack of a signal, changing environmental conditions, or obstructions. Locating an RF hole automatically is critical for wireless administrators; they can then resolve the problem by either relocating APs or increasing the density of APs in the problem area or by adjusting the power settings on nearby APs for better coverage. Performance problems: This category includes all the situations where a client observes degraded performance, e.g., low throughput or high latency. There could be a number of reasons why the performance problem exists, e.g., traffic slow-down due to congestion, RF interference due to a microwave oven or cordless phone, multi-path interference, large co-channel interference due to poor network planning, or due to a poorly configured client/ap. Performance problems can also occur as a result of problems in the non-wireless part of the network, e.g., due to a slow server or proxy. It is therefore necessary that the diagnostic tool be able to determine whether the problem is in the wireless network or elsewhere. Furthermore, identifying the cause in the wireless part is important for allowing network administrators to better provision the system and improve the experience for end-users. Network security: Large enterprises often use solutions such as IEEE 802.1x [57] to secure their networks. However, a nightmare scenario for IT managers occurs when employees unknowingly compromise the security of the network by connecting an unauthorized AP to an Ethernet tap of the corporate network. The problem is commonly

122 108 referred to as the Rogue AP Problem [5, 7, 36]. These Rogue APs are one of the most common and serious breaches of wireless network security. Due to the presence of such APs, external users are allowed access to resources on the corporate network; these users can leak information or cause other damage. Furthermore, Rogue APs can cause interference with other access points in the vicinity. Detecting Rogue APs in a large network via a manual process is expensive and time-consuming; thus, it is important to detect such APs proactively. Authentication problems: According to the IT support group s logs, a number of complaints are related to users inability to authenticate themselves to the network. In wireless networks secured by technologies such as IEEE 802.1x [57], authentication failures are typically due to missing or expired certificates. Thus, detecting such authentication problems and helping clients to bootstrap with valid certificates is important. In this chapter, we focus on detecting RF holes, diagnosing performance problems, detecting Rogue APs, and helping a client to recover from an authentication problem via Client Conduit. As part of our future work, we will investigate diagnosis of authentication problems as well. 4.3 Related Work To the best of our knowledge, there has been no previous research on fault diagnosis in IEEE infrastructure networks. However, there are a number of commercial products that provide varying degrees of support for network management tasks, e.g., AirWave [7], Network Systems and Management (NSM) [39], Wireless Security Advisor [103], AirDefense [5], SpectraMon/SpectraGuard [131], AirMagnet [6], and Symbol [123]. Due to their propriety nature, the available description typically describes the

123 109 feature-set and not the techniques; the comparison below is based on our understanding of their brochures. The emphasis in most of these products is more towards managing wireless networks rather than diagnosing faults. These tools allow network administrators to obtain and visualize data from access points, upgrade firmware, manage security policies, etc. Some of them also provide real-time WLAN performance monitoring through IEEE statistics such as packet throughput, number of retries, number of dropped packets at the AP, etc. Even though these low-level statistics are useful for network administrators, it is more desirable to provide higher level fault detection and diagnosis, e.g., our approach detects network performance problems and pinpoints the components that are problematic. Many of these products (e.g., AirWave, Unicenter) operate from the AP or the server side only, i.e., clients are not instrumented. Given the asymmetry and variability of the wireless medium, observing data from the client-side is important for fault diagnosis, e.g., since conditions such as interference near the client can be drastically different than the conditions near the AP, client-side information is needed to do a detailed performance breakdown. Furthermore, our approach of modifying clients allows us to help disconnected clients via Client Conduit, locate Rogue APs and disconnected clients, and obtain better coverage for detecting Rogue APs. Some products like AirMagnet and AirDefense obtain the complete view of the enterprise by deploying specialized sensors throughout the organization; these sensors pass all the packets to the server for analysis. Anecdotal evidence from talking to various network administrators suggests that products that use sensor-based monitoring are expensive to deploy; furthermore, their performance degrades significantly even when very few sensors are deployed due to the network traffic. Our approach uses regular

124 110 wireless clients to avoid extra hardware deployment costs. Of course, a limitation of our approach is that we rely on the presence of nearby clients for diagnosing some of the wireless faults; however, the increasing usage of wireless clients in organizations is making it easier to satisfy this requirement. Since Rogue APs are a serious security problem, all the products listed above perform Rogue AP detection. Unlike our solution, most of these products achieve this goal either by using other APs [7,39] or by using specialized sensors [5,6,131]; as discussed above, these approaches have deployment and fault-detection limitations. Our technique of using both clients and APs for detecting Rogue APs is similar to the Symbol technique [123]. However, unlike their approach, our technique can also detect Rogue APs that use MAC address spoofing of real APs; furthermore, we leverage our client and AP instrumentation to approximately locate Rogue APs using DIAL. None of the above products provide solutions for assisting disconnected clients even though they need the most help. Our Client Conduit mechanism allows live and reactive diagnosis to be performed for such clients that are unable to access the infrastructure wireless network. The notion of making wireless clients snoop the environment for ensuring secure and correct routing has been suggested for ad hoc networks. In [83], the authors propose a watchdog mechanism to detect network unreliability problems stemming from selfish nodes. The basic idea is to have watchdog nodes observe their neighbors and determine if they are forwarding traffic as expected; this approach for detecting routing anomalies has been further refined by others as well [15,27]. Inspired by the watchdog mechanism, we also use nearby clients to monitor the RF conditions and traffic flow around them; in our architecture, the watchdog mechanism is used for fault detection (e.g., Rogue APs) and fault diagnosis (e.g., Client Conduit, locating disconnected clients, performance

125 111 isolation). Recent work [101] has used snooping wireless clients for detecting greedy and malicious behavior in hotspots environment; these techniques are orthogonal to our work and can be incorporated in our framework as well. Researchers have developed techniques for diagnosing performance problems over the Internet. For example, Barford et al. [19] use traffic traces at the end points and classify delays as occurring due to a slow server, a slow client, or the network. While EDEN has similar goals over a wireless network, it does so without requiring tracing support from both end points. Tulip [82] is another approach for diagnosing delays over Internet paths. The client sends ICMP packets and uses their responses from different components to determine the cause, such as lost packets, packets reordering, or queueing delay. EDEN also uses ICMP packets. However, the broadcast nature of the wireless medium enables EDEN to use a novel approach of snooping these packets as a mechanism for diagnosing component delays. 4.4 System Architecture In this Section, we first discuss the requirements and then describe the components that make up our fault detection and diagnosis architecture System Requirements Before we describe the system components, we enumerate the requirements for our system: We require that the software on clients be augmented for monitoring. In our system, software modifications on APs are needed only for better scalability and for analyzing an AP s performance (Section 4.6.2). Since our approach does not require hardware modifications, the bar for deploying our system is lower.

126 112 For some of our mechanisms, we need the ability to control beacons and probes. We also require that clients have the capability of starting an infrastructure network (i.e., become an AP) or an ad hoc network on their own; this ability is supported by many wireless cards, e.g., Atheros [14], Native WiFi [86]. Whenever faced with a choice of starting an ad hoc or an infrastructure network, we prefer the latter since infrastructure mode is better supported in current cards. We rely on the availability of a database that keeps track of the location of all the access points; such location databases are typically maintained by network administrators. Some of our techniques require the presence of nearby clients or access points. With the increasing deployment of access points and the use of wireless laptops and PDAs in enterprise wireless networks, this requirement is becoming relatively easy to satisfy in these environments. In fact, based on SNMP data collected from APs over a period of two days, we observed the presence of associated wireless clients on our floor (approximately 2500 sq. meters) during working hours of the day; thus, with such client densities, there is a high likelihood that our requirement will be satisfied. Compared with the existing products that require deploying special wireless sensors throughout the enterprise, our approach takes advantage of nearby clients and access points instrumented with software sensors, thereby imposing a lower deployment cost System Components Our system consists of the following components a Diagnostic Client (DC) that runs on a wireless client machine, an optional Diagnostic AP (DAP) that runs on an Access

127 113 Point, and a Diagnostic Server (DS) that runs on a backend server of the organization (see Figure 4.2). Below, we describe each of these in detail. Diagnostic Client module or DC: The Diagnostic Client module monitors the RF environment and the traffic flow from neighboring clients and APs. Note that during normal activity, the client s wireless card is not placed in promiscuous mode. The DC uses the collected data to perform local fault diagnosis. Depending on the individual fault-detection mechanism, a summary of this data is transmitted to the DAPs or DSs at regular intervals, e.g., for Rogue AP detection, the DC in our prototype sends MAC and channel information of nearby APs every 30 seconds. In addition, the DC is geared to accept commands from the DAP or the DS to perform on-demand data gathering, e.g., switching to promiscuous mode and analyzing a nearby client s performance problems. In case the wireless client becomes disconnected, the DC logs data to a local database/file. This data can be analyzed by the DAP or DS at some future time when network connectivity is resumed. Diagnostic Access Point module or DAP: The Diagnostic AP s main function is to accept diagnostic messages from DCs, merge them along with its own measurements and send a summary report to the DS. The Diagnostic AP is not a fundamental requirement of our architecture; it is primarily needed for offloading work from the DS. Most of our techniques can work in an environment with a mixture of legacy APs and DAPs: if an AP is a legacy AP, its monitoring functions are performed by the DCs and its summarizing functions and checks are performed at the DS. In the rest of the chapter, for the ease of exposition, we assume the presence of DAPs. Diagnostic Server module or DS: The Diagnostic Server accepts data from DCs and DAPs and performs the appropriate analysis to detect and diagnose different faults. The

128 114 User Level Info Diagnostic Server (DS) DS Auth Info Radius Kerberos DHCP DAP Diagnostic Messages/ Actions Legacy AP Send monitor info DC Wiring Coverage Area DC Forward disconnected client msgs Client Conduit DC Disconnected Peer Information flow for Diagnosis DC Figure 4.2: Fault Diagnosis Architecture DS also has access to a database that stores each AP s location. Network administrators may deploy multiple DSs in the system to balance the load, e.g., each AP s MAC address could be hashed to a particular DS. In the rest of the chapter, we present our mechanisms as if one Diagnostic Server is present in the system. Figure 4.2 gives a schematic view of our fault diagnosis system. As shown, the Diagnostic Server interacts with other network servers e.g., the RADIUS [105] and Kerberos [90] servers, to get client authorization and user information. Our architecture allows disconnected clients to communicate with the DS via a nearby connected client using the Client Conduit protocol; this mechanism is presented in Section 4.5.

129 115 Our system supports both reactive and proactive monitoring. In proactive monitoring, DCs and DAPs monitor the system continuously: if an anomaly is detected by a DC, DAP, or DS, an alarm is raised for a network administrator to investigate. The reactive monitoring mode is used when a support personnel wants to diagnose a user complaint. The personnel can issue a directive to a DC from one of the DSs to collect and analyze the data for diagnosing the problem. We believe that it is acceptable to increase the network and CPU load (on the DCs, DAPs, DSs) by a small amount during reactive monitoring; of course, in the proactive case, these overheads must be kept low. Our architecture itself imposes negligible overheads with respect to power management: the individual techniques have to be designed to prevent unnecessary battery wastage. Both the proactive and reactive techniques presented later in this chapter consume very little bandwidth, CPU, or disk resources; as a result, they should have negligible impact on battery consumption. Only during data transfer in Client Conduit does a connected client send/receive messages on behalf of a disconnected client. To ensure that the helping client s applications (or battery) are not affected significantly, it is offered a knob to control the amount of resources it wants to devote for this transfer (see Section 4.5.2). Table 4.1 shows the various problems diagnosed in this chapter, the entities (DCs, DAPs, and DSs) involved in the diagnosis, and whether the solution can be used with legacy APs System Scaling We have designed our system to scale with the number of clients and APs in the system. The two shared resources in our system are DSs and DAPs. To prevent a single Diagnostic Server from becoming a potential bottleneck in our system, the design allows

130 116 Table 4.1: Different fault diagnosis mechanisms and entities that can diagnose them; the last column indicates if the solution can be supported using legacy APs Fault Diagnosis Where performed Support for legacy APs? Help disconnected client DC Yes Locate disconnected client DS Yes Performance Isolation DC and DAP Partially Detect Rogue APs DS Yes more DSs to be added as the system load increases. Furthermore, we offload work from each individual DS by sharing the diagnosis burden with the DCs and the DAPs. The DS is used only when the DCs and DAPs are unable to diagnose the problem and the analysis requires a global perspective and additional data, e.g., signal strength information obtained from multiple DAPs may be needed for locating a disconnected client. As stated earlier, the presence of legacy APs degrades scalability since the work usually performed by DAPs would need to be performed by the DSs. Similarly, since the DAP is a shared resource, making it do extra work can potentially hurt the performance of all its associated clients. To reduce the load on a DAP, different fault diagnosis mechanisms can use a simple technique that we refer to as Busy AP Optimization: with this optimization, an AP does not perform active scanning if any client is associated with it; the associated clients perform these operations as needed. The AP continues to perform passive monitoring activities that have a negligible effect on its performance. If there is no client associated, the AP is idle and it can perform these monitoring operations. This approach ensures that most of the physical area around the AP is monitored without hurting the AP s performance.

131 System Security The interactions between the DC, DAP, and DS are secured using EAP-TLS [2] certificates issued over IEEE 802.1x. An authorized certification authority (CA) issues certificates to DCs, DAPs and DSs; we use these certificates to ensure that all communication between these entities is mutually authenticated. We do not address malicious behavior by legitimate users in our environment. Researchers have developed techniques for detecting greedy and malicious behavior for hotspot environments [101]; others have suggested techniques to handle problems due to false information sent by malicious clients to central entities such as the DS [99]. These approaches are complimentary to our design and could be included in our system. 4.5 Client Conduit This section presents a novel mechanism called Client Conduit that allows disconnected wireless clients to convey information to network administrators and support personnel. If a wireless client cannot connect to the network, the DC logs the problem in its database. When the client is connected later (e.g., via a wired connection), this log is uploaded to the DS, which performs the diagnosis to determine the cause of the problem. However, sometimes it is possible that this client is in the range of other connected clients; this client may be disconnected since it is just outside the range of any AP or due to authentication problems. In this situation, it would be desirable to perform fault diagnosis with the DS immediately and, if possible, rectify the problem. We now focus on this scenario. On first thought one may ask: why not have the disconnected node simply send a message to its connected neighbor? Unfortunately, this approach does not work because

132 118 IEEE does not allow a client to be connected to two networks at the same time. Since the connected node has already associated to an infrastructure network, it cannot simultaneously connect to an ad-hoc network with the disconnected client D ifit wants to receive a message from D, it first has to disconnect and then join the ad-hoc network started by D. This is inefficient and unfair to a normally-functioning connected client. One can imagine solving this problem using multiple radios on the connected client (one dedicated on an ad hoc network for diagnosis), or using MultiNet (which allows a client to multiplex a single wireless card such that it is present on multiple networks), or by making a connected client periodically scan all channels. All these approaches have the undesirable property of penalizing the normal-case operation/costs to deal with a problem that is expected to occur infrequently. In the periodic scanning case, switching the wireless card across channels or networks can cause packet drops at the connected client. In the MultiNet case, the wireless card will periodically spend time on the ad hoc network, and will thus consume bandwidth on the connected client. On the other hand, our Client Conduit approach imposes no overheads in the common case when no disconnected clients are present in the neighborhood The Client Conduit Protocol We now discuss our Client Conduit protocol that allows a disconnected client to be diagnosed by a DS via one of the connected clients. Client Conduit achieves its efficiency (of not penalizing connected clients) by exploiting two operational facts about the IEEE protocol. First, even when a client is associated to an AP, it continues to receive beacons from neighboring APs or ad hoc networks at regular intervals. Second, a connected client can send directed or broadcast Probe Requests without disconnect-

133 119 ing from the infrastructure network. We now present the Client Conduit protocol for a scenario where a disconnected client D is in the vicinity of a connected client C (see Figure 4.3). In the following description, we refer to the first 4 steps of the protocol aa the Connection Setup phase and the last step as the Data Transfer phase. Figure 4.3: Client Conduit Mechanism (Steps 1 through 5 are described below) 1. The DC on the disconnected client D configures the machine to operate in promiscuous mode. It scans all channels to determine if any nearby client is connected to the infrastructure network. If it detects such a connected client on a channel, it starts a new infrastructure or an ad hoc network on the channel on which it detected the client s packets. For the reasons discussed in Section 4.4.1, and for the simplicity of exposition, we assume that client D switches mode to become an AP and starts an infrastructure network This newly-formed AP at D now broadcasts its beacon like a regular AP, with an SSID of the form SOS HELP <num> where num is a 32-bit random number to differentiate between multiple disconnected clients. 1 By examining the ToDS and FromDS fields of IEEE data frames [58], client D can determine whether the data packet is part of an infrastructure network and is being sent to/from an AP.

134 The DC on the connected client C detects the SOS beacon of this new AP. At this point, C needs to inform D that its request has been heard and it can stop beaconing. If client C tries to connect to D, it would need to disconnect from the infrastructure network, thereby hurting the performance of C s applications. Instead, we utilize the active scanning mechanism of IEEE networks C sends a Probe Request of the form SOS ACK <num> to D. Note that the Probe Request is sent with a different SSID than the one being advertised by the AP running on D. This approach prevents some other nearby client that is not involved in the Client Conduit protocol from inadvertently sending a Probe Request to D (as part of that client s regular tests of detecting new APs in its environment). 4. When D hears this Probe Request (and perhaps other requests as well), it stops being an AP, and becomes a station again. Note that in response to the Probe Request, a Probe Response is sent out by D; client C now knows that it does not need to send more Probe Requests (it would have stopped anyway when D s beacons stopped). More importantly, D s Probe Response indicates if D would like to use client C as a hop for exchanging diagnostic messages with the DS. This response mechanism ensures that if multiple connected clients try to help D, only one of them is chosen by D for setting up the conduit with the DS. 5. Now D starts an ad hoc network and C joins this network via MultiNet [30]. At this point, C becomes a conduit for D s messages and D can exchange diagnostic messages with the DS through C. The key advantage of the Client Conduit protocol is that connected clients do not experience unnecessary overheads during normal operation. Their overheads during the execution of the protocol are discussed later in this section.

135 121 It is important to note that the Client Conduit mechanism can also be used for bootstrapping clients. For example, suppose that a client tries to access a wireless network for the first time and does not have EAP-TLS certificates, but has other credentials such as Kerberos credentials; Client Conduit can be used to authenticate the user/machine with the backend Radius/Kerberos servers. New certificates can then be installed on the client machine; similarly, a client s expired certificates can also be refreshed without requiring a wired connection. It is possible that a client D is within the range of an AP and is disconnected because of IEEE 802.1x authentication problems [24]. Client Conduit can be used if a connected client is in range as well. If there is no such client, one can dynamically configure the AP to allow D s diagnostic messages to the back end DS (or to the RADIUS servers who can forward to the DS) via the uncontrolled port [57] Client Conduit Security and Attacks We must ensure that the Client Conduit protocol does not introduce any new security leaks or opportuniues for denial-of-service attacks in the system. To ensure that a malicious/unauthorized client does not obtain arbitrary access to the network, the connected client allows a disconnected client s packets to be exchanged only with the DS or backend authentication servers. We now discuss two potential abuses of Client Conduit: hurting the performance of helping clients and disguising a Rogue AP as a disconnected client. Performance Degradation of Helping Clients When a connected client C helps a disconnected client via Client Conduit, we need to ensure that C s application s performance is not adversely affected. During the Con-

136 122 nection Setup part of Client Conduit, the connected client C simply requires processing the beacon message and sending/receiving probe messages; no messages are forwarded by C on the disconnected client s behalf. These steps not only consume negligible resources on C but they also do not result in any security leak or compromise on C; of course, C can further rate-limit or stop performing these steps if it discovers that the disconnected client is making it perform these steps often. We now consider the Data Transfer part of the protocol for possible security and denial-of-service attacks. Switching to MultiNet mode can consume bandwidth at the connected client [30]. There are two problems that need to be addressed. First, a malicious client should not be allowed to waste a connected client C s resources by making it enter MultiNet mode unnecessarily. Second, even when helping a legitimate client, C should be able to control the amount of resources that it wants to allocate for the disconnected client D during the MultiNet transfer. The second problem can be addressed by providing a knob to the client that allows it to limit the percentage of time that it spends on the ad hoc network relative to the infrastructure network; client C may also limit this usage to save battery power. Section characterizes the disconnected client s performance overheads due to this tradeoff. To prevent the first problem due to malicious clients, we add the following authentication step before Data Transfer to ensure that only legitimate clients are allowed to connect via client C. After the Connection Setup phase, client C switches to MultiNet mode for performing authentication. To prevent a denial-of-service (DoS) attack where C is forced into MultiNet mode repeatedly, C can limit the number of times per minute that it performs such an authentication step. As part of the authentication step, client C obtains the EAP-TLS machine certificate from the disconnected client and validates it (for ensuring

137 123 mutual authentication, client D can perform these steps as well). If the disconnected client has no certificates or its certificates have expired, client C acts as an intermediary for running the desired authentication protocol, e.g., C could help D perform Kerberos authentication from the back end Kerberos servers and obtain the relevant tickets. If the disconnected client D still cannot authenticate, C asks D to send the last (say) 10 KBytes of its diagnosis log to C and C forwards this log to the DS. To prevent a possible DoS attack in which a malicious client tries to send this unauthenticated log repeatedly (e.g., while spoofing different MAC addresses), the connected client can limit the total amount of unauthenticated data that it sends in a fixed time period, e.g., C could say that it will send at most 10 KBytes of such data every 5 minutes. Preventing Disguised Rogue APs As discussed in Section 4.2, unauthorized APs are a serious security problem in an enterprise wireless network. An attacker who wants to set up an unauthorized AP and remain undetected may try to exploit the properties of Client Conduit. The attacker s AP can be set up to beacon with an SOS SSID; our Rogue AP detection mechanism (Section 4.6.3) will assume that this beaconing device is actually a disconnected client and not declare it as a Rogue AP. Thus, we need to distinguish between the cases where the beaconing device is a legitimate client and where it is actually a Rogue AP. In Client Conduit, when a disconnected client becomes an AP or starts an ad hoc network during the Connection Setup and starts beaconing, it does not send or receive any data packets. Thus, if a DC ever detects an AP (or a node in ad hoc mode) that is beaconing the SOS SSID and sending/receiving data packets, the DC can immediately flag it as a Rogue device. There is another test that can be used to detect such a Rogue device: when the helping client hears the Probe Response in step 4 of the Client Conduit

138 124 protocol, it knows that the disconnected client no longer needs to beacon. Thus, if the helping client continues to hear the SOS beacons after a few seconds, it can flag the device as a disguised Rogue device. 4.6 Fault Detection and Diagnosis This section discusses our techniques for detecting and diagnosing faults in a IEEE wireless network. Section describes a simple technique for locating disconnected clients. Section presents our mechanisms for isolating performance problems and Section describes how we detect rogue access points Locating Disconnected Clients The ability to locate disconnected wireless clients automatically in a fault diagnosis system is valuable for proactively determining problematic regions in a deployment, e.g., poor coverage or high interference (locating RF holes) or for locating possibly faulty APs. A disconnected client can determine that it is in an RF hole if it does not hear beacons from any AP (as opposed to being disconnected due to some other reason such as authentication failures). To approximately locate disconnected clients (and hence help in locating RF holes), we now discuss a technique called Double Indirection for Approximating Location or DIAL. As discussed earlier, when a client D discovers that it is disconnected, it becomes an AP or starts an ad hoc network and starts beaconing. To determine the approximate location of this client, nearby connected clients hear D s beacons and record the signal strength (RSSI) of these packets. They inform the DS that client D is disconnected and send the collected RSSI data. At this point, the DS executes the first step of DIAL to determine the location of the connected clients: this can be done using any known

139 125 location-determination technique in the literature [17, 73]. In the next step of DIAL, the DS uses the locations of the connected clients as anchor points and the disconnected client s RSSI data to estimate its approximate location. This step can be performed using any scheme that uses RSSI values from multiple clients for determining a machine s location [17, 25, 73]. Since locating the connected clients results in some error, consequently locating disconnected clients with these anchor points further increases the error. In Section 4.8.3, we show that this error is approximately 10 to 12 meters which is acceptable for estimating the location of disconnected clients Network Performance Problems Our design for diagnosing network performance problems comprises two lightweight components: a proactive/passive monitoring component and a reactive diagnosing component. The monitoring component runs in the background at the DC and informs the diagnosing component when it detects connections experiencing poor performance. At this point, the diagnosing component analyzes the connections and outputs a report that gives a breakdown of the delays, i.e., the extent of the delays in the wired and the wireless part, and for the latter, a further breakdown into delays at the client, AP, and the medium. Note that the monitoring component can be conservative in declaring that network problems are being encountered; a false alarm simply invokes our diagnosing component. Since this component has low overheads, invoking it has a small impact on the performance of clients and APs. These components have not been implemented yet in our current prototype but we have evaluated the effectiveness of some of these techniques using tools such as AiroPeek and WinDump.

140 126 Detecting Network Performance Problems We focus on diagnosing performance problems for TCP connections since TCP is the most widely used transport protocol in the Internet. For a TCP connection, we can passively diagnose performance problems by leveraging the connection s data and acknowledgment (ACK) packets. For other transport protocols, we can determine endto-end loss-rate and round-trip times using either active probing or performance reports (e.g., RTCP reports [53]). Network performance problems can manifest themselves in different ways, such as low throughput, high loss rate, and high delay. We do not use throughput as a metric for detecting a problem since it is dependent on the workload (i.e., the client s application may not need a high throughput) and on specific parameters of the transport protocol (e.g., initial window size, sender and receive window size in TCP). Instead, we use packet loss rate and round-trip time for detecting performance problems. Estimating the round trip time (RTT) in a TCP connection is simple: if the client is a sender, it already keeps track of the RTT; if the client is a receiver, it can apply the heuristic proposed in [139] to estimate the round-trip time. To estimate the loss rate, we use heuristics suggested in [47] and [10] on the client side. We compute different loss rates for packets sent and received by the client. For data packets sent by the client, the loss rate is estimated as the ratio of retransmitted packets to the packets sent over the last L RTTs [10]. This estimation mechanism assumes that the TCP implementation uses Selective ACKs so that loss rate is not overestimated unnecessarily; this is a reasonable assumption since a number of operating systems now support this option by default, e.g., Windows, Linux, Solaris. As shown in [10], this estimate can be higher than the actual loss rate when timeouts occur in a TCP connection. For our purposes, this inaccuracy is acceptable for two reasons: first, if a TCP

141 127 connection is experiencing timeouts, it is probably experiencing problems and is worth diagnosing; second, the only consequence of a mistake is to trigger our diagnosis component, which incurs low overhead. If more accurate analysis is needed, the LEAST approach suggested in [10] can be used. For the data packets received by the client, we use an approach similar to the one suggested in [47] to estimate the number of losses: if a packet is received such that its starting sequence number is not the next expected sequence number, the missing segment is considered lost. The loss rate is estimated as the ratio of lost packets to the total number of expected packets in the last L RTTs. Note that the expected number of bytes is calculated as the maximum observed sequence number minus the minimum during the last L RTTs; we apply the idea in [139] to estimate maximum segment size (MSS), and estimate the number of packets by dividing the number of bytes by MSS. Our assumption is that segments are rarely delivered out-of-order in a TCP connection (which has been observed by others [22]). Our detection component triggers the diagnosis component if a connection is very lossy or it experiences high delay. A connection is detected as experiencing high delays if the RTT of a particular packet is more than 250 msec or is higher than twice the current TCP RTT [140]. To avoid invoking our diagnosis algorithm for high delays that occur temporarily, we flag a connection only when D or more packets experience a high delay. A connection is classified as lossy if its loss rate (for transmitted or received packets) is higher than 5% [96, 140]. Both D and L are configurable parameters and each represents a tradeoff between responsiveness of the detection component and unnecessary invocation of the diagnosis component. That is, with a low value of D or L, any change in delays/losses will be detected quickly but it may also result in invoking the diagnosis component unnecessar-

142 128 ily. For high values, apart from slow responsiveness, another problem occurs: the TCP connection may end before sufficient number of samples have been collected. Such a situation can occur with short Web transfers. We can alleviate this problem by aggregating loss rate and delay information between the client and remote hosts across TCP connections. We are currently exploring such techniques along with choosing appropriate values of D and L. Isolating Wireless and Wired Problems When the DC at a client detects a network performance problem for a TCP connection, it communicates with its associated DAP to differentiate between the delays on the wired and wireless parts of the path. The DAP then starts monitoring the TCP data and ACK packets for that client s connection. If the DC is the sender in the TCP connection, the DAP computes the difference between the received time of a data packet from the client to the remote host and the corresponding TCP ACK packet; this time difference is an estimate of the delay incurred in the wired network. To ensure that the roundtrip time estimate is reasonable, various heuristics used by TCP need to be applied to these roundtrip measurements as well, e.g., Karn s algorithm [117]. The DAP sends this estimate to the DC who can now determine the wireless part of the delay by subtracting this estimate from the TCP roundtrip time. A similar approach can be used to compute this breakdown when the client is a receiver: the DAP determines the wireless delay by monitoring the data packets from the remote host to the client and the corresponding ACK packets. Note that the amount of state maintained at the DAP is small since it corresponds to the number of unacknowledged TCP packets; this can be reduced further by sampling.

143 129 Diagnosing Wireless Network Problems A client may experience poor wireless performance due to a number of reasons, such as an overloaded processor at the AP or the client, problems in the wireless medium, some driver or other kernel issues at either the AP or the client. We quantify the effect of these problems by observing their impact on packet delay in the wireless network path. We group these performance problems into three categories: packet delay at the client, packet delay at the AP, and packet delay in the wireless medium. In this section, we describe a collaborative scheme, called Estimating Delay using Eavesdropping Neighbors or EDEN, which leverages the presence of other clients to quantify the delay experienced in each of the above categories. Since electromagnetic waves travel at the speed of light, we can safely assume that RF propagation delays are negligible relative to the client or AP delays. When a client D s performance diagnosis component is triggered, it starts broadcasting packets asking for diagnosis help from nearby clients. All clients who hear these packets switch to promiscuous mode and ask the DAP to start the diagnosis (Section shows that the CPU overheads of entering promiscuous mode are low on modern processors). Security mechanisms similar to the ones discussed in Section can be used to prevent attacks on these clients. Note that we use multiple snooping clients in EDEN primarily for robustness: multiple clients increase the likelihood that at least one client hears the EDEN protocol requests and responses discussed below. EDEN proceeds in two phases. In the first phase, the DAP to which client D is associated estimates the delay at D. The DAP periodically (say every 2 seconds) sends Snoop request packets to client D. When D receives a Snoop request packet, it immediately replies with a Snoop response message. The eavesdropping clients log the time when they hear a Snoop request and the first attempt by D to send the corresponding

144 130 Snoop response packet, i.e., we only record the times of response packets for which the retransmission bit is clear. If an eavesdropping client misses either of these packets, it ignores the timing values for that request/response pair. The difference between the recorded times is the client delay, i.e., application and OS delays experienced by D after receiving the request packet. For robustness, Snoop requests are sent a number of times (say 20); the client and AP delays are averaged over all these instances. In the second phase, a similar technique is used to measure the AP delay, i.e., client D sends the Snoop request packets and the AP sends the responses. Client D also records the round trip times to the AP for these Snoop requests and responses along with the number of request packets for which it did not receive a response, e.g., the request or response was lost. Strictly speaking, this client and AP delay also includes the delay due to contention experienced in the wireless medium. In Section 4.8.4, we discuss the extent of inaccuracy introduced in EDEN s estimates due to traffic congestion. At the end of the protocol, all the eavesdropping clients send the AP and client delay times to the client D. The difference between the round trip time reported by D, and the sum of the delays at the client and the AP, approximates the sum of the delay experienced by the packet in the forward and backward wireless link. The client can then report the client/ap/medium breakdown to the network administrator; it can also report the percentage of unacknowledged request packets as an indicator of the network-level loss rate on the wireless link Rogue AP Detection As discussed in Section 4.2, Rogue APs are unauthorized APs that have been connected to an Ethernet tap in an enterprise or university network; such APs can result in security

145 131 holes, and unwanted RF and network load. Rogue APs are considered a major security issue for enterprise wireless LANs [5, 7, 36]. Our architectural framework of using clients and (if possible) APs to monitor the environment around them naturally lends itself for detecting Rogue APs. Our basic approach is to make clients and DAPs collect information about nearby access points and send it to the DS. When the DS receives information about an AP X, it checks the AP location database and ensures that X is a registered AP in the expected location and channel. Assumptions We assume that all Rogue APs and the corresponding connected rogue clients use off-the-shelf IEEE compliant hardware. Our approach essentially raises the bar such that non-compliant APs with low-level modifications are needed to defeat our scheme: to avoid detection, an attacker must modify the Rogue AP to not beacon and not respond to probe requests. Of course, an attacker can simply use a proprietary access point or one with different technology, e.g. HIPERLAN. Detecting such intruders requires special hardware and is not our goal. We simply want a low-cost mechanism that addresses the (common case) Rogue AP problem being faced in current deployments: for many networks administrators, the main goal is to detect APs inadvertently installed by employees for experimentation or convenience [24]. As part of future research, we will investigate the detection of non-compliant Rogue access points and clients as well. If two companies have neighboring wireless networks, our mechanisms will classify the other companies access points as Rogue APs. If this classification is unacceptable, the network administrators of the respective companies can share their AP location databases.

146 132 Monitoring at clients and APs In our system, each DC monitors the packets in its vicinity (non-promiscuous mode), and for each AP that it detects, it sends a 4-tuple < MAC address, SSID, channel, RSSI > to the DS. Essentially, the 4-tuple uniquely identifies an AP in a particular location and channel. To get this information, a DC needs to determine the MAC addresses of all APs around it. The DC can obtain the MAC address of an AP by switching to promiscuous mode and observing data packets (it can use the FromDS and ToDS bits in the packet to determine which address belongs to the AP). However, we can achieve the same effect using a simpler approach: since IEEE requires all APs to broadcast beacons at regular intervals, the DC can obtain the MAC addresses from the APs beacons from all the APs that it can hear. In Section 4.8.5, we show that a DC not only hears beacons on its channel but it may also hear beacons from overlapping channels as well; this property increases the likelihood of a Rogue AP being detected. To ensure that we do not miss a Rogue AP even if no client is present on any channel overlapping with the AP, we use the Active Scanning mechanism of the IEEE protocol: when a client wants to find out what APs are nearby, the client goes to each of the 11 channels (in b), sends Probe Requests and waits for Probe Responses from all APs that hear those Probe Requests; from these responses, the DC can obtain the APs MAC addresses. Every IEEE compliant AP must respond to such requests and in some chipsets [86], no controls are provided to disable this functionality. Consistent with our framework, we use the Busy AP Optimization (see Section 4.4.3) so that active scans in an AP s vicinity are performed by the AP only when it has no client associated with it.

147 133 Analysis at the DS When the DS receives information for an AP from various clients, it uses DIAL to estimate the AP s approximate location based on these clients locations and the AP s RSSI values from them. The DS classifies an AP as rogue if a 4-tuple does not correspond to a known legal AP in the DS s AP location database, i.e., if the MAC address is not present in the database, or if the AP is not in the expected location, or the SSID does not correspond to the expected SSID(s) in the organization. Note that if an AP s SSID corresponds to an SOS SSID, the DS skips further analysis since this AP actually corresponds to a disconnected client that is executing the Connection Setup phase of the Client Conduit protocol. The channel information is used in a slightly different way. As stated above, if an AP is on a certain channel, it is possible to be heard on overlapping channels. Thus, an AP is classified as rogue only if it is reported on a channel that does not overlap with the one on which it is expected. Note that if the channel on an AP is changed, the DAP can ask the DS to update its AP location database (recall that the communication between the DAP and the DS is authenticated; if the AP is a legacy AP, the administrator can update the AP location database when the AP s channel is changed). The checks that the DS executes are summarized in Figure 4.4. A Rogue AP, say R, may try MAC address spoofing to avoid being detected, i.e., send packets using the MAC address of an authorized AP, say G. However, the DS can still detect R as it will reside in a different location or channel than G (if it is on the same channel and location, G would immediately detect it). Our approach also detects a Rogue AP that does not broadcast an SSID in its beacons since a DC can still obtain the AP s MAC address. Of course, we can detect such unauthorized APs in an even simpler manner by disallowing APs that do not broadcast SSIDs in an enterprise LAN.

148 134 Start Is MAC registered? Yes Is AP in expected location? No No Rogue AP detected No Yes Is AP on the expected channel? No Is AP advertising the expected SSID? Yes Figure 4.4: Decision steps taken by the DS to determine if an AP is a Rogue AP or not Thus, given the above strategy, an unauthorized AP may stay undetected for a short time by spoofing an existing AP X near X s location, beacon a valid SSID in the organization, and stay on a channel on which no DC or AP can overhear its beacons. However, when a nearby client performs an active scan, the Rogue AP will be detected; as we show in Section 4.8.5, a DC can easily perform such a scan every 5 minutes. 4.7 Implementation We now describe the details of our fault diagnosis implementation. We have implemented the basic architecture consisting of the DC, DAP and DS daemons; the authentication and logging mechanisms have not been implemented. We have also implemented

149 135 the Client Conduit protocol and the Rogue AP detection mechanism. The support for DIAL and EDEN is currently being added. Our system has been implemented on the Windows operating system with Netgear MA b cards. On the DS, we simply run a daemon process that accepts information from DAPs. The DS reads the list of legitimate APs from a file; support for reading this information from a database can be easily added. The structure of the code on the DC or DAP consists of a user-level daemon and kernel level drivers (see Figure 4.5). These pieces are structured such that code is added to the kernel drivers only if the functionality cannot be achieved in the user-level daemon or if the performance penalty is too high. User Mode Kernel Mode Diagnostics Daemon TCP/IP Diagnostics IM Module Native WiFi IM Driver NDIS Diagnostics Miniport Module Native WiFi Miniport Driver Native WiFi NIC Figure 4.5: Components on DC and DAP

150 136 Kernel drivers: There are two drivers in our system a miniport driver and an intermediate driver (IM driver) called the Native WiFi driver [86]. The miniport driver communicates directly with the hardware and provides basic functionalities such as sending/receiving packets, setting channels, etc. It exposes sufficient interfaces such that functions like association, authentication, etc. can be handled in the IM driver. The IM driver supports a number of interfaces (exposed via ioctls) for querying various parameters such as the current channel, transmission level, power management mode, SSID, etc. In addition to allowing the parameters to be set, it allows the user-level code to request for active scans, associate with a particular SSID, capture packets, etc. In general, it provides a significant amount of flexibility and control to the user-level code. Even though some of the required operations were already present in the IM driver, we still had to make significant modifications to expose certain functionalities and to improve performance of our protocols. The miniport driver was changed to expose certain packet types to the IM driver. In the IM driver, we added the following support: Capturing packet headers and packets: We allow filters to be set such that only certain packets or packet headers are captured, e.g., filters based on specific MAC addresses, packet types, packet subtypes (such as management and beacon packets), etc. Storing the RSSI values from received packets: We obtain the RSSI value of every received packet and maintain a table called the NeighborInfo table that keeps track of the RSSI value from each neighbor (indexed on the MAC address). We maintain an exponentially weighted average with the new value given a weighting factor of The RSSI information is needed for estimating the location of disconnected clients and APs using DIAL.

151 137 Keeping track of AP information: In the NeighborInfo table, we keep track of the channels on which packets were heard from a particular MAC address, SSID information (from beacons), and whether the device is an AP or a station. This information needs to be sent to the DAP/DS for Rogue AP detection. Kernel event support for protocol efficiency: We added an event that is shared between the kernel and user-level code. The kernel triggers this event when an interesting event occurs; this allows some of our protocols to be interrupt-driven rather being polling-based. Currently, the kernel sets this event whenever it hears an SOS beacon from a disconnected client during Client Conduit, thereby resulting in a lower protocol latency (see Section 4.8.2). We added a number of ioctls to get and clear the information discussed above. Fault Diagnostic daemon: This daemon gathers information and implements various mechanisms discussed in this chapter, e.g.., collect MAC addresses of APs for Rogue AP detection, perform Client Conduit, etc. If the device is an AP, it communicates diagnostic information with the DS and the DCs; if the device is just a DC, it communicates with its associated AP to convey the diagnostic information. The Diagnostic daemon on the DC obtains the current NeighborInfo table from the kernel every 30 seconds. If any new node has been discovered or if the existing data has changed significantly (e.g., RSSI value of a client has changed by more than a factor of 2), it is sent to the DAP. The DAP also maintains a similar table indexed on MAC addresses. However, it only sends information about disconnected clients and APs to the DS; otherwise, the DS would end up getting updates for every client in the system, making it less scalable. The DAP sends new or changed information about APs to the DS periodically (30 seconds in our current prototype). Furthermore, if the DAP has any

152 138 pending information about a disconnected client D, it informs the DS immediately so that D can be serviced in a timely fashion. All messages from the DC to the DAP and DAP to the DS are sent as XML messages. A sample message format from the DC is shown below (timestamps have been removed): <DiagPacket Type="RSSIInfo" TStamp="..."> <Clients TStamp="..."> <MacInfo MAC="00:40:96:27:dd:cc" RSSI="23" Channels ="19" SSID="" TStamp="..."/> </Clients> <Real-APs TStamp="..."> <MacInfo MAC="00:20:a6:4c:c7:85" RSSI="89" Channels="12" SSID="UNIV_LAN" TStamp="..."/> <MacInfo MAC="00:20:a6:4c:bb:ad" RSSI="7" Channels="10" SSID="EXPER" TStamp="..."/> </Real-APs> <Disconnected-Clients TStamp="..."> <MacInfo MAC="00:40:96:33:34:3e" RSSI="57" Channels="2048" SSID="SOS_764" TStamp="..."/> </Disconnected-Clients> </DiagPacket> As the sample message shows, the DC sends information about other connected clients, APs, and disconnected clients. For each such class of entities, it sends the MAC address of a machine along with RSSI, SSID, and a channel bitmap which indicates the channels on which the particular device was overheard.

153 System Evaluation We now evaluate our mechanisms and show that they are not only effective but they also impose low overheads. For the basic architecture evaluation, Client Conduit, and Rogue AP detection, we use our prototype. To demonstrate the effectiveness of EDEN and DIAL, we use a combination of tools such as AiroPeek [132] and WinDump [134]. Section presents the timings for individual operations that are used by our protocols. Section presents the breakdown of the costs involved in the Client Conduit mechanism and shows that it can be used to help disconnected clients in a timely manner. Section show the effectiveness of our DIAL technique for locating disconnected clients. In Section 4.8.4, we evaluate the effectiveness of the EDEN technique to isolate performance problems. Section shows that the scanning requirements of our Rogue AP detection mechanism imposes low overheads on client machines. Finally, in Section 4.8.6, we discuss scalability issues with respect to the Client Conduit protocol, DIAL, EDEN, and Rogue AP detection mechanisms Cost of Individual Operations To better understand the cost of various operations involved in our detection and diagnosis mechanisms (e.g., Client Conduit), we ran a series of micro-benchmarks. We believe that these numbers are valuable for other researchers for modeling purposes as well. Table 4.2 shows the results. Note, the cost of changing a machine from AP to Station mode is less than 2 seconds (731 msecs for the actual change and then waiting for a few hundred msecs as specified by the hardware specifications). Additionally, we ran some experiments to understand the overheads of placing a card in promiscuous mode. We first ran an experiment with 4 machines, A, B, C, and D to determine if placing a machine in promiscuous mode has any effect on the machine s

154 140 Table 4.2: Times for different operations: U means time measured from user-level code; rest are times taken for the corresponding ioctl to complete Operation Time (msecs) Std. dev Mostly No-op Ioctl (U) RPC-based Ioctl (U) Set channel Set beacon period Set AP/STA mode Active Scan Set SSID incoming/outgoing bandwidth. We setup the machines such that machine A did a TCP transfer to C at full blast and B performed a full blast TCP transfer to D. The experiment was performed three times; in each case, machine C was placed in normal mode first and then in promiscuous mode. We observed that C s throughput was largely unaffected by being in promiscuous mode: C achieved an incoming bandwidth of KB/sec (standard deviation of 63.7 KB/sec) in the normal mode case and a bandwidth of KB/sec (standard deviation of 21.7 KB/sec) in the promiscuous mode case. We ran another experiment to determine a machine s CPU utilization when it is placed in promiscuous mode. We ran a full blast TCP transfer between two machines A and B; during this process, we first placed machine M in normal mode and then in promiscuous mode. Figure 4.6 shows the CPU overhead for machine M (a 1 GHz Pentium III machine). Even for such a relatively old machine, the CPU overhead of placing it in promiscuous mode is quite low, mostly staying below 10%; we also observed that none of the packets were dropped.

155 141 Thus, these results show that the CPU overheads on a machine due to promiscuous mode are reasonably low Client Conduit To measure the performance of the Client Conduit protocol, we set an experiment with one AP, one connected client C and a disconnected client D. The connected client is a 1 GHz Pentium III machine and the disconnected machine is a 800 MHz Pentium III machine. Both machines have 512 MB of memory and Netgear MA b cards. CPU Usage Promiscuous mode Normal mode Time (secs) Figure 4.6: CPU usage in Promiscuous mode (1 GHz machine) Figure 4.7 shows the total time taken along with a breakdown of the Connection Setup part of the protocol. User time indicates the end-to-end time taken by our userlevel implementation whereas Kernel time indicates the time taken by the relevant ioctls for the same functionality. The costs in both cases are similar thereby justifying our approach of implementing only the essential mechanisms at the kernel level and driving most of the protocol from the user-level (for ease of debugging). In the first two bars, the user-level daemon at the connected client shares an event with the kernel who immediately informs the daemon when a disconnected client s beacon is detected

156 142 (See Section 4.7). Thus, the disconnected client needs to wait only a short time before it hears the Probe Request message from the connected client C indicating that C is ready to help (see the Get ACK times). This delay would be much higher if the daemon obtained the disconnected machine information from the kernel periodically instead of being interrupt-driven. The third bar shows the delay breakdown for an implementation where the daemon client polls for this information every 10 seconds from the kernel (from a disconnected client s perspective, the Get ACK delay is higher). We now clarify a couple of details about our experiment. First, the initial step of setting the channel and checking for available clients takes approximately 190 msecs. In the worst case, the disconnected client may have to scan all channels and check for connected clients; in that case, this step may take an 2-3 seconds. Second, the steps in which we set the AP/Station mode of the machine take approximately 730 msec; however, the hardware specifications require that the operating system must wait for a few hundred milliseconds before using the card in the new mode. For robustness, we added a one second delay after such a mode change; the figure includes these delays after each mode change. From the figure, one can see that the Connection Setup and association time for the disconnected client is quite reasonable: it takes less than 5 seconds to run the setup and another 1.9 seconds to associate with a connected client C in ad-hoc mode so that the MultiNet protocol can be started on C. After MultiNet starts running on the connected client, the disconnected client can interact with the DS to diagnose its problems, e.g., transfer certificates or log files to the DS. To evaluate the time taken to perform these transfers via MultiNet, we ran an experiment in which a machine D sent files of different sizes (100KB, 500KB and 1MB) to the DS through connected client C. Figure 4.8 shows the time taken when

157 143 Time (msec) 14,000 12,000 10,000 8,000 6,000 Adhoc-mode association Sleep 1 second Become STA Get Ack Set Beacon Period Set SSID Sleep 1 second Become AP Set channel 4,000 2,000 0 User time (ms) Kernel time (ms) User time with polling (ms) Figure 4.7: Breakdown of costs for Client Conduit. The protocol steps are executed from the bottom entry in the legend to the topmost, i.e., starting at Set channel. the connected client C allows 17-50% of its time to be used for ad hoc mode; client C stays on the infrastructure network for 500 msecs, and the time on the ad-hoc network is varied between 100 to 500 msecs. In our experiment, the time to switch from ad-hoc to infrastructure mode is 500 msecs and from infrastructure to ad hoc mode is 300 msecs. As expected, the results show that the file transfer speed is a direct function of the time a connected client stays in the ad hoc network. We expect that as the switching delay overhead reduces (as in newer cards) the transfer speeds will improve. Thus, our results show that Client Conduit allows a disconnected client s problem to be reported (and even be resolved, e.g., updating expired certificates) in a few seconds Location Determination We now evaluate the accuracy of locating disconnected clients (or Rogue APs) using our DIAL scheme described in Section Unlike previous work on location determina-

158 144 Time to transfer data (sec) MB 500KB 100KB % Time of connected node Figure 4.8: Time taken by a disconnected client to transfer data via Multinet tion, the location calculated by DIAL incurs extra error since the location of reference points themselves may not be known accurately. We evaluated DIAL using RADAR [17] for locating the disconnected clients from the anchor points due to its simplicity; more sophisticated RSSI-based schemes such as the one suggested in [73] can be used for reducing the errors of DIAL even further. In our experiment, we placed 3 connected clients in 3 offices on the same floor of our building. We obtained the floor map, and applied the Cohen-Sutherland line-clipping algorithm [48] to compute the number of walls between each of the three connected clients and the other rooms. We placed a disconnected client at 7 different locations while it sent out broadcast packets. We used AiroPeek [132] to measure the RSSI of the disconnected client s packets received at the connected machines. We then applied the equation specified in [17] to compute the wall attenuation factor (WAF). Based on the WAF, we inferred that the disconnected client is in location X if the predicted signal strength at X is closest to the observed signal strength at the three connected clients. We ran the RADAR algorithm on the collected RSSI data for locating the discon-

159 145 nected client D using the precise location of the connected clients. We computed the error in D s predicted location with respect to its actual location; the No Error bar in Figure 4.9(a) shows this error. Median Location Error (metres) No Error E(1) E(2) E(1,2) E(3) E(1,3) E(2,3) E(1,2,3) (a) Estimated location of connected client is one-room off from its true location Median Location Error (metres) No Error E(1) E(2) E(1,2) E(3) E(1,3) E(2,3) E(1,2,3) (b) Estimated location of connected client is two-rooms off from its true location. Figure 4.9: Median error in locating disconnected clients. The lower and upper bounds of error bars correspond to min and max error. E(i) denotes that the i th connected client s location contains error. Then, we ran the algorithm again by assuming that there was an error in estimating the location of one connected client by a distance of 3.3 meters; this distance corresponds to the average width of a room in our building. For example, if connected client A was placed in room X, we assume its estimated location to be a neighbouring room Y

160 146 when using it as an anchor point in RADAR. The second bar in Figure 4.9(a) shows this error when such a situation occurs. The rest of the bars show the error in locating the disconnected client when the location of either one, two or three connected clients is estimated incorrectly by one room; Figure 4.9(b) shows the error when estimated location is off by a distance equivalient to that of two rooms. The results show that when there is no error in the known location of the connected clients, the median error is 9.7 meters. This error increases to at most 12 meters when the estimated location of one or more clients is one or two rooms off from its true location. Of course, when the estimated locations of the connected clients are off by two rooms, the maximum error is substantially higher, e.g., 33 meters for the case when the location of all three clients is incorrect. This case occrs when the estimated locations of the connected clients are off in different directions, e.g., client A s location is off towards north and client B s location is off towards south. Note that the error in the location of the anchor points (i.e., connected clients) can be kept low (less than one room off) by using mechanisms such as Cricket [98] and Active Badges [129] for locating connected clients. With accurate location of anchor points, DIAL s error would be similar to that of the best-known RSSI-based location mechanism. Note that even an error of metres (for our experimental setup using RADAR) is acceptable since the goal of DIAL is to approximately locate disconnected clients or Rogue APs. Thus, based on our results, we can say that DIAL is a practical approach for helping network administrators estimate the approximate location of problematic areas.

161 Estimating Wireless Delays In Section 4.6.2, we presented the EDEN scheme that uses nearby clients to measure the delay encountered by a wireless station or an AP. We now show that EDEN can estimate the delay encountered at these endpoints with reasonable accuracy. The EDEN technique measures the time spent on a client (or an AP) by measuring the times of the Snoop request and response packets at nearby clients. However, this measurement includes the delay at the machine due to medium contention. To understand the extent of this congestion delay, we set up a simple experiment with 4 machines A, B, C and D on the same channel. Machine A performed a full-blast data transfer to machine B, thereby creating traffic congestion in the medium. Then we associated client C with the Native WiFi AP machine D. The Native WiFi AP then sent 20 ping packets to the associated client, which in turn sent ping reply packets. We ran the experiment twice: once with no extra client delays and next when an extra 40 msec were added at the client between the ping request and replies. Using a fifth machine running Airopeek, we observed that EDEN over-estimated the client delay by approximately 3 msec. When examining scenarios where the client or the AP are the bottlenecks, such inaccuracies may be acceptable. However, when these entities are not bottlenecks or when EDEN is examining a scenario with low delays or when contention is even worse (e.g., the contention delay can even be more than 20 msec in b), a better estimation is needed; we are currently exploring mechanisms to reduce such inaccuracies. Next, we ran an experiment to determine EDEN s accuracy in determining delays at an endpoint. In this setup, a client machine was associated with another machine running as an access point; both machines had Netgear MA b cards and the corresponding Native WiFi drivers. We then injected delays in the path of all packets at the client (varying from 30 to 300 msecs). To emulate the EDEN protocol, the AP sent

162 148 Error in Estimated Delay (%) Delay Introduced at client (msec) Figure 4.10: EDEN s accuracy of estimating the delay at a client 20 ping packets to the client; the ping packets and replies emulate the Snoop request and response messages in EDEN. A third machine running AiroPeek was used to snoop on these ping packets; this machine effectively emulates the eavesdropping client in EDEN. The collected Airopeek data was then analyzed to estimate the delays at the client. Figure 4.10 shows that EDEN is reasonably accurate in estimating the delays at an endpoint: EDEN can estimate client delays with an error less than 5% of the actual introduced delay. Finally, we studied EDEN s effectiveness in classifying delays at the client, AP, and the medium. We used a 3-machine setup similar to the one in the previous experiment; in this case, to estimate delays at the AP, the client also send ping packets to the AP. To introduce delays in the medium, we increased the distance between the client and the AP. The medium delay increased relative to the case when the AP and client were nearby beacuse there were more retries 2. For better accuracy, we ran these experiments in the night when the wireless traffic was expected to be low (since the corporate LAN is 2 the increased distance resulted in an increase in the number of walls between the two machines, thereby weakening the received signal

163 149 actively used by employees during the day, we did not want traffic interference to affect our measurements). 120 Roundtrip delay (msec) Medium Delay AP Delay Client Delay near far 0-0-far Figure 4.11: Breakdown of delay at the client, AP, and the medium as estimated by EDEN Figure 4.11 shows EDEN s breakdown for three different scenarios. The near bar corresponds to the scenario when the AP and client were placed near each other, and we added a 40 msec delay to all packets at both machines. The far scenario is similar except that client and the AP were placed far from each other. Finally, the 0-0-far case is one in which we did not introduce any delays at the client or the AP but they were placed far from each other. In the near case, EDEN estimates approximately equal delays for the client and the AP. With an increase in the distance (the far and 0-0-far cases), the medium delays increase and EDEN is able to estimate this change as well. Note that the client and the AP delays increased in the the latter two cases by a few milseconds beacuse the wireless cards transmitted the packets at a lower transmission rate (1 Mbps) in order to decrease the error rate. These results show that EDEN is an effective mechanism for obtaining a delay breakdown in a wireless setting.

164 Rogue AP Detection In this section, we explore two issues related to Rogue AP detection. Section shows that overlapping channels helps in quicker detection of Rogue APs that are hiding on channels where no AP or client is present. Section shows that even if Rogue APs are not overheard on overlapping channels, there is ample opportunity for clients to perform active scanning without hurting their performance. To check the effectiveness of our implementation, we ran our Rogue AP detection mechanism on our building floor and were able to detect all known Rogue APs (these were experimental APs being used by our colleagues). Overlapping Channels It is known that overlapping channels in IEEE not only interfere with one other but it is sometimes possible for a NIC on one channel to decode packets from another overlapping channel. This characteristic is helpful in detecting Rogue APs: if a client is present on a channel that overlaps with a Rogue AP s channel, it will detect the AP s presence if it is able to hear the AP s beacons. To verify the extent of this overlap, we performed an experiment in which an AP was placed on channel 1 and a nearby client checked for the AP s beacons on all 11 channels. We repeated this experiment by placing the AP in all channels from 2 to 11 and document where it could be heard. In one run, the client lingered on each channel for 1 second and in the second run, it stayed for 5 seconds. Figure 4.12 plots the channels on which the AP is heard (Y-axis) when it is placed on a specific channel (X-axis). Clearly, the overlap across various channels is non-negligible and is helpful for detection of Rogue APs. Furthermore, given sufficient time (see the 5-second run), there is an even higher likelihood that some packet from a Rogue AP leaks through to a monitoring DC.

165 151 Channel on which beacons are decoded correctly Detected in 1 and 5 sec runs Detected only in 5 sec run Channel on which AP beacons Figure 4.12: Overlapping channels on which an AP is overheard In the above experiments, the AP and the client were placed 5 feet apart with one obstacle between them. We wanted to study the change in leakage across overlapping channels on increasing the the distance between the AP and the client. For this we placed an AP machine at 10 different locations on our floor in various rooms and repeated the above experiment. Figure 4.13 shows that as the distance between the AP and the monitoring client increases, the AP is heard on fewer channels (the decrease is not monotonic due to obstructions). The above results show that even though one cannot rely on overlap as a guaranteed mechanism for detecting Rogue APs, it does reduce the need of performing frequent active scans. This observation also implies that there are more opportunities for detecting Rogue APs: for a Rogue AP to go undetected, it must be far away from any client that is on an overlapping channel. Availability of Idle Times for Active Scans As shown in Section 4.8.2, active scans can take up to 2 seconds. Our current implementation performs an active scan every 5 minutes; we refer to this period as the Active Scan Period. Even though 2 seconds out of 300 seconds is a small fraction of the time, it

166 152 Num Channels Sensing AP Beacons Channel 1 Channel 4 Channel Distance (in feet) Figure 4.13: Overlapping channels heard relative to distance is important for clients to perform these scans at appropriate times; otherwise, network traffic on a client may get disrupted: packets sent to this client may be dropped, TCP may timeout, etc. Max idle period (seconds) every 5 minutes Time of day (hours) Figure 4.14: The maximum idle time duration available during every 5-minute period at different times of the day Ideally, these scans should be done when the node is idle and has no ongoing network transfers. To determine whether such idle times exist in current usage, we used

167 153 Ethereal [45] to obtain traces from 3 desktop machines of our colleagues over multiple days. Note that even though these traces are from desktops attached to wired networks, they still give us a reasonable estimate of network traffic generated by users; as users start using laptops as their primary machines, it is likely that the network and idle time behavior will be similar to that of desktop clients. We divided the traces into 5-minute periods (the Active Scan Period) and for each period, we determined the maximum period of time for which the network was idle. Figure 4.14 presents the maximum idle period in every 5-minute interval during a 24- hour period. Each point in the graph (e.g., for 12:00 pm to 12:05 pm) is obtained by averaging the maximum idle time value across multiple days and multiple machines for the same 5-minute period. The figure shows that there are large chunks of idle periods available for performing active scans: the smallest idle period available in a 5-minute interval was 118 seconds and typically, idle periods of more than 2.5 to 3 minutes were easily available. Thus, a large window of opportunity is available to our rogue detection scheme for performing active scans every 5 minutes. Given the availability of such opportunities, one can use any heuristic to predict idle times for launching an active scan (which takes 2 seconds). We studied the effectiveness of a simple history-based heuristic: if the network has been idle for X seconds, it predicts that the network will be idle for the next 2 seonds. Thus, after every 5 minutes, the Rogue AP detection module can perform an active scan whenever it observes that the network interface has been idle for X seconds. We evaluated the effectiveness of this heuristic over our 3-machine traces with two different values of X: 5 and 10 seconds. With both values of X, we observed that the active scan would complete within the idle period for more than 95% of the cases. The effectiveness of this heuristic shows that wireless clients can perform active scans for Rogue AP detection without hurting performance.

168 Scalability Analysis As discussed in Section 4.4.3, our architecture is designed to scale with the number of access points and clients in the system. We now discuss why our proactive and reactive techniques maintain the scalability property. We also argue why our reactive mechanisms impose low network overhead even if a number of clients are experiencing wireless problems in an area. As discussed in Section 4.7, each DC pro-actively sends the RSSI, SSID, and MAC address information about nearby devices to the DAP 30 seconds; this information is necessary for Rogue AP detection. The DAP filters this data and sends information about APs every 30 seconds. To understand the network bandwidth consumed on the wireless link, we set up an experiment with a single DC, DAP and DS for 4 hours. We observed that the bandwidth consumption by the DC was less than 0.2 Kbps and the DAP s bandwidth requirements were less than 0.01 Kbps. This result implies that even if a large number of clients were present, the bandwidth usage is still low, e.g., 20 Kbps for 100 clients by DC. Thus, for pro-active monitoring, our techniques have negligible bandwidth requirements. We now analyze the bandwidth overheads of our reactive diagnosis mechanisms, i.e., Client Conduit and EDEN; we do not discuss DIAL s overheads since DIAL s beaconing messages are part of Client Conduit and the overheads of sending the RSSI information to the DAP has already been discussed above. The bandwidth requirements of EDEN and the Connection Setup part (beacons and probe messages) of Client Conduit are low since these protocols send small broadcast or beacon packets at a low frequency, e.g., every 100 msecs in Client Conduit and every 2 seconds in EDEN. The bandwidth consumption while using MultiNet can also be controlled: as stated in Section 4.5.2, the connected client can limit the amount of

169 155 bandwidth that it allocates to the disconnected client. Thus, if a single client needs help, our reactive mechanisms impose little overhead. We now analyze the overheads when a large number of clients (say 50) in an area have wireless faults and are utilizing our reactive mechanisms to diagnose their problems. Our basic idea for ensuring that the performance of the network does not deteoriate is to rate-limit our mechanisms; we have not implemented these protocol extensions in our current prototype. In Client Conduit, when a disconnected client overhears the beacons on N disconnected clients, instead of choosing a fixed beacon period of 100 msec, it sends out a beacon every K msecs where K is a random number between 0 and 100*N msecs. This self-regulation ensures that the network is not swamped out by Client Conduit beacons if a sudden loss of coverage occurs in an area. A similar selfregulatory mechanism is used to limit the rate at which the initial broadcast packets are sent in EDEN. Furthermore, to limit the overheads on a connected client C (and possibly reduce the reactive scheme s load on the DAP and DS), we can use a policy such that C helps only one client at any given point. Thus, with these policy decisions, we can ensure that Client Conduit and EDEN impose low bandwidth overheads even when a large number of clients are experiencing problems. 4.9 Future Work There are a number of additional problems in wireless fault diagnosis that require further research. We plan to pursue these in the near future. We presented a technique for detecting Rogue APs in a deployment. A related problem is to detect Rogue Ad-hoc Networks. Such networks are created when a user connected to the corporate network (e.g., via a wired network) sets up an IEEE ad-hoc network with an unauthenticated client. Thus, like the

170 156 Rogue AP scenario, such a network can compromise the security of the corporate network. The problem of performing root-cause analysis on client authentication problems was not discussed in this chapter. For example, the system could analyze the IEEE 802.1x protocol messages to determine the point at which authentication failed. In Section 4.6.1, we show how the location of disconnected clients can be determined when a few connected clients are present nearby. The question remains, what should be done when there are no connected clients in the neighborhood. One approach may be to have the client log its last known location where connectivity was available. Using heuristics, such as movement trajectory, it might be possible to determine the approximate location of the dead spot. The next logical step after diagnosis is recovery. Once a fault has been detected, one needs to determine what automatic steps should the system take to resolve the situation without necessarily involving a network administrator Summary The rising popularity of IEEE networks has made fault detection and diagnosis an important problem for IT managers responsible for maintaining these networks. Interestingly, the wireless research community has overlooked these problems, perhaps because maintenance issues surface only after large deployments are in place, which is a relatively recent phenomenon. In this chapter, we presented novel solutions for detecting a variety of faults and proposed approaches for analyzing performance problems experienced by end-users. Our initial results show that our mechanisms of locating RF holes, detecting Rogue

171 157 APs, and diagnosing performance problems are effective and impose low overheads. Furthermore, we show that a novel mechanism called Client Conduit can be used for assisting disconnected clients in real-time. These techniques in conjunction with our general architecture that uses clients, APs, and backend servers together for diagnosing wireless networks make our system unique and practical. The general problem space of effective network management for IEEE networks is large. Our fault diagnosis architecture is a first attempt at addressing some of the critical problems identified to us by network administrators managing a large deployment. It is our hope that this work will stimulate other researchers to investigate such problems further and propose solutions that will eventually result in the smooth operation of IEEE networks. The contents of this chapter were developed in joint work with Atul Adya, Victor Bahl and Lili Qiu. The idea to work on this problem was conceived by Victor Bahl. He also helped define the problem space. I designed the fault diagnosis architecture, described in Section 4.4, the Client Conduit Protocol and the Rogue AP algorithm, along with Atul Adya. I implemented the Client Conduit protocol, and Atul implemented the Rogue AP algorithm of Section I also designed the performance isolation and location determination algorithms, presented in Sections and 4.6.2, in joint work with Lili Qiu.

172 CHAPTER 5 CONCLUSION To the best of our knowledge, this dissertation is the first to look at the problem of simultaneously connecting to multiple wireless networks with a single wireless card. The MultiNet solution is a new virtualization architecture for wireless network cards. It has been implemented on Windows XP, and is distributed by Microsoft Research as part of its Mesh Networking Academic Resource Toolkit [104]. In addition to describing MultiNet, this dissertation also presents two of its applications: SSCH and Client Conduit. SSCH is a channel hopping protocol for increasing the capacity of wireless ad hoc networks. SSCH can be implemented in the link layer of the network stack and works over the IEEE standard. It is the first multi-channel protocol that we are aware of, which works over a single wireless card without requiring a dedicated control channel. We show that SSCH significantly increases the capacity of wireless ad hoc networks. Client Conduit is a key component of our fault diagnosis architecture. It uses Multi- Net to provide a thin pipe of communication to disconnected wireless clients by using the bandwidth of connected machines. Client Conduit has been implemented on Windows XP and is shown to be lightweight and secure. In addition to SSCH and Client Conduit, MultiNet enables the design of a whole new class of applications. System designers are no longer constrained by the number of wireless cards they can fit into in a system. They are free to design systems and applications that can connect to many wireless networks at the same time. We believe that MultiNet is the first system to relax this physical constraint. Through its constructions, this dissertation contributes towards solving some of the key problems in existing wireless networks, in particular power, capacity and manage- 158

173 159 ability. MultiNet saves battery power by not requiring multiple wireless cards to stay connected on multiple wireless networks. SSCH improves the capacity of wireless ad hoc networks by distributing interfering flows on orthogonal frequency channels. Finally, this dissertation presents a new client-centric fault diagnosis architecture for infrastructure wireless networks.

174 REFERENCES [1] NLANR/DAST: Iperf TCP/UDP Bandwidth Measrement Tool. [2] B. Aboba and D. Simon. PPP EAP TLS Authentication Protocol. In RFC 2716, October [3] A. Adya, P. Bahl, R. Chandra, and L. Qiu. Architecture and Techniques for Diagnosing Faults in IEEE Infrastructure Networks. In Proc. of ACM MobiCom, Philadelphia, PA, September [4] Atul Adya, Paramvir Bahl, Jitendra Padhye, Alec Wolman, and Lidong Zhou. A Multi-Radio Unification Protocol for IEEE Wireless Networks. Technical Report MSR-TR , Microsoft Research, July [5] AirDefense. Wireless LAN Security. [6] AirMagnet. AirMagnet Distributed System. [7] AirWave. AirWave Management Platform. [8] A. Akella, G. Judd, S. Seshan, and P. Steenkiste. Self Management in Chaotic Wireless Deployments. In ACM MobiCom 2005, August [9] M. Alicherry, R. Bhatia, and L. Li. Joint Channel Assignment and Routing for Throughput Optimization in Multi-radio Wireless Mesh Networks. In MobiCom, August [10] M. Allman, W. Eddy, and S. Ostermann. Estimating Loss Rates With TCP. In ACM Perf. Evaluation Review 31(3), Dec [11] AMD. Advanced Micro Devices. [12] T. M. Apostol. Introduction to Analytic Number Theory. Springer-Verlag, NY, [13] ATA. Flash Memory Cards. [14] Atheros Communications. [15] B. Awerbuch, D. Holmer, and H. Rubens. Provably Secure Competitive Routing against Proactive Byzantine Adversaries via Reinforcement Learning. In JHU Tech Report Version 1, May [16] P. Bahl, R. Chandra, and J. Dunagan. SSCH: Slotted Seeded Channel Hopping for Capacity Improvement in IEEE Ad-Hoc Wireless Networks. In Proc. of ACM MobiCom, Philadelphia, PA, September

175 161 [17] P. Bahl and V. N. Padmanabhan. RADAR: An Inbuilding RF-based User Location and Tracking System. In Proc. of IEEE INFOCOM, Tel-Aviv, Israel, March [18] P. Barford and M. Crovella. Generating Representative Web Workloads for Network and Server Performance Evaluation. In ACM SIGMETRICS 1998, pages , July [19] P. Barford and M. Crovella. Critical Path Analysis of TCP Transactions. In Proc. of ACM SIGCOMM, Stockholm, Sweden, Aug [20] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In Proc. of ACM SOSP, Bolton Landing, NY, October [21] BAWUG. Bay Area Wireless Users Group. [22] J. Bellardo and S. Savage. Measuring Packet Reordering. In Proc. of ACM Internet Measurement Workshop, Marseille France, Nov [23] A. Bensoussan, C. T. Clingen, and R. C. Daley. The Multics Virtual Memory. In Proc. of ACM SOSP, Princeton, NJ, October [24] D. Berry and G. Breeze. Microsoft IT division. Private Communication, [25] Bluetooth SIG. Location Working Group. [26] J. Broch, D. A. Maltz, and D. B. Johnson. Supporting Hierarchy and Heterogeneous Interfaces in Multi-Hop Wireless Ad Hoc Networks. In Workshop on Mobile Computing held in conjunction with the International Symposium on Parallel Architectures, June [27] S. Buchegger and J. Le Boudec. The Effect of Rumor Spreading in Reputation Systems for Mobile Ad-Hoc Networks. In Proc. of WiOpt, France, March [28] E. Bugnion, S. Devine, and M. Rosenblum. Disco: Running Commodity Operating Systems on Scalable Multiprocessors. In Sixteenth ACM Symposium on Operating System Principles, October [29] P. Buonadonna, A. Geweke, and D. Culler. An Implementation and Analysis of the Virtual Interface Architecture. In Proc. of SC, November [30] R. Chandra, P. Bahl, and P. Bahl. MultiNet: Connecting to Multiple IEEE Networks Using a Single Wireless Card. In Proc. of IEEE INFOCOM, Hong Kong, Mar [31] I. Chlamtac and A. Farago. Making Transmission Schedules Immune to Topology Changes in Multi-Hop Packet Radio Networks. IEEE/ACM Transactions on Networking, 2(1):23 29, February 1994.

176 162 [32] I. Chlamtac and A. Farago. Time-Spread Multiple-Access (TSMA) Protocols for Multihop Mobile Radio Networks. IEEE/ACM Transactions on Networking, 5(6): , December [33] I. Chlamtac, C. Petrioli, and J. Redi. Energy-Conserving Access Protocols for Identification Networks. IEEE/ACM Transactions on Networking, 7(1):51 61, February [34] R. R. Choudhury, X. Yang, R. Ramanathan, and N. H. Vaidya. Using Directional Antennas for Medium Access Control in Ad Hoc Networks. In Proc. of ACM MobiCom, September [35] Cisco. Cisco Aironet 350 series. ao350ap. [36] Cisco. CiscoWorks Wireless LAN Solution Engine. [37] Executive Committee. Wireless Philadelphia. [38] Intel Compaq and Microsoft Corporations. Virtual Interface Specification. Version 1.0. December [39] Computer Associates. Unicenter Solutions: Enabling a Successful Wireless Enterprise. [40] D. De Couto, D. Aguayo, J. Bicket, and R. Morris. A High-Throughput Path Metric for Multi-Hop Wireless Routing. In ACM MobiCom 2003, September [41] P. J. Denning. Virtual Memory. In ACM Computing Surveys, volume 2, pages , September [42] T. ElBatt and B. Ryu. On the Channel Reservation Schemes for Ad-hoc Networks: Utilizing Directional Antennas. In IEEE International Symposium on Wireless Personal Multimedia Communications, October [43] J. Elson, L. Girod, and D. Estrin. Fine-Grained Network Time Synchronization using Reference Broadcast. In Operating Systems Design and Implementation (OSDI 2002), December [44] Engim. Intelligent, Wideband WLAN Chipsets with ASAP Functionality. [45] Ethereal. A network protocol analyzer. [46] F. Fitzek, D. Angelini, G. Mazzini, and M. Zorzi. Design and performance of an enhanced IEEE MAC protocol for multihop coverage extension. IEEE Wireless Communications, 10(6):30 39, December 2003.

177 163 [47] S. Floyd, M. Handley, J. Padhye, and J. Widmer. Equation-Based Congestion Control for Unicast Applications. In Proc. of ACM SIGCOMM, Stockholm, Sweden, Aug [48] J. D. Foley, A. van Dam, S. K. Feiner, and J. F. Hughes. Computer Graphics Principles and Practice (2nd Edition). Addison Wesley, [49] T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh. Terra: A Virtual Machine-Based Platform for Trusted Computing. In Proc. of ACM SOSP, Bolton Landing, NY, October [50] Motorola Government and Enterprise. Motorola s Mobile Mesh Networks Technology. [51] F. Herzel, G. Fischer, and H. Gustat. An Integrated CMOS RF Synthesizer for a Wireless LAN. IEEE Journal of Solid-state Circuits, Vol. 38, No. 10, October [52] M. Heusse, F. Rousseau, G. Berger-Sabbatel, and A. Duda. Performance Anomaly of b. In IEEE INFOCOM, [53] Hung-Yun Hsieh, Kyu-Han Kim, Yujie Zhu, and Raghupathy Sivakumar. A receiver-centric transport protocol for mobile hosts with heterogeneous wireless interfaces. In Proceedings of the 9th annual international conference on Mobile computing and networking, pages ACM Press, [54] L. Huang and T. Lai. On the scalability of IEEE ad hoc networks. In Proceedings of the 3rd ACM international symposium on Mobile Ad Hoc Networking & Computing, MobiHoc, pages ACM Press, [55] L. Huang, G. Peng, and T. Chiueh. Multi-Dimensional Storage Virtualization. In Proc. of ACM SIGMETRICS, New York, June [56] IBM. Tivoli Software. [57] IEEE. IEEE 802.1x-2001 IEEE Standards for Local and Metropolitan Area Networks: Port-Based Network Access Control, [58] IEEE Computer Society. Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications. IEEE Standard , [59] IEEE802.11a. Wireless LAN Medium Access Control(MAC) and Physical (PHY) Layer Specification: High Speed Physical Layer Extensions in the 5 GHz Band [60] IEEE802.11b/D3.0. Wireless LAN Medium Access Control(MAC) and Physical (PHY) Layer Specification: High Speed Physical Layer Extensions in the 2.4 GHz Band

178 164 [61] Crossbow Technology Inc. Motes, Smart Dust Sensors, Wireless Sensor Networks. Sensor Networks.htm. [62] Scalable Networks Inc. The Qualnet Simulator. m/. [63] Intel. WiMAX - Broadband Wireless Access Technology. netcomms/technologies/wimax/. [64] InterEpoch Technology Inc. IWE1100-T Series. products/iwe1100t.asp. [65] K. Jain, J. Padhye, V. Padmanabhan, and L. Qiu. Impact of Interference on Multihop Wireless Network Performance. In ACM MobiCom 2003, September [66] N. Jain and S. R. Das. A Multichannel CSMA MAC Protocol with Receiver- Based Channel Selection for Multihop Wireless Networks. In International Conference on Computer Communications and Networks (IC3N), October [67] Jinyang Li and Charles Blake and Douglas S. J. De Couto and Hu Imm Lee and Robert Morris. Capacity of Ad Hoc wireless networks. In Mobile Computing and Networking, pages 61 69, [68] D. Johnson, D. Maltz, and J. Broch. DSR The Dynamic Source Routing Protocol for Multihop Wireless Ad Hoc Networks. In C.E. Perkins, editor, Ad Hoc Networking, chapter 5, pages Addison-Wesley, [69] E. Jung and N. Vaidya. An Energy Efficient MAC Protocol for Wireless LANS. In IEEE INFOCOM 2002, June [70] R. Karrer, A. Sabharwal, and E. Knightly. Enabling Large-scale Wireless Broadband: The Case for TAPs. HotNets [71] R. Krashinsky and H. Balakrishnan. Minimizing Energy for Wireless Web Access with Bounded Slowdown. In ACM MobiCom 2002, pages , September [72] R. Kravets and R. Krishnan. Power Management Techniques for Mobile Communications. In ACM MobiCom 1998, October [73] A. Ladd, K. Bekris, A. Rudys, G. Marceau, L. Kavraki, and D. Wallach. Robotics-Based Location Sensing using Wireless Ethernet. In Proc. of ACM MobiCom, Atlanta, GA, Sept [74] T. H. Lai and D. Zhou. Efficient and Scalable IEEE Ad-Hoc Mode Timing Synchronization Function. In Proc. of International Conference on Advanced Information Networking and Applications, March 2003.

179 165 [75] L. Lamport. Time, Clocks and the Ordering of Events in Distributed Systems. In Communications of the ACM, volume 21, pages , [76] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals Problem. ACM TOPLAS, 4(3): , July [77] C. Law, A. K. Mehta, and K. Siu. A New Bluetooth Scatternet Formation Protocol. In To appear in ACM Mobile Networks and Applications Journal, [78] C. Law and K. Siu. A Bluetooth Scatternet Formation Algorithm. In IEEE Symposium on Ad Hoc Wireless Networks 2001, November [79] J. Li, Z. J. Haas, M. Sheng, and Y. Chen. Performance Evaluation of Modified IEEE MAC for Multi-Channel Multi-Hop Ad Hoc Network. In International Conference on Advanced Information Networking and Applications (AINA), [80] Y. Li, H. Wu, D. Perkins, N. Tzeng, and M. Bayoumi. MAC-SCC: Medium Access Control with a Separate Control Channel for Multihop Wireless Networks. In 23rd International Conference on Distributed Computing Systems Workshops (ICDCSW), [81] C. R. Lumb, A. Merchant, and G. A. Alvarez. Facade: Virtual Storage Devices with Performance Guarantees. In Proc. of USENIX FAST, San Francisco, April [82] R. Mahajan, N. Spring, D. Wetherall, and T. Anderson. User-level Internet Path Diagnosis. In Proc. of ACM SOSP, Bolton Landing, NY, October [83] S. Marti, T. Giuli, K. Lai, and M. Baker. Mitigating Routing Misbehavior in Mobile Ad Hoc Networks. In Proc. of ACM MobiCom, Boston, MA, August [84] Maxim. Maxim 2.4GHz b Zero-IF Transceivers. [85] Maxim. Tracking Advances in VCO Technology. an/an1768.pdf. [86] Microsoft Corp. Native Framework for IEEE Networks. w.microsoft.com. [87] A. Miu, H. Balakrishnan, and C. E. Koksal. Achieving Loss Resiliency through Multi-Radio Diversity in Wireless Networks. In ACM MobiCom 2005, August [88] Expert Monitoring. WiSNet Wireless Sensor Networks. om/products.html.

180 166 [89] A. Nasipuri and S. R. Das. Multichannel CSMA with Signal Power-Based Channel Selection for Multihop Wireless Networks. In IEEE Vehicular Technology Conference (VTC), September [90] B. Neuman and T. Tso. An Authentication Service for Computer Networks. In IEEE Communications, Karlsruhe, Germany, Sept [91] S. Ni, Y. Tseng, Y. Chen, and J. Sheu. The Broadcast Storm Problem in a Mobile Ad Hoc Network. In ACM MobiCom, August [92] L. Nord and J. Haartsen. The Bluetooth Radio Specification and The Bluetooth Baseband Specification. Bluetooth, [93] Nortel. Wireless Mesh Network Solution. [94] HP Openview. Management Solutions for Your Adaptive Enterprise. managementsoftware.hp.com/. [95] J. Padhye, R. Draves, and B. Zill. Routing in Multi-radio, Multi-hop Wireless Mesh Networks. In ACM MobiCom, [96] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose. Modeling TCP Throughput: a Simple Model and its Empirical Validation. In Proc. of ACM SIGCOMM, Vancouver, BC, September [97] C. Perkins, E. Belding-Royer, and S. Das. Ad hoc On-Demand Distance Vector (AODV) Routing. In IETF RFC 3561, July [98] N. B. Priyantha, A. Chakraborty, and H. Balakrishnan. The Cricket Location- Support System. In Proc. of ACM MobiCom, Boston, MA, August [99] L. Qiu, P. Bahl, A. Rao, and L. Zhou. Fault Detection, Isolation, and Diagnosis in Multihop Wireless Networks. Technical Report MSR-TR , Microsoft Research, Redmond, WA, Dec [100] I. Ramani and S. Savage. SyncScan: Practical Fast Handoff for Infrastructure Networks. In Proc. of IEEE Infocom, Miami, FL, March [101] M. Raya, J. P. Hubaux, and I. Aad. DOMINO: A System to Detect Greedy Behavior in IEEE Hotspots. In Proc. of MobiSys, Boston, MA, June [102] Relatek. Rtl8185l. [103] IBM Security Research. Wireless Security Auditor (WSA). [104] Microsoft Research. Mesh Networking Academic Resource Toolkit. ch.microsoft.com/netres/software.aspx.

181 167 [105] C. Rigney, A. Rubens, W. Simpson, and S. Willens. Remote Authentication Dial In User Service (RADIUS). In RFC 2138, IETF, April [106] J. Robinson, K. Papagiannaki, C. Diot, X. Guo, and L. Krishnamurthy. Experimenting with a Multi-Radio Mesh Networking Testbed. In Workshop on Wireless Network Measurements, April [107] RoofNet. MIT RoofNet. [108] R. Rozovsky and P. Kumar. SEEDEX: A MAC Protocol for Ad Hoc Networks. In ACM MobiHoc, [109] SeattleWireless. Seattle Wireless. [110] J. Sheu, C. Chao, and C. Sun. A Clock Synchronization Algorithm for Multi- Hop Wireless Ad Hoc Networks. In Proc. of IEEE International Conference on Distributed Computing Systems, ICDCS, Tokyo, March [111] E. Shih, P. Bahl, and M. Sinclair. Wake On Wireless: An Event Driven Energy Saving Strategy for Battery Operated Devices. In MOBICOM, September [112] E. Shih, P. Bahl, and M. Sinclair. Wake on Wireless: An event driven power saving strategy for battery operated devices. In ACM MobiCom 2002, September [113] M. Shin, A. Mishra, and W. Arbaugh. Improving the Latency of Handoffs Using Neighbr Graphs. In Proc. of MobiSys, Boston, MA, June [114] J. So and N. H. Vaidya. A Multi-channel MAC Protocol for Ad Hoc Wireless Networks. In UIUC Technical Report, also accepted to MobiHoc 2004, January [115] Sputnik. Sputnik Managed Wi-Fi Networks. [116] T. K. Srikanth and S. Toueg. Optimal Clock Synchronization. Journal of the ACM, 34(3): , July [117] R. Stevens. TCP/IP Illustrated (Vol. 1): The Protocols. Addison Wesley, [118] R. Stine. FYI on a Network Management Tool: Catalog Tools for Monitoring and Debugging TCP/IP Internets and Interconnected Devices. In IETF RFC 1147, April [119] Strix Systems. Networks without Wires. [120] J. Sugerman, G. Venkitachalam, and B. Lim. Virtualizing I/O devices on VMware workstation s hosted virtual machine monitor. In Annual Usenix Technical Conference, June 2001.

182 168 [121] SuperPass. Wireless LAN PCI card for 2.4 GHz. PCI-01.html. [122] Symbol. Spectrum AccessPoint. less/ap4131.html. [123] Symbol Technolgies Inc. SpectrumSoft: Wireless Network Management System. [124] A. Tyamaloukas and J. J. Garcia-Luna-Aceves. Channel-hopping multiple access. In IEEE International Communications Conference (ICC), [125] P. Verissimo and L. Rodrigues. A Posteriori Agreement for Fault Tolerant Clock Synchronization on Broadcast Networks. In Proc. of International Symposium on Fault-Tolerant Computing (FTCS), page 85, July [126] VMware. Enterprise-Class Virtualization Software. [127] VoIP. Voice Over Internet Protocol. [128] T. von Eicken, A.Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel anddistributed Computing. In Proc. of ACM SOSP, New York, December [129] R. Want, A. Hopper, V. Falcao, and J. Gibbons. The Active Badge Location System. ACM Transactions on Information Systems, 10(1), January [130] A. Whitaker, M. Shaw, and S. D. Gribble. Scale and Performance in the Denali Isolation Kernel. In Fifth Symposium on Operating Systems Design and Implementation, December [131] Wibhu Technologies Inc. SpectraMon. [132] WildPackets Incorporation. Airopeek Wireless LAN Analyzer. packets.com. [133] WiMax.com. WiMAX technology, news, training and conferences. imax.com/. [134] WinDump: Tcpdump for Windows. [135] S.-L. Wu, C.-Y. Lin, Y.-C. Tseng, and J.-P. Sheu. A New Multi-Channel MAC Protocol with On-Demand Channel Assignment for Mobile Ad Hoc Networks. In International Symposium on Parallel Architectures, Algorithms and Networks (I-SPAN), [136] S. Xu and T. Saadawi. Does the IEEE MAC Protocol Work Well in Multihop Wireless Ad Hoc Networks? IEEE Communi.Magazine, pp , June 2001.

183 169 [137] Z. Tang and J. Garcia-Luna-Aceves. Hop-reservation multiple access (HRMA) for ad-hoc networks. In IEEE INFOCOM, [138] M. Zec. Implementing a Clonable Network Stack in the FreeBSD Kernel. In Proc. of USENIX Annual Technical Conference, June [139] Y. Zhang, L. Breslau, V. Paxson, and S. Shenker. On the Characteristics and Origins of Internet Flow Rates. In Proc. of ACM SIGCOMM, Pitsburgh, PA, August [140] Y. Zhang, N. Duffield, V. Paxson, and S. Shenker. On the Constancy of Internet Path Properties. In Proc. of ACM Internet Measurement Workshop, San Francisco, CA, Nov 2001.