From Lock-Free to Wait-Free: Linked List

Transcription

1 From Lock-Free to Wait-Free: Linked List Edward Duong School of Computer Science Carleton University Ottawa, Canada K1S 5B6 April 23, 2014 Abstract Lock-free data structures guarantee that at least one thread must make progress over time. To achieve a higher performance guarantee that prevents thread starvation, we must look to wait-freedom. Wait-free data structures guarantee that all threads are bounded by a finite number of steps. However, building wait-free data structures is both challenging and often leads to inefficient algorithms. We attempt to apply a recent methodology for transforming lock-free data structures into the highly desirable wait-free form. The data structure we select to apply this transformation is the localityconscious linked list. As of current literature, it is the first time that this particular linked list variant is executed in a wait-free form. Our experimental results show that the performance of the wait-free form is fair. 1 Introduction With the emergence of multiprocessing systems in past decades, it is clear that a shift in the way we think and construct data structures is required. Traditional data structures give few considerations to their execution in concurrent environments. It is not sufficient to simply move a traditional data structure into a concurrent environment and expect immediate improvements in performance. To satisfy the growth in parallel computing, concurrent data structure designs try to maximize operation throughput. In such cases, a thread is spawned every time an operation is made. Creators of these algorithms strive long and hard to make optimizations to remove or strongly reduce the critical sections where access is serialized. Critical sections form a bottleneck for asynchronous operations where all other threads are either halted or delayed. Often, designs avoid entirely heavy synchronization mechanisms, such as mutexes or monitors, and focus solely on the use of atomic primitive instructions. One popular atomic instructions is the compare-and-swap (CAS). A CAS is given three parameters: a target address, an expected value and a new value. It swaps the value at the target address to the new value under the condition that the target value matches the expected value. If the expected value does not match, the CAS does nothing. Part of the difficulties[4] in constructing concurrent data structures arise from the complexity of dealing with threads. Threads are controlled by the operating system and are subject to scheduling, interrupts, and preemption from context switches. We must keep in mind that a set of instructions may be executed in an arbitrary interleaving fashion by 1

2 multiple threads. This places great difficulty in designing and proving correct concurrent algorithms. It also highlights the need for data consistency. We must take extra precautions to ensure that the state of a data structure remains valid throughout a mixture of concurrent read, write, and modify operations. A second difficulty that arises from concurrent data structures is the need to effectively limit interference among threads. To formally address these issues, wait-free data structures[5], defined by J. Aspnes and M. Herlihy in 1990, guarantee that a thread can complete any operation in a finite number of steps regardless of the interference of any other thread. Years later, M. Herlihy et al. provided a lesser freedom guarantee called lockfree[16]. An implementation following the lock-free principle makes system-wide progress if given sufficient processing steps. Note that, unlike wait-free implementations, it may starve individual threads as long as some thread in the system progresses. Existing waitfree and lock-free implementations make use of lightweight synchronization mechanism such as latches, atomic primitives or compare and swap (CAS) primitives in order to meet their progress guarantees. A common strategy with threads attempting to modify an atomic value under such lightweight mechanisms is to retry continuously until they succeed. This contrasts the technique used in locking, where a thread is put to sleep as opposed to spinning in contention on a shared resource. The data structure we examine in this paper is a locality-conscious linked link[6]. Traditional linked lists have the advantage that entries (nodes) in memory are allocated or deallocated dynamically as they are inserted or deleted; however, this runs into the commonly encountered fault of fragmenting memory access because entries may have no predictable pattern in memory. The authors tackles this issue by introducing a mechanism to logically group entries in a bid to enhance cache-awareness. The focus of our work is to discuss and examine the performance of transforming a lock-free locality-conscious linked list into its wait-free form. We begin with Section 2, a brief literature review on our two main topics: lock-free linked lists and methodologies on transforming lock-free algorithms to a wait-free counterpart. In Section 3, we provide details into the locality-conscious linked list. In Section 4, we discuss the requirements and steps for the transformation. Following this, Section 5 outlines the transformation applied on the linked list. The paper closes with Section 6 and Section 7, which look at our experimental results and provide closing statements, respectively. 2 Literature Review To date, there have been many attempts to provide a methodology or technique to transform sequential data structures into a parallel equivalent[4][23][14][15]. However, they are fraught with difficulties[17][11]; often the proposed method is infeasible in practice due to memory constraints or excessive overhead. In some cases the concurrent data structure even runs slower than the sequential version. Since the discovery of wait-free data structures, one of the most explored forms of conversion has been the use of universal construction. The technique was first introduced in 1990 by Herlihy[14]. In his work, a sequential program with no explicit synchronization is automatically transformed through a special set of synchronization and memory management algorithms. The concept has evolved ever since its inception. Afek[1] provided a universal construction based on a group update algorithm. A thread first completes its own operation and then helps the group of active threads finish their operation. More recently, in 2010, Chuong[9] explored universal construction under 2

3 transactional memory. Threads would interact with the shared data structure through the use of a special Perform function which handled synchronization through the use of compare-and-swaps. The concept of universal construction through transactions is again attempted by Crain[10] in They defined a deterministic software transactional memory (STM) system which abstracts the sequential algorithm from the underlying shared memory. Any operation performing a transaction is executed only once and knowledge of the concept of commit/abort is unneeded. Lastly, we examine the transformation[24] we apply in this paper. Timnat et. al. exploit a common pattern of construction in lock-free data structures in order to build fast and effective wait-free algorithms without the need for a transactional memory layer. The concurrent data structure we focus on is the locality-conscious linked list. The linked list is a favourable backbone to many traditional data structures such as the stack or queue and, unsurprisingly, sees use across numerous existing applications and systems today. The linked list is no stranger to concurrent optimizations. The first lock-free linked list was designed by Valois[25] in as early as His construction was known for adding a backlink pointer to each entry. This backlink pointer allowed operations encountering interference to traverse backwards to a point where they can resume their work. In 2001, Harris[13] provided another algorithmically simpler lock-free design, which showed better experimental results over Valois linked list. In 2004, Fomitchev and Rupert[12] provided a lock-free variant which uses a smart retreat technique. This allowed operations to avoid restarting from scratch should a CAS fail. There have also been improvements to existing designs. In 2002 Michael[20][21] made use of hazard pointers to improve memory management that allowed for the reclamation of entries in a lock-free fashion. In 2010, Braginsky and Petrank[6] gave an improvement over Michael s work by grouping entries over a continuous block of memory. This enhanced the cache retrieval capabilities of the linked list, giving traversals an increase in performance. While the most recent publication is not a lock-free design, not long ago in 2012, Braginsky et. al.[8] published the first wait-free linked list and showed its performance to be comparable to Harris s design. Their wait-free design builds on Harris linked list, but does not incorporate their earlier work to enhance traversals. 3 Locality-Conscious Lock-Free Linked List The lock-free data structure we apply the transformation on is the one conceived by Braginsky and Petrank[6]. In their work, they add to the state-of-the-art lock-free linked list by improving the performance of traversals. This is done by grouping entries into a consecutive block of memory, denoted as a chunk. An example of a chunk is shown in Figure 1. By having entries physically close to one another, they can be fitted into a single virtual memory page which can lead to performance improvements in caching. This locality-conscious enhancement is not a new idea; however it is the first time it is attempted in a lock-free fashion. 3.1 Overview While we cannot outline every detail of the data structure, we will first provide a brief overview of the data structure followed by an attempt to highlight the the operations that are important for the transformation. The linked list provides three basic operations: search, insert and delete. In addition, it has the property that it is ordered, that is to say every 3

4 Figure 1: High-level view of chunks and entries key of a predecessor must be smaller and every key of a successor must be greater. This property holds true for both entries and chunks. All keys within a predecessor chunk must be smaller and all keys within a successor chunk must be greater. There are no duplicate keys. Each one is unique and attempting to insert an existing key will fail. As mentioned earlier, the linked list has a locality-conscious enhancement that groups entries into a chunk. A chunk has 3 properties: a) it maintains a fixed-sized array of entries (nodes), b) holds 2 pointers (nextchunk and newchunk), one to point to the next chunk in the list and another to a chunk which may eventually replace it, and c) it keeps a lower bound counter on the number of entries within it. Due to the complex nature of maintaining an exact count in a lock-free fashion, the algorithm is designed to function using only a lower bound on the actual count. Intuitively, the chunk counter increases on insert, while it decreases on delete. A range is selected to prevent cases where a chunk becomes too small or full. When a delete causes the entry count to fall below this range, the chunk typically is frozen and is merged with it s left neighbor. When an insert causes the chunk to become full, most often it is frozen and split into two new chunks. Figures 2 and 3 depict examples of the merge and split. Ideally, a range is selected such that merges and splits do not occur too frequently. A chunk head provides a central starting point. Beginning at the chunk head, every chunk connects to a successor through a pointer. The last chunk points to null. A first time reader might remark that the data structure resembles a traditional linked list (entries) nested within an outer linked list (chunks). 4

5 Figure 2: Two chunks merging into a new chunk 5

6 Figure 3: A chunk splitting into two new chunks 3.2 Searching for a Key The search operation is central to the algorithm; it is used in both insert and delete. It is a combination of two smaller searches, one that searches the chunk list (chunk-level) and another that searches the entries list (entry-level). A search begins by traversing the chunk list, starting with the chunk head, until it identifies the chunk in which the key-to-find should reside. It identifies this correct chunk by comparing the key-to-find against the first entry s keys of a chunk window as depicted in Figure 4. A window contains the predecessor and successor chunks in which the key-to-find should fall in-between. Using the ordered 6

7 property, if the key-to-find is smaller or equal, then we narrow down our search to within the current chunk. Otherwise, the key-to-find is larger and we move to the next window and repeat the process until the chunk is found. To search the entries within a chunk we traverse its list starting at the entry head by following the same algorithm as a basic linked list search. If an entry with the exact key is found, the data value is returned and the search operation returns success. If the key is not found, the operation returns with result failed. Two details worth mentioning for the purpose of our transformation are the secondary roles that search plays. Firstly, while traversing chunks, if a chunk is ready to be replaced by a new chunk after checking the newchunk pointer it maintains, the old chunk will be atomically swapped with the new one. Secondly, while traversing entries, should an entry be marked for delete, it will be atomically swapped out of the list before continuing. Figure 4: Comparing the first entry of two chunks 3.3 Inserting a Key We begin the discussion of the insert operation by outlining its success path. The insert operation begins by identifying the correct chunk in which to insert its key / data pair by running the same chunk-level search algorithm. Following this, it attempts to atomically claim an empty entry within the chunk by setting the entry s key / data pair. If successful, it searches for the window in which the entry should be inserted into by running an entrylevel search on the current chunk. A window contains the predecessor and successor entries in which the new entry should be inserted in-between. Two atomic compare-and-swaps (CAS) are used to connect the new entry to the list. The first CAS causes the new entry to point to the window s successor. The second CAS causes the window s predecessor to point to the new entry. An example of these two CASes is shown in Figure 5. Lastly, before returning from the operation, we atomically increment the entry counter by one. The insert path has numerous locations where it can encounter interference. The first such place is when atomically claiming an empty entry. When a chunk is full, no empty entry can be claimed. The insert must begin an irreversible freeze of the chunk which likely results in a split of the frozen chunk into two new chunks. The freeze may aid the insert by pre-inserting the key / data pair into a new chunk before it replace the frozen chunk; however, if multiple threads are freezing the same chunk, this aid cannot be a guaranteed. In the case where the freeze completes but was unable to aid the insert, the operation attempts again at claiming an empty entry on the new chunk created after the freeze. 7

8 The next potential failure points are the two compare-and-swaps that connect the new entry into the list. Should one of these fail due to interference from other operations, we re-perform the search for a window and retry until they succeed. Note that it is also possible that while searching for a window an entry with the same key is detected. The memory address of the two entries are compared in order to determine if in fact it is a duplicate key or if another thread simply helped connect our entry to the list. The operation immediately returns with a result success in the latter case. If a duplicate key is found, cleanup is initiated. A cleanup requires freeing the entry claimed earlier by atomically clearing its next pointer and reverting its key / data pair to empty. Should either atomic instructions fail, it must be due to another thread performing a freeze. We help freeze and check one last time to see if our entry was inserted. The last point of potential failure we would like to highlight is from atomically incrementing the entry counter. Should it fail, it is simply retried. Figure 5: 2 compare-and-swaps for inserting an entry 3.4 Deleting a Key The success path of the delete operation begins by using the chunk-level search to identify the chunk in which the key-to-delete will reside. Before proceeding with the delete, we 8

9 atomically decrement the chunk s counter and check that it does not fall below the minimum threshold. If so, an irreversible freeze is performed (we will elaborate on this in a later paragraph). Otherwise an entry-level search is used to find to the window belonging to the entry to delete. If no entry is found, the operation simply returns with a result failed. In order to delete an entry, it is marked. Marking is done by atomically flipping a special bit on the entry s next pointer address. This prevents the pointer from changing value since all subsequent compare-and-swap operations to modify the pointer will expect the bit to not be set. The final step is to disconnect the entry from the list. A compare-and-swap to make the deleted entry s predecessor point to the deleted entry s successor is sufficient - the entry memory can now be reclaimed. An interesting point about this last step is that it can also be done independently to the delete operation. In fact, search operations will help disconnect deleted entries should they come across one in their traversal. An example of this is shown in the bottom half of Figure 6. The delete operation must be able to handle failures in certain paths. Early on, if it fails to atomically decrement the chunk s counter, it simply retries until it succeeds. In the case of marking an entry as deleted, it also retries again by performing a search for the entry s window and attempts to mark the entry. The final step of disconnecting the entry and recycling it can be done by any ongoing search operation, thus the delete operation will only try this once since a failure would indicate that another operation has helped complete it. Regarding the freeze that is performed when the chunk s counter falls below a threshold, the most common outcome is that the chunk is merged with its left neighbor. A merge requires that both chunks be irreversibly frozen and a new chunk is created with the combined contents of both chunks. The new chunk is then swapped into the chunk list atomically and the two frozen chunks are freed. Similar to how the freeze mechanism can aid an insert by pre-inserting the key / data pair into the new chunk, it can do the same here by pre-deleting the entry before the new chunk is connected. When multiple threads are helping to freeze, this aid cannot be guaranteed. Should aid fail, the delete operation will simply restart on the new chunk returned from the freeze. 9

10 Figure 6: Logical and physical deletion of an entry 4 Lock-Free to Wait-Free Transformation The work of Timnat and Petrank [24] provide a practical technique for transforming linearizable lock-free data structures into the coveted linearizable wait-free form. Their concept draws on the ubiquitous fast-path-slow-path methodology[2][3][19][22]. This methodology separates operations that typically succeed quickly, with little to no interference, from the ones that are difficult and can easily be starved for long periods. As an example, in their previous work on a wait-free queue[18], the fast-path would execute the lock-free algorithm to attain good performance. Only when failure to make progress was detected did it switch over to the slower wait-free algorithm that was guaranteed to make progress. In a similar way, their transformation mimics the fast-path-slow-path design by separating operations into two paths: normal and helped. Data structures begin all operations on the normal path and only move to the helped path if it detects no progress is made. When helped, the operation is guaranteed to make progress and eventually completes. We will provide an overview of the technique, our implementation of the transformation, and our experiences in doing so. 10

11 Not all lock-free data structures are eligible for transformation. There are a few requirements that must first be met. The data structures must be lock-free, linearizable, and all atomic instructions must be in the form of a compare-and-swap (CAS). Additionally, the data structure operations must depend only on the input parameters and the shared data structure itself. The locality-conscious lock-free queue we select fits these requirements. 4.1 Help Queue Before being able to run the lock-free algorithm in a fast-path-slow-path manner, it must undergo modifications. The outcome is a normalized form of the original algorithm. Getting an algorithm into this form is a major part of the overall transformation. The first addition over the original algorithm is to initialize an empty wait-free queue[8] that will contain all operations that ask for help. In order to ask for help, an operation needs to be able to express itself in a succinct description of its current computation state. We show an example of this from our transform below. A thread begins by running its operation in the normalized form. If a thread requires help, it creates a description of its current operation state and enqueues it to the help queue before moving on to help other operations in the queue. After helping any operation, threads will check to see if the operation belongs to it. If not, it continues helping until it completes its own. The result of an operation will always be written to the description, whether by the parent thread or by a helping thread, so that the parent thread simply reads it and reports the result when it finishes. Another modification to the original algorithm is that any new operation will check the help queue to help an operation in the slow-path once before moving on to perform its own operation. We provide a very simplistic state diagram to show the two paths in Figure 7. s t r u c t OperationRecord i n t ownertid ; OperationType optype ; // search, i n s e r t, d e l e t e OperationInput input ; OperationRecordState s t a t e ; // r e s t a r t, f a i l u r e, s u c c e s s... boolean r e s u l t ; Array c a s L i s t ; // l i s t o f CASes to be executed s t r u c t OperationInput i n t key ; Data data ; Data datareturn ; 11

12 4.2 Detecting Failure Figure 7: Simplified states of the helping mechanism While running a normalized operation in the fast-path, it may encounter contention from other threads. Contention typically occurs in the form of a CAS failure. In order to detect that no progress is being made, each operation maintains a contention counter. This counter increments by one each time a CAS fails. When incremented, it is checked to see if it has exceeded some threshold. If it has been exceeded, the thread returns from its current task to create a succinct description of its operation state and enqueues it in the help queue. It has now entered the slow-path where it is guaranteed to make progress. s t r u c t C o n t e n t i o n I n f o t i n t counter ; f u n c t i o n runcaslist ( CASList, ContentionInfo ) f o r each CAS d e s c r i p t o r in CASList r e s u l t = runcas ( ) i f r e s u l t i s true cas >s t a t e = CAS STATE SUCCESS e l s e cas >s t a t e = CAS STATE FAILURE c o n t e n t i o n I n f o >counter++ break 12

13 4.3 Normalized Form The second part of the normalized form is the more involved part. It requires separating the atomic CAS instructions of the original algorithm so they fit within the normalized form. There are three stages of the normalized form, which are run consecutively, that need to be considered: a preparatory stage, an execution stage and a post-execution stage. Their formal names in the paper are CAS Generator, CAS Executor, and Wrap-up, respectively. Each original operation, i.e. search, insert or delete, in the algorithm executes all three stages, one after another. Any stage can be executed by one or more threads, although while in the fast-path it will only be executed by the parent thread. It should also be noted that the operation outcome should be the same whether executed by one thread or by many. A formal proof of this is found in the original paper. One point worth mentioning is that many CASes in the original algorithm do not need be run the in CAS Executor. A complete definition of this type of CAS is given in the original paper along with its formal name, auxiliary CAS. Typically, auxiliary CASes are found in a function that can be run safely in parallel. We will further discuss the uses of auxiliary CASes in the CAS Generator and Wrap-up stages of our transformation. f u n c t i o n NormalizedOperation (... ) checkandhelpqueue ( ) c a s L i s t = CASGenerator runcaslist ( c a s L i s t ) r e s u l t s = WrapUp( c a s L i s t ) return r e s u l t s CAS Generator First off, the CAS Generator has the responsibility to generate a list of compare-and-swaps (CAS) descriptors that must be run exactly once. More specifically, these CASes have the property that they cannot be done in parallel because they must be executed by the thread that initiated the operation. In the original paper they are referred to as owner CASes. For example, in our locality-conscious linked list, the CAS belonging to the delete operation that marks an entry as deleted is an owner CAS. No other thread can claim ownership since the parent thread is tasked with the operation, and that operation performs the marking CAS. However, not all CASes are owner CASes. As seen later, some can be performed safely in parallel in either CAS Generator or Wrap-up stages, avoiding altogether the need to be run in the CAS Executor CAS Executor In the CAS Executor, each CAS descriptor from the previous stage is executed in order, one by one and a result is stamped onto it. The difficulty in this is to make all threads executing the list aware of the result of all other threads without using heavy synchronization mechanisms that spoil wait-freedom. To achieve this, a modification bit is reserved on the primitive that the CAS targets. An arbitrary number of threads will attempt the CAS but will assume the expected value to not have this modification bit to be set. The new value, however, will have the modification bit set. Therefore all threads will attempt the CAS with the same expected value (without the modification bit set) and same new value (with a modification set), but only one thread will succeed since the expected value will no longer 13

14 match afterwards. In addition to the modification bit, a few more bits are reserved as a version counter. In our implementation, we use a version counter that is 1 byte in size and allows for 255 different versions. The version counter solves the well-known ABA problem (which we do not further discuss here for the sake of brevity). After attempting a CAS, regardless of its own result, threads will check if the modification bit is set. Seeing this bit set means that a thread was successful. Before stamping success onto the CAS descriptor, the modification bit is first cleared and the version counter is incremented. Once all CAS descriptors in the list are successfully executed, they are passed on to the next stage. In the case that a CAS fails, failure is stamped onto the CAS descriptor and we move directly to the next stage without attempting any further CAS descriptors Wrap-Up Wrap-up assesses the list of CAS descriptors from the CAS Executor and ultimately decides the final result of the operation. It chooses a result that is either success, failure, or restart from scratch. At this point, the algorithm may execute any non-owner CASes to finish up last steps in the operation. In the case that the operation is in the slow-path, the result of Wrap-up is written back to the descriptor that was enqueued originally. The result determines whether or not the thread(s) should restart the three stages again or simply remove the descriptor from the queue and report the operation s result. 5 Normalized Form: Linked List In this section, we provide the details of our transformation from the original localityconscious linked list to the normalized form. 5.1 Contention Counter A contention failure counter for the locality-conscious linked list is implemented by counting the number of failed CASes. 5.2 Search We begin with the simplest operation, search. Search contains no owner CASes and only two auxiliary CAS: one to swap an old chunk out with a new chunk and another to physically disconnect an entry from the list. The lack of owner CASes means that its CAS Generator always returns an empty list. Essentially the original search algorithm takes place entirely in the Wrap-up function. We provide the normalized form below. CASes performed by Search(key) that are run in the CAS Executor: None The CAS Generator function for Search(key): Return an empty list of CASes The Wrap-up function for Search(key): Call findchunk(key) 14

15 5.3 Insert Call find(chunk, key) on the chunk returned above If an entry with the requested key was found, exit with result true and the data associate to the key Else, exit with result false and null for data The insert operation has two owner CASes: one to set the key / data to an empty entry (I-1) and second to increment the chunk entry counter by one (I-4). However, there is a complication that causes two additional CASes, which could have been done outside of the CAS Executor, to be included. These two CASes are responsible for connecting the new entry into the list (I-2 and I-3), but because they are done after (I-1) but before (I-4) in the original algorithm, they must be included in the CAS Generator s list. If these four CASes are not executed in the specified order, the chunk entry counter can no longer hold a guarantee that it is a lower bound on the actual number of entries. This would foil the algorithm s ability to properly detect when it should merge chunks should they become too small. The insert operation makes use of a few auxiliary CASes, many of which are already placed into parallelizable functions. The first of such examples are the two secondary roles that the search plays. As discussed in Section 3.2, as an insert tries to find the window to where it should insert an entry, along the way it may help use auxiliary CASes to replace old, frozen chunks with new chunks. In addition it will also use an auxiliary CAS to help to remove deleted entries from the list. A second parallelizable function which contains numerous auxiliary CASes is the freeze function. In the original lock-free algorithm, the freeze mechanism is constructed in such a way that multiple threads may help to freeze a chunk. It should be noted that we opt to perform a freeze that does not aid the insert by pre-inserting the entry for the sake of algorithmic simplicity. Within the Wrap-up, there is a slight inefficiency that was introduced in the transformation. Insert operations attempt to acquire an empty entry before checking if a duplicate key exists. When this case happens, the operation will undo changes to the new entry to return it to an empty state using the clearentry function. Firstly, it uses a CAS to clears the entry s nextentry pointer, followed by a second CAS to clear the key / data. Unfortunately, this function does not readily support being run by multiple threads since it expects only one thread to complete both CASes, one after another. In the worst case, any one of two CASes will fail and a potentially unnecessary freeze will occur, but the overall state of the chunk will still remain valid after the freeze. Thus, we make no changes and accept this as part of the transformation. We demonstrate the transformation stages below. CASes performed by Insert(key) that are run in the CAS Executor: I-1 : Set the key / data to an empty entry I-2 : After locating a window in which the new entry should be inserted into, set the new entry to point to the successor I-3 : After locating a window in which the new entry should be inserted into, swap the predecessor to point the new entry I-4 : Increment the chunk counter by 1 The CAS Generator function for Insert(key): 15

16 Call findchunk(key) Find the address to an empty entry which becomes the target of CAS descriptor I-1 If none are empty, call freeze and restart the CAS Generator Else, call find(chunk, key) to locate a window If an entry with the same key is found, return an empty list of CASes Else, return a list of cas-descriptors containing I-1, I-2, I-3 and I-4 The Wrap-up function for Insert(key): If the list of CASes is empty, exit with the result false (operation failed, key already exists) If no CASes succeed, restart the operation from scratch If I-1 succeeds but I-2 or I-3 fail, call find(key, chunk) to check if any other helped insert. Return result true if it is found. Otherise, call clearentry() to undo I-1. If clearentry() fails, return result true (a freeze was performed and our entry was inserted). If clearentry() succeeds, restart the operation. If I-1, I-2 and I-3 succeed, return result true Note: Even if I-4 fails, it is ignored since incrementing the chunk counter does not foil the property that it is only a lower bound to the actual entry count. 5.4 Delete The delete transformation is more straightforward than the insert transformation. There are two owner CASes: one to decrement the chunk counter by one (D-1), and another to mark the entry as deleted (D-2). It also makes use of the findchunk() and find() functions, which use auxiliary CASes to: swap new chunks into the list and physically remove deleted entries from the list. Similar to the insert, the delete operation also makes use of the parallelizable freeze function to should the chunk entry counter fall too low. Once again, we opt to use a freeze which does not aid the delete operation in pre-deleting an entry in the chunk(s) created after a freeze. The transformation for delete follows. CASes performed by Delete(Key) that are run in the CAS Executor: D-1 : Decrement the chunk counter by 1. D-2 : After locating the entry-to-delete, mark its nextentry pointer with the deleted bit. The CAS Generator function for Delete(Key): Call findchunk(key) Check that the chunk counter is above the mininum threshold If it is under the threshold, perform a plain freeze() and restart the generator Otherwise, call find(chunk, key) 16

17 If no entry with the requested key is found, return an empty list of CASes (the operation fails, there is no entry with that key) Else, return a list of cas-descriptors containing D-1 and D-2 The Wrap-up function for Delete(key): If the list of CASes is empty, exit with the result false (the operation failed, there is no entry with that key) If D-1 fails, restart the operation from scratch If D-2 fails, check if the chunk is frozen. If so, help freeze and restart the operation from scratch If all CASes succeed, call find(chunk, key) to physically remove the entry and return result true Note: If D-1 succeeds but D-2 fails, we make no attempt to re-increment the chunk counter. The lower bound property of the counter will not be violated. 5.5 Points of Interest There are two points of interest we encountered that are worth mentioning. The first is the need for proper memory management. In our transformation we chose to ignore this topic and have left the current memory management scheme through hazard pointers as is. Ideally, we would have liked to use a wait-free garbage collector to abstract the details of memory reclamation. A second area of interest are the necessary transformation changes regarding the addition of a modification bit and a version counter onto a primitive for use in the CAS Executor as outlined earlier. A primitive has a fixed size. The algorithm typically uses one that is 8 bytes in length. When a primitive is used for internal purposes under our control, e.g. a simple counter, we can partition it easily to reserve parts of it. However, adding a modification bit and versioning counter to an 8-byte pointer is more difficult because we have no knowledge on how the system may manipulate the bytes within. Although there are 15 unreserved bits in the least significant portion of a pointer, we needed to break some rules in order to get enough bytes for the modification. We partitioned the most-significant 4 bytes of the pointer and use that part to store our modification bit and version counter. For practicality purposes, the system rarely reaches an address in the upper ranges, therefore, we make a calculated but potentially dangerous assumption in using it to store our extra data. 6 Results Our experiment compares the performance of the original lock-free locality-conscious linked list against its wait-free transformation. In the wait-free algorithm, we run the normalized form in both the fast-path and the slow-path. The contention counter threshold is set to k = 4, which allows any operation to fail at most 4 CASes in the fast-path before moving to the slow-path. All tests were run using C on a system with 4GB of memory and an Intel Core2 Duo E8400 which houses 2 cores running at 3.0GHz. Both cores share an L2 cache of 6MB and do not support hyperthreading. The benchmark we performed runs in two steps. First, all threads are used to pre-fill the data structures with 10,000 entries using inserts. 17

18 Second, we delegate a specific role to each thread. 15% of the total threads perform inserts, another 15% of them perform deletes and the remaining 70% of them perform searches. All inserts have randomly generated keys which are in the range of [1, 10,000,000]. This test is repeated five times and we report their average results in the following figure. Figure 8: Lock-free versus Wait-Free We found on average that the wait-free algorithm has an increased runtime of 56% over its lock-free counterpart. This number is not unexpected by any means; there are significant additions, e.g. maintaining a help queue, that will cause overhead. 7 Conclusion While a silver-bullet transformation from a lock-free data structures to a wait-free form is highly desirable, it is not attained without encountering a few obstacles. As shown in the experimental results, its performance may not be an acceptable tradeoff to the guarantee that each operation is bounded by a finite number of steps. Fortunately, there is room for improvement. One possible optimization could be to run the lock-free version instead of the normalized version in the fast-path. From the author s experimental results, this optimization proved to have the good performance on a variety of data structures. On average, a difference in performance of 2% was shown with this optimization. A lesser possible improvement could lie in optimizing the current normalized form. In particular, reducing the number of CASes that the insert operation generates in the CAS Generator. It is also possible to better parallelize some of the functions that commonly encounter CAS failures. For example, when attempting to acquire an empty entry, a randomized entry could be returned in place of the sequentially next empty entry. These ideas could help reduce interference among insert operations. Ultimately, this work may lead to more interesting transformations in the future. We can envision applying this algorithm on the lock-free B+Tree[7], which extends the work done on the locality-conscious linked list. A wait-free B+Tree would have more significant application as it is the data structure of choice for databases. 18

19 References [1] Yehuda Afek, Dalia Dauber, and Dan Touitou. Wait-free made fast. In Proceedings of the twenty-seventh annual ACM symposium on Theory of computing, pages ACM, [2] James H Anderson and Yong-Jik Kim. Fast and scalable mutual exclusion. In Distributed Computing, pages Springer, [3] James H Anderson and Yong-Jik Kim. Adaptive mutual exclusion with local spinning. In Distributed Computing, pages Springer, [4] Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. I-structures: Data structures for parallel computing. ACM Trans. Program. Lang. Syst., 11(4): , October [5] J. Aspnes and M. Herlihy. Wait-free data structures in the asynchronous pram model. In Proceedings of the Second Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 90, pages , New York, NY, USA, ACM. [6] Anastasia Braginsky and Erez Petrank. Locality-conscious lock-free linked lists. In Distributed Computing and Networking, pages Springer, [7] Anastasia Braginsky and Erez Petrank. A lock-free b+tree. In Proceedinbgs of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 12, pages 58 67, New York, NY, USA, ACM. [8] Shahar Timnat Anastasia Braginsky, Alex Kogan, and Erez Petrank. Wait-free linkedlists [9] Phong Chuong, Faith Ellen, and Vijaya Ramachandran. A universal construction for wait-free transaction friendly data structures. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, pages ACM, [10] Tyler Crain, Damien Imbs, and Michel Raynal. Towards a universal construction for transaction-based multiprocess programs. In Distributed Computing and Networking, pages Springer, [11] Faith Fich, Danny Hendler, and Nir Shavit. On the inherent weakness of conditional synchronization primitives. In Proceedings of the Twenty-third Annual ACM Symposium on Principles of Distributed Computing, PODC 04, pages 80 87, New York, NY, USA, ACM. [12] Mikhail Fomitchev and Eric Ruppert. Lock-free linked lists and skip lists. In Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, pages ACM, [13] Timothy L Harris. A pragmatic implementation of non-blocking linked-lists. In Distributed Computing, pages Springer, [14] M. Herlihy. A methodology for implementing highly concurrent data structures. In Proceedings of the Second ACM SIGPLAN Symposium on Principles &Amp; Practice of Parallel Programming, PPOPP 90, pages , New York, NY, USA, ACM. 19

20 [15] Maurice Herlihy. A methodology for implementing highly concurrent data objects. ACM Trans. Program. Lang. Syst., 15(5): , November [16] Maurice Herlihy, Victor Luchangco, and Mark Moir. Obstruction-free synchronization: Double-ended queues as an example. In Proceedings of the 23rd International Conference on Distributed Computing Systems, ICDCS 03, pages 522, Washington, DC, USA, IEEE Computer Society. [17] Maurice P. Herlihy. Impossibility and universality results for wait-free synchronization. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, PODC 88, pages , New York, NY, USA, ACM. [18] Alex Kogan and Erez Petrank. Wait-free queues with multiple enqueuers and dequeuers. ACM SIGPLAN Notices, 46(8): , [19] Leslie Lamport. A fast mutual exclusion algorithm. ACM Transactions on Computer Systems (TOCS), 5(1):1 11, [20] Maged M Michael. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures, pages ACM, [21] Maged M Michael. Hazard pointers: Safe memory reclamation for lock-free objects. Parallel and Distributed Systems, IEEE Transactions on, 15(6): , [22] Mark Moir and James H Anderson. Wait-free algorithms for fast, long-lived renaming. Science of Computer Programming, 25(1):1 39, [23] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Benjamin Hertzberg. Mcrt-stm: A high performance software transactional memory system for a multi-core runtime. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 06, pages , New York, NY, USA, ACM. [24] Shahar Timnat and Erez Petrank. A practical wait-free simulation for lock-free data structures [25] John D Valois. Lock-free linked lists using compare-and-swap. In Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, pages ACM,