From Lock-Free to Wait-Free: Linked List

Size: px
Start display at page:

Download "From Lock-Free to Wait-Free: Linked List"

Transcription

1 From Lock-Free to Wait-Free: Linked List Edward Duong School of Computer Science Carleton University Ottawa, Canada K1S 5B6 April 23, 2014 Abstract Lock-free data structures guarantee that at least one thread must make progress over time. To achieve a higher performance guarantee that prevents thread starvation, we must look to wait-freedom. Wait-free data structures guarantee that all threads are bounded by a finite number of steps. However, building wait-free data structures is both challenging and often leads to inefficient algorithms. We attempt to apply a recent methodology for transforming lock-free data structures into the highly desirable wait-free form. The data structure we select to apply this transformation is the localityconscious linked list. As of current literature, it is the first time that this particular linked list variant is executed in a wait-free form. Our experimental results show that the performance of the wait-free form is fair. 1 Introduction With the emergence of multiprocessing systems in past decades, it is clear that a shift in the way we think and construct data structures is required. Traditional data structures give few considerations to their execution in concurrent environments. It is not sufficient to simply move a traditional data structure into a concurrent environment and expect immediate improvements in performance. To satisfy the growth in parallel computing, concurrent data structure designs try to maximize operation throughput. In such cases, a thread is spawned every time an operation is made. Creators of these algorithms strive long and hard to make optimizations to remove or strongly reduce the critical sections where access is serialized. Critical sections form a bottleneck for asynchronous operations where all other threads are either halted or delayed. Often, designs avoid entirely heavy synchronization mechanisms, such as mutexes or monitors, and focus solely on the use of atomic primitive instructions. One popular atomic instructions is the compare-and-swap (CAS). A CAS is given three parameters: a target address, an expected value and a new value. It swaps the value at the target address to the new value under the condition that the target value matches the expected value. If the expected value does not match, the CAS does nothing. Part of the difficulties[4] in constructing concurrent data structures arise from the complexity of dealing with threads. Threads are controlled by the operating system and are subject to scheduling, interrupts, and preemption from context switches. We must keep in mind that a set of instructions may be executed in an arbitrary interleaving fashion by 1

2 multiple threads. This places great difficulty in designing and proving correct concurrent algorithms. It also highlights the need for data consistency. We must take extra precautions to ensure that the state of a data structure remains valid throughout a mixture of concurrent read, write, and modify operations. A second difficulty that arises from concurrent data structures is the need to effectively limit interference among threads. To formally address these issues, wait-free data structures[5], defined by J. Aspnes and M. Herlihy in 1990, guarantee that a thread can complete any operation in a finite number of steps regardless of the interference of any other thread. Years later, M. Herlihy et al. provided a lesser freedom guarantee called lockfree[16]. An implementation following the lock-free principle makes system-wide progress if given sufficient processing steps. Note that, unlike wait-free implementations, it may starve individual threads as long as some thread in the system progresses. Existing waitfree and lock-free implementations make use of lightweight synchronization mechanism such as latches, atomic primitives or compare and swap (CAS) primitives in order to meet their progress guarantees. A common strategy with threads attempting to modify an atomic value under such lightweight mechanisms is to retry continuously until they succeed. This contrasts the technique used in locking, where a thread is put to sleep as opposed to spinning in contention on a shared resource. The data structure we examine in this paper is a locality-conscious linked link[6]. Traditional linked lists have the advantage that entries (nodes) in memory are allocated or deallocated dynamically as they are inserted or deleted; however, this runs into the commonly encountered fault of fragmenting memory access because entries may have no predictable pattern in memory. The authors tackles this issue by introducing a mechanism to logically group entries in a bid to enhance cache-awareness. The focus of our work is to discuss and examine the performance of transforming a lock-free locality-conscious linked list into its wait-free form. We begin with Section 2, a brief literature review on our two main topics: lock-free linked lists and methodologies on transforming lock-free algorithms to a wait-free counterpart. In Section 3, we provide details into the locality-conscious linked list. In Section 4, we discuss the requirements and steps for the transformation. Following this, Section 5 outlines the transformation applied on the linked list. The paper closes with Section 6 and Section 7, which look at our experimental results and provide closing statements, respectively. 2 Literature Review To date, there have been many attempts to provide a methodology or technique to transform sequential data structures into a parallel equivalent[4][23][14][15]. However, they are fraught with difficulties[17][11]; often the proposed method is infeasible in practice due to memory constraints or excessive overhead. In some cases the concurrent data structure even runs slower than the sequential version. Since the discovery of wait-free data structures, one of the most explored forms of conversion has been the use of universal construction. The technique was first introduced in 1990 by Herlihy[14]. In his work, a sequential program with no explicit synchronization is automatically transformed through a special set of synchronization and memory management algorithms. The concept has evolved ever since its inception. Afek[1] provided a universal construction based on a group update algorithm. A thread first completes its own operation and then helps the group of active threads finish their operation. More recently, in 2010, Chuong[9] explored universal construction under 2

3 transactional memory. Threads would interact with the shared data structure through the use of a special Perform function which handled synchronization through the use of compare-and-swaps. The concept of universal construction through transactions is again attempted by Crain[10] in They defined a deterministic software transactional memory (STM) system which abstracts the sequential algorithm from the underlying shared memory. Any operation performing a transaction is executed only once and knowledge of the concept of commit/abort is unneeded. Lastly, we examine the transformation[24] we apply in this paper. Timnat et. al. exploit a common pattern of construction in lock-free data structures in order to build fast and effective wait-free algorithms without the need for a transactional memory layer. The concurrent data structure we focus on is the locality-conscious linked list. The linked list is a favourable backbone to many traditional data structures such as the stack or queue and, unsurprisingly, sees use across numerous existing applications and systems today. The linked list is no stranger to concurrent optimizations. The first lock-free linked list was designed by Valois[25] in as early as His construction was known for adding a backlink pointer to each entry. This backlink pointer allowed operations encountering interference to traverse backwards to a point where they can resume their work. In 2001, Harris[13] provided another algorithmically simpler lock-free design, which showed better experimental results over Valois linked list. In 2004, Fomitchev and Rupert[12] provided a lock-free variant which uses a smart retreat technique. This allowed operations to avoid restarting from scratch should a CAS fail. There have also been improvements to existing designs. In 2002 Michael[20][21] made use of hazard pointers to improve memory management that allowed for the reclamation of entries in a lock-free fashion. In 2010, Braginsky and Petrank[6] gave an improvement over Michael s work by grouping entries over a continuous block of memory. This enhanced the cache retrieval capabilities of the linked list, giving traversals an increase in performance. While the most recent publication is not a lock-free design, not long ago in 2012, Braginsky et. al.[8] published the first wait-free linked list and showed its performance to be comparable to Harris s design. Their wait-free design builds on Harris linked list, but does not incorporate their earlier work to enhance traversals. 3 Locality-Conscious Lock-Free Linked List The lock-free data structure we apply the transformation on is the one conceived by Braginsky and Petrank[6]. In their work, they add to the state-of-the-art lock-free linked list by improving the performance of traversals. This is done by grouping entries into a consecutive block of memory, denoted as a chunk. An example of a chunk is shown in Figure 1. By having entries physically close to one another, they can be fitted into a single virtual memory page which can lead to performance improvements in caching. This locality-conscious enhancement is not a new idea; however it is the first time it is attempted in a lock-free fashion. 3.1 Overview While we cannot outline every detail of the data structure, we will first provide a brief overview of the data structure followed by an attempt to highlight the the operations that are important for the transformation. The linked list provides three basic operations: search, insert and delete. In addition, it has the property that it is ordered, that is to say every 3

4 Figure 1: High-level view of chunks and entries key of a predecessor must be smaller and every key of a successor must be greater. This property holds true for both entries and chunks. All keys within a predecessor chunk must be smaller and all keys within a successor chunk must be greater. There are no duplicate keys. Each one is unique and attempting to insert an existing key will fail. As mentioned earlier, the linked list has a locality-conscious enhancement that groups entries into a chunk. A chunk has 3 properties: a) it maintains a fixed-sized array of entries (nodes), b) holds 2 pointers (nextchunk and newchunk), one to point to the next chunk in the list and another to a chunk which may eventually replace it, and c) it keeps a lower bound counter on the number of entries within it. Due to the complex nature of maintaining an exact count in a lock-free fashion, the algorithm is designed to function using only a lower bound on the actual count. Intuitively, the chunk counter increases on insert, while it decreases on delete. A range is selected to prevent cases where a chunk becomes too small or full. When a delete causes the entry count to fall below this range, the chunk typically is frozen and is merged with it s left neighbor. When an insert causes the chunk to become full, most often it is frozen and split into two new chunks. Figures 2 and 3 depict examples of the merge and split. Ideally, a range is selected such that merges and splits do not occur too frequently. A chunk head provides a central starting point. Beginning at the chunk head, every chunk connects to a successor through a pointer. The last chunk points to null. A first time reader might remark that the data structure resembles a traditional linked list (entries) nested within an outer linked list (chunks). 4

5 Figure 2: Two chunks merging into a new chunk 5

6 Figure 3: A chunk splitting into two new chunks 3.2 Searching for a Key The search operation is central to the algorithm; it is used in both insert and delete. It is a combination of two smaller searches, one that searches the chunk list (chunk-level) and another that searches the entries list (entry-level). A search begins by traversing the chunk list, starting with the chunk head, until it identifies the chunk in which the key-to-find should reside. It identifies this correct chunk by comparing the key-to-find against the first entry s keys of a chunk window as depicted in Figure 4. A window contains the predecessor and successor chunks in which the key-to-find should fall in-between. Using the ordered 6

7 property, if the key-to-find is smaller or equal, then we narrow down our search to within the current chunk. Otherwise, the key-to-find is larger and we move to the next window and repeat the process until the chunk is found. To search the entries within a chunk we traverse its list starting at the entry head by following the same algorithm as a basic linked list search. If an entry with the exact key is found, the data value is returned and the search operation returns success. If the key is not found, the operation returns with result failed. Two details worth mentioning for the purpose of our transformation are the secondary roles that search plays. Firstly, while traversing chunks, if a chunk is ready to be replaced by a new chunk after checking the newchunk pointer it maintains, the old chunk will be atomically swapped with the new one. Secondly, while traversing entries, should an entry be marked for delete, it will be atomically swapped out of the list before continuing. Figure 4: Comparing the first entry of two chunks 3.3 Inserting a Key We begin the discussion of the insert operation by outlining its success path. The insert operation begins by identifying the correct chunk in which to insert its key / data pair by running the same chunk-level search algorithm. Following this, it attempts to atomically claim an empty entry within the chunk by setting the entry s key / data pair. If successful, it searches for the window in which the entry should be inserted into by running an entrylevel search on the current chunk. A window contains the predecessor and successor entries in which the new entry should be inserted in-between. Two atomic compare-and-swaps (CAS) are used to connect the new entry to the list. The first CAS causes the new entry to point to the window s successor. The second CAS causes the window s predecessor to point to the new entry. An example of these two CASes is shown in Figure 5. Lastly, before returning from the operation, we atomically increment the entry counter by one. The insert path has numerous locations where it can encounter interference. The first such place is when atomically claiming an empty entry. When a chunk is full, no empty entry can be claimed. The insert must begin an irreversible freeze of the chunk which likely results in a split of the frozen chunk into two new chunks. The freeze may aid the insert by pre-inserting the key / data pair into a new chunk before it replace the frozen chunk; however, if multiple threads are freezing the same chunk, this aid cannot be a guaranteed. In the case where the freeze completes but was unable to aid the insert, the operation attempts again at claiming an empty entry on the new chunk created after the freeze. 7

8 The next potential failure points are the two compare-and-swaps that connect the new entry into the list. Should one of these fail due to interference from other operations, we re-perform the search for a window and retry until they succeed. Note that it is also possible that while searching for a window an entry with the same key is detected. The memory address of the two entries are compared in order to determine if in fact it is a duplicate key or if another thread simply helped connect our entry to the list. The operation immediately returns with a result success in the latter case. If a duplicate key is found, cleanup is initiated. A cleanup requires freeing the entry claimed earlier by atomically clearing its next pointer and reverting its key / data pair to empty. Should either atomic instructions fail, it must be due to another thread performing a freeze. We help freeze and check one last time to see if our entry was inserted. The last point of potential failure we would like to highlight is from atomically incrementing the entry counter. Should it fail, it is simply retried. Figure 5: 2 compare-and-swaps for inserting an entry 3.4 Deleting a Key The success path of the delete operation begins by using the chunk-level search to identify the chunk in which the key-to-delete will reside. Before proceeding with the delete, we 8

9 atomically decrement the chunk s counter and check that it does not fall below the minimum threshold. If so, an irreversible freeze is performed (we will elaborate on this in a later paragraph). Otherwise an entry-level search is used to find to the window belonging to the entry to delete. If no entry is found, the operation simply returns with a result failed. In order to delete an entry, it is marked. Marking is done by atomically flipping a special bit on the entry s next pointer address. This prevents the pointer from changing value since all subsequent compare-and-swap operations to modify the pointer will expect the bit to not be set. The final step is to disconnect the entry from the list. A compare-and-swap to make the deleted entry s predecessor point to the deleted entry s successor is sufficient - the entry memory can now be reclaimed. An interesting point about this last step is that it can also be done independently to the delete operation. In fact, search operations will help disconnect deleted entries should they come across one in their traversal. An example of this is shown in the bottom half of Figure 6. The delete operation must be able to handle failures in certain paths. Early on, if it fails to atomically decrement the chunk s counter, it simply retries until it succeeds. In the case of marking an entry as deleted, it also retries again by performing a search for the entry s window and attempts to mark the entry. The final step of disconnecting the entry and recycling it can be done by any ongoing search operation, thus the delete operation will only try this once since a failure would indicate that another operation has helped complete it. Regarding the freeze that is performed when the chunk s counter falls below a threshold, the most common outcome is that the chunk is merged with its left neighbor. A merge requires that both chunks be irreversibly frozen and a new chunk is created with the combined contents of both chunks. The new chunk is then swapped into the chunk list atomically and the two frozen chunks are freed. Similar to how the freeze mechanism can aid an insert by pre-inserting the key / data pair into the new chunk, it can do the same here by pre-deleting the entry before the new chunk is connected. When multiple threads are helping to freeze, this aid cannot be guaranteed. Should aid fail, the delete operation will simply restart on the new chunk returned from the freeze. 9

10 Figure 6: Logical and physical deletion of an entry 4 Lock-Free to Wait-Free Transformation The work of Timnat and Petrank [24] provide a practical technique for transforming linearizable lock-free data structures into the coveted linearizable wait-free form. Their concept draws on the ubiquitous fast-path-slow-path methodology[2][3][19][22]. This methodology separates operations that typically succeed quickly, with little to no interference, from the ones that are difficult and can easily be starved for long periods. As an example, in their previous work on a wait-free queue[18], the fast-path would execute the lock-free algorithm to attain good performance. Only when failure to make progress was detected did it switch over to the slower wait-free algorithm that was guaranteed to make progress. In a similar way, their transformation mimics the fast-path-slow-path design by separating operations into two paths: normal and helped. Data structures begin all operations on the normal path and only move to the helped path if it detects no progress is made. When helped, the operation is guaranteed to make progress and eventually completes. We will provide an overview of the technique, our implementation of the transformation, and our experiences in doing so. 10

11 Not all lock-free data structures are eligible for transformation. There are a few requirements that must first be met. The data structures must be lock-free, linearizable, and all atomic instructions must be in the form of a compare-and-swap (CAS). Additionally, the data structure operations must depend only on the input parameters and the shared data structure itself. The locality-conscious lock-free queue we select fits these requirements. 4.1 Help Queue Before being able to run the lock-free algorithm in a fast-path-slow-path manner, it must undergo modifications. The outcome is a normalized form of the original algorithm. Getting an algorithm into this form is a major part of the overall transformation. The first addition over the original algorithm is to initialize an empty wait-free queue[8] that will contain all operations that ask for help. In order to ask for help, an operation needs to be able to express itself in a succinct description of its current computation state. We show an example of this from our transform below. A thread begins by running its operation in the normalized form. If a thread requires help, it creates a description of its current operation state and enqueues it to the help queue before moving on to help other operations in the queue. After helping any operation, threads will check to see if the operation belongs to it. If not, it continues helping until it completes its own. The result of an operation will always be written to the description, whether by the parent thread or by a helping thread, so that the parent thread simply reads it and reports the result when it finishes. Another modification to the original algorithm is that any new operation will check the help queue to help an operation in the slow-path once before moving on to perform its own operation. We provide a very simplistic state diagram to show the two paths in Figure 7. s t r u c t OperationRecord i n t ownertid ; OperationType optype ; // search, i n s e r t, d e l e t e OperationInput input ; OperationRecordState s t a t e ; // r e s t a r t, f a i l u r e, s u c c e s s... boolean r e s u l t ; Array c a s L i s t ; // l i s t o f CASes to be executed s t r u c t OperationInput i n t key ; Data data ; Data datareturn ; 11

12 4.2 Detecting Failure Figure 7: Simplified states of the helping mechanism While running a normalized operation in the fast-path, it may encounter contention from other threads. Contention typically occurs in the form of a CAS failure. In order to detect that no progress is being made, each operation maintains a contention counter. This counter increments by one each time a CAS fails. When incremented, it is checked to see if it has exceeded some threshold. If it has been exceeded, the thread returns from its current task to create a succinct description of its operation state and enqueues it in the help queue. It has now entered the slow-path where it is guaranteed to make progress. s t r u c t C o n t e n t i o n I n f o t i n t counter ; f u n c t i o n runcaslist ( CASList, ContentionInfo ) f o r each CAS d e s c r i p t o r in CASList r e s u l t = runcas ( ) i f r e s u l t i s true cas >s t a t e = CAS STATE SUCCESS e l s e cas >s t a t e = CAS STATE FAILURE c o n t e n t i o n I n f o >counter++ break 12

13 4.3 Normalized Form The second part of the normalized form is the more involved part. It requires separating the atomic CAS instructions of the original algorithm so they fit within the normalized form. There are three stages of the normalized form, which are run consecutively, that need to be considered: a preparatory stage, an execution stage and a post-execution stage. Their formal names in the paper are CAS Generator, CAS Executor, and Wrap-up, respectively. Each original operation, i.e. search, insert or delete, in the algorithm executes all three stages, one after another. Any stage can be executed by one or more threads, although while in the fast-path it will only be executed by the parent thread. It should also be noted that the operation outcome should be the same whether executed by one thread or by many. A formal proof of this is found in the original paper. One point worth mentioning is that many CASes in the original algorithm do not need be run the in CAS Executor. A complete definition of this type of CAS is given in the original paper along with its formal name, auxiliary CAS. Typically, auxiliary CASes are found in a function that can be run safely in parallel. We will further discuss the uses of auxiliary CASes in the CAS Generator and Wrap-up stages of our transformation. f u n c t i o n NormalizedOperation (... ) checkandhelpqueue ( ) c a s L i s t = CASGenerator runcaslist ( c a s L i s t ) r e s u l t s = WrapUp( c a s L i s t ) return r e s u l t s CAS Generator First off, the CAS Generator has the responsibility to generate a list of compare-and-swaps (CAS) descriptors that must be run exactly once. More specifically, these CASes have the property that they cannot be done in parallel because they must be executed by the thread that initiated the operation. In the original paper they are referred to as owner CASes. For example, in our locality-conscious linked list, the CAS belonging to the delete operation that marks an entry as deleted is an owner CAS. No other thread can claim ownership since the parent thread is tasked with the operation, and that operation performs the marking CAS. However, not all CASes are owner CASes. As seen later, some can be performed safely in parallel in either CAS Generator or Wrap-up stages, avoiding altogether the need to be run in the CAS Executor CAS Executor In the CAS Executor, each CAS descriptor from the previous stage is executed in order, one by one and a result is stamped onto it. The difficulty in this is to make all threads executing the list aware of the result of all other threads without using heavy synchronization mechanisms that spoil wait-freedom. To achieve this, a modification bit is reserved on the primitive that the CAS targets. An arbitrary number of threads will attempt the CAS but will assume the expected value to not have this modification bit to be set. The new value, however, will have the modification bit set. Therefore all threads will attempt the CAS with the same expected value (without the modification bit set) and same new value (with a modification set), but only one thread will succeed since the expected value will no longer 13

14 match afterwards. In addition to the modification bit, a few more bits are reserved as a version counter. In our implementation, we use a version counter that is 1 byte in size and allows for 255 different versions. The version counter solves the well-known ABA problem (which we do not further discuss here for the sake of brevity). After attempting a CAS, regardless of its own result, threads will check if the modification bit is set. Seeing this bit set means that a thread was successful. Before stamping success onto the CAS descriptor, the modification bit is first cleared and the version counter is incremented. Once all CAS descriptors in the list are successfully executed, they are passed on to the next stage. In the case that a CAS fails, failure is stamped onto the CAS descriptor and we move directly to the next stage without attempting any further CAS descriptors Wrap-Up Wrap-up assesses the list of CAS descriptors from the CAS Executor and ultimately decides the final result of the operation. It chooses a result that is either success, failure, or restart from scratch. At this point, the algorithm may execute any non-owner CASes to finish up last steps in the operation. In the case that the operation is in the slow-path, the result of Wrap-up is written back to the descriptor that was enqueued originally. The result determines whether or not the thread(s) should restart the three stages again or simply remove the descriptor from the queue and report the operation s result. 5 Normalized Form: Linked List In this section, we provide the details of our transformation from the original localityconscious linked list to the normalized form. 5.1 Contention Counter A contention failure counter for the locality-conscious linked list is implemented by counting the number of failed CASes. 5.2 Search We begin with the simplest operation, search. Search contains no owner CASes and only two auxiliary CAS: one to swap an old chunk out with a new chunk and another to physically disconnect an entry from the list. The lack of owner CASes means that its CAS Generator always returns an empty list. Essentially the original search algorithm takes place entirely in the Wrap-up function. We provide the normalized form below. CASes performed by Search(key) that are run in the CAS Executor: None The CAS Generator function for Search(key): Return an empty list of CASes The Wrap-up function for Search(key): Call findchunk(key) 14

15 5.3 Insert Call find(chunk, key) on the chunk returned above If an entry with the requested key was found, exit with result true and the data associate to the key Else, exit with result false and null for data The insert operation has two owner CASes: one to set the key / data to an empty entry (I-1) and second to increment the chunk entry counter by one (I-4). However, there is a complication that causes two additional CASes, which could have been done outside of the CAS Executor, to be included. These two CASes are responsible for connecting the new entry into the list (I-2 and I-3), but because they are done after (I-1) but before (I-4) in the original algorithm, they must be included in the CAS Generator s list. If these four CASes are not executed in the specified order, the chunk entry counter can no longer hold a guarantee that it is a lower bound on the actual number of entries. This would foil the algorithm s ability to properly detect when it should merge chunks should they become too small. The insert operation makes use of a few auxiliary CASes, many of which are already placed into parallelizable functions. The first of such examples are the two secondary roles that the search plays. As discussed in Section 3.2, as an insert tries to find the window to where it should insert an entry, along the way it may help use auxiliary CASes to replace old, frozen chunks with new chunks. In addition it will also use an auxiliary CAS to help to remove deleted entries from the list. A second parallelizable function which contains numerous auxiliary CASes is the freeze function. In the original lock-free algorithm, the freeze mechanism is constructed in such a way that multiple threads may help to freeze a chunk. It should be noted that we opt to perform a freeze that does not aid the insert by pre-inserting the entry for the sake of algorithmic simplicity. Within the Wrap-up, there is a slight inefficiency that was introduced in the transformation. Insert operations attempt to acquire an empty entry before checking if a duplicate key exists. When this case happens, the operation will undo changes to the new entry to return it to an empty state using the clearentry function. Firstly, it uses a CAS to clears the entry s nextentry pointer, followed by a second CAS to clear the key / data. Unfortunately, this function does not readily support being run by multiple threads since it expects only one thread to complete both CASes, one after another. In the worst case, any one of two CASes will fail and a potentially unnecessary freeze will occur, but the overall state of the chunk will still remain valid after the freeze. Thus, we make no changes and accept this as part of the transformation. We demonstrate the transformation stages below. CASes performed by Insert(key) that are run in the CAS Executor: I-1 : Set the key / data to an empty entry I-2 : After locating a window in which the new entry should be inserted into, set the new entry to point to the successor I-3 : After locating a window in which the new entry should be inserted into, swap the predecessor to point the new entry I-4 : Increment the chunk counter by 1 The CAS Generator function for Insert(key): 15

16 Call findchunk(key) Find the address to an empty entry which becomes the target of CAS descriptor I-1 If none are empty, call freeze and restart the CAS Generator Else, call find(chunk, key) to locate a window If an entry with the same key is found, return an empty list of CASes Else, return a list of cas-descriptors containing I-1, I-2, I-3 and I-4 The Wrap-up function for Insert(key): If the list of CASes is empty, exit with the result false (operation failed, key already exists) If no CASes succeed, restart the operation from scratch If I-1 succeeds but I-2 or I-3 fail, call find(key, chunk) to check if any other helped insert. Return result true if it is found. Otherise, call clearentry() to undo I-1. If clearentry() fails, return result true (a freeze was performed and our entry was inserted). If clearentry() succeeds, restart the operation. If I-1, I-2 and I-3 succeed, return result true Note: Even if I-4 fails, it is ignored since incrementing the chunk counter does not foil the property that it is only a lower bound to the actual entry count. 5.4 Delete The delete transformation is more straightforward than the insert transformation. There are two owner CASes: one to decrement the chunk counter by one (D-1), and another to mark the entry as deleted (D-2). It also makes use of the findchunk() and find() functions, which use auxiliary CASes to: swap new chunks into the list and physically remove deleted entries from the list. Similar to the insert, the delete operation also makes use of the parallelizable freeze function to should the chunk entry counter fall too low. Once again, we opt to use a freeze which does not aid the delete operation in pre-deleting an entry in the chunk(s) created after a freeze. The transformation for delete follows. CASes performed by Delete(Key) that are run in the CAS Executor: D-1 : Decrement the chunk counter by 1. D-2 : After locating the entry-to-delete, mark its nextentry pointer with the deleted bit. The CAS Generator function for Delete(Key): Call findchunk(key) Check that the chunk counter is above the mininum threshold If it is under the threshold, perform a plain freeze() and restart the generator Otherwise, call find(chunk, key) 16

17 If no entry with the requested key is found, return an empty list of CASes (the operation fails, there is no entry with that key) Else, return a list of cas-descriptors containing D-1 and D-2 The Wrap-up function for Delete(key): If the list of CASes is empty, exit with the result false (the operation failed, there is no entry with that key) If D-1 fails, restart the operation from scratch If D-2 fails, check if the chunk is frozen. If so, help freeze and restart the operation from scratch If all CASes succeed, call find(chunk, key) to physically remove the entry and return result true Note: If D-1 succeeds but D-2 fails, we make no attempt to re-increment the chunk counter. The lower bound property of the counter will not be violated. 5.5 Points of Interest There are two points of interest we encountered that are worth mentioning. The first is the need for proper memory management. In our transformation we chose to ignore this topic and have left the current memory management scheme through hazard pointers as is. Ideally, we would have liked to use a wait-free garbage collector to abstract the details of memory reclamation. A second area of interest are the necessary transformation changes regarding the addition of a modification bit and a version counter onto a primitive for use in the CAS Executor as outlined earlier. A primitive has a fixed size. The algorithm typically uses one that is 8 bytes in length. When a primitive is used for internal purposes under our control, e.g. a simple counter, we can partition it easily to reserve parts of it. However, adding a modification bit and versioning counter to an 8-byte pointer is more difficult because we have no knowledge on how the system may manipulate the bytes within. Although there are 15 unreserved bits in the least significant portion of a pointer, we needed to break some rules in order to get enough bytes for the modification. We partitioned the most-significant 4 bytes of the pointer and use that part to store our modification bit and version counter. For practicality purposes, the system rarely reaches an address in the upper ranges, therefore, we make a calculated but potentially dangerous assumption in using it to store our extra data. 6 Results Our experiment compares the performance of the original lock-free locality-conscious linked list against its wait-free transformation. In the wait-free algorithm, we run the normalized form in both the fast-path and the slow-path. The contention counter threshold is set to k = 4, which allows any operation to fail at most 4 CASes in the fast-path before moving to the slow-path. All tests were run using C on a system with 4GB of memory and an Intel Core2 Duo E8400 which houses 2 cores running at 3.0GHz. Both cores share an L2 cache of 6MB and do not support hyperthreading. The benchmark we performed runs in two steps. First, all threads are used to pre-fill the data structures with 10,000 entries using inserts. 17

18 Second, we delegate a specific role to each thread. 15% of the total threads perform inserts, another 15% of them perform deletes and the remaining 70% of them perform searches. All inserts have randomly generated keys which are in the range of [1, 10,000,000]. This test is repeated five times and we report their average results in the following figure. Figure 8: Lock-free versus Wait-Free We found on average that the wait-free algorithm has an increased runtime of 56% over its lock-free counterpart. This number is not unexpected by any means; there are significant additions, e.g. maintaining a help queue, that will cause overhead. 7 Conclusion While a silver-bullet transformation from a lock-free data structures to a wait-free form is highly desirable, it is not attained without encountering a few obstacles. As shown in the experimental results, its performance may not be an acceptable tradeoff to the guarantee that each operation is bounded by a finite number of steps. Fortunately, there is room for improvement. One possible optimization could be to run the lock-free version instead of the normalized version in the fast-path. From the author s experimental results, this optimization proved to have the good performance on a variety of data structures. On average, a difference in performance of 2% was shown with this optimization. A lesser possible improvement could lie in optimizing the current normalized form. In particular, reducing the number of CASes that the insert operation generates in the CAS Generator. It is also possible to better parallelize some of the functions that commonly encounter CAS failures. For example, when attempting to acquire an empty entry, a randomized entry could be returned in place of the sequentially next empty entry. These ideas could help reduce interference among insert operations. Ultimately, this work may lead to more interesting transformations in the future. We can envision applying this algorithm on the lock-free B+Tree[7], which extends the work done on the locality-conscious linked list. A wait-free B+Tree would have more significant application as it is the data structure of choice for databases. 18

19 References [1] Yehuda Afek, Dalia Dauber, and Dan Touitou. Wait-free made fast. In Proceedings of the twenty-seventh annual ACM symposium on Theory of computing, pages ACM, [2] James H Anderson and Yong-Jik Kim. Fast and scalable mutual exclusion. In Distributed Computing, pages Springer, [3] James H Anderson and Yong-Jik Kim. Adaptive mutual exclusion with local spinning. In Distributed Computing, pages Springer, [4] Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. I-structures: Data structures for parallel computing. ACM Trans. Program. Lang. Syst., 11(4): , October [5] J. Aspnes and M. Herlihy. Wait-free data structures in the asynchronous pram model. In Proceedings of the Second Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 90, pages , New York, NY, USA, ACM. [6] Anastasia Braginsky and Erez Petrank. Locality-conscious lock-free linked lists. In Distributed Computing and Networking, pages Springer, [7] Anastasia Braginsky and Erez Petrank. A lock-free b+tree. In Proceedinbgs of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 12, pages 58 67, New York, NY, USA, ACM. [8] Shahar Timnat Anastasia Braginsky, Alex Kogan, and Erez Petrank. Wait-free linkedlists [9] Phong Chuong, Faith Ellen, and Vijaya Ramachandran. A universal construction for wait-free transaction friendly data structures. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, pages ACM, [10] Tyler Crain, Damien Imbs, and Michel Raynal. Towards a universal construction for transaction-based multiprocess programs. In Distributed Computing and Networking, pages Springer, [11] Faith Fich, Danny Hendler, and Nir Shavit. On the inherent weakness of conditional synchronization primitives. In Proceedings of the Twenty-third Annual ACM Symposium on Principles of Distributed Computing, PODC 04, pages 80 87, New York, NY, USA, ACM. [12] Mikhail Fomitchev and Eric Ruppert. Lock-free linked lists and skip lists. In Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, pages ACM, [13] Timothy L Harris. A pragmatic implementation of non-blocking linked-lists. In Distributed Computing, pages Springer, [14] M. Herlihy. A methodology for implementing highly concurrent data structures. In Proceedings of the Second ACM SIGPLAN Symposium on Principles &Amp; Practice of Parallel Programming, PPOPP 90, pages , New York, NY, USA, ACM. 19

20 [15] Maurice Herlihy. A methodology for implementing highly concurrent data objects. ACM Trans. Program. Lang. Syst., 15(5): , November [16] Maurice Herlihy, Victor Luchangco, and Mark Moir. Obstruction-free synchronization: Double-ended queues as an example. In Proceedings of the 23rd International Conference on Distributed Computing Systems, ICDCS 03, pages 522, Washington, DC, USA, IEEE Computer Society. [17] Maurice P. Herlihy. Impossibility and universality results for wait-free synchronization. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, PODC 88, pages , New York, NY, USA, ACM. [18] Alex Kogan and Erez Petrank. Wait-free queues with multiple enqueuers and dequeuers. ACM SIGPLAN Notices, 46(8): , [19] Leslie Lamport. A fast mutual exclusion algorithm. ACM Transactions on Computer Systems (TOCS), 5(1):1 11, [20] Maged M Michael. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures, pages ACM, [21] Maged M Michael. Hazard pointers: Safe memory reclamation for lock-free objects. Parallel and Distributed Systems, IEEE Transactions on, 15(6): , [22] Mark Moir and James H Anderson. Wait-free algorithms for fast, long-lived renaming. Science of Computer Programming, 25(1):1 39, [23] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Benjamin Hertzberg. Mcrt-stm: A high performance software transactional memory system for a multi-core runtime. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 06, pages , New York, NY, USA, ACM. [24] Shahar Timnat and Erez Petrank. A practical wait-free simulation for lock-free data structures [25] John D Valois. Lock-free linked lists using compare-and-swap. In Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, pages ACM,

Brewer s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Brewer s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Brewer s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Seth Gilbert Nancy Lynch Abstract When designing distributed web services, there are three properties that

More information

Wait-Free Queues With Multiple Enqueuers and Dequeuers

Wait-Free Queues With Multiple Enqueuers and Dequeuers Wait-Free Queues With Multiple Enqueuers and Dequeuers Alex Kogan Department of Computer Science Technion, Israel sakogan@cs.technion.ac.il Erez Petrank Department of Computer Science Technion, Israel

More information

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms 27 th Symposium on Parallel Architectures and Algorithms SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY Nuno Diegues, Paolo Romano and Stoyan Garbatov Seer: Scheduling for Commodity

More information

Memory Allocation. Static Allocation. Dynamic Allocation. Memory Management. Dynamic Allocation. Dynamic Storage Allocation

Memory Allocation. Static Allocation. Dynamic Allocation. Memory Management. Dynamic Allocation. Dynamic Storage Allocation Dynamic Storage Allocation CS 44 Operating Systems Fall 5 Presented By Vibha Prasad Memory Allocation Static Allocation (fixed in size) Sometimes we create data structures that are fixed and don t need

More information

Conallel Data Structures: A Practical Paper

Conallel Data Structures: A Practical Paper Blocking and non-blocking concurrent hash tables in multi-core systems ÁKOS DUDÁS Budapest University of Technology and Economics Department of Automation and Applied Informatics 1117 Budapest, Magyar

More information

Towards Relaxing STM. Christoph M. Kirsch, Michael Lippautz! University of Salzburg. Euro-TM Workshop, April 2014

Towards Relaxing STM. Christoph M. Kirsch, Michael Lippautz! University of Salzburg. Euro-TM Workshop, April 2014 Towards Relaxing STM Christoph M. Kirsch, Michael Lippautz! University of Salzburg! Euro-TM Workshop, April 2014 Problem linear scalability positive scalability good performance throughput (#transactions/time)

More information

Chapter 6, The Operating System Machine Level

Chapter 6, The Operating System Machine Level Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General

More information

A Methodology for Creating Fast Wait-Free Data Structures

A Methodology for Creating Fast Wait-Free Data Structures A Methodology for Creating Fast Wait-Free Data Structures Alex Kogan Department of Computer Science Technion, Israel sakogan@cs.technion.ac.il Erez Petrank Department of Computer Science Technion, Israel

More information

DATABASE CONCURRENCY CONTROL USING TRANSACTIONAL MEMORY : PERFORMANCE EVALUATION

DATABASE CONCURRENCY CONTROL USING TRANSACTIONAL MEMORY : PERFORMANCE EVALUATION DATABASE CONCURRENCY CONTROL USING TRANSACTIONAL MEMORY : PERFORMANCE EVALUATION Jeong Seung Yu a, Woon Hak Kang b, Hwan Soo Han c and Sang Won Lee d School of Info. & Comm. Engr. Sungkyunkwan University

More information

Garbage Collection in the Java HotSpot Virtual Machine

Garbage Collection in the Java HotSpot Virtual Machine http://www.devx.com Printed from http://www.devx.com/java/article/21977/1954 Garbage Collection in the Java HotSpot Virtual Machine Gain a better understanding of how garbage collection in the Java HotSpot

More information

Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms

Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms Maged M. Michael Michael L. Scott Department of Computer Science University of Rochester Rochester, NY 14627-0226 fmichael,scottg@cs.rochester.edu

More information

Victor Shoup Avi Rubin. fshoup,rubing@bellcore.com. Abstract

Victor Shoup Avi Rubin. fshoup,rubing@bellcore.com. Abstract Session Key Distribution Using Smart Cards Victor Shoup Avi Rubin Bellcore, 445 South St., Morristown, NJ 07960 fshoup,rubing@bellcore.com Abstract In this paper, we investigate a method by which smart

More information

Persistent Binary Search Trees

Persistent Binary Search Trees Persistent Binary Search Trees Datastructures, UvA. May 30, 2008 0440949, Andreas van Cranenburgh Abstract A persistent binary tree allows access to all previous versions of the tree. This paper presents

More information

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System Multicore Systems Challenges for the Real-Time Software Developer Dr. Fridtjof Siebert aicas GmbH Haid-und-Neu-Str. 18 76131 Karlsruhe, Germany siebert@aicas.com Abstract Multicore systems have become

More information

COS 318: Operating Systems

COS 318: Operating Systems COS 318: Operating Systems File Performance and Reliability Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Topics File buffer cache

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

The Classical Architecture. Storage 1 / 36

The Classical Architecture. Storage 1 / 36 1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage

More information

Lock-free Dynamically Resizable Arrays

Lock-free Dynamically Resizable Arrays Lock-free Dynamically Resizable Arrays Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup Texas A&M University College Station, TX 77843-3112 {dechev, peter.pirkelbauer}@tamu.edu, bs@cs.tamu.edu Abstract.

More information

Facing the Challenges for Real-Time Software Development on Multi-Cores

Facing the Challenges for Real-Time Software Development on Multi-Cores Facing the Challenges for Real-Time Software Development on Multi-Cores Dr. Fridtjof Siebert aicas GmbH Haid-und-Neu-Str. 18 76131 Karlsruhe, Germany siebert@aicas.com Abstract Multicore systems introduce

More information

Dynamic Load Balancing. Using Work-Stealing 35.1 INTRODUCTION CHAPTER. Daniel Cederman and Philippas Tsigas

Dynamic Load Balancing. Using Work-Stealing 35.1 INTRODUCTION CHAPTER. Daniel Cederman and Philippas Tsigas CHAPTER Dynamic Load Balancing 35 Using Work-Stealing Daniel Cederman and Philippas Tsigas In this chapter, we present a methodology for efficient load balancing of computational problems that can be easily

More information

Node-Based Structures Linked Lists: Implementation

Node-Based Structures Linked Lists: Implementation Linked Lists: Implementation CS 311 Data Structures and Algorithms Lecture Slides Monday, March 30, 2009 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks CHAPPELLG@member.ams.org

More information

Data Structures Fibonacci Heaps, Amortized Analysis

Data Structures Fibonacci Heaps, Amortized Analysis Chapter 4 Data Structures Fibonacci Heaps, Amortized Analysis Algorithm Theory WS 2012/13 Fabian Kuhn Fibonacci Heaps Lacy merge variant of binomial heaps: Do not merge trees as long as possible Structure:

More information

Concurrent Data Structures

Concurrent Data Structures 1 Concurrent Data Structures Mark Moir and Nir Shavit Sun Microsystems Laboratories 1.1 Designing Concurrent Data Structures............. 1-1 Performance Blocking Techniques Nonblocking Techniques Complexity

More information

14.1 Rent-or-buy problem

14.1 Rent-or-buy problem CS787: Advanced Algorithms Lecture 14: Online algorithms We now shift focus to a different kind of algorithmic problem where we need to perform some optimization without knowing the input in advance. Algorithms

More information

Hagit Attiya and Eshcar Hillel. Computer Science Department Technion

Hagit Attiya and Eshcar Hillel. Computer Science Department Technion Hagit Attiya and Eshcar Hillel Computer Science Department Technion !!" What are highly-concurrent data structures and why we care about them The concurrency of existing implementation techniques Two ideas

More information

Memory Management in the Java HotSpot Virtual Machine

Memory Management in the Java HotSpot Virtual Machine Memory Management in the Java HotSpot Virtual Machine Sun Microsystems April 2006 2 Table of Contents Table of Contents 1 Introduction.....................................................................

More information

Predictive modeling for software transactional memory

Predictive modeling for software transactional memory VU University Amsterdam BMI Paper Predictive modeling for software transactional memory Author: Tim Stokman Supervisor: Sandjai Bhulai October, Abstract In this paper a new kind of concurrency type named

More information

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++ Answer the following 1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++ 2) Which data structure is needed to convert infix notations to postfix notations? Stack 3) The

More information

Sequential Data Structures

Sequential Data Structures Sequential Data Structures In this lecture we introduce the basic data structures for storing sequences of objects. These data structures are based on arrays and linked lists, which you met in first year

More information

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,

More information

Operating Systems. Virtual Memory

Operating Systems. Virtual Memory Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page

More information

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium. Chapter 4: Record Storage and Primary File Organization 1 Record Storage and Primary File Organization INTRODUCTION The collection of data that makes up a computerized database must be stored physically

More information

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation

More information

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation

More information

TREE BASIC TERMINOLOGIES

TREE BASIC TERMINOLOGIES TREE Trees are very flexible, versatile and powerful non-liner data structure that can be used to represent data items possessing hierarchical relationship between the grand father and his children and

More information

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the

More information

Name: 1. CS372H: Spring 2009 Final Exam

Name: 1. CS372H: Spring 2009 Final Exam Name: 1 Instructions CS372H: Spring 2009 Final Exam This exam is closed book and notes with one exception: you may bring and refer to a 1-sided 8.5x11- inch piece of paper printed with a 10-point or larger

More information

Lock-free Dynamically Resizable Arrays

Lock-free Dynamically Resizable Arrays Lock-free Dynamically Resizable Arrays Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup Texas A&M University College Station, TX 77843-3112 {dechev, peter.pirkelbauer}@tamu.edu, bs@cs.tamu.edu Abstract.

More information

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer. Guest lecturer: David Hovemeyer November 15, 2004 The memory hierarchy Red = Level Access time Capacity Features Registers nanoseconds 100s of bytes fixed Cache nanoseconds 1-2 MB fixed RAM nanoseconds

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Chapter 13 Embedded Operating Systems

Chapter 13 Embedded Operating Systems Operating Systems: Internals and Design Principles Chapter 13 Embedded Operating Systems Eighth Edition By William Stallings Embedded System Refers to the use of electronics and software within a product

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 )

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 ) Cheap Paxos Leslie Lamport and Mike Massa Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 ) Cheap Paxos Leslie Lamport and Mike Massa Microsoft Abstract Asynchronous

More information

Versioned Transactional Shared Memory for the

Versioned Transactional Shared Memory for the Versioned Transactional Shared Memory for the FénixEDU Web Application Nuno Carvalho INESC-ID/IST nonius@gsd.inesc-id.pt João Cachopo INESC-ID/IST joao.cachopo@inesc-id.pt António Rito Silva INESC-ID/IST

More information

Multi- and Many-Core Technologies: Architecture, Programming, Algorithms, & Application

Multi- and Many-Core Technologies: Architecture, Programming, Algorithms, & Application Multi- and Many-Core Technologies: Architecture, Programming, Algorithms, & Application 2 i ii Chapter 1 Scheduling DAG Structured Computations Yinglong Xia IBM T.J. Watson Research Center, Yorktown Heights,

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 11 Memory Management Computer Architecture Part 11 page 1 of 44 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin

More information

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D. 1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D. base address 2. The memory address of fifth element of an array can be calculated

More information

A Survey of Parallel Processing in Linux

A Survey of Parallel Processing in Linux A Survey of Parallel Processing in Linux Kojiro Akasaka Computer Science Department San Jose State University San Jose, CA 95192 408 924 1000 kojiro.akasaka@sjsu.edu ABSTRACT Any kernel with parallel processing

More information

How To Make A Correct Multiprocess Program Execute Correctly On A Multiprocedor

How To Make A Correct Multiprocess Program Execute Correctly On A Multiprocedor How to Make a Correct Multiprocess Program Execute Correctly on a Multiprocessor Leslie Lamport 1 Digital Equipment Corporation February 14, 1993 Minor revisions January 18, 1996 and September 14, 1996

More information

A High-Throughput In-Memory Index, Durable on Flash-based SSD

A High-Throughput In-Memory Index, Durable on Flash-based SSD A High-Throughput In-Memory Index, Durable on Flash-based SSD Insights into the Winning Solution of the SIGMOD Programming Contest 2011 Thomas Kissinger, Benjamin Schlegel, Matthias Boehm, Dirk Habich,

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc() CS61: Systems Programming and Machine Organization Harvard University, Fall 2009 Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc() Prof. Matt Welsh October 6, 2009 Topics for today Dynamic

More information

The Oracle Universal Server Buffer Manager

The Oracle Universal Server Buffer Manager The Oracle Universal Server Buffer Manager W. Bridge, A. Joshi, M. Keihl, T. Lahiri, J. Loaiza, N. Macnaughton Oracle Corporation, 500 Oracle Parkway, Box 4OP13, Redwood Shores, CA 94065 { wbridge, ajoshi,

More information

Lecture Notes on Linear Search

Lecture Notes on Linear Search Lecture Notes on Linear Search 15-122: Principles of Imperative Computation Frank Pfenning Lecture 5 January 29, 2013 1 Introduction One of the fundamental and recurring problems in computer science is

More information

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE

More information

File System Management

File System Management Lecture 7: Storage Management File System Management Contents Non volatile memory Tape, HDD, SSD Files & File System Interface Directories & their Organization File System Implementation Disk Space Allocation

More information

An Introduction to the ARM 7 Architecture

An Introduction to the ARM 7 Architecture An Introduction to the ARM 7 Architecture Trevor Martin CEng, MIEE Technical Director This article gives an overview of the ARM 7 architecture and a description of its major features for a developer new

More information

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86

More information

IF The customer should receive priority service THEN Call within 4 hours PCAI 16.4

IF The customer should receive priority service THEN Call within 4 hours PCAI 16.4 Back to Basics Backward Chaining: Expert System Fundamentals By Dustin Huntington Introduction Backward chaining is an incredibly powerful yet widely misunderstood concept, yet it is key to building many

More information

Concepts of Concurrent Computation

Concepts of Concurrent Computation Chair of Software Engineering Concepts of Concurrent Computation Bertrand Meyer Sebastian Nanz Lecture 3: Synchronization Algorithms Today's lecture In this lecture you will learn about: the mutual exclusion

More information

Scheduling Shop Scheduling. Tim Nieberg

Scheduling Shop Scheduling. Tim Nieberg Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations

More information

Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

More information

File Management. Chapter 12

File Management. Chapter 12 Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution

More information

Cloud Based Distributed Databases: The Future Ahead

Cloud Based Distributed Databases: The Future Ahead Cloud Based Distributed Databases: The Future Ahead Arpita Mathur Mridul Mathur Pallavi Upadhyay Abstract Fault tolerant systems are necessary to be there for distributed databases for data centers or

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Record Storage and Primary File Organization

Record Storage and Primary File Organization Record Storage and Primary File Organization 1 C H A P T E R 4 Contents Introduction Secondary Storage Devices Buffering of Blocks Placing File Records on Disk Operations on Files Files of Unordered Records

More information

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team Lecture Summary In this lecture, we learned about the ADT Priority Queue. A

More information

Symbol Tables. Introduction

Symbol Tables. Introduction Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The

More information

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003 Oracle9i Release 2 Database Architecture on Windows An Oracle Technical White Paper April 2003 Oracle9i Release 2 Database Architecture on Windows Executive Overview... 3 Introduction... 3 Oracle9i Release

More information

The Trip Scheduling Problem

The Trip Scheduling Problem The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems

More information

Data Management for Portable Media Players

Data Management for Portable Media Players Data Management for Portable Media Players Table of Contents Introduction...2 The New Role of Database...3 Design Considerations...3 Hardware Limitations...3 Value of a Lightweight Relational Database...4

More information

PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP

PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP Sruthi Sankar CSE 633: Parallel Algorithms Spring 2014 Professor: Dr. Russ Miller Sudoku: the puzzle A standard Sudoku puzzles contains 81 grids :9 rows

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

FAWN - a Fast Array of Wimpy Nodes

FAWN - a Fast Array of Wimpy Nodes University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed

More information

Java Virtual Machine: the key for accurated memory prefetching

Java Virtual Machine: the key for accurated memory prefetching Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Winter Term 2008 / 2009 Jun.-Prof. Dr. André Brinkmann Andre.Brinkmann@uni-paderborn.de Universität Paderborn PC² Agenda Multiprocessor and

More information

Physical Data Organization

Physical Data Organization Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

More information

DATABASE DESIGN - 1DL400

DATABASE DESIGN - 1DL400 DATABASE DESIGN - 1DL400 Spring 2015 A course on modern database systems!! http://www.it.uu.se/research/group/udbl/kurser/dbii_vt15/ Kjell Orsborn! Uppsala Database Laboratory! Department of Information

More information

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and: Binary Search Trees 1 The general binary tree shown in the previous chapter is not terribly useful in practice. The chief use of binary trees is for providing rapid access to data (indexing, if you will)

More information

Design and Implementation of the Heterogeneous Multikernel Operating System

Design and Implementation of the Heterogeneous Multikernel Operating System 223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,

More information

System Software Prof. Dr. H. Mössenböck

System Software Prof. Dr. H. Mössenböck System Software Prof. Dr. H. Mössenböck 1. Memory Management 2. Garbage Collection 3. Linkers and Loaders 4. Debuggers 5. Text Editors Marks obtained by end-term exam http://ssw.jku.at/misc/ssw/ 1. Memory

More information

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are

More information

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March

More information

ESQUIVEL S.C., GATICA C. R., GALLARD R.H.

ESQUIVEL S.C., GATICA C. R., GALLARD R.H. 62/9,1*7+(3$5$//(/7$6.6&+('8/,1*352%/(0%

More information

KWIC Implemented with Pipe Filter Architectural Style

KWIC Implemented with Pipe Filter Architectural Style KWIC Implemented with Pipe Filter Architectural Style KWIC Implemented with Pipe Filter Architectural Style... 2 1 Pipe Filter Systems in General... 2 2 Architecture... 3 2.1 Pipes in KWIC system... 3

More information

ACM, 2012. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.

ACM, 2012. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. ACM, 2012. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in SIGMOD Record, Volume

More information

Increasing the Scalability of a Software Transactional Memory System

Increasing the Scalability of a Software Transactional Memory System Increasing the Scalability of a Software Transactional Memory System Faustino Dabraio da Silva Dissertation submitted to obtain the Master Degree in Information Systems and Computer Engineering Jury Chairman:

More information

The Sierra Clustered Database Engine, the technology at the heart of

The Sierra Clustered Database Engine, the technology at the heart of A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

More information

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation PLDI 03 A Static Analyzer for Large Safety-Critical Software B. Blanchet, P. Cousot, R. Cousot, J. Feret L. Mauborgne, A. Miné, D. Monniaux,. Rival CNRS École normale supérieure École polytechnique Paris

More information

A Comparison of Dictionary Implementations

A Comparison of Dictionary Implementations A Comparison of Dictionary Implementations Mark P Neyer April 10, 2009 1 Introduction A common problem in computer science is the representation of a mapping between two sets. A mapping f : A B is a function

More information

Efficient Scheduling Of On-line Services in Cloud Computing Based on Task Migration

Efficient Scheduling Of On-line Services in Cloud Computing Based on Task Migration Efficient Scheduling Of On-line Services in Cloud Computing Based on Task Migration 1 Harish H G, 2 Dr. R Girisha 1 PG Student, 2 Professor, Department of CSE, PESCE Mandya (An Autonomous Institution under

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by

More information

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski kbajda@cs.yale.edu

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski kbajda@cs.yale.edu Kamil Bajda-Pawlikowski kbajda@cs.yale.edu Querying RDF data stored in DBMS: SPARQL to SQL Conversion Yale University technical report #1409 ABSTRACT This paper discusses the design and implementation

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Testing LTL Formula Translation into Büchi Automata

Testing LTL Formula Translation into Büchi Automata Testing LTL Formula Translation into Büchi Automata Heikki Tauriainen and Keijo Heljanko Helsinki University of Technology, Laboratory for Theoretical Computer Science, P. O. Box 5400, FIN-02015 HUT, Finland

More information

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering DELL RAID PRIMER DELL PERC RAID CONTROLLERS Joe H. Trickey III Dell Storage RAID Product Marketing John Seward Dell Storage RAID Engineering http://www.dell.com/content/topics/topic.aspx/global/products/pvaul/top

More information

Atomicity for Concurrent Programs Outsourcing Report. By Khilan Gudka <kg103@doc.ic.ac.uk> Supervisor: Susan Eisenbach

Atomicity for Concurrent Programs Outsourcing Report. By Khilan Gudka <kg103@doc.ic.ac.uk> Supervisor: Susan Eisenbach Atomicity for Concurrent Programs Outsourcing Report By Khilan Gudka Supervisor: Susan Eisenbach June 23, 2007 2 Contents 1 Introduction 5 1.1 The subtleties of concurrent programming.......................

More information