Dynamic Storage Allocation: A Survey and Critical Review *

Transcription

1 Dynamic Storage Allocation: A Survey and Critical Review * Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles** Department of Computer Sciences University of Texas at Austin Austin, Texas, 78751, USA (Wilson ] markj I neely@cs, ut exas. edu) Abstract. Dynamic memory allocation has been a fundamental part of most computer systems since roughly 1960, and memory allocation is widely considered to be either a solved problem or an insoluble one. In this survey, we describe a variety of memory allocator designs and point out issues relevant to their design and evaluation. We then chronologically survey most of the literature on allocators between 1961 and (Scores of papers are discussed, in varying detail, and over 150 references are given.) We argue that allocator designs have been unduly restricted by an emphasis on mechanism, rather than policy, while the latter is more important; higher-level strategic issues are still more important, but have not been given much attention. Most theoretical analyses and empirical allocator evaluations to date have relied on very strong assumptions of randomness and independence, but real program behavior exhibits important regularities that must be exploited if allocators are to perform well in practice. 1 Introduction and Contents In this survey, we will discuss the design and evaluation of conventional dynamic memory allocators. By "conventional," we mean allocators used for general purpose "heap" storage, where the a program can request a block of memory to store a program object, and free that block at any time. A heap, in this sense, is a pool of memory available for the allocation and deallocation of arbitrary-sized blocks of memory in arbitrary order? An allocated block is typically used to store a program "object," which is some kind of structured data item such as a * This work was supported by the National Science Foundation under grant CCR , and by a gift from Novel], Inc. ** Author's current address: Convex Computer Corporation, Dallas, Texas, USA. (dboles@zeppelin.convex.com) 3 This sense of "heap" is not to be confused with a quite different sense of "heap," meaning a partially ordered tree structure.

2 Pascal record, a C struct, or a C++ object, but not necessarily an object in the sense of object-oriented programming. 4 Throughout this paper, we will assume that while a block is in use by a program, its contents (a data object) cannot be relocated to compact memory (as is done, for example, in copying garbage collectors [Wi195]). This is the usual situation in most implementations of conventional programming systems (such as C, Pascal, Ada, etc.), where the memory manager cannot find and update pointers to program objects when they are moved. 5 The allocator does not examine the data stored in a block, or modify or act on it in any way. The data areas within blocks that are used to hold objects are contiguous and nonoverlapping ranges of (real or virtual) memory. We generally assume that only entire blocks are allocated or freed, and that the allocator is entirely unaware of the type of or values of data stored in a block--it only knows the size requested. Scope of this survey. In most of this survey, we will concentrate on issues of overall memory usage, rather than time costs. We believe that detailed measures of time costs are usually a red herring, because they obscure issues of strategy and policy; we believe that most good strategies can yield good policies that are amenable to efficient implementation. (We believe that it's easier to make a very fast allocator than a very memory-efficient one, using fairly straightforward techniques (Section 3.12). Beyond a certain point, however, the effectiveness of speed optimizations will depend on many of the same subtle issues that determine memory usage.) We will also discuss locality of reference only briefly. Locality of reference is increasingly important, as the differences between CPU speed and main memory (or disk) speeds has grown dramatically, with no sign of stopping. Locality is very poorly understood, however; aside from making a few important general comments, we leave most issues of locality to future research. Except where locality issues are explicitly noted, we assume that the cost of a unit of memory is fixed and uniform. We do not address possible interactions with unusual memory hierarchy schemes such as compressed caching, which may complicate locality issues and interact in other important ways with allocator design [WLM91, Wi191, Dou93]. 4 While this is the typical situation, it is not the only one. The "objects" stored by the allocator need not correspond directly to language-level objects. An example of this is a growable array, represented by a fixed size part that holds a pointer to a variable-sized part. The routine that grows an object might allocate a new, larger variable-sized part, copy the contents of the old variable-sized part into it, and deallocate the old part. We assume that the allocator knows nothing of this, and would view each of these parts as separate and independent objects, even if normal programmers would see a "single" object. 5 It is also true of many garbage-collected systems. In some, insufficient information is available from the compiler and/or programmer to allow safe relocation; this is especially likely in systems where code written in different languages is combined in an application [BW88]. In others, real-time and/or concurrent systems, it difficult to for the garbage collector to relocate data without incurring undue overhead and/or disruptiveness [Wil95].

3 We will not discuss specialized allocators for particular applications where the data representations and allocator designs are intertwined. 6 Allocators for these kinds of systems share many properties with the "conventional" allocators we discuss, but introduce many complicating design choices. In particular, they often allow logically contiguous items to be stored noncontiguously, e.g., in pieces of one or a few fixed sizes, and may allow sharing of parts or (other) forms of data compression. We assume that if any fragmenting or compression of higher-level "objects" happens, it is done above the level of abstraction of the allocator interface, and the allocator is entirely unaware of the relationships between the "objects" (e.g., fragments of higher-level objects) that it manages. Similarly, parallel allocators are not discussed, due to the complexity of the subject. Structure of the Paper. This survey is intended to serve two purposes: as a general reference for techniques in memory allocators, and as a review of the literature in the field, including methodological considerations. Much of the literature review has been separated into a chronological review, in Section 4. This section may be skipped or skimmed if methodology and history are not of interest to the reader, especially on a first reading. However, some potentially significant points are covered only there, or only made sufficiently clear and concrete there, so the serious student of dynamic storage allocation should find it worthwhile. (It may even be of interest to those interested in the history and philosophy of computer science, as documentation of the development of a scientific paradigm, r) The remainder of the current section gives our motivations and goals for the paper, and then frames the central problem of memory allocation--fragmentation-and the general techniques for dealing with it. Section 2 discusses deeper issues in fragmentation, and methodological issues (some of which may be skipped) in studying it. Section 3 presents a fairly traditional taxonomy of known memory allocators, including several not usually covered. It also explains why such mechanism-based taxonomies are very limited, and may obscure more important policy issues. Some of those policy issues are sketched. Section 4 reviews the literature on memory allocation. A major point of this section is that the main stream of allocator research over the last several decades has focused on oversimplified (and unrealistic) models of program behavior, and 6 Examples inlude specialized allocators for chained-block message-buffers (e.g., [Wo165]), "cdr-coded" list-processing systems [BC79], specialized storage for overlapping strings with shared structure, and allocators used to manage disk storage in file systems. 7 We use "paradigm" in roughly the sense of Kuhn [Kuh70], as a "pattern or model" for research. The paradigms we discuss are not as broad in scope as the ones usually discussed by Kuhn, but on our reading, his ideas are intended to apply at a variety of scales. We are not necessarily in agreement with all of Kuhn's ideas, or with some of the extreme and anti-scientific purposes they have been put to by others.

4 that little is actually known about how to design allocators, or what performance to expect. Section 5 concludes by summarizing the major points of the paper, and suggesting avenues for future research. Table of Contents 1 Introduction and Contents... 1 Table of Contents Motivation What an Allocator Must Do Strategies, Placement Policies, and Splitting and Coalescing... 9 Strategy, policy, and mechanism Splitting and coalescing A Closer Look at Fragmentation, and How to Study It Internal and External Fragmentation The Traditional Methodology: Probabilistic Analyses, and Simulation Using Synthetic Traces Random simulations Probabilistic analyses A note on exponentially-distributed random lifetimes A note on Markov models What Fragmentation Really Is, and Why the Traditional Approach is Unsound Fragmentation is caused by isolated deaths Fragmentation is caused by time-varying behavior Implications for experimental methodology Some Real Program Behaviors Ramps, peaks, and plateaus Fragmentation at peaks is important Exploiting ordering and size dependencies Implications for strategy Implications for research Profiles of some real programs Summary Deferred Coalescing and Deferred Reuse Deferred coalescing Deferred reuse A Sound Methodology: Simulation Using Real Traces Tracing and simulation Locality studies A Taxonomy of Allocators Allocator Policy Issues... 37

5 3.2 Some Important Low-Level Mechanisms Header fields and alignment Boundary tags Link fields within blocks Lookup tables SpeciM treatment of small objects Special treatment of the end block of the heap Basic Mechanisms Sequential Fits Discussion of Sequential Fits and General Policy Issues Segregated Free Lists Buddy Systems Indexed Fits Discussion of indexed fits Bitmapped Fits Discussion of Basic Mechanisms Quick Lists and Deferred Coalescing Scheduling of coalescing What to coalesce Discussion A Note on Time Costs A Chronological Review of The Literature The first three decades: 1960 to to to to Recent Studies Using Real Traces Zorn, Grunwald, et al Vo Wilson, Johnstone, Neely, and Boles Summary and Coneluslons Models and Theories Strategies and Policies Mechanisms Experiments Data Challenges and Opportunities

6 1.1 Motivation This paper is motivated by our perception that there is considerable confusion about the nature of memory allocators, and about the problem of memory allocation in general. Worse, this confusion is often unrecognized, and allocators are widely thought to be fairly well understood. In fact, we know little more about allocators than was known twenty years ago, which is not as much as might be expected. The literature on the subject is rather inconsistent and scattered, and considerable work appears to be done using approaches that are quite limited. We will try to sketch a unifying conceptual framework for understanding what is and is not known, and suggest promising approaches for new research. This problem with the allocator literature has considerable practical importance. Aside from the human effort involved in allocator studies per se, there are effects in the real world, both on computer system costs, and on the effort required to create real software. We think it is likely that the widespread use of poor allocators incurs a loss of main and cache memory (and CPU cycles) upwards of of a billion (109) U.S. dollars worldwide--a significant fraction of the world's memory and processor output may be squandered, at huge cost. s Perhaps even worse is the effect on programming style due to the widespread use of allocators that are simply bad--either because better allocators are known but not widely known or understood, or because allocation research has failed to address the proper issues. Many programmers avoid heap allocation in many situations, because of perceived space or time costs. 9 It seems significant to us that many articles in non-refereed publications-- and a number in refereed publications outside the major journals of operating systems and programming languages--are motivated by extreme concerns about the speed or memory costs of general heap allocation. (One such paper [GM85] is discussed in Section 4.1.) Often, ad hoc solutions are used for applications that should not be problematic at all, because at least some well-designed general allocators should do quite well for the workload in question. We suspect that in some cases, the perceptions are wrong, and that the costs of modern heap allocation are simply overestimated. In many cases, however, it appears that poorly-designed or poorly-implemented allocators have lead to a widespread and quite understandable belief that general heap allocation is s This is an unreliable estimate based on admittedly casual last-minute computations, approximately as follows: there are on the order of 100 million PC's in the world. If we assume that they have an average of 10 megabytes of memory at $30 per megabyte, there is 30 billion dollars worth of RAM at stake. (With the expected popularity of Windows 95, this seems like it will soon become a fairly conservative estimate, if it isn't already.) If just one fifth (6 billion dollars worth) is used for heap-allocated data, and one fifth of that is unnecessarily wasted, the cost is over a billion dollars. 9 It is our impression that UNIX programmers' usage of heap allocation went up significantly when Chris Kingsley's allocator was distributed with BSD 4.2 UNIX-- simply because it was much faster than the allocators they'd been accustomed to. Unfortunately, that allocator is somewhat wasteful of space.

7 necessarily expensive. Too many poor allocators have been supplied with widelydistributed operating systems and compilers, and too few practitioners are aware of the alternatives. This appears to be changing, to some degree. Many operating systems now supply fairly good allocators, and there is an increasing trend toward marketing libraries that include general allocators which are at least claimed to be good, as a replacement for default allocators. It seems likely that there is simply a lag between the improvement in allocator technology and its widespread adoption, and another lag before programming style adapts. The combined lag is quite long, however, and we have seen several magazine articles in the last year on how to avoid using a general allocator. Postings praising ad hoc allocation schemes are very common in the Usenet newsgroups oriented toward real-world programming. The slow adoption of better technology and the lag in changes in perceptions may not be the only problems, however. We have our doubts about how well allocators are really known to work, based on a fairly thorough review of the literature. We wonder whether some part of the perception is due to occasional programs that interact pathologically with common allocator designs, in ways that have never been observed by researchers. This does not seem unlikely, because most experiments have used non-representative workloads, which are extremely unlikely to generate the same problematic request patterns as real programs. Sound studies using realistic workloads are too rare. The total number of real, nontrivial programs that have been used for good experiments is very small, apparently less than 20. A significant number of real programs could exhibit problematic behavior patterns that are simply not represented in studies to date. Long-running processes such as operating systems, interactive programming environments, and networked servers may pose special problems that have not been addressed. Most experiments to date have studied programs that execute for a few minutes (at most) on common workstations. Little is known about what happens when programs run for hours, days, weeks or months. It may well be that some seemingly good allocators do not work well in the long run, with their memory efficiency slowly degrading until they perform quite badly. We don't know--and we're fairly sure that nobody knows. Given that long-running processes are often the most important ones, and are increasingly important with the spread of client/server computing, this is a potentially large problem. The worst case performance of any general allocator amounts to complete failure due to memory exhaustion or virtual memory thrashing (Section 1.2). This means that any real allocator may have lurking "bugs" and fail unexpectedly for seemingly reasonable inputs. Such problems may be hidden, because most programmers who encounter severe problems may simply code around them using ad hoc storage management techniques--or, as is still painfully common, by statically allocating "enough" memory for variable-sized structures. These ad-hoc approaches to storage management lead to "brittle" software with hidden limitations (e.g., due to the use

8 of fixed-size arrays). The impact on software clarity, flexibility, maintainability, and reliability is quite important, but difficult to estimate. These hidden costs should not be underestimated, however, because they can lead to major penalties in productivity and to significant human costs in sheer frustration, anxiety, and general suffering. A much larger and broader set of test applications and experiments is needed before we have any assurance that any allocator works reliably--in a crucial performance sense--much less works well. Given this caveat, however, it appears that some allocators are clearly better than others in most cases, and this paper will attempt to explain the differences. 1.2 What an Allocator Must Do An allocator must keep track of which parts of memory are in use, and which parts are free. The goal of allocator design is usually to minimize wasted space without undue time cost, or vice versa. The ideal allocator would spend negligible time managing memory, and waste negligible space. A conventional allocator cannot control the number or size of live blocks-- they are entirely up to the program requesting and releasing the space managed by the allocator. A conventional allocator also cannot compact memory, moving blocks around to make them contiguous and free contiguous memory. It must respond immediately to a request for space, and once it has decided which block of memory to allocate, it cannot change that decision--that block of memory must be regarded as inviolable until the application l~ program chooses to free it. It can only deal with memory that is free, and only choose where in free memory to allocate the next requested block. (Allocators record the locations and sizes of free blocks of memory in some kind of hidden data structure, which may be a linear list, a totally or partially ordered tree, a bitmap, or some hybrid data structure.) An allocator is therefore an online algorithm, which must respond to requests in strict sequence, immediately, and its decisions are irrevocable. The problem the allocator must address is that the application program may free blocks in any order, creating "holes" amid live objects. If these holes are too numerous and small, they cannot be used to satisfy future requests for larger blocks. This problem is known as fragmentation, and it is a potentially disastrous one. For the general case that we have outlined--where the application program may allocate arbitrary-sized objects at arbitrary times and free them at any later time--there is no reliable algorithm for ensuring efficient memory usage, and none is possible. It has been proven that for any possible allocation algorithm, there will always be the possibility that some application program will allocate and deallocate blocks in some fashion that defeats the allocator's strategy, and forces it into severe fragmentation [Rob71, GGU72, Rob74, Rob77]. Not only are 10 We use the term "application" rather generally; the "application" for which an allocator manages storage may be a system program such as a file server, or even an operating system kernel.

9 there no provably good allocation algorithms, there are proofs that any allocator will be "bad" for some possible applications. The lower bound on worst case fragmentation is generally proportional to the amount of live data 11 multiplied by the logarithm of the ratio between the largest and smallest block sizes, i.e., M log S n, where M is the amount of live data and n is the ratio between the smallest and largest object sizes [RobT]]. (In discussing worst-case memory costs, we generally assume that all block sizes are evenly divisible by the smallest block size, and n is sometimes simply called "the largest block size," i.e., in units of the smallest.) Of course, for some algorithms, the worst case is much worse, often proportional to the simple product of M and n. So, for example, if the minimum and maximum objects sizes are one word and a million words, then fragmentation in the worst case may cost an excellent allocator a factor of ten or twenty in space. A less robust allocator may lose a factor of a million, in its worst case, wasting so much space that failure is almost certain. Given the apparent insolubility of this problem, it may seem surprising that dynamic memory allocation is used in most systems, and the computing world does not grind to a halt due to lack of memory. The reason, of course, is that there are allocators that are fairly good in practice, in combination with most actual programs. Some allocation algorithms have been shown in practice to work acceptably well with real programs, and have been widely adopted. If a particular program interacts badly with a particular allocator, a different allocator may be used instead. (The bad cases for one allocator may be very different from the bad cases for other allocators of different design.) The design of memory allocators is currently something of a black art. Little is known about the interactions between programs and allocators, and which programs are likely to bring out the worst in which allocators. However, one thing is clear--most programs are "well behaved" in some sense. Most programs combined with most common allocators do not squander huge amounts of memory, even if they may waste a quarter of it, or a half, or occasionally even more. That is, there are regularities in program behavior that allocators exploit, a point that is often insufficiently appreciated even by professionals who design and implement allocators. These regularities are exploited by allocators to prevent excessive fragmentation, and make it possible for allocators to work in practice. These regularities are surprisingly poorly understood, despite 35 years of allocator research, and scores of papers by dozens of researchers. 1.3 Strategies, Placement Policies, and Splitting and Coalescing The main technique used by allocators to keep fragmentation under control is placement choice. Two subsidiary techniques are used to help implement that 11 We use "live" here in a different sense from that used in garbage collection or in compiler flow analysis. Blocks are "live" from the point of view of the allocator if it doesn't know that it can safely reuse the storage--i.e., if the block was allocated but not yet freed.

10 10 choice: splitting blocks to satisfy smaller requests, and coalescing of free blocks to yield larger blocks. Placement choice is simply the choosing of where in free memory to put a requested block. Despite potentially fatal restrictions on an allocator's online choices, the allocator also has a huge freedom of action--it can place a requested block anywhere it can find a sufficiently large range of free memory, and anywhere within that range. (It may also be able to simply request more memory from the operating system.) An allocator algorithm therefore should be regarded as the mechanism that implements a placement policy, which is motivated by a strategy for minimizing fragmentation. Strategy, policy, and mechanism. The strategy takes into account regularities in program behavior, and determines a range of acceptable policies as to where to allocate requested blocks. The chosen policy is implemented by a mechanism, which is a set of algorithms and the data structures they use. This three-level distinction is quite important. In the context of general memory allocation, - a strategy attempts to exploit regularities in the request stream, - a policyis an implementable decision procedure for placing blocks in memory, and - a mechanism is a set of algorithms and data structures that implement the policy, often over-simply called "an algorithm." 12 An ideal strategy is "put blocks where they won't cause fragmentation later"; unfortunately that's impossible to guarantee, so real strategies attempt to heuristically approximate that ideal, based on assumed regularities of application programs' behavior. For example, one strategy is "avoid letting small long-lived 12 This set of distinctions is doubtless indirectly influenced by work in very different areas, notably Marr's work in natural and artificial visual systems [Mar82] and Mc- Clamrock's work in the philosophy of science and cognition [McC91, McC95]. The distinctions are important for understanding a wide variety of complex systems, however. Similar distinctions are made in many fields, including empirical computer science, though often without making them quite clear. In "systems" work, mechanism and policy are often distinguished, but strategy and policy are usually not distinguished explicitly. This makes sense in some contexts, where the policy can safely be assumed to implement a well-understood strategy, or where the choice of strategy is left up to someone else (e.g., designers of higher-level code not under discussion). In empirical evaluations of very poorly understood strategies, however, the distinction between strategy and policy is often crucial. (For example, errors in the implementation of a strategy are often misinterpreted as evidence that the expected regularities don't actually exist, when in fact they do, and a slightly different strategy would work much better.) Mistakes are possible at each level; equally important, mistakes are possible between levels, in the attempt to "cash out" (implement) the higher-level strategy as a policy, or a policy as a mechanism.

11 ]1 objects prevent you from reclaiming a larger contiguous free area." This is part of the strategy underlying the common "best fit" family of policies. Another part of the strategy is "if you have to split a block and potentially waste what's left over, minimize the size of the wasted part." The corresponding (best fit) policy is more concrete--it says "always use the smallest block that is at least large enough to satisfy the request." The placement policy determines exactly where in memory requested blocks will be allocated. For the best fit policies, the general rule is "allocate objects in the smallest free block that's at least big enough to hold them." That's not a complete policy, however, because there may be several equally good fits; the complete policy must specify which of those should be chosen, for example, the one whose address is lowest. The chosen policy is implemented by a specific mechanism, chosen to implement that policy efficiently in terms of time and space overheads. For best fit, a linear list or ordered tree structure might be used to record the addresses and sizes of free blocks, and a tree search or list search would be used to find the one dictated by the policy. These levels of the allocator design process interact. A strategy may not yield an obvious complete policy, and the seemingly slight differences between similar policies may actually implement interestingly different strategies. (This results from our poor understanding of the interactions between application behavior and allocator strategies.) The chosen policy may not be obviously implementable at reasonable cost in space, time, or programmer effort; in that case some approximation may be used instead. The strategy and policy are often very poorly-defined, as well, and the policy and mechanism are arrived at by a combination of educated guessing, trial and error, and (often dubious) experimental validation In case the important distinctions between strategy, policy, and mechanism are not clear, a metaphorical example may help. Consider a software company that has a strategy for improving productivity: rewarding the most productive programmers. It may institute a policy of rewarding programmers who produce the largest numbers of lines of program code. To implement this policy, it may use the mechanisms of instructing the managers to count lines of code, and providing scripts that count lines of code according to some particular algorithm. This example illustrates the possible failures at each level, and in the mapping from one level to another. The strategy may simply be wrong, if programmers aren't particularly motivated by money. The policy may not implement the intended strategy, if lines of code are an inappropriate metric of productivity, or if the policy has unintended "strategic" effects, e.g., due to programmer resentment. The mechanism may also fail to implement the specified policy, if the rules for line-counting aren't enforced by managers, or if the supplied scripts don't correctly implement the intended counting function. This distinction between strategy and policy is oversimplified, because there may be multiple levels of strategy that shade off into increasingly concrete policies. At different levels of abstraction, something might be viewed as a strategy or policy. The key point is that there are at least three qualitatively different kinds of levels

12 ]2 Splitting and coalescing Two general techniques for supporting a range of (implementations of) placement policies are splitting and coalescing of free blocks. (These mechanisms are important subidiary parts of the larger mechanism that is the allocator implementation.) The allocator may split large blocks into smaller blocks arbitrarily, and use any sufficiently-large subblock to satisfy the request. The remainders from this splitting can be recorded as smaller free blocks in their own right and used to satisfy future requests. The allocator may also coalesce (merge) adjacent free blocks to yield larger free blocks. After a block is freed, the allocator may check to see whether the neighboring blocks are free as well, and merge them into a single, larger block. This is often desirable, because one large block is more likely to be useful than two small ones--large or small requests can be satisfied from large blocks. Completely general splitting and coalescing can be supported at fairly modest cost in space and/or time, using simple mechanisms that we'll describe later. This Mlows the allocator designer the maximum freedom in choosing a strategy, policy, and mechanism for the allocator, because the allocator can have a complete and accurate record of which ranges of memory are available at all times. The cost may not be negligible, however, especially if splitting and coalescing work too well--in that case, freed blocks will usually be coalesced with neighbors to form large blocks of free memory, and later allocations will have to split smaller chunks off of those blocks to obtained the desired sizes. It often turns out that most of this effort is wasted, because the sizes requested later are largely the same as the sizes freed earlier, and the old small blocks could have been reused without coalescing and splitting. Because of this, many modern allocators use deferred coalescing--they avoid coalescing and splitting most of the time, but use intermittently, to combat fragmentation. 2 A Closer Look at Fragmentation, and How to Study It In this section, we will discuss the traditional conception of fragmentation, and the usual techniques used for studying it. We will then explain why the usual of abstraction involved [McC91]; at the upper levels, there are is the general design goal of exploiting expected regularities, and a set of strategies for doing so; there may be subsidiary strategies, for example to resolve conflicts between strategies in the best possible way. At at a somewhat lower level there is a general policy of where to place objects, and below that is a more detailed policy that exactly determines placement: Below that there is an actual mechanism that is intended to implement the policy (and presumably effect the strategy), using whatever algorithms and data structures are deemed appropriate. Mechanisms are often layered, as well, in the usual manner of structured programming [Dij69]. Problems at (and between) these levels are the best understood--an algorithm may not implement its specification, or may be improperly specified. (Analogous problems occur at the upper levels occur as well--if expected regularities don't actually occur, or if they do occur but the strategy does't actually exploit them, and so on.)

13 13 understanding is not strong enough to support scientific design and evaluation of allocators. We then propose a new (though nearly obvious) conception of fragmentation and its causes, and describe more suitable techniques used to study it. (Most of the experiments using sound techniques have been performed in the last few years, but a few notable exceptions were done much earlier, e.g., [MPS71] and [LH82], discussed in Section 4.) 2.1 Internal and External Fragmentation Traditionally, fragmentation is classed as external or internal [Ran69], and is combatted by splitting and coalescing free blocks. External fragmentation arises when free blocks of memory are available for allocation, but can't be used to hold objects of the sizes actually requested by a program. In sophisticated allocators, that's usually because the free blocks are too small, and the program requests larger objects. In some simple allocators, external fragmentation can occur because the allocator is unwilling or unable to split large blocks into smaller ones. Internal fragmentation arises when a large-enough free block is allocated to hold an object, but there is a poor fit because the block is larger than needed. In some allocators, the remainder is simply wasted, causing internal fragmentation. (It's called internal because the wasted memory is inside an allocated block, rather than being recorded as a free block in its own right.) To combat internal fragmentation, most allocators will split blocks into multiple parts, allocating part of a block, and then regarding the remainder as a smaller free block in its own right. Many allocators will also coalesce adjacent free blocks (i.e., neighboring fi'ee blocks in address order), combining them into larger blocks that can be used to satisfy requests for larger objects. In some allocators, internal fragmentation arises due to implementation constraints within the allocator--for speed or simplicity reasons, the allocator design restricts the ways memory may be subdivided. In other allocators, internal fragmentation may be accepted as part of a strategy to prevent external fragmentation-the allocator may be unwilling to fragment a block, because if it does, it may not be able to coalesce it again later and use it to hold another large object. 2.2 The Traditional Methodology: Probabilistic Analyses, and Simulation Using Synthetic Traces (Note: readers who are uninterested in experimental methodology may wish to skip this section, at least on a first reading. Readers uninterested in the history of allocator research may skip the footnotes. The following section (2.3) is quite important, however, and should not be skipped.) Allocators are sometimes evaluated using probabilistic analyses. By reasoning about the likelihood of certain events, and the consequences of those events for future events, it may be possible to predict what will happen on average. For the

14 14 general problem of dynamic storage allocation, however, the mathematics are too difficult to do this for most algorithms and most workloads. An alternative is to do simulations, and find out "empirically" what really happens when workloads interact with allocator policies. This is more common, because the interactions are so poorly understood that mathematical techniques are difficult to apply. Unfortunately, in both cases, to make probabilistic techniques feasible, important characteristics of the workload must be known--i.e., the probabilities of relevant characteristics of "input" events to the allocation routine. The relevant characteristics are not understood, and so the probabilities are simply unknown. This is one of the major points of this paper. The paradigm of statistical mechanics has been used in theories of memory allocation, but we believe that it is the wrong paradigm, at least as it is usually applied. Strong assumptions are made that frequencies of individual events (e.g., allocations and deallocations) are the base statistics from which probabilistic models should be developed, and we think that this is false. The great success of statistical mechanics in other areas is due to the fact that such assumptions make sense there. Gas laws are pretty good idealizations, because aggregate effects of a very large number of individual events (e.g., collisions between molecules) do concisely express the most important regularities. This paradigm is inappropriate for memory allocation, for two reasons. The first is simply that the number of objects involved is usually too small for asymptotic analyses to be relevant, but this is not the most important reason. The main weakness of the statistical mechanics approach is that there are important systematic interactions that occur in memory allocation, due to phase behavior of programs. No matter how large the system is, basing probabilistic analyses on individual events is likely to yield the wrong answers, if there are systematic effects involved which are not captured by the theory. Assuming that the analyses are appropriate for "sufficiently large" systems does not help here-- the systematic errors will simply attain greater statistical significance. Consider the case of evolutionary biology. If a overly simple statistical approach about individual animals' interactions is used, the theory will not capture predator/prey and host/symbiote relationships, sexual selection, or other pervasive evolutionary effects as niche filling3 4 Developing a highly predictive evolutionary theory is extremely difficult--and some would say impossible--because too many low-level details matter, 15 and there may intrinsic unpredictabilities in the systems described3 6 We are not saying that the development of a good theory of memory allocation is as hard as developing a predictive evolutionary theory--far from it. The 14 Some of these effects may emerge from lower-level modeling, but for simulations to reliably predict them, many important lower-level issues must be modeled correctly, and sufficient data are usually not available, or sufficiently understood. 15 For example, the different evolutionary strategies implied by the varying replication techniques and mutation rates of RNA-based vs. DNA-based viruses. 16 For example, a single mutation that results in an adaptive characteristic in one individual may have a major impact on the subsequent evolution of a species and its entire ecosystem.

15 15 problem of memory allocation seems far simpler, and we are optimistic that a useful predictive theory can be developed. Our point is simply that the paradigm of simple statistical mechanics must be evaluated relative to other alternatives, which we find more plausible in this domain. There are major interactions between workloads and allocator policies, which are usually ignored. No matter how large the system, and no matter how asymptotic the analyses, ignoring these effects seems likely to yield major errors--e.g., analyses will simply yield the wrong asymptotes. A useful probabilistic theory of memory allocation may be possible, but if so, it will be based on a quite different set of statistics from those used so far-- statistics which capture effects of systematicities, rather than assuming such systematicities can be ignored. As in biology, the theory must be tested against reality, and refined to capture systematicities that had previously gone unnoticed. Random simulations. The traditional technique for evaluating allocators is to construct several traces (recorded sequences of allocation and deallocation requests) thought to resemble "typical" workloads, and use those traces to drive a variety of actual allocators. Since an allocator normally responds only to the request sequence, this can produce very accurate simulations of what the allocator would do if the workload were real--that is, if a real program that generated that request sequence. Typically, however, the request sequences are not real traces of the behavior of actual programs. They are "synthetic" traces that are generated automatically by a small subprogram; the subprogram is designed to resemble real programs in certain statistical ways. In particular, object size distributions are thought to be important, because they affect the fragmentation of memory into blocks of varying sizes. Object lifetime distributions are also often thought to be important (but not always), because they affect whether blocks of memory are occupied or free. Given a set of object size and lifetime distributions, the small "driver" subprogram generates a sequence of requests that obeys those distributions. This driver is simply a loop that repeatedly generates requests, using a pseudo-random number generator; at any point in the simulation, the next data object is chosen by "randomly" picking a size and lifetime, with a bias that (probabilistically) preserves the desired distributions. The driver also maintains a table of objects that have been allocated but not yet freed, ordered by their scheduled death (deallocation) time. (That is, the step at which they were allocated, plus their randomly-chosen lifetime.) At each step of the simulation, the driver deallocates any objects whose death times indicate that they have expired. One convenient measure of simulated "time" is the volume of objects allocated so far--i.e., the sum of the sizes of objects that have been allocated up to that step of the simulation In many early simulations, the simulator modeled real time, rather than just discrete steps of allocation and dealloeation. Allocation times were chosen based on

16 16 An important feature of these simulations is that they tend to reach a "steady state." After running for a certain amount of time, the volume of live (simulated) objects reaches a level that is determined by the size and lifetime distributions, and after that objects are allocated and deallocated in approximately equal numbers. The memory usage tends to vary very little, wandering probabilistically (in a random walk) around this "most likely" level. Measurements are typically made by sampling memory usage at points after the steady state has presumably been reached, or by averaging over a period of "steady-state" variation. These measurements "at equilibrium" are assumed to be important. There are three common variations of this simulation technique. One is to use a simple mathematical function to determine the size and lifetime distributions, such as uniform or (negative) exponentim. Exponential distributions are often used because it has been observed that programs are typically more likely to allocate small objects than large ones, is and are more likely to Mlocate short-lived objects than long-lived ones. 19 (The size distributions are generally truncated at some plausible minimum and maximum object size, and discretized, rounding them to the nearest integer.) The second variation is to pick distributions intuitively, i.e., out of a hat, but in ways thought to resemble real program behavior. One motivation for this is to model the fact that many programs allocate objects of some sizes and others in small numbers or not at all; we refer to these distributions as "spiky. "2~ The third variation is to use statistics gathered from real programs, to make the distributions more realistic. In almost all cases, size and lifetime distributions are assumed to be independent--the fact that different sizes of objects may have different lifetime distributions is generally assumed to be unimportant. In general, there has been something of a trend toward the use of more real- randomly chosen "arrival" times, generated using an "interarrival distribution" and their deaths scheduled in continuous time rather than discrete time based on the number and/or sizes of objects allocated so far. We will generally ignore this distinction in this paper, because we ttfink other issues are more important. As will become clear, in the methodology we favor, this distinction is not important because the actual sequences of actions are sufficient to guarantee exact simulation, and the actual sequence of events is recorded rather than being (approximately) emulated. 18 Historically, uniform size distributions were the most common in early experiments; exponential distributions then became increasingly common, as new data became available showing that real systems generally used many more small objects than large ones. Other distributions have also been used, notably Poisson and hyperexponential. Still, relatively recent papers have used uniform size distributions, sometimes as the only distribution. 19 As with size distributions, there has been a shift over time toward non-uniform lifetime distributions, often exponential. This shift occurred later, probably because real data on size information was easier to obtain, and lifetime data appeared later. ~0 In general, this modeling has not been very precise. Sometimes the sizes chosen out of a hat are allocated in uniform proportions, rather than in skewed proportions reflecting the fact that (on average) programs allocate many more small objects than large ones.

17 17 istic distributions, 21 but this trend is not dominant. Even now, researchers often use simple and smooth mathematical functions to generate traces for allocator evaluation. 2~ The use of smooth distributions is questionable, because it bears directly on issues of fragmentation--if objects of only a few sizes are allocated, the free (and uncoalescable) blocks are likely to be of those sizes, making it possible to find a perfect fit. If the object sizes are smoothly distributed, the requested sizes will almost always be slightly different, increasing the chances of fragmentation. Probabilistic analyses. Since Knuth's derivation of the "fifty percent rule" [Knu73] (discussed later, in Section 4), there have been many attempts to reason probabilistically about the interactions between program behavior and allocator policy, and assess the overall cost in terms of fragmentation (usually) and/or CPU time. These analyses have generally made the same assumptions as random-trace simulation experiments--e.g., random object allocation order, independence of size and lifetimes, steady-state behavior--and often stronger assumptions as well. These simplifying assumptions have generally been made in order to make the mathematics tractable. In particular, assumptions of randomness and independence make it possible to apply well-developed theory of stochastic processes (Markov models, etc.) to derive analytical results about expected behavior. Unfortunately, these assumptions tend to be false for most real programs, so the results are of limited utility. It should be noted that these are not merely convenient simplifying assumptions that allow solution of problems that closely resemble real problems. If that were the case, one could expect that with refinement of the analyses--or with sufficient empirical validation that the assumptions don't matter in practice-- the results would come close to reality. There is no reason to expect such a happy outcome. These assumptions dramatically change the key features of the problem; the ability to perform the analyses hinges on the very facts that make them much less relevant to the general problem of memory allocation. Assumptions of randomness and independence make the problem irregular, in a superficial sense, but they make it very smooth (hence mathematically 21 The trend toward more realistic distributions can be explained historically and pragmatically. In the early clays of computing, the distributions of interest were usually the distribution of segment sizes in an operating system's workload. Without access to the inside of an operating system, this data was difficult to obtain. (Most researchers would not have been allowed to modify the implementation of the operating system running on a very valuable and heavily-timeshared computer.) Later, the emphasis of study shifted away from segment sizes in segmented operating systems, and toward data object sizes in the virtual memories of individual processes running in paged virtual memories. 22 We are unclear on why this should be, except that a particular theoretical and experimental paradigm [KuhT0] had simply become thoroughly entrenched by the early 1970's. (It's also somewhat easier than dealing with real data.)

18 18 tractable) in a probabilistic sense. This smoothness has the advantage that it makes it possible to derive analytical results, but it has the disadvantage that it turns a real and deep scientific problem into a mathematical puzzle that is much less significant for our purposes. The problem of dynamic storage allocation is intractable, in the vernacular sense of the word. As an essentially data-dependent problem, we do not have a grip on it, because we simply do not understand the inputs. "Smoothing" the problem to make it mathematically tractable "removes the handles" from something that is fundamentally irregular, making it unlikely that we will get any real purchase or leverage on the important issues. Removing the irregularities removes some of the problems--and most of the opportunities as well. A note on exponentially-distributed random lifetimes. Exponential lifetime distributions have become quite common in both empirical and analytic studies of memory fragmentation over the last two decades. In the case of empirical work (using random-trace simulations), this seems an admirable adjustment to some observed characteristics of real program behavior. In the case of analytic studies, it turns out to have some very convenient mathematical properties as well. Unfortunately, it appears that the apparently exponential appearence of real lifetime distributions is often an artifact of experimental methodology (as will be explained in Sections 2.3 and 4.1) and that the emphasis on distributions tends to distract researchers from the strongly patterned underlying processes that actually generate them (as will be explained in Section 2.4). We invite the reader to consider a randomly-ordered trace with an exponential lifetime distribution. In this case there is no correlation at all between an object's age and its expected time until death--the "half-life" decay property of the distribution and the randomness ensure that allocated objects die completely at random with no way to estimate their death times from any of the information available to the allocator. 23 (An exponential random function exhibits only a half-life property, and no other pattern, much like radioactive decay.) In a sense, exponential lifetimes are thus the reductio ad absuvdum of the synthetic trace methodology--all of the time-varying regularities have been systematically eliminated from the input. If we view the allocator's job as an online problem of detecting and exploiting regularities, we see that this puts the allocator in the awkward position of trying to extract helpful hints from pure noise. This does not necessarily mean that all allocators will perform identically under randomized workloads, however, because there are regularities in size distributions, whether they are real distributions or simple mathematical ones, and some allocators may simply shoot themselves in the foot. Analyses and experiments with exponentially distributed random lifetimes may say something revealing about what happens when an allocator's strategy is completely orthogonal to the actual regularities. We have no real idea whether 23 We are indebted to Henry Baker, who has made quite similar observations with respect to the use of exponential hfetime distributions to estimate the effectiveness of generational garbage collection schemes [Bak93].

19 19 this is a situation that occurs regularly in the space of possible combinations of real workloads and reasonable strategies. (It's clear that it is not the usual case, however.) The terrain of that space is quite mysterious to us. A note on Markov models. Many probabilistic studies of memory allocation have used first-order Markov processes to approximate program and Mlocator behavior, and have derived conclusions based on the well-understood properties of Markov models. In a first-order Markov model, the probabilities of state transitions are known and fixed. In the case of fragmentation studies, this corresponds to assuming that a program allocates objects at random, with fixed probabilities of allocating different sizes. The space of possible states of memory is viewed as a graph, with a node for each configuration. There is a start state, representing an empty memory, and a transition probability for each possible allocation size. For a given placement policy, there will be a known transition from a given state for any possible allocation or deallocation request. The state reached by each possible allocation is another configuration of memory. For any given request distribution, there is a network of possible states reachable from the start state, via successions of more or less probable transitions. In general, for any memory above a very, very smm1 size, and for arbitrary distributions of sizes and lifetimes, this network is inconceivably large. As described so far, it is therefore useless for any practical analyses. To make the problem more tractable, certain assumptions are often made. One of these is that lifetimes are exponentially distributed as well as random, and have the convenient half-life property described above, i.e., they die completely at random as well as being born at random. This assumption can be used to ensure that both the states and the transitions between states have definite probabilities in the long run. That is, if one were to run a random-trace simulation for a long enough period of time, all reachable states would be reached, and all of them would be reached many times--and the number of times they were reached would reflect the probabilities of their being reached again in the future, if the simulation were continued indefinitely. If we put a counter on each of the states to keep track of the number of times each state was reached, the ratio between these counts would eventually stabilize, plus or minus small short-term variations. The relative weights of the counters would "converge" to a stable solution. Such a network of states is called an ergodic Markov model, and it has very convenient mathematical properties. In some cases, it's possible to avoid running a simulation at all, and analytically derive what the network's probabiblities would converge to. Unfortunately, this is a very inappropriate model for real program and allocator behavior. An ergodic Markov model is a kind of (probabilistic) finite automaton, and as such the patterns it generates are very, very simple, though randomized and hence unpredictable. They're almost unpatterned, in fact, and hence very predictable in a certain probabilistic sense.

20 20 Such an automaton is extremely unlikely to generate many patterns that seem likely to be important in real programs, such as the creation of the objects in a linked list in one order, and their later destruction in exactly the same order, or exactly the reverse order. 24 There are much more powerful kinds of machines--which have more complex state, like a real program--which are capable of generating more realistic patterns. Unfortunately, the only machines that we are sure generate the "right kinds" of patterns are actual real programs. We do not understand what regularities exist in real programs well enough to model them formally and perform probabilistic analyses that are directly applicable to real program behavior. The models we have are grossly inaccurate in respects that are quite relevant to problems of memory allocation. There are problems for which Markov models are useful, and a smaller number of problems where assumptions of ergodicity are appropriate. These problems involve processes that are literally random, or can be shown to be effectively random in the necessary ways. The general heap allocation problem is not in either category. (If this is not clear, the next section should make it much clearer.) Ergodic Markov models are also sometimes used for problems where the basic assumptions are known to be false in some cases--but they should only be used in this way if they can be validated, i.e., shown by extensive testing to produce the right answers most of the time, despite the oversimplifications they're based on. For some problems it "just turns out" that the differences between real systems and the mathematical models are not usually significant. For the general problem of memory allocation, this turns out to be false as well--recent results clearly invalidate the use of simple Markov models [ZG94, WJNB95] Technically, a Markov model will eventually generate such patterns, but the probability of generating a particular pattern within a finite period of time is vanishingly small if the pattern is large and not very strongly reflected in the arc weights. That is, many quite probable kinds of patterns are extremely improbable in a simple Markov model. 25 It might seem that the problem here is the use of first-order Markov models, whose states (nodes in the reachability graph) correspond directly to states of memory, and that perhaps "higher-order" Markov models would work, where nodes in the graph represent sequences of concrete state transitions. However, we do not believe these higher-order models will work any better than first-order models do. The important kinds of patterns produced by real programs are generally not simple very-short-term sequences of a few events, but large-scale patterns involving many events. To capture these, a Markov model would have to be of such high order that analyses would be completely infeasible. It would essentially have to be pre-programmed to generate specific literal sequences of events. This not only begs the essential question of what real programs do, but seems certain not to concisely capture the right regularities. Markov models are simply not powerful enough--i.e., not abstract enough in the right ways--to help with this problem. They should not be used for this purpose, or any similarly poorly understood purpose, where complex patterns may be very important. (At least, not without extensive validation.) The fact that the regularities are complex and unknown is not a good reason to assume that they're effectively random [ZG94, WJNB95] (Section 4.2).