Memory System Characterization of Big Data Workloads

Transcription

1 2013 IEEE International Conference on Big Data Memory System Characterization of Big Data Workloads Martin Dimitrov*, Karthik Kumar*, Patrick Lu**, Vish Viswanathan*, Thomas Willhalm* *Software and Services Group, **Datacenter and Connected Systems Group, Intel Corporation Abstract Two recent trends that have emerged include (1) Rapid growth in big data technologies with new types of computing models to handle unstructured data, such as mapreduce and nosql (2) A growing focus on the memory subsystem for performance and power optimizations, particularly with emerging memory technologies offering different characteristics from conventional DRAM (bandwidths, read/write asymmetries). This paper examines how these trends may intersect by characterizing the memory access patterns of various Hadoop and nosql big data workloads. Using memory DIMM traces collected using special hardware, we analyze the spatial and temporal reference patterns to bring out several insights related to memory and platform usages, such as memory footprints, read-write ratios, bandwidths, latencies, etc. We develop an analysis methodology to understand how conventional optimizations such as caching, prediction, and prefetching may apply to these workloads, and discuss the implications on software and system design. Keywords-big data, memory characterization I. INTRODUCTION The massive information explosion over the past decade has resulted in zetabytes of data [1] being created each year, with most of this data being in the form of files, videos, logs, documents, images, etc (unstructured formats). Data continues to grow at an exponential pace, for example, in 2010, the world created half as much data as it had in all previous years combined [2]. With such data growth, it becomes challenging for conventional computing models to handle such large volumes. Big Data analytics [3] [4] [5] [6] has emerged as the solution to parse, analyze and extract meaningful information from these large volumes of unstructured data. Extracting this information provides opportunities and insights in a variety of fronts- from making more intelligent business decisions, understanding trends in usages and markets, detecting frauds and anomalies, etcwith many of these being possible real-time. As a result of these advantages, big data processing is becoming increasingly popular. Two primary big data computing models have emerged: (1) Hadoop-based computing, and (2) NoSQLbased computing, and the two are among the fastest growing software segments in modern computer systems. On the other hand, computing systems themselves have seen a shift in optimizations; unlike a decade earlier when most optimizations were primarily on the processor, there has been more focus on the overall platform, and particularly the memory subsystem for performance and power improvements. Recent studies have shown memory to be a firstorder consideration for performance, and sometimes even the dominant consumer of power in a computer system. This emphasis becomes even more important with recent trends in emerging memory technologies [7] [8], that are expected to offer different characteristics from conventional DRAMsuch as higher latencies, differing capacities, persistence, etc. In order to have software run efficiently using such technologies, it becomes critical to characterize and understand the memory usages.while various studies have been performed on memory characterization of workloads [9] [10] [11] [12] over the past decade, unfortunately, most of them focus on the SPEC benchmark suites and traditional computing models. Very few studies have examined memory behaviors of big data workloads- and these are mostly specific to an optimization, such as TLB improvements [13] [14]. This paper addresses this gap by providing a detailed characterization of the spatial and temporal memory references of various big data workloads. We analyze various building blocks of big data operations such as sort, wordcount, aggregations, and joins on Hadoop, and building indexes to process data on a NoSQL data store. We characterize the memory behavior by monitoring various metrics such as memory latencies, first, second and last level processor cache miss rates, code and data TLB miss rates, peak memory bandwidths, etc. We examine the impact of Hadoop compression both on performance and the spatial patterns. Using specially designed hardware, we are able to observe and trace the memory reference patterns of all the workloads at the DIMM level with precise timing information. These traces provide us the unique ability to obtain insights based on the spatial and temporal references, such as memory footprints, and the spatial histograms of the memory references over the footprints. This paper also examines the potential for big data workloads to be tolerant to the higher latencies expected in emerging memory technologies. The classic mechanisms for doing this are caching in a faster memory tier, and predicting future memory references to prefetch sections of memory. We examine the cacheability of big data workloads by running the memory traces through a cache simulator with different cache sizes. An interesting insight is that we observe is that many of the workloads operate on only a small subset of their spatial footprint at a time. As a result, we find that a cache that is less than 0.1% the size of the footprint can provide a hit rate as high as 40% of all the memory references. For prefetchability, we observe that using existing prefetcher schemes to predict the precise /13/$ IEEE 15

2 next memory reference is a hard problem at the memory DIMM level due to mixing of different streams from different processor cores, and this mixed stream getting further interleaved across different memory ranks for performance. Clearly, more sophisticated algorithms and techniques are required if prefetching is to be transparent to the application at the lower levels of the memory hierarchy. In order to examine the potential for the same, we use signal processing techniques: entropy and trend analysis (correlation with known signals) to bring out insights related to the memory patterns. We believe this is the first paper to examine the design space for memory architectures running big data workloads, by analyzing spatial patterns using DIMM traces and providing a detailed memory characterization. The study brings out a wealth of insights for system and software design. The experiments are performed on a 4 node cluster, using 2 socket servers with Intel Xeon E5 processors, with each node configured with 128GB of DDR3 memory, and 2TB of SSD storage. (We intentionally selected fast storage and large memory capacities: with the price for flash and non-volatile media continuing to drop, we chose this configuration to understand forward looking usages). The remainder of this paper is organized as follows: Section II describes the related work. Section III describes the various workloads used in this paper. Section IV describes the experimental methodology used, and Section V presents the results and observations, with Section VI concluding the paper and discussing future work. II. RELATED WORK A. Memory system characterization There have been several papers discussing memory system characterization of enterprise workloads, over the past decade. Barroso et al. [9] characterize the memory references of various commercial workloads. Domain specific characterization studies include memory characterization of parallel data mining workloads [15], of the ECperf benchmark [16], of memcached [17], and of the SPEC CPU2000 and CPU2006 benchmark suites [10] [11] [12]. Particularly noteworthy is the work of Shao et al. [12], where statistical measures are used for memory characterization. The common focus of all these works is using instrumentation techniques and platform monitoring to understand how specific workloads use memory. With emerging memory technologies [8] [18] [19] [20] having different properties from conventional DRAM that has been used for the past decade, this type of characterization focus becomes particularly important. B. Big Data workloads Various researchers have proposed benchmarks and workloads representative of big data usages; the common focus of all these benchmarks is they deal with processing unstructured data, typically using Hadoop or NoSQL. The Hibench suite developed by Intel [21] consists of several Hadoop workloads such as sort, wordcount, hive aggregation, etc. that are proxies for big data usages in the real world. In this paper, we use several Hadoop workloads from the HiBench suite. Another class of data stores to handle unstructured data are NoSQL databases [22], which are specialized for query and search operations. They differ from conventional databases in that they typically do not offer transactional guarantees, and this is a trade-off made in exchange for very fast retrieval. Recent studies have also proposed characterizing and understanding these big data usage cases. These can be classified as follows Implications on system design and architecture: A study from IBM research [23] examines how big data workloads may be suited for the IBM POWER architecture. Chang et al. [24] examine the implications of big data analytics on system design. Modeling big data workloads: Yang et al. [25] propose using statistics-based techniques for modeling map reduce. Atikoglu et al. [26] model and analyze the behavior of a key-value store (memcached). Performance characterization of big data workloads: Ren et al. characterize the behavior of a production Hadoop cluster, using a specific case study [27]. Issa et al. [28] present power and performance characterization of Hadoop with memcached. Very few studies focus on understanding the memory characteristics of big data workloads. Noteworthy among these are that of Basu et al. [14], which focuses on pagetable and virtual memory related optimizations for big data workloads. Jia et al. [13] present a characterization of L1, L2, and LLC cache misses observed for a Hadoop workload cluster. Both of these studies focus on characterization at the virtual memory and cache hierarchy, as opposed to the DRAM level. C. Contributions The following are the unique contributions of this paper: We believe this is the first study analyzing the memory reference patterns of big data workloads. Using hardware memory traces at the DIMM level, we are able to analyze references to physical memory. We introduce various metrics to qualitatively and quantitatively characterize the memory reference patterns, and we discuss the implications of these metrics for future system design. III. WORKLOADS We use several Hadoop workloads, from the HiBench workload suite [21] and a NoSQL datastore that builds indexes from text documents. We use a performance-optimized Hadoop configuration in our experiments. Since Hadoop has a compression codec for both input and output data, all the 16

3 Hadoop workloads are examined with and without use of compression. The following is a brief description of the various workloads used: A. Sort Sort is a good proxy of a common type of big data operation, that requires transforming data from one representation to another. In this case, the workload sorts its text input data, which is generated using the Hadoop RandomTextWriter example. In our setup, we sort a total of 96GB dataset in HDFS, using a 4 node cluster, with 24GB dataset/node. B. WordCount Word count also represents a common big data operation: extracting a small amount of interesting data from a large dataset, or a needle in haystack search. In this case, the workload counts the number of occurrences of each word in the input data set. The data set is generated using the Hadoop RandomTextWriter example. In our setup, we perform wordcount on a 128GB dataset in HDFS, distributed between the 4 nodes as 32GB per node. C. Hive Join The Hive join workload approximates a complex analytic query, representative of typical OLAP workloads. It computes the average and the sum for each group by joining two different tables. The join task consists of two sub-tasks that perform a complex calculation on two data sets. In the first part of the task, each system must find the sourceip that generated the most revenue within a particular date range. Once these intermediate records are generated, the system must then calculate the average pagerank of all the pages visited during this interval. The data set generated approximates web-server logs with hyperlinks following the Zipfian distribution. In this case, we simulated nearly 130 million user visits, to nearly 18 million pages. D. Hive Aggregation Hive aggregation approximates a complex analytic query, representative of typical OLAP workloads by computing the inlink count for each document in the dataset, a task that is often used as a component of PageRank calculations. The first step is to read each document and search for all the URLs that appear in the contents. The second step is then, for each unique URL, to count the number of unique pages that reference that particular URL across the entire set of documents. It is this type of task that the map-reduce is believed to be commonly used for. E. NoSQL indexing The NoSQL workload uses a NoSQL data store to build indexes, from 240GB of text files, distributed across the 4 nodes. This type of computation is heavy in regular expression comparisons, and is a very common big data use case. IV. EXPERIMENTAL METHODOLOGY In this section, we discuss the experimental methodology used in the paper. The experimental methodology is focused on the following objectives: (1) providing insights about how big data applications use the memory subsystem (Section IV-A) (2) examining the latency tolerance of big data workloads (since emerging memory technologies have higher latencies than DRAM). The latency tolerance is examined by understanding the potential for classic techniques to hide latency: cacheability in a faster tier (Sections IV-B, IV-C, and IV-D) and prefetching into a faster tier (IV-D and IV-E). A. General characterization Performance counter monitoring allows us to analyze various characteristics of the workload and its memory references. Some of these metrics that are of interest include: Memory footprints: the memory footprint is defined as the span of memory that is touched at least once by the workload. It can also be viewed as the minimum amount of memory required to keep the workload in memory. CPI: cycles per instruction. This is a measure of the average number of hardware cycles required to execute one instruction. A lower value indicates that the instructions are running efficiently on the hardware, with fewer stalls, dependencies, waits, and bottlenecks. L1, L2, Last Level Cache (LLC) misses per instruction (MPI): the processor has 3 levels of cache hierarchy. The L1 and L2 caches are smaller (kb sizes), fast caches that are exclusive to each core. The LLC is the last level cache that is shared amongst all the cores in a socket. Since the LLC is the last level in the hierarchy, the penalty for an LLC miss is a reference to DRAM, and this requires 10s of nanoseconds of wait time. Hence the LLC MPI is often a good indicator of the memory intensiveness of a workload. Memory bandwidth: Memory bandwidth is the data rate measured for references to the memory subsystem. Intel Xeon E5 two socket server platforms can easily support bandwidths greater than 60,000 MB/s. Instruction and Data Translation Lookaside Buffers (TLB) MPI: The TLBs are a cache for the page table entries. A higher miss rate for the data TLBs indicates memory references are more widespread in distribution since a TLB miss occurs whenever a 4kB boundary is crossed, and a page entry that is not cached in the TLB is referenced. B. Cache line working set characterization The spatial nature of the memory references of a workload can be identified by a characterization of the cache line references. For example, the memory referenced by a workload may span 5GB. However, it may be the case, that most of the references were concentrated on a smaller 17

4 100MB region within the 5GB. In order to understand such behavior, we employ the following methodology: (1) we create a histogram of the cache lines and their memory references: against each cache line, we have the number of times it is referenced (2) we sort the cache lines by their number of references, with the most frequently referenced cache line occurring first (3) we select a footprint size (for example: 100MB), and percentage of references contained within this footprint size. In this example, lines (=100MB/64 bytes per cache line) are required to contain 100MB; we compute the total number of references against the first cache lines in the list from step (2); and divide by the total number of references in the trace. This gives us information about the spatial distribution of the cache lines within the hottest 100MB, against the overall memory footprint. Intuitively, if one had a fixed cache and had to pin the cache lines with no replacements or additions possible, then the cache lines to be selected for this would be the ones highlighted in this analysis. C. Cache simulation The previous section considers the spatial distribution of the memory references; however, it does not consider the temporal nature. For example, the workload could be repeatedly streaming through memory in very large scans, that are repeated. However, if the working set spanned by the scan is much larger than the size of the cache, this will result in very poor cacheability. Moreover, there could be certain cache eviction and replacement patterns that could result in poor cacheability that is not apparent from inspecting the spatial distribution. On the other hand, it could be also possible that the workload focuses on small regions of memory at a time, resulting in very good cacheability; again this may not be apparent from the spatial distribution. In order to account for such cases, we run the memory reference traces through a cache simulator with different cache sizes, and observe the corresponding hit rates. High hit rates indicate a tiered memory architecture, such as a first tier of DRAM could be used to cache a good portion of the memory references to a second larger tier, based on a non-volatile technology. D. Entropy The previous metrics provide information about the spatial distribution, and how the temporal pattern impacts cacheability. Another important consideration for memorylevel optimizations is predictability and compressibility. This is related to the information content based on the observation that a signal with a high amount of information content is harder to compress, and potentially difficult to predict. In order to quantify and compare this feature for the different traces, we use entropy of the memory references as a metric, as it has been used in [12] as a metric for understanding memory reference patterns of SPEC workloads. The entropy is measure of the information content in the trace, and therefore gives good indication for its cacheability and predictability. For a set of cache lines K, the entropy is defined as H = c K p(c) log(p(c)) where p(c) denotes the probability of the cache line c. The following illustrative example demonstrates how the entropy can be used to characterize memory references: Consider a total of 10 distinct cache lines: {a, b, c, d, e, f, g, h, i, j} that are referenced in a trace and the following three scenarios, each consisting of 100 references to these cache lines: (1) Each of the 10 cache lines are referenced 10 times. (2) Cache lines a, b, c, d, e are referenced 19 times each, and cache lines f, g, h, i, j are referenced 1 time each. (3) Cache line a is referenced 91 times, and cache lines b, c, d, e, f, g, h, i, j are referenced 1 time each. All three access patterns use all 10 cache lines and have 100 cache line references. Metrics like footprint and reference counts therefore become identical in all the 3 cases. However, in the last case a single cache line contains 91% of the references, but only 19% of the references in case (2) and 10% of the references in case (1). Similarly, a set of 3 cache lines contains 93% of the references in case (3), 57% of the references in case (2) and 30% of the references in case (1). Therefore from a footprint or working set point of view, (3) is preferable over (2), which again is preferable over (1). This is nicely reflected in the entropy, which is 1 for scenario (1), for scenario (2), and for scenario (3). The number of references to various cache lines in a trace gets converted to a probability distribution between the cache lines in a trace, and is therefore relative between the cache lines. In particular, the entropy is independent of the length of a trace. E. Correlation and trend analysis To further understand the predictability of memory references, we examine the traces for trends. For example, if we knew that a trace had a trend of increasing physical address ranges in its references, aggressively prefetching large sections in the upper physical address ranges to an upper level cache would result in a fair percentage of cache hits. In order to quantify this, we use correlation analysis with known signals. The computation can be mathematically expressed as follows: c(n) = f g(n) = m= f[m]g[m + n] (1) Here g is the trace, and f is a known signal. For our 18

5 analysis, we use a single sawtooth function: { s n if 0 n l f s,l (n) = 0 otherwise The correlation output would then examine the trace g, looking for the known signal f. With a slope of s = 64 and a length of l = 1000, the test function f is mimicking a ascending stride through memory of 1000 cache lines. Please note that the infinite sum in (1) collapses to f s,l g(n) = l f s,l [m]g[m + n] = m=0 (2) l s m g[m + n] m=0 Furthermore, it is worth noting that test function for an descending stride with a negative slope s simply results in a negative correlation. f s,l g = ( f s,l ) g = f s,l g Figure 1. Memory footprints of the workloads V. RESULTS A. Experimental Setup We perform our experiments on a 4 node cluster, with each node being an Intel Xeon E5 2.9GHz two socket server platform, with 128GB 1600MHz DDR3 memory, and each node having 2TB of SSD storage. One of the nodes is fitted with the special hardware required to collect memory traces of the workloads at the DIMM level. The memory tracer interposes between the DIMM and the motherboard and is completely unobtrusive, while at the same time capable of recording all signals arriving at the pins of the memory DIMM. We keep the cluster configuration identical, and ensured and verified during our experiments that Hadoop and the NoSQL workload distribute tasks equally among the nodes; hence observations made at one node can be generalized. In order to collect the memory traces, we use specially designed hardware that can record physical address referenced at the DIMM level, without any overhead. B. General characterization In this section, we describe the memory characteristics of all the workloads. Memory footprints: Figure 1 shows the memory footprints of the workloads, in GB. It is observed that most workloads have footprints of 10GB or greater, with the NoSQL workload and the uncompressed sort workload having the largest footprints. It can also be observed that compression reduces the memory footprints, and helps reduce the execution time, as seen in Figure 3. Also, the nature of the footprints is mostly read-intensive, with the NoSQL workload and wordcount map having read-write ratios of 2 or greater. Among the Hadoop workloads, Hive join and Hive aggregation were found to have more writes, when compared to the sort Figure 2. Cycles per Instruction to execute the workloads Figure 3. Execution times for the workloads Figure 4. First level data cache misses per 1000 instructions 19

6 Figure 5. Second level cache misses per 1000 instructions Figure 9. Instruction TLB misses per 1000 instructions Figure 6. Last level cache misses per 1000 instructions Figure 7. Peak memory bandwidths recorded for the workloads and wordcount workloads. In most cases, we observed that enabling Hadoop compression reduced the read-write ratio. CPI: The CPI of the workloads is shown in Figure 2. It is observed that most of the workloads have CPI closer to 1, with uncompressed sort having the largest CPI. This difference with compression for the sort workload is also apparent in the execution times, shown in Figure 3. From all the Figures 2, 3, 1, it can be observed that the sort and wordcount workloads benefit most from using compression, followed by the Hive aggregation workload. L1, L2, LLC MPI: Figures 4, 5, and 6 show the miss rates per thousand instructions of the three levels of the cache hierarchy respectively. The sort workload is observed to have the highest cache miss rates. Intuitively, this makes sense because it transforms a large volume of data from one representation to another. The benefits of compression is also apparent in the last level cache miss rates of all the workloads. Memory bandwidth: The peak memory bandwidths of the workloads are shown in Figure 7. It is observed that all the workloads have peak bandwidths of several 10s of GB/s, all within the platform capability of 70 GB/s. Wordcount is observed to be the most bandwidth intensive amongst the workloads. We note that while some workloads have higher bandwidth with compression enabled, the total data traffic to memory (product of execution time and bandwidth) is lowered in all cases, with compression enabled. Instruction and Data TLB MPI: It is interesting to note that although the sort workload has almost an order of magnitude larger footprint than the wordcount workload, wordcount has much higher Data TLB miss rates. This indicates that the memory references of the workload are not well contained within page granularities, and are more widespread. In terms of instruction TLBs, the NoSQL workload is observed to have the highest miss rates. Figure 8. Data TLB misses per 1000 instructions C. Cache line working set characterization Figures 11 shows the working set characterization described in the earlier section. It is observed that the hottest 20

7 Figure 10. Entropy of cache line references Figure 13. Correlation of traces with known signal; suffixes are as follows p: prefetch, no: no prefetch, c: compression, nc: no compression Figure 11. Figure 12. Percentage references contained in memory footprints Cache miss rates for different cache sizes 100MB of cache lines contain 20% of the workloads for all, except for the NoSQL workload and the map phase of the uncompressed word count workload. The NoSQL workload stands out from the other workloads in terms of this characterization for the various footprint sizes. A 1GB footprint is observed to contain 60% of the memory references of all but this workload. It is also interesting to note that even though the sort workload has footprints of more than 100GB, more than 60% of its memory references are contained in 1GB, i.e. less than 1% of its footprint. D. Cache simulation Figure 12 shows the cacheability while accounting for the temporal nature of the reference patterns. It is interesting to compare Figure 11 and Figure 12 - and observe that the percentage of cache hits for the references is higher in Figure 12. This indicates that all the big data workloads do not operate on the entire footprint at once; rather they operate on spatial subsets of the total footprint, which makes them the hit rates higher in a cache that allows for evictions and replacements. It is observed that a 100MB cache has hit rates of 40-50% for most of the workloads, and a 1GB cache has hit rates of 80% for most workloads; indicating that these workloads are extremely cache friendly. Observing the trends- it is interesting to observe that the NoSQL workload appears to have lowest hit rates and slopes for both the working set analysis and the cache simulations. E. Entropy Figure 10 shows the entropies of all the workloads cache line references. It is observed that most of the big data workloads have entropies ranging from 13 to 16, with the NoSQL and sort workloads having the highest entropiesindicating large information content, harder predictability, and poorer compressibility. A common feature between these workloads is that they operate on entire datasets: both the inputs and outputs are comparably of the same size, with large transforms being performed. A noteworthy comparison for Figure 10 would be with the entropies for the SPEC workloads in [12]; it is observed that most of the SPEC workloads have entropies in the range of 8-13 (lower than the big data workloads), with equake and mcf being the only workloads to have entropies close to 16. F. Correlation and trend analysis Figure 13 shows the correlation analysis described in Section IV-E, with a known signal that has an increasing slope of 64, for some of the workloads. We normalize the correlation analysis using the mean and standard deviation to ensure fair comparisons can be made between the different workloads. A higher magnitude for the correlation indicates the trend (known signal) is observed strongly in the trace, with a positive magnitude denoting successive physical addresses are likely to be increasing in magnitude (by 64 bytes) and negative magnitude indicating successive physical addresses are likely to decrease in magnitude. It is observed that the Hive aggregation workload overall has high correlation magnitudes, indicating it may be beneficial to predict and prefetch memory references in the higher address ranges (when compression is disabled) and lower address ranges 21

8 (when compression is enabled). In most cases (other than the NoSQL workload), enabling prefetchers during trace collection results in higher correlation, as would be expected due to the prefetcher hitting adjacent addresses. In the case of the NoSQL workload, on further examination of the trace, we observed there were several local phases of increasing and decreasing trends that changed along the duration of the trace. VI. CONCLUSION AND OUTLOOK We examine the design space for memory architectures running big data workloads, by analyzing spatial patterns using DIMM traces and providing a detailed memory characterization and highlight various observations and insights for system design. Our study shows that several big data workloads can potentially hide latencies by caching references in a faster tier of memory. Moreover, there are trends (increasing address ranges) observable in these workloads, indicating potential for aggressively prefetching large sections of the dataset onto a faster tier. For future work, we plan to expand the measurements to include more big data workloads as well as exploring further ways to characterize the workloads, with variations in dataset size, etc. REFERENCES [1] Wikibon, Big data statistics, in [2] SeekingAlpha, Opportunities to play the explosive growth of big data, in opportunities-to-play-the-explosive-growth-of-big-data. [3] D. Boyd and K. Crawford, Six provocations for big data, Oxford Internet Institute: A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, [4] B. Brown, M. Chui, and J. Manyika, Are you ready for the era of big data? McKinsey Quarterly, vol. 4, pp , [5] K. Bakshi, Considerations for big data: Architecture and approach, in IEEE Aerospace Conference, 2012, pp [6] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, Big data: The next frontier for innovation, competition and productivity, Technical report, McKinsey Global Institute, Tech. Rep., [7] Y. Xie, Modeling, architecture, and applications for emerging memory technologies, Design & Test of Computers, IEEE, vol. 28, no. 1, pp , [8] A. Makarov, V. Sverdlov, and S. Selberherr, Emerging memory technologies: Trends, challenges, and modeling methods, Microelectronics Reliability, vol. 52, no. 4, pp , [9] L. A. Barroso, K. Gharachorloo, and E. Bugnion, Memory system characterization of commercial workloads, ACM SIGARCH Computer Architecture News, vol. 26, no. 3, pp. 3 14, [10] A. Jaleel, Memory characterization of workloads using instrumentation-driven simulation a pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites, Intel Corporation, VSSAD, [11] F. Zeng, L. Qiao, M. Liu, and Z. Tang, Memory performance characterization of spec cpu2006 benchmarks using tsim, Physics Procedia, vol. 33, no. 0, pp , [12] Y. S. Shao and D. Brooks, Isa-independent workload characterization and its implications for specialized architectures, in IEEE International Symposium on Performance Analysis of Systems and Software, 2013, pp [13] Z. Jia, L. Wang, J. Zhan, L. Zhang, and C. Luo, Characterizing data analysis workloads in data centers, arxiv preprint arxiv: , [14] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, Efficient virtual memory for big memory servers, in International Symposium on Computer Architecture, 2013, pp [15] J.-S. Kim, X. Qin, and Y. Hsu, Memory characterization of a parallel data mining workload, in Workload Characterization: Methodology and Case Studies, IEEE, 1999, pp [16] M. Karlsson, K. Moore, E. Hagersten, and D. Wood, Memory characterization of the ecperf benchmark, in Workshop on Memory Performance Issues, [17] Y. Xu, E. Frachtenberg, S. Jiang, and M. Palecezny, Characterizing facebook s memcached workload, IEEE Internet Computing, vol. 99, p. 1, [18] Y. Xie, Future memory and interconnect technologies, in Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, 2013, pp [19] J. Chen, R. C. Chiang, H. H. Huang, and G. Venkataramani, Energy-aware writes to non-volatile main memory, ACM SIGOPS Operating Systems Review, vol. 45, no. 3, pp , [20] E. Chen et al., Advances and future prospects of spintransfer torque random access memory, Magnetics, IEEE Transactions on, vol. 46, no. 6, pp , [21] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, The hibench benchmark suite: Characterization of the mapreducebased data analysis, in International Conference on Data Engineering Workshops. IEEE, 2010, pp [22] M. Stonebraker, Sql databases v. nosql databases, Commun. ACM, vol. 53, no. 4, pp , Apr [23] A. E. Gattiker, F. H. Gebara, A. Gheith, H. P. Hofstee, D. A. Jamsek, J. Li, E. Speight, J. W. Shi, G. C. Chen, and P. W. Wong, Understanding system and architecture for big data, IBM research, [24] J. Chang, K. T. Lim, J. Byrne, L. Ramirez, and P. Ranganathan, Workload diversity and dynamics in big data analytics: implications to system designers, in Workshop on Architectures and Systems for Big Data, 2012, pp [25] H. Yang, Z. Luan, W. Li, D. Qian, and G. Guan, Statisticsbased workload modeling for mapreduce, in Parallel and Distributed Processing Symposium Workshops PhD Forum, 2012, pp [26] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, Workload analysis of a large-scale key-value store, in ACM International Conference on Measurement and Modeling of Computer Systems, 2012, pp [27] Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, Workload characterization on a production hadoop cluster: A case study on taobao, in IEEE International Symposium on Workload Characterization, 2012, pp [28] Hadoop and memcached: Performance and power characterization and analysis, Journal of Cloud Computing, vol. 1, no. 1,