Transactional Memory Coherence and Consistency

Transcription

1 Transactional emory Coherence and Consistency (LBA Reading Group Presentation) Shimin Chen Based on talk slides at

2 The Need for Parallelism Transactional Coherence & Consistency 2 otivation Uniprocessor system scaling is hitting limits Power consumption increasing dramatically Wire delays becoming a limiting factor Design and verification complexity is now overwhelming Exploits limited instruction-level parallelism (ILP) So we need support for multiprocessors Inherently avoid many of the design problems Replicate small cores, don t design big ones: ulti-core Processors Exploit thread-level parallelism (TLP) But can still use ILP within cores But now we have new problems...

3 Existing Parallel Programming odels essage-passing Large data packages Relatively infrequent Shared memory Cache-block sized packages Frequent with high performance requirement Important for load/store inst Simple H/W requirements Programmer effort Divide data and work Complex H/W: Cache coherence emory consistency Only slightly simpler programming model

4 Shared emory Software Problems Transactional Coherence & Consistency 3 otivation Parallel systems are often programmed with: Synchronization through barriers Shared variable access control through locks... Lock granularity and organization must balance performance and correctness Coarse-grain locking: Lock contention Fine-grain locking: Extra overhead ust be careful to avoid deadlocks or races ust be careful not to leave anything unprotected for correctness Performance tuning is not intuitive Performance bottlenecks are related to low level events Such as: false sharing, coherence misses, Feedback is often indirect (cache lines, not variables)

5 Shared emory Hardware Complexity Transactional Coherence & Consistency 4 otivation Cache coherence protocols are complex ust track ownership of cache lines Difficult to implement and verify all corner cases emory consistency protocols are complex ust provide rules to correctly order individual loads/stores Difficult for both hardware and software Current protocols rely on low latency, not bandwidth Critical short control messages on ownership transfers (2-3 hops) Latency of short messages unlikely to scale well in the future Bandwidth likely to scale much better High-speed inter-chip connections Chip multiprocessors = on-chip bandwidth!

6 The Key Question Transactional Coherence & Consistency 5 otivation Is there a shared memory model with: A simple programming model? A simple hardware implementation? Good performance?

7 Transactional emory: Idea (1/2) Transaction: a sequence of instructions that is guaranteed to be atomic Goal: perform consistency and coherence tasks at the granularity of transactions instead of individual memory references

8 Transactional emory: Idea (2/2) Each processor executes its current transaction speculatively Keep track of speculatively read and written data Buffer all writes Once a transaction on P finishes, P asks for system-wide arbitration for commit P broadcasts buffered writes Other processors check the broadcast against its speculative states: roll-back if violation Upon commit, P checkpoints local register states

9 Transactional emory I Transactional Coherence & Consistency 7 Overview Transactions appear to execute in the commit order RAW dependence errors cause transaction violation & restart Transaction A LOAD X Transaction B Time STORE X Violation! LOAD X STORE X Commit X Re-execute with new data LOAD X STORE X

10 Transactional emory II Transactional Coherence & Consistency 8 Overview Antidependencies are automatically handled (details later) WAW are handled by writing buffers only in commit order WAR are handled by keeping all writes private until commit Transaction A Transaction A Time STORE X Commit X Local Buffer 1 2 Shared emory Transaction B STORE X Commit X Local Buffer Time ST X = 1 LD LOAD X = X1 Commit X X = 1 X = 2 Any Local Stores Supply Data Later Transactions Started / Updated with Shared State Transaction B ST X = 2 LD LOAD X = X2 LD LOAD X = X2 Commit X

11 TCC s Difference Transactional Coherence & Consistency 9 Overview So what? Transactional memory is old news.... Herlihy, et.al., proposed to replace locks a decade ago Rajwar and Goodman / artinez and Torrellas proposed more automated versions of the same thing recently Thread-level speculation (TLS) uses transactional memory TCC s New Idea: Leave transactions on all of the time Provides ANY new benefits Completely eliminates conventional cache coherence and consistency models

12 The TCC Cycle Transactional Coherence & Consistency 1 Overview Transactions now run in a cycle Continues for all execution Transaction Starts P P1 P2 Speculatively execute code and buffer Execute Code Wait for commit permission Phase provides synchronization, if necessary Arbitrate with other CPUs Commit stores together, as a block Provides a well-defined write ordering Can invalidate or update other caches Large block utilizes bandwidth effectively Transaction Completes Requests Commit Starts Commit Finishes Commit Wait for Phase Arbitrate Commit And repeat!

13 Advantages of TCC Transactional Coherence & Consistency 11 Overview Trades bandwidth for simplicity & latency tolerance Easier to build Not dependent on timing/latency of loads/stores Transactions eliminate locks Transactions are inherently atomic Catches most common parallel programming errors Shared memory consistency is simplified Conventional model sequences individual loads and stores Now only have hardware sequence transaction commits Shared memory coherence is simplified Processors may have copies of cache lines in any state (no ESI) Commit order implies an ownership sequence

14 How to Use TCC I Transactional Coherence & Consistency 12 Overview Divide code into potentially parallel tasks Usually loop iterations, after function calls, etc. For initial division, tasks = transactions But can be subdivided up or grouped to match hardware limits (buffering) Similar to threading in conventional parallel programming, but: We do not have to verify parallelism in advance Locking is handled automatically Therefore, much easier to get a parallel program running correctly! Programmer then orders transactions as necessary Ordering techniques implemented using phase numbers Assign an age number to each transaction Deadlock-free (at least one transaction is always oldest ) Livelock-free (watchdog hardware can easily insert barriers anywhere) Three common scenarios...

15 How to Use TCC II Transactional Coherence & Consistency 13 Overview Unordered for purely parallel code Fully ordered to specify sequential tasks Partially ordered to insert synchronization like barriers CPU CPU 1 CPU 2 Time Unordered Transactions Barrier = Phase Transition Parallelized Sequential Code Transactions

16 How to Use TCC III Performance tuning Based on feed-back from runtime violation reports Choose transactions to maximize parallelism and minimize dependencies

17 Sample TCC Hardware Transactional Coherence & Consistency 14 Overview Stores Only Processor Core Local Cache Hierarchy Loads and Stores Node # Write Buffer Transaction Control Bits Read, odified, etc. L1 Cache Snooping from other nodes Commits to other nodes Commit Control Node : Node 1: Node 2: Phase Broadcast Bus or Network Write buffers and some L1 cache bits (read, modified, optional renamed) Write buffer in processor, before broadcast A broadcast bus or network to distribute commit packets All processors see the commits in a single order Snooping on broadcasts triggers violations, if necessary Commit arbitration/sequencing logic

18 Sample TCC Hardware Transactional Coherence & Consistency 14 Overview Stores Only Processor Core Local Cache Hierarchy Loads and Stores Node # Write Buffer Transaction Control Bits Read, odified, etc. L1 Cache Snooping from other nodes Commits to other nodes Commit Control Node : Node 1: Node 2: Phase Broadcast Bus or Network Cannot overflow ust hold states in cache or victim cache A fix: ask for commit permission state is no longer speculative pseudo-overflow used to handle I/O

19 Evaluation ethodology Transactional Coherence & Consistency 15 Results We simulated a wide range of applications: SPLASH-2, SPEC, ava, SpecBB Divided into transactions using a preliminary TCC API Trace-based analysis Generated execution traces from sequential execution Then analyzed the traces while varying: Number of processors Interconnect bandwidth Communication overheads Simplifications Results shown assume infinite caches and write-buffers But we track the amount of state stored in them Fixed one cycle/instruction Would require a reasonable superscalar processor for this rate

20 Limits of Availabile Parallelism Transactional Coherence & Consistency 16 Results Speedup Over 1 Processor TCC speedups are similar to conventional ones And sometimes better: SPECjbb eliminates locking overhead within warehouses B B HF ÑÉ ÇÅ HF ÑÉ ÇÅ B H F Ñ É Ç Å B HÑ FÉ ÇÅ art equake_l equake_s lu radix SPECjbb swim tomcatv water HÑ ÇÅ BF É ÑÇ Å H F B É Number of Processors ÇÅ HÑ F B É Speedup Over 1 Processor B H F Ñ É Ç Å â euler fft jbyte_b5 jbyte_b6 jbyte_b8 jbyte_b9 moldyn mtrt raytrace shallow Fâ Å Ç H BÉ 8 F H Ñ Åâ Ñ Ñ Ç BÉ BH FÑ ÉÇ Åâ B HF ÑÉ ÇÅ â B HF ÑÉ ÇÅ â Number of Processors Fâ Å Ç H B É Explicitly Parallel Applications TLS-ava Applications

21 Write Buffering Needs Transactional Coherence & Consistency 17 Results Only a few KB of write buffering needed Set by the natural transaction sizes in applications Occasional overflow can be handled by committing early Reasonable for on-chip buffers 7 5 Write State in KB (with 64B lines) % 5% 1% moldyn equake_s euler jbyte_b9 raytrace jbyte_b5 jbyte_b6 shallow equake_l jbyte_b8 art water-8p fft swim mtrt Applications tomcatv lu-8p radix-m-8p radix-xs-8p radix-s-8p SPECjbb-8P radix-l-8p

22 Broadcast Bandwidth Transactional Coherence & Consistency 18 Results Another issue is broadcast bandwidth (update protocol) If data is sent with commit, to avoid broadcast saturation: Needs about 16 32p with whole modified lines Needs only about 8 32p with dirty data only High, but feasible on-chip Bytes per Cycle Whole Lines Dirty Data Only radix (xl, l, m, s, xs) SPECjbb tomcatv fft

23 Bandwidth Sensitivity Transactional Coherence & Consistency 19 Results ost parallel applications are tolerant of limited BW SPECjbb shows some server-code noise speedup variation Normalized Speedup â â B HF ÑÉ ÇÅ â Ö7 à B HF ÑÉ ÇÅ â Ö7 à HF ÑÉ ÇÖ 7à F H ÑÖ 7à Å ÉÇ Ñâ 7à Ö H B art Ç B É equake_l H equake_s Å F F lu B Ñ radix_xl Å É radix_l Ç radix_m B Å radix_s radix_xs â SPECjbb Ö swim Normalized Speedup B H ÑÉ ÇÅ â BH H Ñ ÇÅ â B ÇÅ â H Ç H É Åâ ÇÅ ÑÉ B â É É Ñ B B euler fft H F jbyte_b5 jbyte_b6 Ñ Ñ jbyte_b8 É jbyte_b9 Ç moldyn Å mtrt raytrace 7 tomcatv â shallow à water! Bandwidth Limit, bytes/cycle 8 procs! Bandwidth Limit, Bytes/Cycle

24 Snoop Bandwidth Transactional Coherence & Consistency 2 Results Snooping requirements are quite reasonable Significantly less than 1 address/cycle on most systems Address-only commits could reduce BW requirements Only broadcast addresses for an invalidation-based protocol Send full packets only to memory Needs only about p Addresses per Cycle radix (xl, l, m, s, xs) SPECjbb tomcatv fft

25 Conclusions Transactional Coherence & Consistency 21 Conclusions TCC simplifies shared memory control hardware Trades higher interconnect bandwidth for simpler protocols Eliminates traditional ESI coherence protocols ost communication in large, less latency-sensitive packets Scaling trends favor these trade-offs in the future TCC eases parallel programming Transactions provide error tolerance and free locking Allows all-manual to nearly automated parallelization ore on this at ASPLOS-XI in October