Optimizing a Semantic Comparator uing CUDA-enabled Graphic Hardware Aalap Tripathy Suneil Mohan, Rabi Mahapatra Embedded Sytem and Codeign Lab codeign.ce.tamu.edu (Preented at ICSC 0, September 9, 0 in Palo Alto, CA)
09/9/0 Overview Introduction Current v/ future technologie Key tep in computation Decription of architecture Experimental Setup & Reult Concluion
09/9/0 3 Introduction Search i a key activity on the Internet 3 billion querie a month (3500/ec) Growing rapidly (38% annually) Search Engine need to be Precie Increaed uer expectation from earch reult Not 00 link but few relevant document Search Engine need to be Scalable Search engine deployed a ditributed ytem Newer method make more computational demand Search Engine need to conume Low Energy Ten of Mega Watt (.5 MW/year) Coare-grained, tak-parallel approach i inufficient Preciion Meaning-baed earch Low Energy Conumption Objective: Deploy meaning-baed earch to enhance earch-quality while conuming le energy and meeting time contrain
09/9/0 4 Current Search Engine q! URL,Snippet! Front-End Web App (URL,nippet)! Query Proceor (Score, docid)! Document Proceor I 0 I I I 3 I N- I N I 0 I I I 3 I N- I N I 0 I I I 3 I N- I N Pool 0 Pool Pool Pool 3 Pool Np- Pool Np Pool 0 Pool Pool Pool 3 Pool Np- Pool Np Pool 0 Pool Pool Pool 3 Pool Np- Pool Np I 0 I I I 3 I N- I N I 0 I I I 3 I N- I N Index Server Pool 0 Pool Pool Pool 3 Pool Np- Pool Np Pool 0 Pool Pool Pool 3 Pool Np- Pool Np Doc Server
Current Search Paradigm Vector Method 09/9/0 5 D = American The American man ate Indian food. VAmerican Man V Man ate V ate Indian VIndian Food VFood Decriptor repreenting meaning of the entire text Bai vector repreenting the term American Scalar weight / coefficient denoting relative importance of the term Vector Method can not differentiate between two document containing the ame keyword - American man ate Indian food v/ Indian man ate American food - Produce hundred of irrelevant reult no preciion - Hundred of redundant operation performed in the proce - Ten of MW of power conumed in the proce
Future Search paradigm - Tenor Method The American man ate Indian food The Indian man ate American food c a b d e a e c d b ab c ac 3 bc a e c a c 3 e c 4 ab 5 a 6 b 7 c Bai vector term a = " man", b = " american", c = " ate", a b = " man american ", 4 a e 5 a 6 e 7 c (3 Term) (3 Term) a = "man", e Bai vector term = "Indian", c = "ate", a e = " man Indian ", bc = " american ate ", ab c = " man american ate ", e c = " Indian ate ", a e c = " man Indian ate ", Tenor method differentiate document containing ame keyword - Capture the relationhip between term - At what cot? Exponentially larger number of term 09/9/0 6
Semantic Comparion uing Tenor 04/0/00 7 (The american man ate indian food) ab c 4 ab 5 a ac 6 b 7 c 3 bc Tenor (T) (The indian man ate american food) ae c ac 3 ec 4 ae 5b 6c 7a Tenor (T) Similarity(T, T ) = T T = 57 65 76 ( < ). Identify common bai vector. Multiply calar coefficient 3. Find um of all product
09/9/0 8 Overview Introduction Current v/ future technologie Key tep in computation Challenge Decription of architecture Experimental Setup & Reult Concluion
Key Step in Semantic Computation For two Tenor of ize n, n, Search i O(n.n ) or O(n. log n ) Can we improve upon thi? 09/9/0 9
09/9/0 0 Bloom Filter Bloom Filter Enable Compact repreentation of a Set - Parameter: Number of element to be inerted (m) Size of the Bloom Filter (ize n ) Number of Indice ued to repreent each element (k) - Probability of fale poitive can be controlled
04/0/00 Detail of Comparion Coefficient table of Query Tenor (Table ) Tenor Coeff id Set of BF bit indice Id { x i : 0 x i ize n } Id i Id n Tenor id Id Id i i = 0. n Coeff i = 0.4 { 0,, 5 } { } Coefficient table of Doc Tenor (Table ) Set of BF bit indice { x i : 0 x i ize n } { 0,, 5 } Bloom filter of Doc Tenor 0 0 0 0 3 4. Identify common bai vector (filtered). Multiply coefficient 3. Compute um of product Id n n { } 0 Size n - Size n - T T =... i i... j j...
09/8/0 Overview Introduction Current v/ future technologie Key tep in computation Challenge Decription of architecture Experimental Setup & Reult Concluion
09/9/0 3 Architecture Decription - CUDA CUDA - Compute Unified Device Architecture - Device Architecture pec - An extenion to C (library, API, compiler) GPGPU ue heterogeneou parallel computing model - Kernel i called by hot and run by GPU - Each SIMD proceor execute ame intruction over different data element in parallel - Can proce thouand of thread imultaneouly. - The number of logical thread and thread block urpae the number of phyical execution unit
Architecture Decription CUDA Memory Model 09/9/0 4 CUDA Memory Model - Thread - Regiter (per thread) - Local Memory (off-chip) - Block - Shared Memory between thread - Device - Global Memory between kernel - Contant Memory Read only, tore invariant - Texture Memory limited but can cache part of Global Memory
Semantic Comparion uing CUDA 5 Phae A Phae B Phae C Phae D Copy Table from Hot PC to CUDA Global Memory Copy Table from Hot PC to CUDA Global Memory Encode Table in Bloom Filter Encode Table. Tet preence in Bloom Filter Compute Semantic Similarity uing Filtered element CUDA Programming Model Split a tak into ubtak Divide input data into chunk that fit global memory Load a data chunk from global memory into hared memory Each data chunk i proceed by a thread block Copy reult from hared memory back to global memory Optimization : Maximize independent parallelim
Phae A Copy Data from Hot to GPU 09/9/0 6 Copy two table to be compared into CUDA global Memory - Data ha to be explicitly copied into CUDA Global Memory - Optimization : Data tructure i flattened to increae coaleced memory accee - Maximize the available PCIe bandwidth (76.8 Gb/ for NVIDIA C870)
Phae B Encode Co-efficient Table in Bloom Filter 09/9/0 7 Encode Table (Document Bai Coefficient Table) in Bloom Filter - The i th Doc_Bai term i hahed uing two hah function - k additional Bloom Filter Indice are generated uing: - Turn every Index Poition in BF bit array in CUDA texture Memory - Optimization 3: At leat n thread are launched. Limit the number of block and increae the number of thread. Increae hared memory reue.
Phae C Encode & Tet Table uing Bloom Filter 09/9/0 8 Encode Table in Bloom Filter, Tet - The i th Query_Bai term i hahed uing ame two hah function - k additional Bloom Filter Indice are generated - Thoe index poition are teted in BF bit array in CUDA texture Memory - At leat n thread are launched. - If all indice are, Query_Bai i i a filtered element, tore Index (i) - Optimization 4: Shared memory i inacceible after end of kernel. Thi data i tranferred to Global Memory at the end of each thread block
Phae D Compute Semantic Similarity uing Filtered element 09/9/0 9 Extract correponding calar coefficient, multiply and um - The index of the potential match in Table i ued to lookup Coeff - The correponding match in Table i ued to lookup Coeff - The ame kernel perform multiplication (interim product) - Intermediate product from multiple thread are ummed in parallel.
09/9/0 0 Other Algorithmic Optimization Partitioning the computation to keep all tream core buy - Optimization 5: Multiple thread, multiple thread block in contant ue Monitoring per-proceor reource utilization - Optimization 6: Low utilization per thread block allow multiple active block per multi-proceor
09/9/0 Overview Introduction Current v/ future technologie Key tep in computation Decription of architecture Experimental Setup & Reult Concluion
09/9/0 Experimental Setup Device Characteritic GPU # Stream Proceor / core Core Frequency Value 8/6 (Nvidia Tela C870) 600 Mhz CUDA Toolkit 3. Interface 6x PCI-Expre Memory Clock Global Memory Contant Memory.6 GHz.6GB 65KB Experimental etup Experimental Parameter Table Size (N) Similarity between Table (c) CUDA Parameter (num_bock, thread / block) Experimental Meaurement Execution Time Power/Energy Throughput (Comparion / ec) Shared Memory/block 6KB Regiter per block 89 Number of thread per block Memory Bu Bandwidth Warp Size (Number of thread per thread proceor) 5 76.8 GB/, 384 bit-wide GDDR3 CPU P4 GB RAM, Ubuntu 9.0 3
09/9/0 3 Reult - Execution Time Profiling Execution Time (m) 00 80 GPU_Exec_Time(m) 60 CPU_Exec_Time(m) 40 0 00 80 60 40 0 0 000 5000 0000 50000 00000 50000 Number of Entrie (Term) in Table & (n=n=n) Exponential increae in CPU execution time for large table Same dataet on a GPU i up to 4x fater (imilarity c=0%)
09/9/0 4 Reult - Power Profiling Dynamic Power (W) 300 50 00 50 00 50 0 GPU Dynamic Power(W) CPU Dynamic Power(W) 000 5000 0000 50000 00000 50000 Number of Entrie (Term) in Table & (n=n=n) CPU-GPU Power Characteritic Sytem Bae Power Sytem Idle Power (GPU cold hutdown) Sytem Idle Power (GPU Awake, Idle) GPU Idle Power Value 5W 50W 86W 36W Meaured uing WattUp Power Meter. (Meaure Main Power) GPU dynamic power i lower but approache that of a CPU for N>50000 GPU are known to be energy-efficient but not necearily power-efficient
Reult - Energy Saved per Comparion 09/9/0 5 Table Size (N) CPU Execution Time () CPU Average Power (W) GPU Execution Time () GPU Average Power (W) Energy Saved (%) 5k 0.8 3 0.05 59 79.65 0k 0.74 39 0. 56 77.64 50k 0.0 4 4.93 88 77.7 00k 8.4 46 9.57 7 77.96 50k 85.3 5 43.83 33 78.04 Computing Energy Saved (Wh%) - Experiment over 5000<N<50000, Similarity between table: c=75% - Energy aving ~78% per comparion - A future emantic earch engine can either: - reduce energy footprint or - increae throughput with ame footprint
Reult - Profiling Semantic Comparator Kernel on the GPU 09/9/0 6 Number of Entrie (Term) in Table & (n =n =N) Profiling Semantic Kernel - Data copy from CPU to GPU (Phae A) ceae to be a bottleneck for N>5k - Extracting Scalar coefficient (Phae D) become a bottleneck - Computing Hah Function, Inertion into a Bloom Filter (Phae B, C) computationally negligible
09/9/0 7 Reult - Throughput Improvement Table Size (N) CPU Throughput (comparion / ) GPU Throughput (comparion / ) Improvement 5k 53996.9 73097.43 3.0 0k 3458.9 4675.69 3.47 50k 499.40 05.8 4.05 00k.39 50.85 4.0 50k 53.94 8. 4. Improvement in Throughput - Ran experiment with randomly varying imilarity between table for given N - Throughput wa defined a the invere of the averaged execution time for a given N - GPU throughput improvement i higher for larger value of N - For maller value of N, the overhead of data tranfer from CPU to GPU dominate
09/9/0 8 Overview Introduction Current v/ future technologie Key tep in computation Decription of architecture Experimental Setup & Reult Concluion
09/9/0 9 Concluion Semantic earch require introduction of fine-grained parallelim at compute node Search Engine Preciion Ue Tenor Method for meaning repreentation Search Engine Scalability Handle exploive growth in coefficient table within compute node Preciion Meaning-baed earch Leverage off-the-helf hardware like GPU a coproceor Search Engine Energy Conumption GPU baed emantic comparator ha extraordinary energy efficiency We have deigned GPU baed co-proceor that provide 4x peedup and 78% energy aving over a traditional CPU for emantic earch Low Energy Conumption
Thank You Q&A Optimizing a Semantic Comparator uing CUDAenabled Graphic Hardware
Appendix (Extra Slide) Optimizing a Semantic Comparator uing CUDAenabled Graphic Hardware
04/0/00 3 Comparion with prior art Characteritic Time/Cycle Energy Saving Traditional Microproceor Wort performing Wort performing GPGPU Medium Moderately high ASIC 4 cycle @ realizable clock frequency Very High Adoption Cot Low Intermediate High fabrication, development, integration cot. IO iue not addreed Overall characterization Low peed, Low Cot Balanced Cot and Speed High Speed, high cot
04/0/00 33 Future Work Memory I/O Iue Tranmit only hahed dataet to GPU Will reduce dataet from Nx40x to Nx8x byte per table (5 time) Tranmit only one Hah intead of two to GPU Compute the econd et of hahe in the GPU from the firt Will reduce dataet from Nx8x to Nx8x Can not Call one kernel from another Control ha to pa through the CPU Vary GPU Parameter Experimentation with Multiple Grid (In thi paper a ingle grid wa ued) Further experimentation with varying number of block, number of thread per block
04/0/00 34 Reference S. Mohan, A. Tripathy, A. Biwa, and R. Mahapatra, "Parallel Proceor Core for Semantic Search Engine," preented at the Workhop on Large-Scale Parallel Proceing (LSPP) to be held at the IEEE International Parallel and Ditributed Proceing Sympoium (IPDPS'), Anchorage, Alaka, USA, 0. S. Mohan, A. Biwa, A. Tripathy, J. Panigrahy, and R. Mahapatra, "A parallel architecture for meaning comparion," preented at the Parallel & Ditributed Proceing (IPDPS), 00 IEEE International Sympoium on, Atlanta, GA 00. A. Biwa, S. Mohan, A. Tripathy, J. Panigrahy and R. Mahapatra, "Semantic Key for Meaning Baed Searching", in 009 IEEE International Conference on Semantic Computing (ICSC 009),4-6 September 009, Berkeley, CA, USA. A. Biwa, S. Mohan, and R. Mahapatra, "Search Co-ordination by Semantic Routed Network", in 8th International Conference on Computer Communication and Network, (ICCCN 009),009, San Francico, CA, USA. A. Biwa, S. Mohan, J. Panigrahy, A. Tripathy, and R. Mahapatra, "Repreentation of complex concept for emantic routed network," in 0th International Conference on Ditributed Computing and Networking, (ICDCN 009), 009, Hyderabad, pp 7-38 A. Biwa, S. Mohan and R. Mahapatra, "Optimization of Semantic Routing Table", in 7th International Conference on Computer Communication and Network, (ICCCN 008), 008, US. Virgin Iland
Challenge Exploive Growth in Number of Term with Tenor Model 09/9/0 35 Repreentation of intent a b c Number of Index Term with Vector Method 3 Number of Index Term with Tenor Method 7 e a b c d 5 3 a b c d e f g h 8 55 a b c d e f g h i 9 5
04/0/00 36 Current Search Paradigm Vector baed model Aign weight to keyword Compute imilarity uing dot product The ale manager took the order. D = ale Vale manager Vmanager tookvtook order V order Decriptor repreenting meaning of the entire text Scalar weight / coefficient denoting preence of the term Bai vector repreenting the term ale
Comparing Current v/ Future Method 09/9/0 37 Creating Coefficient Table - Firt column how term, Second Column how coefficient Tenor Method introduce two additional tep - Concept Tree, Tenor Form - More computation, but increaed preciion.
09/9/0 38 Future A Semantic Search Engine Query Proceor I 0 I I I 3 I N- I N I 0 I I I 3 I N- I N I 0 I I I 3 I N- I N I 0 I I I 3 I N- I N q! Front-End Web App (URL,nippet)! Document Proceor (Score, docid)! Pool 0 Pool Pool Pool 3 Pool Np- Pool Np Pool 0 Pool Pool Pool 3 Pool Np- Pool Np Pool 0 Pool Pool Pool 3 Pool Np- Pool Np Pool 0 Pool Pool Pool 3 Pool Np- Pool Np Reorganize the index hard of a earch engine - Small World Network - Reduce Query Rate to <Q/ N<<Q - Query reolution i guaranteed within a average of 3 hop - What i the downide? I 0 I I I 3 I N- I N Index Server Pool 0 Pool Pool Pool 3 Pool Np- Pool Np Doc Server
04/0/00 39 Approach Ue Tenor Baed Repreentation for meaning. Meaning Comparion baed on dot product of the tenor. The american man ate indian food 4 ab c ab 5 a ac 6 b 7 3 bc c Bai vector term a = " man", b = " american", c = " ate", a b = " man american ", bc = " american ate ", ab c = " man american ate ",
04/0/00 40 Converion of a Tree to a Tenor Example of a pecific concept tree Generic Concept tree with C Child Node, Depth D, L leave Salient Feature - Concept Tree: Hierarchical acyclic directed n-ary tree. - Lead node repreent term wherea the tree decribe interrelationhip Expanion of Tree - Bottom-up. Make allpoible polyadic combination - Generic Cae (Aume): - Intermediate Node I - C Child Node - One child node P contain L leave - Number of Term at D I due to P will be L - - Each of the C node will produce nnnc - term
04/0/00 4 Tenor comparion c b a a b a c ab 7 6 5 4 3 b c c ) ( T T ) T Similarity(T, 6 7 5 6 7 5 < = = a c b b a a a b 7 6 5 4 3 c c b c Tenor (T) Tenor (T) (The american man ate indian food) (The indian man ate american food). Identify common bai vector. Multiply calar coefficient 3. Find um of all product
04/0/00 4 Dot product challenge (The american man ate indian food) 4 ab c ab 5 a ac 6 b 7 3 bc c When Tenor are large, identification of common bai vector i time conuming. For two Tenor of ize n, n Search i O(n.n ) or O(n. log n ) Can we improve upon thi?
04/0/00 43 Bloom Filter Compact repreentation of a et. m bit long bit vector k hah function Element X Hah Func. F (Id i )= 0 F (Id i )= Bit Array 0 0 0 3 4 F k (Id i )= j 0 m- m-
04/0/00 44 Bloom Filter Inertion Element X Hah Func. F (Id i )= 0 F (Id i )= Bit Array 0 0 0 0 0 0 3 4 F k (Id i )= j 0 0 m- m-
04/0/00 45 Bloom Filter Teting for preence (Memberhip tet) Element X Bit Array Hah Func. 0 F (Id i )= 0 F (Id i )= 0 0 3 4 Can have fale poitive F k (Id i )= j Never have fale negative Fale Poitive rate can be reduced by chooing large m and optimal k value. m- m- 0 For n=0 3 element, k= 7, m = 040 bit Probability of Fale poitive ~ 8x0-3
04/0/00 46 Data Structure (The american man ate indian food) Element ac Id = MD5( ac ) 4 ab c ab 5 a ac 6 b 7 3 bc c Bit Index Generation Hah Func. Bit F (Id i )= 0 Array 0 F (Id i )= 0 3 F k (Id i )= j 4 Tenor id Id Id i Coeff i = 0. Set of BF bit indice bit indice { x i : 0 x i m } { 0,, j } 0 m- m- Id n n { } Bloom Filter Coefficient table