Is OpenCL a suitable platform for algorithm development in health care systems?

Size: px
Start display at page:

Download "Is OpenCL a suitable platform for algorithm development in health care systems?"

Transcription

1 UPTEC IT Examensarbete 15 hp Augusti 2012 Is OpenCL a suitable platform for algorithm development in health care systems? Mattias Larsson

2 !!

3 Abstract Is OpenCL a suitable platform for algorithm development in health care systems? Mattias Larsson Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box Uppsala Telefon: Telefax: This thesis reviews if OpenCL is a suitable and cost effective platform for algorithm development in health care systems. Aspects such as maintainability, performance, portability and integration with high-level languages (in this case Python) are analyzed. The review is done by implementing one part of a dose calculation algorithm that is complex enough to provide a realistic case. The vision is that OpenCL can replace multiple platforms for both multi core CPU and GPU computing and removing the need of implementing an optimized version of an algorithm for every platform. To achieve performance- portability, automatic optimization is done using parameter tuning. Both its effects on performance and code structure are analyzed. The conclusion is that OpenCL coupled with auto tuning is not a suitable platform due to problems with code structure, language limitations, programming- portability, tool support and the effort and difficulty in implementing auto tuning. Hemsida: Handledare: Anders Edin Ämnesgranskare: David Black-Schaffer Examinator: Arnold Pears ISSN: , UPTEC IT Tryckt av: Reprocentralen ITC

4 !!

5 Sammanfattning) I!modern!strålterapi!används!kraftfulla!datorer!för!att!göra!en!så!bra!planering! av!behandlingen!som!möjligt.!behovet!av!beräkningskraft!gör!att!möjligheten!att! använda!specialhårdvara!så!som!gpuer!är!intressant.!för!att!använda! specialhårdvara!så!krävs!dock!att!mjukvara!skrivs!om!till!en!plattform!som! stödjer!specialhårdvaran.!en!sådan!plattform!är!opencl,!som!är!en!öppen! standard!med!stöd!av!de!flesta!vanliga!hårdvarutillverkarna.!visionen!är!att!en! algoritm!skriven!i!opencl!kan!köras!med!rimlig!prestanda!både!på!vanlig! hårdvara!(cpu)!och!på!specialhårdvara!(gpu)!och!därmed!ersätta!behovet!av! multipla!plattformar.!! Denna!uppsats!undersöker!hur!OpenCL!tillsammans!med!tekniker!för!att! automatiskt!anpassa!mjukvaran!till!den!aktuella!hårdvaran!påverkar!faktorer!så! som:!underhållbarhet,!prestanda,!portabilitet!och!integration!med!högnivåspråk.! Det!undersöks!genom!att!implementera!en!del!av!en!dosberäkningsalgoritm!som! är!tillräckligt!komplex!för!att!kunna!motsvara!ett!riktigt!fall.!för!att!automatiskt! anpassa!mjukvaran!till!den!aktuella!hårdvaran!så!kan!mjukvarans!beteende! anpassas!med!hjälp!av!parametrar.!hur!väl!den!automatiska!anpassningen! fungerar!analyseras,!både!med!avseende!på!kodstruktur!och!prestanda.! Slutsatsen!är!att!OpenCL!som!plattform!tillsammans!med!automatisk!anpassning! av!mjukvaran!inte!är!en!lämplig!väg!att!gå!i!dagsläget.!det!beror!på!negativa! effekter!på!kodstruktur,!begränsningar!i!programspråket,!problem!med! portabilitet,!avsaknad!av!verktygsstöd!och!den!svårghet!det!innebär!att! implementera!automatisk!anpassning!av!programvaran.!!!

6 !!

7 Contents) 1!Introduction!...!9! 1.1!Background!...!9! 1.2!Goal!...!9! 1.3!Scope!...!10! 2!Method!and!theory!...!11! 2.1!Calculation!of!the!fluence!map!...!11! 2.1.1!Fluence!map!basics!...!11! 2.1.2!Ray!tracing!...!13! 2.2!NonWfunctional!software!requirements!...!16! 2.2.1!Safety!...!16! 2.2.2!Portability!...!16! 2.2.3!Maintainability!and!performance!...!16! 2.3!OpenCL!...!17! 2.3.1!The!OpenCL!architecture!...!17! 2.3.2!The!OpenCL!programming!language!...!19! 2.3.3!Tool!support!...!20! 2.4!Automatic!tuning!...!21! 2.4.1!Related!work!...!21! 2.4.2!Model!based!optimization!...!22! 2.4.3!Empirical!optimization!...!22! 3!Design!and!implementation!...!23! 3.1!Program!structure!and!implementation!details!...!23! 3.1.1!Modules!...!23! 3.1.2!OpenCL!C!language!considerations!...!23! 3.1.3!Accuracy!adjustment!...!24! 3.1.4!Parallelization!and!concurrency!...!24! 3.1.5!General!optimizations!...!24! 3.2!Optimization!parameters!and!automatic!tuning!...!25! 3.2.1!WorkWgroup!size!and!shape!...!26! 3.2.2!Address!spaces!...!27! 3.2.3!Structure!...!28! 3.2.4!Scene!...!28! 3.2.5!Intersection!algorithms!...!29! 3.2.6!Automatic!tuning!...!29! 3.3!Integration!with!Python!...!30!

8 3.3.1!PyOpenCL!...!30! 3.3.2!C!structures!and!alignment!...!30! 4!Results!and!analysis!...!31! 4.1!Test!setup!...!31! 4.1.1!Hardware!platforms!...!31! 4.1.2!Test!scene!...!31! 4.1.3!Search!heuristic!...!33! 4.2!Test!results!and!parameter!analysis!...!34! 4.2.1!Performance!results!...!34! 4.2.2!Parameter!search!statistics!...!37! 4.2.3!WorkWgroup!parameters!...!37! 4.2.4!Address!space!parameters!...!39! 4.2.5!Algorithm,!scene!and!structure!parameters!...!40! 4.2.6!Parameter!importance!...!40! 4.3!Structure!analysis!...!41! 5!Conclusion!...!45! 6!Discussion!and!future!work!...!47! 6.1!Ray!tracing!improvements!...!47! 6.1.1!Intersection!algorithms!...!47! 6.1.2!Integration!and!sampling!techniques!...!47! 6.1.3!Hierarchies!of!bounding!volumes!...!47! 6.2!Automatic!optimization!...!47! 6.3!The!future!of!OpenCL!...!48! 7!References!...!49!!

9 1)Introduction) 1.1)Background) Radiation!therapy!is!a!type!of!cancer!treatment!where!highWenergy!radiation!is! used!to!kill!cancer!cells.!the!radiation!damages!the!dna!and!stops!the!cells! ability!to!divide.!both!cancer!and!healthy!cells!are!affected!by!the!radiation,!so!it! is!essential!to!minimize!the!radiation!to!the!healthy!cells.!developments!in! medical!informatics!have!enabled!better!treatment!of!cancer!patients!with! radiation!therapy,!much!with!the!help!of!powerful!computers!and!smart! software.!treatments!are!planned!in!advance!and!computers!simulate!a!patient s! expected!radiation!dose.!the!exact!anatomy!of!a!patient!is!known,!with!the!help! of!computer!tomography,!therefore!even!the!radiation!dose!on!individual!organs! can!be!simulated.!the!radiation!can!then!be!shaped!to!fit!and!only!affect!the! cancer!tumor!and!minimize!the!radiation!dose!on!important!organs,!much!like! when!you!use!your!hands!in!front!of!a!lamp!to!form!shadows!on!a!wall.!the!lamp! in!this!case!is!radiation!from!an!accelerator!and!the!hands!are!decimeter!thick! blocks!of!tungsten,!called!a!collimator,!that!refract!and!absorb!most!of!the! radiation.!the!simulation!of!radiation!dose!is!done!in!two!steps:!first!simulate! the!shape!formed!by!the!collimator!onto!a!virtual!plane!called!a!fluence!map!that! captures!the!shape!and!intensity!of!the!radiation,!second!use!the!fluence!map!to! calculate!the!dose!in!the!patient![5].!the!simulation!of!radiation!dose!on!a! patient s!body!is!a!computationally!expensive!operation!and!is!not!done!in!realw time.!this!limitation!influences!how!medical!staff!plans!the!treatment!and!the! quality!of!the!planning.!if!this!could!be!done!in!realwtime,!hopes!are!that!the! planning!could!become!more!effective.!there!is!an!endless!need!for! computational!power,!which!can!be!used!to!either!increase!speed!or!accuracy.! Recent!hardware!and!software!developments!have!started!to!expose!the! computing!power!of!multiwcore!cpus,!gpus!and!other!types!of!specialized! hardware.!this!gives!hope!to!be!able!to!do!both!faster!and!more!accurate! simulations!of!radiation!dose.!opencl!is!a!platform!for!getting!access!to!the! computing!power!of!the!new!hardware!and!is!defined!by!a!nonwprofit!group!that! is!supported!by!all!major!hardware!developers.!opencl!supports!execution!on! several!types!of!hardware!(cpus,!gpus!etc.)!without!any!modification!of!the!code.! OpenCL!introduces!a!C99!based!programming!language!in!which!portable! computation!kernels!can!be!written.!the!kernels!are!compiled!at!runwtime!and! can!be!run!on!any!available!and!supported!hardware.!even!though!opencl! supports!programming!portability,!the!performance!is!not!portable![2].!to!get! good!performance,!it!is!often!the!case!that!the!kernels!have!to!be!optimized!for!a! particular!hardware.!recent!studies!have!shown!that!auto!tuning!of! optimizations!can!be!used!to!provide!a!more!general!code!that!can!be!optimized! dependent!on!the!current!executing!hardware!and!possibly!fix!the!problem!of! performance!portability![3,!4].! 1.2)Goal) The!goal!with!this!thesis!is!to!review!if!OpenCL!is!a!suitable!and!cost!effective! platform!for!lower!level!algorithms!by!implementing!the!raywtracing!algorithm!in! OpenCL.!This!is!done!by!analyzing!maintainability,!performanceWportability!! 9!

10 through!auto!tuning!and!how!it!can!interact!with!highwlevel!platforms!such!as! C#/.Net!and!Python.!The!first!part!of!the!dose!simulation!(the!calculation!of!the! fluence!map)!is!implemented!and!then!the!solution!is!analyzed.! 1.3)Scope) The!implementation!and!study!is!limited!to!use!a!single!OpenCL!device.!The! implementation!is!only!of!proofwofwconcept!quality!and!is!not!aiming!for! precision!or!to!adhere!to!medical!standards.!the!implementation!and!data! should!be!complex!enough!and!realistic!enough!to!test!the!limitations!of!opencl! together!with!auto!tuning.!! )! 10!

11 2)Method)and)theory) 2.1)Calculation)of)the)fluence)map) 2.1.1)Fluence)map)basics) As!described!in!the!introduction,!the!expected!radiation!dose!from!a!treatment!is! calculated!in!a!simulation.!one!way!of!doing!this!is!in!a!twowstep!manner![5].!the! calculation!of!the!effect!the!collimators!are!separated!from!the!step!of!calculating! how!the!radiation!is!spread!in!the!patient.! A" B" Figure'1.'A'patient'in'radiation'treatment.'A'marks'where'the'ray'source'is'located' and'b'marks'where'the'collimator'is'located.'figure'from'[48].! The!collimators!block!the!radiation,!generated!by!an!accelerator,!from!hitting!the! patient.!the!goal!is!to!only!hit!the!cancer!tumor,!but!radiation!leakage!is! unavoidable.!this!is!due!to!limitations!in!the!collimator.!a!collimator!is! constructed!with!a!set!of!leaves!made!out!of!a!radiation!blocking!material,! typically!tungsten.!the!number!and!width!of!the!leaves!are!what!determines! what!shape!can!be!created!and!at!which!precision.!the!thickness!determines! how!much!radiation!will!pass!through!the!blocked!areas!of!the!patient.! Sometimes!a!backup!singleWleaf!collimator!is!positioned!aligned!with!the!most! open!leaf!to!even!more!reduce!the!radiation!leakage.!the!leaves!are!movable!and! allow!the!collimator!to!change!shape!to!best!fit!a!tumor!for!different!angles! around!a!patient.!there!is!also!another!singlewleaf!collimator!in!the!other! direction,!orthogonal!to!the!multiwleaf!collimator,!which!is!called!the!jaw.!usually,! the!tumor!is!exposed!to!radiation!from!a!couple!of!directions!around!the!patient,! but!there!are!also!types!where!it!is!exposed!to!radiation!from!all!angles!around! the!patient!in!two!dimensions.!!! 11!

12 1. Multisource beam fluence modelling 2. Dose calculation from fluence Pencil beam, C/S, Collapsed cone, Monte Carlo... = ( Ax,, y ) + ( Ax,, y ) +... D= D( Ψ tot ( A, xbl, ybl ), P( x, y, z) ) Ψ Ψ Ψ tot direct source BL BL flattening filter BL BL Fluence map phase space Ψ tot Ψ tot Figure'2.'A'two@step'dose'calculation'model.'Figure'from'[5].! P( x, y, z) Process independent of field size!! As!seen!in!figure!2,!the!radiation!from!a!radiation!source!is!projected!onto!a! plane!with!the!effects!of!the!collimators,!which!shape!the!radiation!beam.!this!is! called!a!fluence!map!and!contains!the!fluence!in!a!twowdimensional!plane!of! points.!the!fluence!map!is!then!used!to!calculate!the!radiation!dose!in!the!patient.! The!fluence!!!at!the!point!(x,y)!is!calculated!by!a!sum!of!the!contribution!of! multiple!radiation!sources:!!!,! =!!!"#$%&!!"#$%&!,!,! + ' where!a!represents!the!collimator!settings!and!x!and!y!are!coordinates!in!the! fluence!map.!in!this!case!only!the!radiation!directly!from!the!source!is! considered!because!it!accounts!for!the!major!part!of!the!fluence.! The!dose!D!in!the!patient!at!the!point!x!is!then!calculated!as!a!function!of!the! fluence:!!(!,!) =!(!!,!)! where!a!represents!the!collimator!settings!and!p!represents!the!body!of!the! patient.! This!thesis!will!focus!on!the!first!step!in!the!dose!calculation.!The!problem!is! complex!enough!to!be!able!to!test!and!analyze!the!suitability!of!the!opencl! platform.! Since!the!goal!is!to!calculate!a!fluence!map,!the!problem!is!similar!to!rendering!a! scene!using!ray!tracing!in!computer!graphics.!ray!tracing!is!where!every!pixel,!in! a!twowdimensional!virtual!camera!plane,!cast!a!ray!which!interacts!with!the! scene!and!eventually!hits!a!light!source!or!goes!to!infinity.!it!is!also!called! backwards!ray!tracing!because!the!rays!are!cast!in!the!opposite!direction!of!the! actual!photons.!forward!ray!tracing!is!where!rays!are!cast!from!the!light!sources! and!interacts!with!the!scene!until!it!hits!a!pixel!on!the!virtual!camera!plane!or! goes!to!infinity.!both!methods!are!in!fact!equivalent,!but!the!implementation!! 12!

13 details!differ![42].!backward!ray!tracing!is!more!common!in!the!literature,!so! that!one!is!used!in!this!case!study.! 2.1.2)Ray)tracing) Recursive!backward!ray!tracing!was!first!introduced!by!Whitted![7].!Whitted s! model!is!based!on!phong s!model!but!phong s!model!only!supports!points!of!light! infinitely!far!away!from!the!objects!in!the!scene![7,!8].!whitted s!model!supports! point!light!sources!in!the!scene!and!is!using!recursion.!when!an!object!is!hit,!new! rays!are!cast!from!the!point!of!intersection!recursively.!the!rendering!equation! is!defined!by!whitted!as:!!!!"! =!! +!!!!!!!!!+!!! +!!!' where!the!i!is!the!intensity,!ia!is!the!intensity!due!to!ambient!light,!kd!is!the! diffuse!intensity!coefficient,!n!is!the!unit!surface!normal,!lj!is!the!vector!in!the! direction!of!the!j:th!light!source,!ks!is!the!specular!intensity!coefficient,!s!is!the! intensity!of!light!from!the!specular!reflection,!kt!is!the!transmission!coefficient! and!t!is!the!intensity!of!light!from!transmission.! The!resulting!intensity!is!composed!of!four!parts:!ambient,!diffuse,!specular!and! transmitted!intensity.!the!ambient!and!specular!intensity!is!removed!in!this! model!of!the!fluence!map!calculation.!whitted s!model!has!one!disadvantage:!it! does!only!support!point!light!sources.!in!the!calculation!of!the!fluence!map,!it!is! important!to!account!for!the!area!of!the!ray!source.!that!means!that!whitted s! model!is!alone!not!sufficient.! To!account!for!the!area!of!the!light!source!is!important!in!the!case!where!only!a! part!of!the!light!source!is!visible.!a!natural!way!of!calculating!the!area!is!to! integrate!over!the!visible!area!of!the!light!source.!analytic!integration!is!not!a! feasible!technique!in!this!case.!a!numerical!method!has!to!be!used.!in!this!case,!a! disc!shaped!light!source!is!used!and!it!is!not!trivial!to!integrate!over.!by!sampling! over!a!simpler!shape!like!a!rectangle,!which!is!easier!to!integrate!over,!the!visible! area!of!the!disc!can!be!determined.!the!simplest!way!is!to!do!a!uniform!sampling! over!the!smallest!rectangle!that!fits!the!disc!using!the!midpoint!rule!in!two! dimensions.!this!is!done!by!subdividing!the!source!into!small!rectangular!part:!!!,!!"!#!!(!!,!! )! where!a'is!the!area!of!each!part!and!f(xi,yj)'is!the!intensity!at!the!center!of!each! part'[38].! In!a!scene!where!objects!can!hide!a!light!source,!visibility!is!also!an!important! aspect.!that!is!the!case!if!point!x!in!a!pixel!cannot!see!the!point!x!on!the!light! source.!a!visibility!function!can!encode!this!property:!!!!! = 1!!"!!h!"!!!"!!"#$%!!"#h!!!"#$""%!!"#$%!!!!"#!! 0!!"!!"!!"#$%&!h!"#$!!"#$%!!!!!"#$!!"#$%!!!!! 13!

14 The!effect!of!the!collimators!can!be!determined!by!the!amount!of!material!a!ray! has!to!go!through.!since!the!scene!consists!of!several!collimators!and!each! collimator!consists!of!one!or!more!leaves,!each!leaf!has!to!be!tested!for! intersection!by!the!ray.!if!the!ray!intersects!a!leaf,!the!amount!of!material!(the! thickness)!it!has!to!pass!will!affect!the!intensity!of!the!ray!that!comes!out!of!the! material.!the!intensity!can!be!calculated!by!the!beerwlambert!law:!! =!!!!!!! where!i0!is!the!initial!intensity,!α!is!the!attenuation!coefficient!of!the!material!and! d!is!the!thickness!of!the!material![11].!the!attenuation!of!a!ray!consists!of!both! scatter!and!absorption,!but!it!is!assumed!that!if!a!ray!is!attenuated,!it!loses!all!its! importance!in!the!scene!and!can!be!omitted.!the!attenuation!coefficient!is! determined!by!the!material!of!the!collimator!leaf!and!the!energy!of!the!ray.!one! example!of!a!collimator!leaf!is!one!made!out!of!tungsten!with!a!thickness!of!7.8! cm![12].!the!total!absorption!can!be!described!as:!!!"#$%&'($) =!!!!!!!!!""!!"##$%&'"(!!"#$"% where!cabsorption!is!the!absorption!coefficient!and!d!is!the!distance!the!ray!has!to! pass!though!leaf!i.!this!will!replace!the!visibility!function!s!for!the!case!where!the! visibility!is!blocked!by!a!collimator!leaf.! The!distance!from!the!pixel!to!the!light!source!is!also!a!factor!to!take!into!account,! because!a!ray!source!loose!intensity!as!a!function!of!distance.!by!projecting!the! light!source!onto!a!unit!half!sphere!with!origin!from!the!ray!origin,!the!intensity! loss!with!distance!can!be!calculated.!the!shape!of!the!light!source!is! approximated!by!a!rectangle.!the!distance!decay!is!calculated!by:!!!"#$%&'(!!"#$% =!!!! 2!! where!αx!is!the!angle!around!the!xwaxis!and!αy!is!the!angle!around!the!ywaxis.! The!resulting!intensity!in!a!pixel!is!described!by:!!(!!,!! ) =!!"#$%&'(!!"#$%!!!!,!!!!!(!!,!!!!,! )! where!f(xi,yj)!is!the!intensity!in!the!point!(xi,yj)!at!the!source!and!s(xi,yj x,y )!is!the! visibility!between!the!point!(x,y )!and!the!point!(xi,yj)!at!the!source.!the!total! intensity!in!a!pixel!is!the!integral!over!the!entire!source,!using!the!midpoint!rule.! $Intersection$algorithms$ If!and!where!a!ray!hits!an!object!on!its!way!towards!a!light!source!is!an!integral! question!in!ray!tracing.!therefore,!the!selection!of!algorithms!for!finding!out!the! intersections!between!rays!and!objects!are!important.!all!the!following! algorithms!are!standard!algorithms!in!ray!tracing.!they!have!probably!been! developed!for!a!cpu!and!not!for!any!specialized!hardware.!! 14!

15 A!scene!can!have!three!kinds!of!primitive!objects:!triangle,!axisWaligned!box!and!a! disc.!a!disc!is!only!used!for!the!ray!source,!axis!aligned!boxes!for!bounding! volumes!and!all!other!objects!are!built!out!of!triangles.!that!means!that! algorithms!for!intersection!checks!are!needed!between!rays!and!triangles,!axis! aligned!boxes!and!discs.!there!is!quite!a!lot!of!research!on!fast!intersection! algorithms!especially!on!axis!aligned!boxes!and!triangles,!because!they!are! common!in!ray!tracing.!it!is!worth!looking!in!to!rather!than!using!the!naïve!way.! The!ray!triangle!intersection!used!in!this!case!study!is!one!from!Möller!and! Trumbore![13].!Its!performance!is!good!and!the!required!memory!is!relatively! low.!it!also!requires!no!precomputation!of!the!plane!equation,!inverse!direction! vectors!or!ray!type.!that!makes!it!a!good!fit!for!this!implementation,!because!it! uses!the!data!that!is!available.!the!standard!form!of!this!intersection!algorithm! gives!a!true!or!false!intersection!result!and!the!distance!from!the!ray!origin!to!the! intersection!point.!with!an!adjustment!to!the!intersection!algorithm,!the! intersection!point!itself!can!be!calculated:!!!" =!! +!!"#$%&"'(!"#$%&'(! where!p0!is!the!ray!origin,!vdirection!is!the!normalized!ray!direction!and!distance!is! the!distance!from!p0!to!the!intersection!point!ip!given!by!the!intersection! algorithm.!getting!the!intersection!point!is!necessary!to!enable!refraction!of!a!ray! when!the!ray!enters!a!material.! The!intersection!of!a!ray!and!an!axis!aligned!box!is!also!an!essential!intersection! test.!in!this!case!the!intersection!algorithm!from!williams!et!al.![14]!is!used,!but! without!the!precomputed!inverted!ray!direction!to!save!memory.!this!algorithm! relies!on!some!on!the!properties!of!ieeew754!floating!point!standard:!when!a! positive!number!is!divided!by!zero!the!result!is!+!and!when!a!negative!number! is!divided!by!zero!the!result!is!w.!opencl!supports!the!ieeew754!floating!point! standard![15,!p.!248].! $Bounding$Volumes$ Instead!of!testing!every!triangle!in!the!scene!for!intersection!with!a!ray,!triangles! can!be!grouped!in!a!bounding!volume!which!can!be!checked!for!intersection.!if!a! ray!intersects!a!bounding!volume,!all!its!triangles!are!checked!for!intersection.!if! it!does!not!intersect!a!bounding!volume,!none!of!the!triangles!have!to!be!tested! for!intersection!with!the!ray.!that!makes!it!possible!to!skip!intersection!tests! with!a!specific!ray!and!potentially!most!triangles!in!the!scene,!dependent!on!how! the!bounding!volumes!are!constructed.! Any!type!of!volume!can!be!used!as!a!bounding!volume,!but!axis!aligned!boxes!are! common!because!of!its!low!memory!requirements!(two!points,!min!and!max)! and!the!fast!intersection!algorithms!that!are!available.! In!scenes!with!a!large!number!of!triangles!and!a!large!number!of!bounding! volumes!it!is!also!common!to!use!hierarchies!of!bounding!volumes.!the! hierarchy!forms!a!tree!structure!and!if!a!node!is!intersected,!then!its!leaf!nodes!! 15!

16 are!also!tested!for!intersection.!in!this!case!study,!hierarchies!of!bounding! volumes!is!not!used.! 2.2)NonCfunctional)software)requirements) NonWfunctional!software!requirements!describe!desired!nonWfunctional! characteristics!of!a!system.!they!describe!a!property!or!a!quality!a!system!must! have!to!make!its!functionality!usable![39].! 2.2.1)Safety) A!software!failure!in!a!cancer!treatment!planning!system!can!result!in!injuries!or! even!death.!it!is!defined!as!a!safetywcritical!system![1,!p.!300].!in!such!a!complex! system!as!a!cancer!treatment!planning!system,!it!is!unviable!to!do!formal! verification,!so!verification!has!to!be!done!through!testing.! 2.2.2)Portability) If!the!same!code!base!for!a!performance!critical!algorithm!could!be!used!to!run! on!different!kinds!of!hardware! Radiation!therapy!planning!software!is!used!in!all!parts!of!the!world!and!the! resources!of!each!individual!hospital!can!be!very!different.!to!impose!too!strict! hardware!requirements!can!be!a!selling!disadvantage.!such!an!example!is!to! require!a!gpu!that!supports!opencl!and!has!errorwcorrecting!code!(ecc)! memory,!which!is!required!today!for!medical!hardware!of!this!kind.!in!a! treatment!facility,!several!computers!are!often!used!to!be!able!to!access!the! treatment!planning!software.!if!some!of!the!computers!have!cheaper!hardware! and!still!can!run!the!software,!but!with!a!lower!performance,!that!is!a!good! selling!point.!medical!staff!with!lower!salary!can!use!the!slower!computers!when! the!faster!and!more!expensive!computers!are!occupied!by!doctors!when!the! treatment!verification!is!done.!it!is!the!case!that!radiation!equipment!such!as! accelerators!and!collimators!are!bought!separately!from!information!and! planning!systems.!in!sweden!the!procurements!for!these!different!categories!of! hardware!are!forced!to!be!separate.!in!practice!that!makes!hardware!costs!of! information!and!planning!systems!a!more!important!factor.!the!cost!of! information!and!planning!systems!does!not!get!hidden!by!the!cost!of!the!other! radiation!equipment.!because!of!the!reasons!given!above,!portability!is!an! important!factor!when!developing!medical!software!of!this!kind.! One!can!distinguish!between!several!types!of!portability.!Two!of!them!will!be! discussed!here:!functional!portability!and!performance!portability.!functional! portability!is!when!software!is!portable!across!several!platforms.!even!if! software!is!designed!and!tested!on!one!platform!it!can!be!run!on!another! platform!with!the!same!functionality.!if!the!functionality!is!to!multiply!two! matrices,!the!result!of!the!multiplication!of!the!same!two!matrices!should!be!the! same!on!all!platforms,!not!considering!floating!point!differences.!performance! portability!is!when!performance!is!portable!across!platforms.!this!is!not!the!case! for!heavily!optimized!software.!studies!have!shown!that!auto!tuning!can! accomplish!at!least!some!level!of!performance!portability![2,!3].! 2.2.3)Maintainability)and)performance) Maintenance!is!defined!by!Sommerville!as!doing!one!or!more!of!the!following! activities!on!existing!software::!repairing!faults,!adopt!to!a!changed!environment!! 16!

17 or!to!introduce!new!functionality.!writing!code!that!is!easy!to!maintain!is! essential!for!keeping!down!software!development!costs.!maintenance!costs! usually!take!up!two!thirds!of!the!total!cost!in!an!it!project![1,!p.!242].!in!medical! applications,!maintenance!costs!are!expected!to!be!higher!because!of!a!greater! need!of!verification!testing.! Performance!is!described!by!van!Vliet!as:!speed,!efficiency,!resource! consumption,!throughput!and!response!time![39].!the!performance!that!matters! in!this!case!is!how!fast!a!fluence!map!can!be!calculated!given!a!precision! requirement,!which!is!dependent!on!the!number!of!samples!per!second! (throughput)!and!techniques!for!minimizing!the!number!of!samples!(efficiency).! The!performance!is!later!in!section!4.2.1!measured!as!throughput.! The!common!way!of!programming!for!GPU s!is!by!writing!performance!focused! code!that!is!optimized!for!a!single!specific!hardware!architecture.!with!that!kind! of!optimized!code,!the!performance!is!often!not!portable!across!hardware! architectures!from!different!manufacturers!or!even!across!architectures!from!the! same!manufacturer![2,!4].!hardware!architectures!are!in!general!updated!every! other!year!or!every!third!year![6,!16].!to!adapt!to!the!changed!environment!and! support!all!the!new!capabilities!of!the!newest!architecture!and!simultaneously! support!the!older!architectures,!several!code!bases!have!to!be!maintained,!one! for!each!hardware!architecture.!duplicated!code!is!considered!by!fowler!as!the! worst!!problem!in!code![41.!p.!76].!if!maintenance!is!done!on!the!software,!all! code!bases!have!to!be!updated.!all!code!bases!also!have!to!be!tested!separately.! This!is!both!expensive!and!complex.!It!is!much!preferred!to!only!have!to! maintain!a!single!code!base!for!the!performance!critical!algorithms.! A!system!designed!with!a!focus!on!maintainability!is!potentially!less!costly!to! maintain!and!test![40!p.!459].!since!maintenance!costs!are!a!large!part!of!the! total!cost!of!a!system,!the!choice!of!not!focusing!on!maintenance!can!be!a!costly! one.!on!the!other!hand,!focusing!on!performance!gives!a!better!product!that!can! bring!in!higher!revenue!because!of!more!sales!of!the!product.!unfortunately,! strategies!for!creating!maintainable!code!and!good!performing!code!can!be! opposing![41!p.!!69].!for!instance,!large!software!components!can!give!better! performance!but!are!also!harder!to!maintain![1,!p.!153].!on!the!other!hand,!a! wellwstructured!and!easy!to!maintain!program!can!be!easier!to!tune!for! performance![41!p.!69],!so!not!all!strategies!necessarily!have!to!be!opposing.! What!it!comes!down!to!when!deciding!on!the!tradeWoff!between!maintainability! and!performance!is!its!costs!and!potential!revenue!gains.! 2.3)OpenCL) OpenCL!is!a!framework!for!programming!a!collection!of!heterogeneous! hardware!resources!including!cpu s!and!gpu s.!it!includes!a!programming! language,!an!api,!libraries!and!a!runtime!system.!the!programming!language!is! based!on!c99!with!some!restrictions!and!some!extensions![15].! 2.3.1)The)OpenCL)architecture) OpenCL!architecture!consists!of!four!models:!the!platform!model,!the!execution! model,!the!memory!model!and!the!programming!model![15,!section!3].!! 17!

18 The!platform!model!consists!of!a!host!device!and!one!or!more!OpenCL!devices.! Every!OpenCL!device!can!then!include!one!or!more!compute!units!which! includes!one!or!more!processing!elements.! The!execution!model!defines!how!execution!is!done.!A!host!device!sets!up!a! context!with!opencl!devices,!kernels,!program!objects!and!memory!objects.!a! kernel!is!a!function!that!can!be!run!on!a!opencl!device,!initiated!by!the!host! device.!each!opencl!device!has!its!own!program!object!to!implement!a!kernel! which!is!usually!compiled!and!linked!at!runtime.!version!1.2!of!the!specification! separates!the!compilation!and!linking!and!supports!offwline!compilation!of! kernels.!in!that!way!kernels!can!be!precompiled!and!distributed!with!an! executable!without!the!need!to!compile!the!kernel!at!runtime![17].!memory! objects!maps!objects!in!memory!of!the!host!device!to!the!memory!of!a!opencl! device.!sometimes!(especially!the!case!for!gpu s)!the!memory!objects!has!to!be! transferred!to!the!memory!on!the!device,!which!incurs!an!overhead.! The!execution!model!also!defines!how!parallel!work!of!kernels!is!structured.!It!is! structured!into!an!nwdimensional!index!space!(called!ndrange)!from!one!up!to! three!dimensions.!the!index!space!consists!of!workwgroups!with!the!same! number!of!dimensions!as!the!index!space.!the!smallest!part!is!called!a!workwitem! and!is!one!running!instance!of!a!kernel.!each!workwitem!has!both!a!global!id!in! the!index!space!and!a!local!id!in!its!workwgroup.!each!workwgroup!also!has!a! workwgroup!id.!all!workwitems!in!a!workwgroup!are!executed!simultaneously![15,! section!3.2].! items.'figure'from'[15,'p.'24].!! The!memory!model!has!four!distinct!memory!spaces:!global,!constant,!local!and! private!memory.!the!global!memory!is!accessible!by!every!workwitem!in!the! index!space.!the!constant!memory!is!also!accessible!by!every!workwitem!in!the! address!space!but!is!readwonly.!local!memory!is!only!accessible!from!workwitems! within!the!same!workwgroup.!private!memory!is!accessible!only!by!its!workwitem.!! 18!

19 Table'1.'Memory'spaces'and'its'allocation'and'accessibility'capabilities.'Figure' from'[15,'p'27].!! The!programming!model!in!OpenCL!explicitly!supports!both!the!data!parallel! programming!model!and!the!task!parallel!programming!model,!but!the!data! parallel!model!is!the!only!one!that!gives!a!good!performance!on!today s!gpu s! [15,!18].! The!architecture!of!OpenCL!much!reflects!the!architecture!of!GPU s![18].! 2.3.2)The)OpenCL)programming)language) The!OpenCL!programming!language!is!based!on!C99,!but!has!both!restrictions! and!extensions!to!it![15,!chapter!6].! Some!of!the!extensions!include:! Implementation!of!four!disjoint!address!spaces:!global,!local,!constant!and! private.! The! kernel!function!qualifier.! The! attribute!qualifier.! The!address!space!qualifiers!are!used!to!define!in!what!region!of!memory!a! variable!is!allocated!upon!variable!declaration.!the! kernel!qualifier!declares! functions!as!kernels!which!can!be!executed!on!an!opencl!device,!initiated!by!a! host!device.! The!padding!of!structures!can!be!adjusted!by!using!the! attribute!qualifier.! The!attribute!packed!is!used!to!minimize!the!required!memory.!For!alignment! purposes!the!packed!attribute!is!not!always!the!most!appropriate.!sometimes! extra!padding!can!make!a!data!type!better!aligned!to!fit!the!hardware!better![19].!!!! 19!

20 Some!of!the!restrictions!include:!! No!recursion.! No!dynamic!memory!or!variable!sized!arrays.! Pointers!to!functions!are!not!allowed.! A!pointer!pointing!to!one!address!space!cannot!be!cast!to!point!to!another! address!space.! OpenCL!C!supports!the!IEEEW754!floating!point!standard!for!single!precision! floating!point!numbers.!double!precision!floating!point!numbers!can!be! supported!as!an!extension!up!to!version!1.1!of!the!opencl!standard,!but!are! mandatory!in!version!1.2![15,!17].!it!also!supports!vectors!in!dimensions!2,!3,!4,! 8,!and!16!with!common!vector!operation!such!as:!addition,!subtraction,! multiplication!and!division!by!vector!or!scalar,!dot!product,!cross!product,! normalization!and!length.! 2.3.3)Tool)support) Tools!can!be!an!important!part!of!the!development!of!software.!Tools!for! debugging!and!profiling!are!a!great!help!for!writing!good!programs![1!p.197,!40].! Most!vendors!with!their!own!implementation!of!the!OpenCL!standard!supply! their!own!set!of!tools.!! $NVIDIA$ NVIDIA!supplies!a!set!of!tools!for!their!proprietary!but!free!platform!CUDA.!Some! of!them!also!support!opencl,!but!often!with!a!limited!set!of!features.!this! includes!the!visual!studio!plugin!parallel!nsight,!which!can!debug!and!profile! CUDA!kernels.!For!OpenCL,!Parallel!Nsight!is!limited!to!profiling!but!with!a! heavily!limited!set!of!information!such!as!memory!usage!and!timings!of!kernels.! NVIDIA!also!supplies!the!crossWplatform!tool!Visual!Profiler,!which!supports! profiling!of!cuda!as!well!as!opencl!kernels.!the!profiler!can!give!information!of! memory!usage,!kernel!timings,!register!usage,!number!of!threads,!number!of! divergent!threads,!reads!and!writes!to!global!memory!and!occupancy.!it!can!also! give!hints!on!optimization!areas!and!what!is!limiting!performance!for!individual! kernels.! NVIDIA s!compiler!can!also!give!some!useful!statistics!at!compile!time.!with!the! compiler!option!-cl-nv-verbose,!the!compiler!can!output!stack,!register!and! shared!memory!usage!for!individual!kernels.![35]! $Intel$ Intel!supplies!a!Visual!Studio!plugin!for!debugging!OpenCL!kernels.! Unfortunately!it!only!supports!C/C++!projects!and!cannot!be!used!for!this!study! because!python!projects!are!used.! Intel!also!supplies!an!offline!kernel!compiler!which!can!output!assembly!code.! [36]! $Amd$ AMD!supplies!several!Visual!Studio!plugins!for!OpenCL.!gDebugger!is!for! debugging!and!app!profiler!is!profiling!opencl!kernels.!unfortunately!they!both!! 20!

21 only!supports!c/c++!projects!and!cannot!be!used!for!this!study.!parts!of!the! profiling!features!are!accessible!through!a!commandwline!utility!that!can!profile! kernels!executed!from!any!environment.!the!commandwline!utility!can!generate! a!data!file!which!can!be!opened!by!the!visual!studio!plugin!to!show!the! information!in!a!more!appealing!way.!unfortunately!the!information!is!sparse.! AMD!also!supplies!an!offline!kernel!compiler!which!can!output!assembly!code!for! CPU s!and!different!families!of!gpu s.![37]! 2.4)Automatic)tuning) Finding!good!optimizations!for!a!software!in!an!environment!can!be!hard!and! tedious!work!and!often!requires!expert!knowledge.!in!this!case!it!would!require! expert!knowledge!of!raywtracing,!hardware!architectures!and!programming.!a! framework!for!automatic!tuning!can!replace!an!expert!in!finding!good! optimizations.!it!also!scales!better!because!for!every!new!instance!of!a!system! that!needs!to!be!tuned,!one!simply!needs!to!copy!and!include!the!auto!tuning! framework.!a!human!expert!can!only!work!with!one!instance!at!a!time.!for!an! installation!of!a!medical!specialist!application!it!is!reasonable!that!the!automatic! tuning!is!part!of!the!installation!process!and!that!it!may!be!allowed!to!run! overnight.!any!more!than!that!might!cause!inconvenience!for!the!staff.! 2.4.1)Related)work) There!are!a!couple!of!implementations!and!articles!using!auto!tuning!where! FFTW!and!ATLAS!are!the!most!famous!ones.! FFTW!is!an!implementation!of!the!discrete!Fourier!transform.!It!contains! fragments!that!can!be!composed!to!an!implementation!that!calculates!the! Fourier!transform.!Different!fragments!contain!different!optimizations!and!the! combination!of!fragments!construct!a!search!space!of!fourier!transform! calculators.!the!search!space!is!then!explored!with!dynamic!programming!to! find!the!fastest!one![32].! ATLAS!(Automatically!Tuned!Linear!Algebra!Software)!is!a!project!to!produce!a! performance!portable!linear!algebra!library.!as!fftw!it!generates!optimized! code!from!an!abstract!description.!the!performance!of!the!different! optimizations!is!timed!to!find!out!which!one!is!the!best.!atlas!changes!the! blocking!factor!and!loopwunrolling!among!other!things!that!are!hard!to!predict.! To!lower!the!searchWspace!of!linear!algebra!libraries,!it!uses!a!search!heuristic! that!determines!the!best!value!for!one!optimization!at!a!time!which!finds!a!local! optimum![33].! GATLAS!is!an!attempt!at!implementing!ATLAS!on!a!GPU!using!OpenCL.!OpenCL! source!code!are!generated!from!c++!template!classes.!optimizations!are!applied! to!a!base!class!using!inheritance!and,!where!the!c++!metawprogramming!facilities! are!not!sufficient,!the!mixin!pattern.!it!uses!expectationwmaximization!and! dynamic!programming!to!find!the!best!optimizations.!optimizations!are!workw group!size,!data!layout,!inner!blocking!among!others![34].! Maestro!is!an!open!source!library!for!data!orchestration!on!one!or!more!OpenCL! devices.!it!uses!empirical!autowtuning!to!tune!workwgroup!sizes,!buffer!chunk! sizes!and!load!balancing!between!multiple!opencl!devices![3].!!! 21!

22 2.4.2)Model)based)optimization) Model!based!optimization!is!when!a!model!is!supplied!to!the!auto!tuning! framework.!the!model!can!be!a!model!over!the!hardware!architecture!for! instance,!with!a!map!of!the!different!memories,!their!speeds!and!properties!and! arithmetic!units.!based!on!that!map,!the!auto!tuning!framework!can!then!tune! the!program!so!that!it!is!utilizing!the!available!hardware!to!the!maximum.!that! can!be!setting!buffers!to!exactly!fit!the!available!memory.!on!a!gpu,!that!could! mean!setting!the!size!of!a!buffer!to!exactly!fit!the!local!memory!size!reported!by! OpenCL!for!the!device.!A!problem!is!that!a!model!is!not!always!available!and!if!it! is!available,!it!can!be!wrong.!if!a!model!is!missing,!tuning!cannot!be!done!at!all.!if! the!parameters!are!tuned!to!the!wrong!model,!the!result!is!not!optimal!for!that! hardware.!an!advantage!with!model!based!optimization!is!that!is!enables! precalculation!of!optimal!optimizations!without!the!need!of!access!to!the!actual! environment!where!the!system!is!installed![31].!this!can!lower!the!time!of!the! installation!process!of!software!that!is!being!optimized.! 2.4.3)Empirical)optimization) Empirical!optimization!of!software!is!when!a!software!runs!test!to!figure!out! what!optimizations!works!best!for!a!given!environment.!essentially!no! information!about!the!environment!is!needed!beforehand,!but!access!to!run!tests! in!the!environment!is!required.!therefore!no!precalculation!of!optimizations!can! be!done.! ATLAS!sets!up!a!set!of!requirements!for!automatic!empirical!optimization!of! software![33]:!! isolation!of!performancewcritical!code! a!method!of!adapting!software!to!differing!environments! robust!and!context!sensitive!timers! appropriate!search!heuristic! If!the!searchWspace!is!big,!then!a!search!heuristic!is!needed!to!find!a!solution!in! reasonable!time.!that!will!only!guarantee!to!find!a!local!optimum,!but!that!is! probably!good!enough.!if!the!searchwspace!is!small,!all!permutations!of! optimizations!can!be!tested!and!given!correct!timing!data,!the!globally!optimal! set!of!optimizations!in!the!searchwspace!will!be!found.!! )! 22!

23 3)Design)and)implementation) 3.1)Program)structure)and)implementation)details) To!test!if!an!acceptable!tradeWoff!between!maintainability!and!performance!can! be!achieved,!the!structure!is!initially!more!focused!on!being!maintainable!than!to! give!good!performance!with!gradually!shift!towards!performance!by!applying! optimizations.!a!number!of!strategies!are!used!to!strive!to!achieve!maintainable! code.!these!include:!separation!of!concerns,!reuse,!modularity,!understandable! structure!and!clarity!in!code.! 3.1.1)Modules) The!OpenCL!code!is!structured!into!modules!according!to!functionality.!A!module! consists!of!a!header!file,!a!source!file!and!a!unit!test!file.!one!module!for! primitive!scene!objects!and!intersection!algorithms,!one!for!collimator!objects! and!one!for!ray!tracing!and!algorithms.!this!gives!a!hierarchical!dependency! graph.!each!module!can!be!tested!individually!given!its!dependencies!are! fulfilled.!it!is!the!ray!tracing!module!that!exposes!functionality!to!the!host.! The!Primitives!module!contains!all!the!primitive!scene!objects!and!their! intersections.!primitive!scene!objects!are:!ray,!triangle,!rectangle,!plane,!disc,! BoundingBox!and!Box.!A!Ray!is!represented!by!an!origin!and!a!direction!vector.! A!triangle!is!represented!by!three!vertex!points.!A!Rectangle!is!made!up!of!two! triangles.!the!difference!between!a!boundingbox!and!a!box!is!that!boundingbox! is!an!axiswaligned!box!represented!by!a!minimum!and!a!maximum!point!but!a! Box!is!a!box!made!up!of!10!triangles!where!the!back!face!is!missing.! The!Collimator!module!contains!the!definition!of!a!collimator!and!the!different! kinds!of!scene!representations!of!it.!it!also!contains!intersection!tests!that! depend!on!the!primitives!module.! The!Ray!tracing!module!contains!all!ray!tracing!functionality!and!exposes! OpenCL!kernels!to!the!host.!The!complete!fluence!map!calculation!is!separated! into!three!steps!as!separate!kernels.!first!the!intensity!for!each!ray!is!calculated! when!it!is!cast!from!a!point!on!the!fluence!plane!towards!the!ray!source.! Secondly!the!intensity!decay!is!calculated.!Third,!all!the!intensities!that!makes!up! the!total!intensity!of!a!pixel!are!summed!up!and!multiplied!with!the!intensity! decay!factor.!because!the!ray!source!is!not!a!point!but!a!disc,!several!rays!are!cast! from!each!pixel!to!integrate!over!the!visible!area!of!the!ray!source!as!described! in!2.12.! 3.1.2)OpenCL)C)language)considerations) OpenCL!C!does!not!support!classes!so!all!objects!are!defined!by!C!structs.!All! object!specific!functionality!is!located!in!functions!within!its!respective!module.! Functionality!is!reused!whenever!possible.!One!example!is!the!intersection!test! between!a!ray!and!a!box!which!uses!the!intersection!test!between!a!ray!and!a! Triangle.!Since!OpenCL!C!does!not!support!variable!sized!arrays!all!array!sizes! has!to!be!known!upon!execution.!kernels!are!compiled!at!runtime!so!constants! that!define!array!sizes!can!be!set!by!the!host!before!compilation.!opencl! supports!macros!being!set!as!a!compiler!option,!which!can!be!used!to!solve!this! problem![15].!! 23!

24 3.1.3)Accuracy)adjustment) The!fluence!calculation!can!be!calculated!with!different!degrees!of!accuracy.! When!the!positions!of!the!collimator!leaves!are!being!optimized!to!fit!a!tumor,! only!a!rough!estimation!is!needed,!then!the!execution!time!of!the!calculation! should!be!as!low!as!possible!to!enable!the!optimizer!to!try!many!possible! positions.!on!the!other!hand,!when!a!final!setup!of!the!collimator!is!decided!on,! the!calculation!should!be!as!accurate!as!possible,!within!limits.!there!is!a!need!to! adjust!the!degree!of!accuracy!to!accomplish!a!fast!or!an!accurate!fluence! calculation.!this!can!be!done!in!a!number!of!ways.!the!accuracy!of!the!geometry! of!the!scene!objects,!the!accuracy!of!the!integration!of!the!light!source,!the! resolution!of!the!fluence!map!or!the!numerical!accuracy!can!be!adjusted.!the! primary!way!to!adjust!the!accuracy!here!is!chosen!to!be!the!representation!of!the! collimator!leaves.!this!implementation!supports!three!different!types!of! representations!for!the!collimator.!one!is!when!the!leaves!are!infinitely!flat! rectangles,!one!is!where!the!leaves!are!axiswaligned!boxes!and!one!where!the! leaves!are!focused!boxes.!focused!in!this!context!means!that!they!are!shaped!to! minimize!soft!shadows!made!by!the!collimator!on!the!fluence!plane.!all! representations!are!generated!from!a!more!general!description!of!a!collimator! which!has!the!minimum!amount!of!information.!that!allows!the!approximated! representation!of!the!collimator!to!be!different!depending!on!design!of! collimator!blade!for!instance.!all!generation!of!collimator!geometry!is!done!by! the!host!device!for!simplicity,!but!could!as!well!be!generated!on!an!opencl! device.! 3.1.4)Parallelization)and)concurrency) The!ray!tracing!is!implemented!so!that!the!intensity!from!each!ray!is!calculated! independent!of!every!other!ray.!that!enables!flexible!use!of!auto!tuning.! Grouping!rays!together!using!packet!traversal!is!common!in!ray!tracing,!but!the! results!by!aila!and!leine![28],!shows!that!independent!ray!traversal!is!more! efficient!on!gpu s!than!packet!traversal.! The!three!steps!in!the!calculation!of!the!fluence!map!are!separated!into!each!own! kernel.!this!gives!an!implicit!synchronization!between!the!steps!since!the! kernels!are!executing!one!at!a!time.!the!last!summation!step!is!dependent!on! this!synchronization,!so!that!it!does!not!start!before!all!calculations!have!been! done.!explicit!synchronization!is!only!done!when!global!data!is!copied!to!a!local! buffer!in!a!workwgroup.! 3.1.5)General)optimizations) Depending!on!the!type!of!OpenCL!device,!the!transfer!of!memory!objects!from! the!host!to!the!device!can!be!a!considerable!overhead.!typically!on!a!cpu!the! transfer!is!not!needed,!but!on!a!gpu!it!is,!because!of!its!separate!memory.! Therefore!the!size!of!the!memory!objects!should!be!minimized![19].!In!this!case,! the!memory!objects!(scene!information!and!result)!are!small,!so!the!overhead!is! negligible!(see!table!13).! The!majority!of!the!scene!data!is!in!the!vectors!that!represents!vertices!in! triangles.!therefore!that!data!is!separated!into!its!own!array.!that!is!good!both! because!it!makes!it!easier!to!cache!that!data!into!local!memory!and!also!makes! the!accesses!more!aligned.!! 24!

25 A!common!way!of!structure!data!is!to!have!one!array!that!contains!several!C! structs!of!the!same!type.!this!pattern!is!called!array!of!structures.!commonly! only!one!variable!in!a!structure!is!read!at!a!time,!in!a!loop!over!all!structures!in! an!array.!this!can!prevent!aligned!access.!a!way!to!make!the!memory!access! aligned!is!to!instead!create!one!c!struct!with!an!array!for!each!variable! containing!all!objects.!this!pattern!is!called!structure!of!arrays.!all!c!structs!are! structured!according!to!the!structure!of!arrays!pattern.!this!is!the!recommended! way!to!structure!data!by!both!nvidia!and!intel!because!it!uses!a!memory!access! pattern!that!is!more!cachewfriendly![19,!20].! Array!of! structures:! Object!0! X, Y, Z Object 1 X, Y, Z Object 2 X, Y, Z Object! Structure of arrays: X: O! 1 2 Y:! O 1 2 Z:! O 1 2 Figure'5.'Illustration'of'the'difference'between'array'of'structures'and'structure'of' arrays.! 3.2)Optimization)parameters)and)automatic)tuning) Optimization!parameters!are!parameters!that!can!be!changed!to!change!the! behavior!of!the!program,!to!maximize!the!performance!for!a!specific!platform! and!hardware.!the!optimization!parameters!are!grouped!into!different! categories!according!to!what!kind!of!behavior!they!are!changing.!the!categories! are:!workwgroup!size!and!shape,!use!of!address!spaces,!structure,!scene!and! intersection!algorithms.!!! 25!

26 ! Category! Parameter! Valid!values! Notes! WorkWgroup!size! X! [1, )!!Z! Has!to!be! and!shape! Y! [1, )!!Z! lower!than!the! Z! [1, )!!Z! index!space.! Address!spaces! Ray! private,!local!! Scene!information! constant,!global!! Triangle!data! local,!constant,!! global! Triangle!data!buffer! private,!local,!! constant,!global! Structure! DepthWfirst! [True,!False]! False!means! breadthwfirst.! Scene! Pieces! [1, )!!Z! Max!is!the! number!of! collimator! leaves.! Intersection! Triangle! [DS,!MT1,!MT2,!MT3]!! algorithms! intersection! algorithm! Table'2.'Summary'of'the'optimization'parameters.' 3.2.1)WorkCgroup)size)and)shape) The!workWgroup!size!decides!how!many!workWitems!are!grouped!together!in!the! same!workwgroup.!typically,!workwitems!in!the!same!workwgroup!have!access!to! a!fast!onwchip!memory!where!shared!data!can!be!stored.!if!shared!data!is!copied! from!the!global!address!space!to!the!shared!onwchip!address!space,!a! performance!gain!can!be!expected,!if!the!data!is!reused!so!that!it!outweighs!the! cost!of!copying!it!to!the!onwchip!memory.!the!size!of!the!workwgroup!often! decides!how!much!data!is!allocated!on!the!shared!memory!space!dependent!on! many!factors!and!the!following!considerations!have!to!be!taken!into!account:!! Buffering!of!data!on!the!onWchip!shared!memory!space!lowers!the!access! needed!to!the!slower!global!memory!space.! If!too!much!data!is!allocated!on!the!onWchip!memory,!the!program!fails!to! run.! The!onWchip!memory!is!sometimes!used!as!an!automatic!cache.!More! allocated!memory!by!the!program!can!mean!a!smaller!cache!and!less! automatic!caching!of!data!from!the!global!address!space![25].! Registers!can!be!stored!on!the!onWchip!memory.!Per!workWitem,!there!is! an!amount!of!needed!registers!and!if!the!onwchip!memory!is!full,!the! registers!are!spilled!over!to!slower!memory!or!the!program!fails!to!run.! On!GPU s,!the!work!group!size!and!shape!can!have!a!big!impact!on!the! performance!because!it!is!usually!what!decides!how!much!data!is!allocated!to!the! controllable!onwchip!memory.!a!common!case!is!to!try!to!make!use!of!the!onwchip! memory!as!much!as!possible,!but!without!overflowing!it.!on!cpu s!the!work! group!size!is!not!as!important!because!opencl!typically!do!not!have!control!over! the!faster!caches.!intel!suggests!to!not!setting!the!workwgroup!at!all![19,!20,!24,!! 26!

27 25].!This!implementation!can!adjust!the!workWgroup!size!and!shape!in!three! dimensions.!opencl!implementations!set!limits!to!the!size!and!shape!of!a!workw group,!where!the!x!and!y!dimensions!typically!supports!larger!width!than!the!z! dimension.! Vendor" X"size" Y"size" Z"size" NVidia!compute!capability!1.0!!2.x! 65535! 65535! 64! NVidia!compute!capability!3.0! 2 31 W1! 65535! 64! AMD!GPU!with!SDK!2.1! Product!of!all!dimensions!<=!256! Apple!Mac!OS!X! Dependent!on!hardware! Table'3.'Overview'of'maximum'allowed'size'per'dimension'and'implementation'[24,' 26,'27].' 3.2.2)Address)spaces) OpenCL!supports!allocation!on!four!different!memory!spaces:!global,!constant,! local!and!private![15].!since!no!guarantees!are!made!on!the!performance!of!the! memory!spaces,!it!is!not!known!how!and!where!data!should!be!resident!to!utilize! the!hardware!the!best!to!get!good!performance.! Data" Type" Size" Available"memory"spaces" Rays! R/W! 32!bytes! private,!local! Scene!information! Read!only! 1512W16172! constant,!global! bytes! Triangle!data! Read!only! 39360!bytes! local,!constant,!global! Triangle!data!buffer! R/W*! 480W19200! bytes! private,!local,!constant,! global! Table'4.'Data,'type,'size'and'where'it'can'be'allocated'for'each'work@item'using'the' test'scene'described'in'4.1.2.' Rays!are!allocated!and!created!on!the!OpenCL!device,!so!therefore!only!the! private!and!local!address!space!is!available.!the!host!could!allocate!global! memory!to!store!rays!in,!but!this!is!not!done!for!simplicity!reasons.! The!triangle!data!buffer!caches!the!triangle!data!of!a!single!scene!object.!In!the! case!when!it!is!constant!or!global,!the!buffer!is!just!a!pointer!to!constant!or!global! memory!and!nothing!is!copied.!when!both!the!triangle!data!and!the!triangle!data! buffer!is!in!the!local!address!space,!all!scene!objects!are!copied!to!the!triangle! buffer!at!the!start.!in!the!case!where!it!is!private!or!local!and!the!triangle!data!is! constant!or!global,!data!has!to!be!copied!to!the!chosen!address!space!because! data!cannot!be!copied!to!that!address!space!directly!from!the!host!device.! Triangle!data!is!copied!to!the!buffer!when!needed.!The!size!of!the!scene! information!and!the!triangle!data!buffer!is!variable!because!a!scene!object!can!be! split!into!smaller!parts.!the!smallest!division!creates!a!triangle!data!buffer!of!size! 448!bytes!and!the!full!a!buffer!is!of!size!19200!bytes.!This!is!for!a!scene!with!two! single!leaf!collimators!and!two!forty!leaf!collimators!as!described!in!the!test! scene!in!4.1.2.!there!exists!multiwleaf!collimators!with!80!leaves!as!well,!which! would!increase!the!triangle!data!to!77760!bytes,!which!would!in!turn!not!fit!on!a! NVIDIA!GPU!with!compute!compability!2.0!(see!Table!6).!! 27!

Mobile audience response system

Mobile audience response system IT 14 020 Examensarbete 15 hp Mars 2014 Mobile audience response system Jonatan Moritz Institutionen för informationsteknologi Department of Information Technology Abstract Mobile audience response system

More information

A Patient Post-operative Function Survey System for the Tablet-PC

A Patient Post-operative Function Survey System for the Tablet-PC IT 09 057 Examensarbete 30 hp December 2009 A Patient Post-operative Function Survey System for the Tablet-PC Jing Yao Institutionen för informationsteknologi Department of Information Technology Abstract

More information

A Modular Framework Approach to Regression Testing of SQL

A Modular Framework Approach to Regression Testing of SQL UPTEC IT 14 008 Examensarbete 30 hp Juni 2014 A Modular Framework Approach to Regression Testing of SQL Oskar Eriksson Abstract A Modular Framework Approach to Regression Testing of SQL Oskar Eriksson

More information

Redesign of Website for the Master Students at the IT-department of Uppsala University

Redesign of Website for the Master Students at the IT-department of Uppsala University IT 11 089 Examensarbete 30 hp December 2011 Redesign of Website for the Master Students at the IT-department of Uppsala University Meher Jamil Institutionen för informationsteknologi Department of Information

More information

How To Run Hadoop On A Single Node Cluster

How To Run Hadoop On A Single Node Cluster IT 13 034 Examensarbete 30 hp Maj 2013 Optimizing Hadoop Parameters Based on the Application Resource Consumption Ziad Benslimane Institutionen för informationsteknologi Department of Information Technology

More information

Design and Implementation of Business Intelligence Systems

Design and Implementation of Business Intelligence Systems IT 11 055 Examensarbete 30 hp July 2011 Design and Implementation of Business Intelligence Systems Tanvir Ahmad Institutionen för informationsteknologi Department of Information Technology Abstract Design

More information

Implementing dynamic allocation of user load in a distributed load testing framework

Implementing dynamic allocation of user load in a distributed load testing framework IT 13 090 Examensarbete 15 hp December 2013 Implementing dynamic allocation of user load in a distributed load testing framework Hugo Heyman Institutionen för informationsteknologi Department of Information

More information

A Publish/Subscribe Data Gathering Framework Integrating Wireless Sensor Networks and Mobile Phones

A Publish/Subscribe Data Gathering Framework Integrating Wireless Sensor Networks and Mobile Phones IT 10 066 Examensarbete 30 hp December 2010 A Publish/Subscribe Data Gathering Framework Integrating Wireless Sensor Networks and Mobile Phones He Huang Institutionen för informationsteknologi Department

More information

Investigation of Navigation on Mobile Websites with Hierarchical Information Structures

Investigation of Navigation on Mobile Websites with Hierarchical Information Structures UPTEC IT 13 007 Examensarbete 30 hp Juni 2013 Investigation of Navigation on Mobile Websites with Hierarchical Information Structures A Development Basis for Mobile Hierarchical Navigation Sammi Haj Hassine

More information

Correlation and Graphical Presentation of Event Data from a Real-Time System

Correlation and Graphical Presentation of Event Data from a Real-Time System UPTEC IT 08 010 Examensarbete 30 hp Juni 2008 Correlation and Graphical Presentation of Event Data from a Real-Time System Tobias Hedlund Xingya Zhou Abstract Correlation and Graphical Presentation of

More information

A proposal for an Android-based tablet client used in one-to-one computing in teaching environments

A proposal for an Android-based tablet client used in one-to-one computing in teaching environments IT 11 059 Examensarbete 15 hp Augusti 2011 A proposal for an Android-based tablet client used in one-to-one computing in teaching environments Alexander Rangevik Institutionen för informationsteknologi

More information

Development of a cloud service and a mobile client that visualizes business data stored in Microsoft Dynamics CRM

Development of a cloud service and a mobile client that visualizes business data stored in Microsoft Dynamics CRM UPTEC F 15004 Examensarbete 30 hp Februari 2015 Development of a cloud service and a mobile client that visualizes business data stored in Microsoft Dynamics CRM Jeton Mustini Abstract Development of a

More information

Implementing A Network Monitoring Feature In A Multipurpose Device Control Application

Implementing A Network Monitoring Feature In A Multipurpose Device Control Application IT 14 038 Examensarbete 15 hp Juni 2014 Implementing A Network Monitoring Feature In A Multipurpose Device Control Application Joseph Lundström Institutionen för informationsteknologi Department of Information

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

Data Driven Development for Mobile Applications

Data Driven Development for Mobile Applications UPTEC IT 13 013 Examensarbete 30 hp Augusti 2013 Data Driven Development for Mobile Applications Oskar Wirén Abstract Data Driven Development for Mobile Applications Oskar Wirén Teknisk- naturvetenskaplig

More information

Process-Oriented User Behavior Study Based on Machine Learning

Process-Oriented User Behavior Study Based on Machine Learning IT 11 085 Examensarbete 30 hp November 2011 Process-Oriented User Behavior Study Based on Machine Learning Yuting Wu Institutionen för informationsteknologi Department of Information Technology Abstract

More information

Multiplayer Game Server for Turn-Based Mobile Games in Erlang

Multiplayer Game Server for Turn-Based Mobile Games in Erlang UPTEC IT 12 020 Examensarbete 30 hp Februari 2013 Multiplayer Game Server for Turn-Based Mobile Games in Erlang Anders Andersson Abstract Multiplayer Game Server for Turn-Based Mobile Anders Andersson

More information

Microsoft SQL Server OLAP Solution A Survey

Microsoft SQL Server OLAP Solution A Survey IT 10 044 Examensarbete 15 hp September 2010 Microsoft SQL Server OLAP Solution A Survey Sobhan Badiozamany Institutionen för informationsteknologi Department of Information Technology Abstract Microsoft

More information

Agile Software Development

Agile Software Development UPTEC IT 12 005 Examensarbete 30 hp Juni 2012 Agile Software Development Android Prototype For The Execution of Daily Walkaround Inspections Henric Salomonsson Abstract Agile Software Development Henric

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Automatic Log Analysis using Machine Learning

Automatic Log Analysis using Machine Learning IT 13 080 Examensarbete 30 hp November 2013 Automatic Log Analysis using Machine Learning Awesome Automatic Log Analysis version 2.0 Weixi Li Institutionen för informationsteknologi Department of Information

More information

Online learning of multi-class Support Vector Machines

Online learning of multi-class Support Vector Machines IT 12 061 Examensarbete 30 hp November 2012 Online learning of multi-class Support Vector Machines Xuan Tuan Trinh Institutionen för informationsteknologi Department of Information Technology Abstract

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Integration of face processing functionalities into relational database system Mimer SQL

Integration of face processing functionalities into relational database system Mimer SQL IT 12 021 Examensarbete 30 hp Juni 2012 Integration of face processing functionalities into relational database system Mimer SQL Qing Gu Institutionen för informationsteknologi Department of Information

More information

Benchmarking of Data Mining Techniques as Applied to Power System Analysis

Benchmarking of Data Mining Techniques as Applied to Power System Analysis IT 13 061 Examensarbete 30 hp September 2013 Benchmarking of Data Mining Techniques as Applied to Power System Analysis Can ANIL Institutionen för informationsteknologi Department of Information Technology

More information

Advanced Forms and Menus in Web Development

Advanced Forms and Menus in Web Development IT 14 061 Examensarbete 15 hp Oktober 2014 Advanced Forms and Menus in Web Development Extending HTML 4.01 by Framework Fredrik Reveny Institutionen för informationsteknologi Department of Information

More information

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper Four Keys to Successful Multicore Optimization for Machine Vision White Paper Optimizing a machine vision application for multicore PCs can be a complex process with unpredictable results. Developers need

More information

Design and Implementation of Web front-end based on Mainframe education cloud

Design and Implementation of Web front-end based on Mainframe education cloud IT 15 015 Examensarbete 30 hp Mars 2015 Design and Implementation of Web front-end based on Mainframe education cloud Fan Pan Department of Information Technology Abstract Design and Implementation of

More information

fire Utrymningsplan/Evacuation plan In case of fire or other emergency Vid brand eller annan fara Rescue Call Larma Warn Varna Extinguish Evacuate

fire Utrymningsplan/Evacuation plan In case of fire or other emergency Vid brand eller annan fara Rescue Call Larma Warn Varna Extinguish Evacuate genom telefon 2 In case of or other emergency telephone 2 the if possible and risk engineering Uppsala 08-8 58 00 205-02-25/JB Plan 3, tr genom telefon 2 In case of or other emergency telephone 2 the if

More information

Interaction between web browsers and script engines

Interaction between web browsers and script engines IT 12 058 Examensarbete 45 hp November 2012 Interaction between web browsers and script engines Xiaoyu Zhuang Institutionen för informationsteknologi Department of Information Technology Abstract Interaction

More information

High Efficiency Video Coding (HEVC) or H.265 is a next generation video coding standard developed by ITU-T (VCEG) and ISO/IEC (MPEG).

High Efficiency Video Coding (HEVC) or H.265 is a next generation video coding standard developed by ITU-T (VCEG) and ISO/IEC (MPEG). HEVC - Introduction High Efficiency Video Coding (HEVC) or H.265 is a next generation video coding standard developed by ITU-T (VCEG) and ISO/IEC (MPEG). HEVC / H.265 reduces bit-rate requirement by 50%

More information

Template based relation database creator for mining industry

Template based relation database creator for mining industry IT 12 031 Examensarbete 30 hp Juni 2012 Template based relation database creator for mining industry Jan Carlsson Institutionen för informationsteknologi Department of Information Technology Abstract

More information

Traffic Recognition in Cellular Networks

Traffic Recognition in Cellular Networks IT 09 010 Examensarbete 30 hp March 2009 Traffic Recognition in Cellular Networks Alexandros Tsourtis Institutionen för informationsteknologi Department of Information Technology Abstract Traffic Recognition

More information

Voice mail system for IP Multimedia Subsystem

Voice mail system for IP Multimedia Subsystem IT 08 009 Examensarbete 30 hp May 2008 Voice mail system for IP Multimedia Subsystem Henrik Back Ming Zhao Institutionen för informationsteknologi Department of Information Technology Abstract Voice mail

More information

Variance reduction techniques used in BEAMnrc

Variance reduction techniques used in BEAMnrc Variance reduction techniques used in BEAMnrc D.W.O. Rogers Carleton Laboratory for Radiotherapy Physics. Physics Dept, Carleton University Ottawa, Canada http://www.physics.carleton.ca/~drogers ICTP,Trieste,

More information

Study and Implementation of Statistical Information System for EHR System

Study and Implementation of Statistical Information System for EHR System IT 11 026 Examensarbete 30 hp Maj 2011 Study and Implementation of Statistical Information System for EHR System Xuejun Xu Institutionen för informationsteknologi Department of Information Technology Abstract

More information

Masters program in Computational Science and Engineering, CSE

Masters program in Computational Science and Engineering, CSE 6 februari 2006 1 (??) The Bologna task force at the IT department Institutionen för informationsteknologi Teknisk databehandling Malin Ljungberg Besöksadress: MIC hus 2, Polacksbacken Lägerhyddsvgen 2

More information

Analysis and refactoring of the chat architecture in EVE Online

Analysis and refactoring of the chat architecture in EVE Online IT 11 009 Examensarbete 15 hp Mars 2011 Analysis and refactoring of the chat architecture in EVE Online Philip Pettersson Institutionen för informationsteknologi Department of Information Technology Abstract

More information

COSCO 2015 Heterogeneous Computing Programming

COSCO 2015 Heterogeneous Computing Programming COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology

More information

New Cryptographic Key Management for Smart Grid

New Cryptographic Key Management for Smart Grid IT 13 036 Examensarbete 30 hp May 2013 New Cryptographic Key Management for Smart Grid Filip Šebesta Institutionen för informationsteknologi Department of Information Technology 2 I would like to dedicate

More information

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,

More information

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014. Copyright Khronos Group 2014 - Page 1

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014. Copyright Khronos Group 2014 - Page 1 SYCL for OpenCL Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014 Copyright Khronos Group 2014 - Page 1 Where is OpenCL today? OpenCL: supported by a very wide range of platforms

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

GSM Voice Mail Service TDM Call Control

GSM Voice Mail Service TDM Call Control IT 12 072 Examensarbete 30 hp December 2012 GSM Voice Mail Service TDM Call Control Ebby Wiselyn Jeyapaul Institutionen för informationsteknologi Department of Information Technology Abstract GSM Voice

More information

Website Globalization in Monetary Gaming

Website Globalization in Monetary Gaming IT 08 021 Examensarbete 30 hp Maj 2008 Website Globalization in Monetary Gaming Jan Sundman Institutionen för informationsteknologi Department of Information Technology Abstract Website Globalization

More information

Mitigation of Virtunoid Attacks on Cloud Computing Systems

Mitigation of Virtunoid Attacks on Cloud Computing Systems IT 15 005 Examensarbete 15 hp Februari 2015 Mitigation of Virtunoid Attacks on Cloud Computing Systems Daniel McKinnon Forsell Institutionen för informationsteknologi Department of Information Technology

More information

User Behavior Analysis and Prediction Methods for Large-scale Video-ondemand

User Behavior Analysis and Prediction Methods for Large-scale Video-ondemand IT 15071 Examensarbete 30 hp August 2015 User Behavior Analysis and Prediction Methods for Large-scale Video-ondemand System Huimin Zhang Institutionen för informationsteknologi Department of Information

More information

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,

More information

How To Write A Mapreduce Program In Hadoop

How To Write A Mapreduce Program In Hadoop IT 13 0 Examensarbete 30 hp Mars 13 Research and optimization of the Bloom filter algorithm in Hadoop Bing Dong Institutionen för informationsteknologi Department of Information Technology Abstract Research

More information

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model 5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model C99, C++, F2003 Compilers Optimizing Vectorizing Parallelizing Graphical parallel tools PGDBG debugger PGPROF profiler Intel, AMD, NVIDIA

More information

Migrating and governing data in the jungle

Migrating and governing data in the jungle IT 15 022 Examensarbete 15 hp March 2015 Migrating and governing data in the jungle A study of migrations and data governance in Seco Tools AB Ahmad Salman Kanbar Institutionen för informationsteknologi

More information

mjeliot An interactive smartphone-based learning tool for programming lectures Moritz Rogalli

mjeliot An interactive smartphone-based learning tool for programming lectures Moritz Rogalli IT 12 039 Examensarbete 30 hp September 2012 mjeliot An interactive smartphone-based learning tool for programming lectures Moritz Rogalli Institutionen för informationsteknologi Department of Information

More information

Lecture 3. Optimising OpenCL performance

Lecture 3. Optimising OpenCL performance Lecture 3 Optimising OpenCL performance Based on material by Benedict Gaster and Lee Howes (AMD), Tim Mattson (Intel) and several others. - Page 1 Agenda Heterogeneous computing and the origins of OpenCL

More information

NoSQL: Moving from MapReduce Batch Jobs to Event-Driven Data Collection

NoSQL: Moving from MapReduce Batch Jobs to Event-Driven Data Collection IT 15 025 Examensarbete 15 hp Mars 2015 NoSQL: Moving from MapReduce Batch Jobs to Event-Driven Data Collection Lukas Klingsbo Institutionen för informationsteknologi Department of Information Technology

More information

Appendix L. General-purpose GPU Radiative Solver. Andrea Tosetto Marco Giardino Matteo Gorlani (Blue Engineering & Design, Italy)

Appendix L. General-purpose GPU Radiative Solver. Andrea Tosetto Marco Giardino Matteo Gorlani (Blue Engineering & Design, Italy) 141 Appendix L General-purpose GPU Radiative Solver Andrea Tosetto Marco Giardino Matteo Gorlani (Blue Engineering & Design, Italy) 14 15 October 2014 142 General-purpose GPU Radiative Solver Abstract

More information

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will

More information

Enriching Circuit Switched Mobile Phone Calls with Cooperative Web Applications

Enriching Circuit Switched Mobile Phone Calls with Cooperative Web Applications UPTEC-F11051 Examensarbete 30 hp September 2011 Enriching Circuit Switched Mobile Phone Calls with Cooperative Web Applications Måns Hommerberg Abstract Enriching Circuit Switched Mobile Phone Calls with

More information

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction Cristina Silvano cristina.silvano@polimi.it Politecnico di Milano HiPEAC CSW Athens 2014 Motivations System

More information

Hardware design for ray tracing

Hardware design for ray tracing Hardware design for ray tracing Jae-sung Yoon Introduction Realtime ray tracing performance has recently been achieved even on single CPU. [Wald et al. 2001, 2002, 2004] However, higher resolutions, complex

More information

Development of an ImmunoCAP ISAC database application

Development of an ImmunoCAP ISAC database application IT 11 088 Examensarbete 30 hp December 2011 Development of an ImmunoCAP ISAC database application Lennie Fredriksson Institutionen för informationsteknologi Department of Information Technology Abstract

More information

The Vision in Scrum Development

The Vision in Scrum Development IT 14 018 Examensarbete 30 hp Februari 2014 The Vision in Scrum Development Studying the Challenges the Vision in Practice Bastiaan Boel Institutionen för informationsteknologi Department of Information

More information

An Integrated Point of Sales System with Magnetic Stripe Card Reader For Binary-based Multi-Level Marketing (MLM) Business System. Mamun Sirajul Majid

An Integrated Point of Sales System with Magnetic Stripe Card Reader For Binary-based Multi-Level Marketing (MLM) Business System. Mamun Sirajul Majid IT 10 014 Examensarbete 30 hp Maj 2010 An Integrated Point of Sales System with Magnetic Stripe Card Reader For Binary-based Multi-Level Marketing (MLM) Business System Mamun Sirajul Majid Institutionen

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

Development of an application for individualized Warfarin treatment

Development of an application for individualized Warfarin treatment TVE nr 12 049 augusti Examensarbete 15 hp Juni 2012 Development of an application for individualized Warfarin treatment Independent Project in Engineering Physics Jacob Hellman Jonny Dahlberg Populärvetenskaplig

More information

Latency in High Performance Trading Systems Feb 2010

Latency in High Performance Trading Systems Feb 2010 Latency in High Performance Trading Systems Feb 2010 Stephen Gibbs Automated Trading Group Overview Review the architecture of a typical automated trading system Review the major sources of latency, many

More information

Application development for the Android platform

Application development for the Android platform TVE 12 023 juni Examensarbete 15 hp Juni 2012 Application development for the Android platform Eniro Friend Finder Axel Johansson Jakob Sahlström Abstract Application development for the Android platform

More information

Part I Courses Syllabus

Part I Courses Syllabus Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment

More information

A Bird-watching Database System

A Bird-watching Database System IT 13 087 Examensarbete 15 hp December 2013 A Bird-watching Database System Conny Andersson Institutionen för informationsteknologi Department of Information Technology Abstract A Bird-watching Database

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

NVIDIA GeForce Experience

NVIDIA GeForce Experience NVIDIA GeForce Experience DU-05620-001_v02 October 9, 2012 User Guide TABLE OF CONTENTS 1 NVIDIA GeForce Experience User Guide... 1 About GeForce Experience... 1 Installing and Setting Up GeForce Experience...

More information

Designing a Business Intelligence Solution for Analyzing Security Data

Designing a Business Intelligence Solution for Analyzing Security Data IT 13 070 Examensarbete 15 hp September 2013 Designing a Business Intelligence Solution for Analyzing Security Data Premathas Somasekaram Institutionen för informationsteknologi Department of Information

More information

Scania bus operations and supply chain management - two case studies

Scania bus operations and supply chain management - two case studies TVE 14 037 Examensarbete 30 hp Augusti 2014 Scania bus operations and supply chain management - two case studies Lin Wang Maja Åkerlund Masterprogram i industriell ledning och innovation Master Programme

More information

Email subscription utility for updates in Dyntaxa.

Email subscription utility for updates in Dyntaxa. IT 14 036 Examensarbete 15 hp Juni 2014 Email subscription utility for updates in Dyntaxa. Jesper Andersson Institutionen för informationsteknologi Department of Information Technology Abstract Email

More information

OpenCL for programming shared memory multicore CPUs

OpenCL for programming shared memory multicore CPUs Akhtar Ali, Usman Dastgeer and Christoph Kessler. OpenCL on shared memory multicore CPUs. Proc. MULTIPROG-212 Workshop at HiPEAC-212, Paris, Jan. 212. OpenCL for programming shared memory multicore CPUs

More information

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

More information

Advanced variance reduction techniques applied to Monte Carlo simulation of linacs

Advanced variance reduction techniques applied to Monte Carlo simulation of linacs MAESTRO Advanced variance reduction techniques applied to Monte Carlo simulation of linacs Llorenç Brualla, Francesc Salvat, Eric Franchisseur, Salvador García-Pareja, Antonio Lallena Institut Gustave

More information

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158

More information

Automatic error diagnostic for network connection problems

Automatic error diagnostic for network connection problems IT 09 062 Examensarbete 30 hp December 2009 Automatic error diagnostic for network connection problems Christer Folkesson Institutionen för informationsteknologi Department of Information Technology Abstract

More information

Stitching of X-ray Images

Stitching of X-ray Images IT 12 057 Examensarbete 30 hp November 2012 Stitching of X-ray Images Krishna Paudel Institutionen för informationsteknologi Department of Information Technology Abstract Stitching of X-ray Images Krishna

More information

Application of User-Centered Design for a Student Case Management System

Application of User-Centered Design for a Student Case Management System IT 11 057 Examensarbete 15 hp August 2011 Application of User-Centered Design for a Student Case Management System Vincent Kahl Institutionen för informationsteknologi Department of Information Technology

More information

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing

More information

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server. Evaluation Report HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Development of an ERP Requirements Specification Method by Applying Rapid Contextual Design

Development of an ERP Requirements Specification Method by Applying Rapid Contextual Design STS15 018 Examensarbete 30 hp Juni 2015 Development of an ERP Requirements Specification Method by Applying Rapid Contextual Design A Case Study of a Medium-sized Enterprise Johan Jansson Fredrik Jonsson

More information

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices The Yin and Yang of Processing Data Warehousing Queries on GPU Devices Yuan Yuan Rubao Lee Xiaodong Zhang Department of Computer Science and Engineering The Ohio State University {yuanyu, liru, zhang}@cse.ohio-state.edu

More information

Industrial Mobile Application Design and Development

Industrial Mobile Application Design and Development IT 13 069 Examensarbete 30 hp September 2013 Industrial Mobile Application Design and Development Transformer Monitoring Mobile Application Jing Liu Institutionen för informationsteknologi Department of

More information

The Methodology of Application Development for Hybrid Architectures

The Methodology of Application Development for Hybrid Architectures Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department

More information

The Future Of Animation Is Games

The Future Of Animation Is Games The Future Of Animation Is Games 王 銓 彰 Next Media Animation, Media Lab, Director cwang@1-apple.com.tw The Graphics Hardware Revolution ( 繪 圖 硬 體 革 命 ) : GPU-based Graphics Hardware Multi-core (20 Cores

More information

Challenge. Solution. Key Results

Challenge. Solution. Key Results Signal Processing for Medical Data Analysis A Case Study on Algorithm Development and Research by Rawzor Technologies, India. Challenge Signal processing for a portable battery less blood glucose monitor

More information

Introduction to GPU Computing

Introduction to GPU Computing Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture

More information

Mobile Subscriber Home Zone Billing

Mobile Subscriber Home Zone Billing IT 11 001 Examensarbete 30 hp Januari 2011 Mobile Subscriber Home Zone Billing Sicong Huang Yin Zhang Institutionen för informationsteknologi Department of Information Technology Abstract Mobile Subscriber

More information

GTC 2014 San Jose, California

GTC 2014 San Jose, California GTC 2014 San Jose, California An Approach to Parallel Processing of Big Data in Finance for Alpha Generation and Risk Management Yigal Jhirad and Blay Tarnoff March 26, 2014 GTC 2014: Table of Contents

More information

Integrated Thermal Energy Systems

Integrated Thermal Energy Systems TVE 16 045 juni Examensarbete 15 hp Juni 2016 Integrated Thermal Energy Systems A Case Study of Nya Studenternas IP and Uppsala University Hospital Mika Bäckelie Thomas Lindén Freja Nielsen Emma Pålsson

More information

A Framework for Profiling and Performance Monitoring of Heterogeneous Applications

A Framework for Profiling and Performance Monitoring of Heterogeneous Applications A Framework for Profiling and Performance Monitoring of Heterogeneous Applications Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,

More information

Live Programming for Mobile Application Development

Live Programming for Mobile Application Development IT 13 076 Examensarbete 45 hp November 2013 Live Programming for Mobile Application Development Paolo Boschini Institutionen för informationsteknologi Department of Information Technology Abstract Live

More information

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing Won-Jong Lee, Shi-Hwa Lee, Jae-Ho Nah *, Jin-Woo Kim *, Youngsam Shin, Jaedon Lee, Seok-Yoon Jung SAIT, SAMSUNG Electronics, Yonsei Univ. *,

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

Carpool in Östra Sala backe

Carpool in Östra Sala backe TVE 20 015 juni Examensarbete 15 hp Juni 2012 Carpool in Östra Sala backe Case study on how the parking standard is affected Jonas Andersson David Jakobsson Magnus Larsson Abstract Carpool in Östra Sala

More information