Acceleration for Personalized Medicine Big Data Applications Zaid Al-Ars Computer Engineering (CE) Lab Delft Data Science Delft University of Technology 1"
Introduction Definition & relevance Personalized medicine is the customization of healthcare - with medical decisions, practices, and products being tailored to the individual patient. Example of societally critical, highly-demanding big data application domains 2"
Introduction Scientific and societal challenges Exponentially growing data volumes Increasing complexity of analysis Both computational and data challenges 3"
Introduction Scientific and societal challenges Urgent clinical diagnostics, for example Targeted cancer & neo-natal diagnostics! We provide techniques to reduce compute time Cost prohibitive for society More patients & diseases to be treated! We provide techniques to reduce cost COMPUTE"COST " """"""""""""""""COMPUTE"TIME" 4"
Introduction Master class outline Introduction and background Field of personalized medicine Challenges and opportunities Relations to other big data fields Computational big data pipeline Stages of a typical personalized medicine pipeline Methods to reduce computation time Methods to reduce pipeline cost Solution demonstration 5"
Background Field of personalized medicine Vision: P4 medicine medicine that is predictive, preventive, personalized, and participatory 6"
Background Field of personalized medicine Sources of personalized information Measurements of vitals & body data Regular blood, spit, urine, etc. testing Genome data sequencing 7"
Background Field of personalized medicine Measurements of vitals & body data Pros Body is minutely and continuously monitored Corporate support from big industry Cons Use is not yet clear Health risks are not monitored! Not known if applications in health are possible 8"
Background Field of personalized medicine Regular blood, spit, urine, etc. testing Pros Measurement 100s of molecules in body Direct correlation to health risk Cons Still too expensive No specific health advice yet possible! Possible future use if cost becomes manageable 9"
Background Field of personalized medicine Genome data sequencing Pros Detailed knowledge genetic information Known markers to diagnose disease Cons Huge computational effort! Can be used today if computation effort becomes manageable 10"
Background DNA-based diagnostics 11"
Background DNA-based diagnostics DNA"muta<on"results"in" abnormal"cell"behavior" " Some"muta<ons"cause"cells" to"divide"without"control" causing"cancer" Cancer"can"be"diagnosed"by" iden<fying"which"muta<ons" are"in"the"dna"!cancer!diagnos-cs!is!main!use!for!dna!data!today! 12"
Background DNA-based diagnostics 13"
Big data pipeline Computational big data pipeline Three"main"stages" 1. Data"genera<on" Generate"and"store"DNA"data"using" specialized"compression"techniques" 2. Data"analysis" Accelerate"mapping"&"variant"calling" of"gene<c"algorithms"on"hardware" 3. Data"visualiza<on" Understand"the"analyzed"gene<c" data"to"make"clinical"decisions"for" the"pa<ent" GENERATE" Generate"and"store"DNA"data" using"specialized"compression" techniques" ANALYZE" Accelerate"mapping"&"variant" calling"of"gene<c"algorithms" on"hardware" INTERPRET" Understand"the"analyzed" gene<c"data"to"make"clinical" decisions"for"the"pa<ent" 14"
Big data pipeline Data generation " DNA processing passes in 3 stages # Sequence generation # Data analysis # Result interpretation " Sequence generation faces size bottlenecks 10^E7" 10^E6" GENERATE" Generate"and"store"DNA"data" using"specialized"compression" techniques" ANALYZE" Accelerate"mapping"&"variant" calling"of"gene<c"algorithms" on"hardware" 10^E5" 10^E4" 10^E3" 10^E2" 10^E1" 2003" 2004" 2005" 2006" 2007" 2008" 2009" 2010" 2011" Lincoln"D"Stein," The"case"for"cloud"compu<ng"in"genome" informa<cs,"genome"biology,"11:207,"2010." INTERPRET" Understand"the"analyzed" gene<c"data"to"make"clinical" decisions"for"the"pa<ent" 15"
Big data pipeline Data analysis " Growth of throughput of data generation is faster than growth in CPU processing capacity 10^E8" 10^E7" 10^E6" # Growth is exponential # Need for rapidly increasing processing capacity DNA"sequencing"(bp/day)" GENERATE" Generate"and"store"DNA"data" using"specialized"compression" techniques" ANALYZE" Accelerate"mapping"&"variant" calling"of"gene<c"algorithms" on"hardware" 10^E5" 10^E4" 10^E3" CPU"speed"(M"Inst./s)" INTERPRET" Understand"the"analyzed" gene<c"data"to"make"clinical" decisions"for"the"pa<ent" 10^E2" 2003" 2004" 2005" 2006" 2007" 2008" 2009" 2010" 2011" Po-Ru Loh, Michael Baym & Bonnie Berger, Compressive genomics, Nature Biotechnology, 30:627 630, 2012. 16"
Big data pipeline Data interpretation " Relative cost of interpretation is increasing # Number of sequenced genomes increases # Cross referencing multiple genomes to identify correlations # Need for innovative DNA visualization " Sequence generation faces size bottlenecks 100%" 90%" 80%" 70%" 60%" 50%" 40%" 30%" 20%" 10%" 0%" 2012$ 2020$ Genotyping" 2012$ 2020$ Interpreta<on" Ingo"Helbig," Be"literate"when"the"exome"goes"clinical,"hcp:// channelopathist.net/,"june"6,"2012" GENERATE" Generate"and"store"DNA"data" using"specialized"compression" techniques" ANALYZE" Accelerate"mapping"&"variant" calling"of"gene<c"algorithms" on"hardware" INTERPRET" Understand"the"analyzed" gene<c"data"to"make"clinical" decisions"for"the"pa<ent" 17"
Big data pipeline 10^E8" 10^E7" 10^E6" Current solution " Current solution: increasing capacity in local or cloud clusters # Not always the best solution Growth"in"DNA"and"CPU"computa<onal"complexity" DNA"sequencing"(bp/day)" 10^E5" 10^E4" CPU"speed"(M"Inst./s)" 10^E3" 10^E2" 2003" 2004" 2005" 2006" 2007" 2008" 2009" 2010" 2011" Po-Ru Loh, Michael Baym & Bonnie Berger, Compressive genomics, Nature Biotechnology, 30:627 630, 2012. 18"
Big data pipeline CE lab solution: compression " Domain specific compression # Enables high compression rate # Allows reduced infrastructure footprint " Possible transparent compression from and to file system 19"
Big data pipeline CE lab solution: acceleration " Hybrid core computing # Means using dedicated computing chips for specific algorithms # Next to traditional general-purpose CPUs (Intel processors) " Dedicated chips use FPGAs (field programmable gate arrays) like Xilinx " Recreate small compute elements on hardware " Can parallelize the computations tens of times " Becoming mainstream: used by Intel, IBM, Microsoft, Facebook, etc. 20"
Big data pipeline CE lab solution: acceleration " Compare and align nucleotide or protein sequences " Algorithm scores every possible alignment # Cell of matrix compares elements of query and database # Much parallelism, both within & between sequences 21"
Big data pipeline CE lab solution: distribution " Efficient utilization of available hardware resources # Less hardware is used for same algorithms " Tuning of hardware-software system to use case # More parallelism extracted from algorithms Task"P1" Task"S1" Task"S2" Task"S3" " Task"Sn" Task"S1" Task"S2" Task"S3" " Task"Sn" Task"S1" Task"S2" Task"S3" " Task"Sn" Task"P2" Task"P3" " Task"Pn" Task"P1" Task"P2" Task"P3" " Task"Pn" 22"
Big data pipeline CE lab solution: distribution " Higher Performance # 5x to 25x speed gains " Energy Saving # Up to 90% power reduction " Easy to use, program, manage # Standard Linux ecosystem # Transparent to the user " Well suited for Bioinformatics # Inherent parallelism exploited by pipelining # Small data types use logic efficiently 23"
Next steps Delft Data Science research agenda CE Lab provides a holistic approach to optimize big data infrastructure 1. Addressing big data storage limitations Effective compression techniques 2. Addressing big data computational time Acceleration of big data algorithms 3. Addressing big data system cost Effective utilization of system resources Storage"limita<ons" Computa<onal" boclenecks" Infrastructure"cost" op<miza<ons" 24"
Next steps Collaboration opportunities Collaborations on big data infrastructure Work together on industrially relevant challenges Transfer of expert knowledge to organizations CE Lab is leading research in Pipeline-wide performance optimization Integrated system cost optimization Large network of leading technology providers IBM, Intel, Altera, etc. 25"
Next steps Contact for further discussion Contact for further discussion on collaborations or question/feedback Zaid Al-Ars CE Lab / TUDelft Mekelweg 4, 2628 CD Delft Email: z.al-ars@tudelft.nl Web: ce.ewi.tudelft.nl/zaid Tel: 015 27 89097 26"
Next steps Future prospects Genetic analysis has significant potential Personalized medicine Preemptive intervention Trait selection and enhancement Etc. Early detection & cure of diabetes w/ ipop 60 TeraB data 27"
Next steps Solution demonstration 28"