Big Data Processing and Analytics for Mouse Embryo Images liangxiu han Zheng xie, Richard Baldock The AGILE Project team FUNDS Research Group - Future Networks and Distributed Systems School of Computing, Mathematics and Digital Technology, Manchester Metropolitan University Group Webpage: http://www.scmdt.mmu.ac.uk/research/funds/
Outline Background Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work
Background- Big data Era Increasing Capability of Generating and Capturing Data: User-generated (e.g. social media/networks), Machines and sensor data (e.g. experimental simulations, environmental sensors, RFID, etc.), Open government and public data, etc. Big Data Era: Data intensive/data centric/data-driven Data size trends (IDC report) 40 Data Volumes 35 Zettabytes (10^21 bytes) 30 20 10 0 1 2 2010 2011 2020
Background- Big data Era Data from different domains: Astrophysics, Biomedical Science, Geoscience, Social Science, etc. Facebook: 625,000TB per day Sloan Sky Survey: over 40TB cabig: 4.7+ millions biomedical images for cancers Gene expression data in GEO and ArrayExpress: over 1 millions GeneBank: over
Background- Big data Era 10 12 10 11 10 10 base pairs 10 9 10 8 10 7 Nature 487, 282 283 (19 July 2012) doi:10.1038/487282a 10 6 10 5 01/82 01/86 01/90 01/94 01/98 01/02 01/06 01/10 01/14 date,mm/yy
Background- Big data Era What is Big Data? A relative term ( don t define it in terms of size being larger than a certain number of terabytes or petabytes) Larger, more complex and hard to access, organise and analysis beyond the capability of the existing tools (varying on sectors) Volume (amount of data), variety ( types), velocity (the growth rate of data coupled with the need to deliver insights and make decisions faster) and complexity (difficulties in transforming and integrating/linking various data) of data beyond the capability of an organisation to capture, store, manage and process
Background- Challenges How to filter and reduce the amount of data to enable timely insight and decisions? - Data analytics (computational modelling) How to process large-scale data efficiently( data movement) - Data processing (parallel and distributed /high performance computing) 2+-3 "#"$6'078)9$ 6"&+)#19$ ('8%0):+#19$ "#"$%&'()**+,- $",.$/,"01*+* 4'5 4'5 "#"$6"07) 2+-3 Source: http://dicomputing.pnnl.gov/
Background- AGILE PROJECT BBSRC funded AGILE project: A Cloud Approach to Automatic Gene Expression Pattern Recognition and Annotation over Large-Scale Images Foundational work - Automatic Gene Expression Pattern Recognition and Annotation : Data Analytics Goal I - Development of parallel approaches to allow efficient exploitation of Cloud computing: Data Processing Goal II -- Development of generic data reuse mechanism and standard services for performance enhancement and cost reduction in the Cloud: Data Processing
Outline Background Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work
Motivation &Challenges-I Annotation of Gene Expression Patterns: tagging an anatomical term from ontology with gene expression patterns of the anatomical component in images
Motivation &Challenges-II euxaxssay_007708_02.jpg euxaxssay_007708_06.jpg euxaxssay_007708_16.jpg
Motivation &Challenges-Iii Gene expression patterns --- a way to understand the interaction between genes The availability of both ontological annotation and spatial gene pattern --- a resource to identify the mechanism of embryo organisation The current manual annotation --- costly and time consuming Massive amounts of data and complicated organism --- necessity to automate the process of annotation
Motivation &Challenges-Iv Big data, now over 20 TB Multi-components coexisting in an image Variable shape, location and orientation of images The number of images associated with a certain gene is uneven The dimensionality of each image is high (3kx4k pixels)
Outline Background Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work
Methodology-i The Framework
Methodology-iI Core methodologies Image Processing Wavelet Transform Fisher Ratio LDA(SVM, ANN, LSVM)
Methodology-iII Image Processing - Filtering
Methodology-IV Wavelet transform Wavelet decomposition
Methodology-V Fishers Ratio LDA (Linear Discrimination Analysis) Linear discriminant function: Target function: Between-class scatter matrix Within-class scatter matrix
Outline Background Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work
Experimental Evaluation http://www.eurexpress.org/ee/databases/assay.jsp?assayid=euxassay_010351 This work has been published in Bioinformatics (Journal): *Han, L., van Hemert, J., Baldock, R. "Automatically Identifying and Annotating Mouse Embryo Gene Expression Patterns", Bioinformatics 27(8),pp1101-11-07, Oxford Journals, Oxford University Press. DOI:10.1093/ BIOINFORMATICS/BTR105, 2011
Outline Background Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work
parallelisation EXPLORATION- I Parallelisation (e.g. multicore, cloud computing), along with parallel programming models (e.g. MPI, MapReduce), is a sought after solution to address big data problems Three considerations for parallelising an application How to distribute workloads or decompose an algorithm into parts How to map the tasks onto various computing nodes and execute subtasks in parallel How to coordinate and communicate subtasks on those computing nodes.
parallelisation EXPLORATION- II Data Parallelism: workload are distributed into different computing nodes and the same task can be executed on different subsets of the data simultaneously Task Parallelism: tasks are independent and can be executed purely in parallel Pipelining: an iteration of a task consisting of many stages, where each stage in the task is chained and executed in order and the output of one stage is the input of the next one.
parallelisation EXPLORATION- IIi Two parallel implementations: MapReduce MPI -- Message Passing Interface "#$%$%&'(#)#*+)'?+#('>8#&+',$/+' 234E6 >8#&+'?+*0#/+ 234@6 >8#&+ A+%1$*+234B6,+#)-"+' C+%+"#)$1%234D6,.I,.E,.7.FG'F-+"H'' 234I6.#89/+.9/$)' 23476,+#)-"+'.+/+0)$1% 23456 +*)$%&'(#)#*+)?+#('>8#&+',$/+' 234E6 >8#&+'?+*0#/+ 234@6 >8#&+ A+%1$*+234B6,+#)-"+' C+%+"#)$1%234D6 "#$%&"'"(()(*#+,"-"%&"'"(()(*#+ Publications *Xie, Z., Han, L., Baldock, R., "Enhancing Parallelism of Data-Intensive Bioinformatics Applications", 8th EUROSIM Congress on Modelling and Simulation (EUROSIM 2013), 2013, Cardiff. *Han, L., Ong, H.-Y.,"Accelerating biomedical data intensive application using MapReduce", the 13th ACM ACM/IEEE International conference on Grid Computing (Grid 2012), Sep. 20-23, Beijing, China, 2012.,+#)-"+' 4:)"#0)$1%234;6 </#**$+" 234=6,+#)-"+' 4:)"#0)$1%234;6 3"+($0)$1%' 4J#/-#)$1%234IK6
parallelisation EXPLORATION- Iv MapReduce implementation A7D)-1()==%( "#$%&'()*%++,-$.'/& A#97%>6*%&B)C @#? @#?%=%1&3(#-+:)(".@3/ A#97%>6*%&B)C @#? 5%#16(%& E%-%(#1,)-& 3(#,-,-$4%1&5%#16(%.37/ 3%+1,-$4%1&5%#16(%.33/ A#97%>6*%&B)C @#? A#97%>6*%&B)C @#? 0#1#&'#*2%( 8))9&:)(& ;<&:)=>& *()++&?#=,>#1,)- 5%#16(%&4%=%*1,)-.54/ D=#++,%(.D=#/ A#97%>6*%&B)C @#? A#97%>6*%&B)C @#? A#97%>6*%&B)C @#? A#97%>6*%&B)C @#? "#$%&'()"#*+,-*.&/)'01"#$23.+ 6##"('()"#5 *4.'$,5 '(%>,*1,)-&
parallelisation EXPLORATION- v MPI implementation 56%7."80+).&&',7" 2"-.%*/0." 9.,.0%*'+," -.%*/0." 1.$.)*'+,"2" 34*0%)*'+," "#$%&&'(')%*'+," #$%&'(%)*(+(,)+ -.&,/(++)0,123*(+(,)+405'6(789.05/)::50 6(789&'(%) 6(78=&'(%) 6(78>&'(%).05/)::$7%.05/)::$7%.05/)::$7% ;)(+10)*(+(,)+ ;)(+10)*(+(,)+ ;)(+10)*(+(,)+ -.&?(+@)0+5;50'A5'BC)+);)(+10)*(+(,)+(+6(789.05/)::50 -.&,/(++)0,12;)(+10)*(+(,)+(7D&7D)E(+6(789B05/)::50 6(789;$7)?0($7 ;)(+10),)C)/+$57 6(78=;$7)?0($7 ;)(+10),)C)/+$57 6(78>;$7)?0($7 ;)(+10),)C)/+$57 -.&?(+@)0+5D5;)(+10)FE+0(/+$57(+6(789.05/)::50 -.&,/(++)0,12;)(+10)*(+(,)+(7D&7D)E(+6(789.05/)::50 ;$7)3?0($7?051B ;$7)3?0($7?051B ;$7)?0($7?C52(C >)%G-)(7.5:G-)(7 -)(7 -.&,)7DI6)/H -.&,)7DI6)/H -.&,)7DI6)/H ;$7)?0($7 A5H(0$(7/) <):+*(+(,)+ ;$7)?0($7*$:/0$'$7(+$57 ;17/+$57 -.&?(+@)0+5D5AC(::$4$/(+$57.0)D$/(+$57(+6(789.05/)::50 ;$%10)JG.(0(CC)C(0/@$+)/+10)450+@)K5084C5KG :" "
parallelisation EXPLORATION-vi Experiment Results ( Speedup) speedup 16 14 12 10 8 6 4 2 1X 2X 4X 8X 16X 32X ideal 18 x=128 images 16 1X 2X x=128 images 14 4X 8X 12 16X 32X 10 64X ideal speedup 8 6 4 2 0 0 2 4 6 8 10 12 14 16 nodes 0 0 2 4 6 8 10 12 14 16 nodes MapReduce MPI
Outline What is Annotation? Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work
Ongoing work- I Data analytics Feature extraction ( locate the region, process the region of the image, and reduce computing time) and classification algorithms
Ongoing work- II Data analytics Combination of high-level concepts (semantics) and low-level features ( image processing and data mining)
Ongoing work- III Data processing side (Parallelisation and cloud computing) Development of data-reuse mechanism for costeffective and optimisation of the data intensive applications running in the Cloud Large-scale evaluation in the Cloud
FUNDS Research Group Group members: 4 academic staff + 1 PDRA+ 8 PhDs. Also two associate members. There will have more new members to come. Areas of interest Novel architecture of networked distributed system (parallel and distributed computing, cloud computing, wired and wireless sensor networks, IoT, etc.) Large-scale data mining ( application domains: biomedical images, environmental sensors, computer network traffic, web pages including content and linkage of graph, social network analysis, etc.)
Thank you