Big Data Processing and Analytics for Mouse Embryo Images

Similar documents
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Conquering the Astronomical Data Flood through Machine

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Using Proxies to Accelerate Cloud Applications

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

The Scientific Data Mining Process

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Mining Large Datasets: Case of Mining Graph Data in the Cloud

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Are You Ready for Big Data?

Large-Scale Data Processing

Big Data Systems CS 5965/6965 FALL 2015

Concept and Project Objectives

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data. Lyle Ungar, University of Pennsylvania

Big Data: Image & Video Analytics

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

What happens when Big Data and Master Data come together?

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Data Centric Computing Revisited

Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Beyond Watson: The Business Implications of Big Data

Data Mining and Machine Learning in Bioinformatics

Research Statement Immanuel Trummer

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

Introduction of Information Visualization and Visual Analytics. Chapter 2. Introduction and Motivation

ANALYTICS IN BIG DATA ERA

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Advanced In-Database Analytics

Big data and its transformational effects

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Big Data Challenges in Bioinformatics

Are You Ready for Big Data?

Big Data a threat or a chance?

Advances in Natural and Applied Sciences

Bringing Compute to the Data Alternatives to Moving Data. Part of EUDAT s Training in the Fundamentals of Data Infrastructures

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data and Analytics 21 A Technical Perspective Abhishek Bhattacharya, Aditya Gandhi and Pankaj Jain November 2012

Ontology construction on a cloud computing platform

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

What is Analytic Infrastructure and Why Should You Care?

Problems to store, transfer and process the Big Data 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 1

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Industrial Dr. Stefan Bungart

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Extracting Business. Value From CAD. Model Data. Transformation. Sreeram Bhaskara The Boeing Company. Sridhar Natarajan Tata Consultancy Services Ltd.

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

Learning from Big Data in

ICT Perspectives on Big Data: Well Sorted Materials

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Manifest for Big Data Pig, Hive & Jaql

M2M Communications and Internet of Things for Smart Cities. Soumya Kanti Datta Mobile Communications Dept.

Efficient Analysis of Big Data Using Map Reduce Framework

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Challenges for Data Driven Systems

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

MapReduce and Hadoop Distributed File System V I J A Y R A O

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

TECHNOLOGY ANALYSIS FOR INTERNET OF THINGS USING BIG DATA LEARNING

How To Handle Big Data With A Data Scientist

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

Big Data Hope or Hype?

Distributed forests for MapReduce-based machine learning

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

Figure 1: Architecture of a cloud services model for a digital education resource management system.

FOUNDATIONS OF A CROSS- DISCIPLINARY PEDAGOGY FOR BIG DATA

Exploiting the power of Big Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data for smart infrastructure: London Bridge Station Redevelopment. Sinan Ackigoz & Krishna Kumar Cambridge, UK

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Information Management course

Big Data in Subsea Solutions

Big Data and Analytics: Challenges and Opportunities

MapReduce and Hadoop Distributed File System

Big Data Analytics. The Hype and the Hope* Dr. Ted Ralphs Industrial and Systems Engineering Director, Laboratory

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

Industry 4.0 and Big Data

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Simultaneous Gamma Correction and Registration in the Frequency Domain

Machine Learning over Big Data

Transcription:

Big Data Processing and Analytics for Mouse Embryo Images liangxiu han Zheng xie, Richard Baldock The AGILE Project team FUNDS Research Group - Future Networks and Distributed Systems School of Computing, Mathematics and Digital Technology, Manchester Metropolitan University Group Webpage: http://www.scmdt.mmu.ac.uk/research/funds/

Outline Background Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work

Background- Big data Era Increasing Capability of Generating and Capturing Data: User-generated (e.g. social media/networks), Machines and sensor data (e.g. experimental simulations, environmental sensors, RFID, etc.), Open government and public data, etc. Big Data Era: Data intensive/data centric/data-driven Data size trends (IDC report) 40 Data Volumes 35 Zettabytes (10^21 bytes) 30 20 10 0 1 2 2010 2011 2020

Background- Big data Era Data from different domains: Astrophysics, Biomedical Science, Geoscience, Social Science, etc. Facebook: 625,000TB per day Sloan Sky Survey: over 40TB cabig: 4.7+ millions biomedical images for cancers Gene expression data in GEO and ArrayExpress: over 1 millions GeneBank: over

Background- Big data Era 10 12 10 11 10 10 base pairs 10 9 10 8 10 7 Nature 487, 282 283 (19 July 2012) doi:10.1038/487282a 10 6 10 5 01/82 01/86 01/90 01/94 01/98 01/02 01/06 01/10 01/14 date,mm/yy

Background- Big data Era What is Big Data? A relative term ( don t define it in terms of size being larger than a certain number of terabytes or petabytes) Larger, more complex and hard to access, organise and analysis beyond the capability of the existing tools (varying on sectors) Volume (amount of data), variety ( types), velocity (the growth rate of data coupled with the need to deliver insights and make decisions faster) and complexity (difficulties in transforming and integrating/linking various data) of data beyond the capability of an organisation to capture, store, manage and process

Background- Challenges How to filter and reduce the amount of data to enable timely insight and decisions? - Data analytics (computational modelling) How to process large-scale data efficiently( data movement) - Data processing (parallel and distributed /high performance computing) 2+-3 "#"$6'078)9$ 6"&+)#19$ ('8%0):+#19$ "#"$%&'()**+,- $",.$/,"01*+* 4'5 4'5 "#"$6"07) 2+-3 Source: http://dicomputing.pnnl.gov/

Background- AGILE PROJECT BBSRC funded AGILE project: A Cloud Approach to Automatic Gene Expression Pattern Recognition and Annotation over Large-Scale Images Foundational work - Automatic Gene Expression Pattern Recognition and Annotation : Data Analytics Goal I - Development of parallel approaches to allow efficient exploitation of Cloud computing: Data Processing Goal II -- Development of generic data reuse mechanism and standard services for performance enhancement and cost reduction in the Cloud: Data Processing

Outline Background Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work

Motivation &Challenges-I Annotation of Gene Expression Patterns: tagging an anatomical term from ontology with gene expression patterns of the anatomical component in images

Motivation &Challenges-II euxaxssay_007708_02.jpg euxaxssay_007708_06.jpg euxaxssay_007708_16.jpg

Motivation &Challenges-Iii Gene expression patterns --- a way to understand the interaction between genes The availability of both ontological annotation and spatial gene pattern --- a resource to identify the mechanism of embryo organisation The current manual annotation --- costly and time consuming Massive amounts of data and complicated organism --- necessity to automate the process of annotation

Motivation &Challenges-Iv Big data, now over 20 TB Multi-components coexisting in an image Variable shape, location and orientation of images The number of images associated with a certain gene is uneven The dimensionality of each image is high (3kx4k pixels)

Outline Background Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work

Methodology-i The Framework

Methodology-iI Core methodologies Image Processing Wavelet Transform Fisher Ratio LDA(SVM, ANN, LSVM)

Methodology-iII Image Processing - Filtering

Methodology-IV Wavelet transform Wavelet decomposition

Methodology-V Fishers Ratio LDA (Linear Discrimination Analysis) Linear discriminant function: Target function: Between-class scatter matrix Within-class scatter matrix

Outline Background Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work

Experimental Evaluation http://www.eurexpress.org/ee/databases/assay.jsp?assayid=euxassay_010351 This work has been published in Bioinformatics (Journal): *Han, L., van Hemert, J., Baldock, R. "Automatically Identifying and Annotating Mouse Embryo Gene Expression Patterns", Bioinformatics 27(8),pp1101-11-07, Oxford Journals, Oxford University Press. DOI:10.1093/ BIOINFORMATICS/BTR105, 2011

Outline Background Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work

parallelisation EXPLORATION- I Parallelisation (e.g. multicore, cloud computing), along with parallel programming models (e.g. MPI, MapReduce), is a sought after solution to address big data problems Three considerations for parallelising an application How to distribute workloads or decompose an algorithm into parts How to map the tasks onto various computing nodes and execute subtasks in parallel How to coordinate and communicate subtasks on those computing nodes.

parallelisation EXPLORATION- II Data Parallelism: workload are distributed into different computing nodes and the same task can be executed on different subsets of the data simultaneously Task Parallelism: tasks are independent and can be executed purely in parallel Pipelining: an iteration of a task consisting of many stages, where each stage in the task is chained and executed in order and the output of one stage is the input of the next one.

parallelisation EXPLORATION- IIi Two parallel implementations: MapReduce MPI -- Message Passing Interface "#$%$%&'(#)#*+)'?+#('>8#&+',$/+' 234E6 >8#&+'?+*0#/+ 234@6 >8#&+ A+%1$*+234B6,+#)-"+' C+%+"#)$1%234D6,.I,.E,.7.FG'F-+"H'' 234I6.#89/+.9/$)' 23476,+#)-"+'.+/+0)$1% 23456 +*)$%&'(#)#*+)?+#('>8#&+',$/+' 234E6 >8#&+'?+*0#/+ 234@6 >8#&+ A+%1$*+234B6,+#)-"+' C+%+"#)$1%234D6 "#$%&"'"(()(*#+,"-"%&"'"(()(*#+ Publications *Xie, Z., Han, L., Baldock, R., "Enhancing Parallelism of Data-Intensive Bioinformatics Applications", 8th EUROSIM Congress on Modelling and Simulation (EUROSIM 2013), 2013, Cardiff. *Han, L., Ong, H.-Y.,"Accelerating biomedical data intensive application using MapReduce", the 13th ACM ACM/IEEE International conference on Grid Computing (Grid 2012), Sep. 20-23, Beijing, China, 2012.,+#)-"+' 4:)"#0)$1%234;6 </#**$+" 234=6,+#)-"+' 4:)"#0)$1%234;6 3"+($0)$1%' 4J#/-#)$1%234IK6

parallelisation EXPLORATION- Iv MapReduce implementation A7D)-1()==%( "#$%&'()*%++,-$.'/& A#97%>6*%&B)C @#? @#?%=%1&3(#-+:)(".@3/ A#97%>6*%&B)C @#? 5%#16(%& E%-%(#1,)-& 3(#,-,-$4%1&5%#16(%.37/ 3%+1,-$4%1&5%#16(%.33/ A#97%>6*%&B)C @#? A#97%>6*%&B)C @#? 0#1#&'#*2%( 8))9&:)(& ;<&:)=>& *()++&?#=,>#1,)- 5%#16(%&4%=%*1,)-.54/ D=#++,%(.D=#/ A#97%>6*%&B)C @#? A#97%>6*%&B)C @#? A#97%>6*%&B)C @#? A#97%>6*%&B)C @#? "#$%&'()"#*+,-*.&/)'01"#$23.+ 6##"('()"#5 *4.'$,5 '(%>,*1,)-&

parallelisation EXPLORATION- v MPI implementation 56%7."80+).&&',7" 2"-.%*/0." 9.,.0%*'+," -.%*/0." 1.$.)*'+,"2" 34*0%)*'+," "#$%&&'(')%*'+," #$%&'(%)*(+(,)+ -.&,/(++)0,123*(+(,)+405'6(789.05/)::50 6(789&'(%) 6(78=&'(%) 6(78>&'(%).05/)::$7%.05/)::$7%.05/)::$7% ;)(+10)*(+(,)+ ;)(+10)*(+(,)+ ;)(+10)*(+(,)+ -.&?(+@)0+5;50'A5'BC)+);)(+10)*(+(,)+(+6(789.05/)::50 -.&,/(++)0,12;)(+10)*(+(,)+(7D&7D)E(+6(789B05/)::50 6(789;$7)?0($7 ;)(+10),)C)/+$57 6(78=;$7)?0($7 ;)(+10),)C)/+$57 6(78>;$7)?0($7 ;)(+10),)C)/+$57 -.&?(+@)0+5D5;)(+10)FE+0(/+$57(+6(789.05/)::50 -.&,/(++)0,12;)(+10)*(+(,)+(7D&7D)E(+6(789.05/)::50 ;$7)3?0($7?051B ;$7)3?0($7?051B ;$7)?0($7?C52(C >)%G-)(7.5:G-)(7 -)(7 -.&,)7DI6)/H -.&,)7DI6)/H -.&,)7DI6)/H ;$7)?0($7 A5H(0$(7/) <):+*(+(,)+ ;$7)?0($7*$:/0$'$7(+$57 ;17/+$57 -.&?(+@)0+5D5AC(::$4$/(+$57.0)D$/(+$57(+6(789.05/)::50 ;$%10)JG.(0(CC)C(0/@$+)/+10)450+@)K5084C5KG :" "

parallelisation EXPLORATION-vi Experiment Results ( Speedup) speedup 16 14 12 10 8 6 4 2 1X 2X 4X 8X 16X 32X ideal 18 x=128 images 16 1X 2X x=128 images 14 4X 8X 12 16X 32X 10 64X ideal speedup 8 6 4 2 0 0 2 4 6 8 10 12 14 16 nodes 0 0 2 4 6 8 10 12 14 16 nodes MapReduce MPI

Outline What is Annotation? Motivation and Challenges Methodology Experimental Evaluation Exploration of Parallelisation Ongoing Work

Ongoing work- I Data analytics Feature extraction ( locate the region, process the region of the image, and reduce computing time) and classification algorithms

Ongoing work- II Data analytics Combination of high-level concepts (semantics) and low-level features ( image processing and data mining)

Ongoing work- III Data processing side (Parallelisation and cloud computing) Development of data-reuse mechanism for costeffective and optimisation of the data intensive applications running in the Cloud Large-scale evaluation in the Cloud

FUNDS Research Group Group members: 4 academic staff + 1 PDRA+ 8 PhDs. Also two associate members. There will have more new members to come. Areas of interest Novel architecture of networked distributed system (parallel and distributed computing, cloud computing, wired and wireless sensor networks, IoT, etc.) Large-scale data mining ( application domains: biomedical images, environmental sensors, computer network traffic, web pages including content and linkage of graph, social network analysis, etc.)

Thank you