Hadoop. Bioinformatics Big Data

Similar documents

Hadoop. Sunday, November 25, 12

Chapter 7. Using Hadoop Cluster and MapReduce

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Cloud-based Analytics and Map Reduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data and Apache Hadoop s MapReduce

Hadoop and Map-Reduce. Swati Gore

Large scale processing using Hadoop. Ján Vaňo

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoopizer : a cloud environment for bioinformatics data analysis

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

BIG DATA What it is and how to use?

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Implement Hadoop jobs to extract business value from large and varied data sets

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop Architecture. Part 1

Data-Intensive Computing with Map-Reduce and Hadoop

HPC ABDS: The Case for an Integrating Apache Big Data Stack

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Map Reduce & Hadoop Recommended Text:

Tutorial for Assignment 2.0

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Tutorial for Assignment 2.0

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Intro to Map/Reduce a.k.a. Hadoop

Hadoop IST 734 SS CHUNG

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

CSE-E5430 Scalable Cloud Computing. Lecture 4

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

A Performance Analysis of Distributed Indexing using Terrier

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Hadoop-BAM and SeqPig

How To Use Hadoop

Processing NGS Data with Hadoop-BAM and SeqPig

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data Course Highlights

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Complete Java Classes Hadoop Syllabus Contact No:

Map Reduce / Hadoop / HDFS

How To Scale Out Of A Nosql Database

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Apache Hadoop. Alexandru Costan

Hadoop Ecosystem B Y R A H I M A.

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to MapReduce and Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

COURSE CONTENT Big Data and Hadoop Training

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop: The Definitive Guide

Maximizing Hadoop Performance with Hardware Compression

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Managing large clusters resources

ITG Software Engineering

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Testing 3Vs (Volume, Variety and Velocity) of Big Data

HADOOP IN THE LIFE SCIENCES:

Scaling Out With Apache Spark. DTL Meeting Slides based on

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data Weather Analytics Using Hadoop

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Peers Techno log ies Pv t. L td. HADOOP

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Cloud Computing. Chapter Hadoop

Open source Google-style large scale data analysis with Hadoop

Getting Started with Hadoop with Amazon s Elastic MapReduce

MapReduce with Apache Hadoop Analysing Big Data

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Hadoop/MapReduce Workshop

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Sriram Krishnan, Ph.D.

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Data-intensive computing systems

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

HADOOP MOCK TEST HADOOP MOCK TEST II

A Brief Outline on Bigdata Hadoop

HDFS. Hadoop Distributed File System

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

A Brief Introduction to Apache Tez

Transcription:

Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it

Big Data Too much information!

Big Data Explosive data growth proliferation of data capture interconnection = more data inexpensive storage Not just the size of data access all the data increase retention machine learning data compression

Big Data Where are you going?

Big Data main problem Data analysis is slower than data creation

Big Data Semi-structured data looser though there may be a schema, it is often ignored Why can t we use databases to do largescale batch analysis? seek time is improving slowly than transfer rate

Old approach HPC and Grid Computing doing large-scale data processing for years: APIs as Message Passing Interface (MPI) distribute the work across a cluster of machines access a shared filesystem (hosted by a SAN) Works well for compute-intensive jobs becomes a problem when nodes need to access larger data volumes the network bandwidth is the bottleneck

Bandwidth bandwidth is the bottleneck and compute nodes become idle HPC and Grid can be overloaded Bandwidth solutions focus on obtaining better performance for very fast workloads

Google approach 10 years ago: the famous MapReduce paper

MapReduce Batch query processor ability to run an ad hoc queries against your whole dataset get the results in a reasonable time Unstructured or semi-structured data designed to interpret the data at processing time

MapReduce BANDWIDTH tries to collocate the data with the compute node data access is fast since it is local known as data locality reason for its good performance.

Apache Hadoop Apache works on Lucene (2005) Full-featured text search engine library Indexing system from scratch Decide to go with MapReduce Splits into a new project Hadoop (2006) April 2008 Hadoop broke a world record to become the fastest system to sort a terabyte of data

Apache Hadoop Open source platform for data storage and processing Scalable Fault tolerant Distributed

HDFS Hadoop File System Build to avoid transferring data over the network Hadoop works on input splits Split time << job execution Small splits faster nodes consumes more splits and jobs than slowers ones If too small overhead breaks performance Fine tuning Best split = HDFS size (64MB default) Hadoop needs topography don t distribute on different racks if not needed Data locality optimization

MapReduce Hadoop jobs Single Job Map tasks build splits local outputs Reduce tasks HDFS output redundant

The hadoop framework Hadoop is written in Java Its main framework is Java-based Write code in many languages (e.g. Python). API to check cluster status and configuration

Hadoop: hands on To work on any example, even the simplest, you clearly need a Hadoop Cluster. Two ways of simulating a Hadoop cluster on your local machine: 1. A pseudo distributed singlenode Hadoop cluster on Linux/Ubuntu 2. A pre-configured virtual machine

Hadoop: hands on A python example Why python? Not native Which will help you to better understand the Hadoop system Easy to write code for Mappers and Reducers

Hadoop: hands on Files example (columns) Stadium (String) - The name of the stadium Capacity (Int) - The capacity of the stadium ExpandedCapacity (Int) - The expanded capacity of the stadium Location (String) - The location of the stadium PlayingSurface (String) - The type of grass, etc that the stadium has IsArtificial (Boolean) - Is the playing surface artificial Team (String) - The name of the team that plays at the stadium Opened (Int) - The year the stadium opened WeatherStation (String) - The name of the weather station closest to the stadium RoofType (Possible Values:None,Retractable,Dome) - The type of roof in the stadium Elevation - The elevation of the stadium Our question: Find the number of stadiums with artificial and natural playing surfaces

Python Mapper (mapper.py) for line in sys.stdin: line = line.strip() stadium, capacity, expanded, location, surface, turf, team, opened, weather, roof, elevation = line.split(",") results = [turf, "1"] print("\t".join(results))

In the middle?... streaming... TRUE 1 TRUE 1 TRUE 1 TRUE 1 FALSE 1 FALSE 1 FALSE 1 The reducer interface for streaming is actually different than in Java. Instead of receiving reduce(k, Iterator[V]) your script is actually sent one line per value, including the key.

Python Reducer (reducer.py) last_turf = None turf_count = 0 for line in sys.stdin: line = line.strip() turf, count = line.split("\t") count = int(count) if not last_turf: # if this is the first iteration last_turf = turf if turf == last_turf: # if they're the same, log it turf_count += count else: # state change result = [last_turf, turf_count] print("\t".join(str(v) for v in result)) last_turf = turf turf_count = 1 #catch the final counts after all records have been received. print("\t".join(str(v) for v in [last_turf, turf_count]))

Testing on Hadoop $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoopstreaming-2.0.0-mr1-cdh4.4.0.jar \ -mapper mapper.py \ -reducer reducer.py \ -input nfldata/stadiums \ -output nfldata/pythonoutput \ -file simple/mapper.py \ -file simple/reducer.py #...twiddle thumbs for a while $ hadoop fs -text nfldata/pythonoutput/part-* FALSE 15 TRUE 17

Test on a laptop # Testing the same code as a bash pipe $ cat ~/workspace/nfldata/unixstadiums.csv simple/mapper.py sort simple/reducer. py # FALSE 15 # TRUE 17

Jobtracker Hadoop web-dashboard: Status and statistics of job executed on our Hadoop cluster

A new cluster prototype Working @CINECA Cloud Virtual nodes Virtual networks Virtual FS OpenStack Hadoop

NGS and bioinformatics Next Generation Sequencing = NGS new platforms high throughput Many analysis application New algorithms & codes and challenges Small costs! Producing Big Data

Why NGS fits Hadoop Embarassingly parallel Little or no effort to separate the problem into a number of parallel tasks Often no dependency (or communication) between parallel tasks VS Distributed system Components are located on networked computers Which communicate and coordinate their actions by passing messages

NGS Hadoop today Hadoop bam (mapping utilities) http://bioinformatics.oxfordjournals.org/content/28/6/876 Solve bio http://techcrunch.com/2012/03/29/cloud-will-cure-cancer/ Crossbow (mapping) http://bowtie-bio.sourceforge.net/crossbow/index.shtml Cloudburst http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php?title=cloudburst Eoulsan (RNAseq) http://transcriptome.ens.fr/eoulsan/ Myrna (RNAseq gene expression) http://bowtie-bio.sourceforge.net/myrna/manual.shtml SeqPig (Hadoop Pig for processing sequences) http://sourceforge.net/projects/seqpig/ Next Bio + Intel!!

Coverage problem New sequencing platforms produce big data files with many (short) sequences The targeted sequencing gives as output many sequences inside the same small genomic regions Alignment mapping sequences on a reference genome Coverage how deep is covered each genomic position in the experiment base per base (one nucleotide at the time) If coverage is too low (given a threshold) in one region we cannot use that region in our results

Coverage problem

SAM format Example Header Lines @HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5: 1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5: a0d9851da00400dec1098a9255ac712e @SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5: fdfd811849cc2fadebc929bb925902e5 @RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 @RG ID:UM0098:2 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L002 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 SM:SD37743 Example Alignments 1:497:R:-272+13M17D24M 113 1 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37 19:20389:F:275+18M2D19M 99 1 >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 19:20389:F:275+18M2D19M 147 1 GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19 9:21597+10M2I25M:R:-209 83 1 CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT XM:i:0 XO:i:1 XG:i:2 MD:Z:35 497 37 37M 15 100338662 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> 0 XT:A:U NM:i:0 SM:i:37 17644 0 37M = 17919 314 RG:Z:UM0098:1 XT:A:R NM:i:0 SM:i:0 AM:i:0 17919 0 18M2D19M = 17644 ;44999;499<8<8<<<8<<><<<<><7<;<<<>><< TATGACTGCTAATAATACCTACACATGTTAGAACCAT X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37-314 XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 AM:i:0 X0:i:1 21678 0 8M2I27M = 21469-244 <;9<<5><<<<><<<>><<><>><9>><>>>9>>><> XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0

NGS project Write a python Hadoop job which calculates coverage from a SAM file for each available genomic position Extra: count the single bases (A,C,T,G,N)

NGS project: skills What you need a little python knowledge, linux experience, curiosity What you learn python, Hadoop installation and comprehension, how to work on a real case scenario, bioinformatics problems

Thesis: working with us Write a python Hadoop daemon to: distribute steps of bioinformatics pipeline of a real bioinformatic service while tuning available cloud resources based on OpenStack and Hadoop API

This is not the end......but the beginning!