Applications of MapReduce to the Life Science

Transcription

1 Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland Wednesday, June 10, 2009 This work is licensed under a Creative Commons Attribution- Noncommercial-Share Alike 3.0 United States. See for details

2 Cloud Maryland Teaching Cloud computing course (version 1.0): Spring 2008 Part of the Google/IBM Academic Cloud Computing Initiative Cloud computing course (version 2.0): Fall 2008 Sponsored by Amazon Web Services through a teaching grant Research Web-scale text processing Statistical machine translation Bioinformatics

3 Learning Translation Models Prodi ha erigido hoy un verdadero muro contra esas acciones, espero que el Sr. Moscovici i lo haya comprendido bien, y realmente también espero que esta tendencia se rompa en los Consejos de Biarritz y de Niza, y se rectifique. Mr Prodi has put an emphatic stop to this kind of action, which has hopefully resonated with Mr Moscovici, and I truly hope that this trend can be broken and reversed at the Councils in Nice and Biarritz. Esas negociaciones sabemos que son muy difíciles y hacen temer un fracaso o un acuerdo de mínimos en Niza, lo que sería aún más grave y usted ya lo ha dicho, señor Ministro. These are, as we know, very tricky negotiations and raise fears of a setback or a watered-down agreement in Nice which, as you have already acknowledged, Mr Moscovici, would be even more serious. Maria bofetada una bruja verde bruja verde no dio Mary slap a witch green green witch did not We built systems for learning translation models in Hadoop sort of like the word count example, but with more math

4 Translation as a Tiling Problem Maria no dio una bofetada a la bruja verde Mary not give a slap to the witch green did not a slap by green witch no did not give slap to the to the slap the witch Example from Koehn (2006)

5 From Text to DNA Sequences Text processing: [0-9A-Za-z]+ DNA sequence processing: [ATCG]+ (Nope, not really) Michael Schatz (Ph.D. student, Computer Science; Spring 2008) Ben Langmead (M.S. student, Computer Science; Fall 2008)

6 Analogy (And two disclaimers)

7 Strangely-Formatted Manuscript Dickens: A Tale of Two Cities Text written on a long spool It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness,

8 With Duplicates Dickens: A Tale of Two Cities Backup on four more copies It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness,

9 Shredded Book Reconstruction Dickens accidently shreds the manuscript It was the It was best the ofbest times, of times, it was it was the worst the worst of of times, it it was the the age age of of wisdom, it it wasthe age the age of of foolishness, It was the It was besthe best of times, of times, it was it was the the worst of times, it was the the age age of wisdom, of wisdom, it was it the was age the of foolishness, age of foolishness, It was the It was best the best of times, of times, it was it was the the worst worst of times, of times, it it was the age of wisdom, it it was was the the age age of of foolishness, It was It the was best the of best times, of times, it was it was the the worst worst of times, of times, it was the age of wisdom, it was it was the the age age of of foolishness, It was It the was best the best of times, of times, it was it was the the worst worst of of times, it was the age of of wisdom, it it was was the the age of age of foolishness, How can he reconstruct the text? 5 copies x 138,656 words / 5 words per fragment = 138k fragments The short fragments from every copy are mixed together Some fragments are identical i

10 Overlaps It was the best of age of wisdom, it was best of times, it was it was the age of it was the age of it was the worst of of times, it was the of times, it was the of wisdom, it was the It was the best of was the best of times, It was the best of of times, it was the It was the best of of wisdom, it was the 4 word overlap 1 word overlap 1 word overlap the age of wisdom, it the best of times, it the worst of times, it times, it was the age times, it was the worst was the age of wisdom, was the age of foolishness, was the best of times, Generally prefer longer overlaps to shorter overlaps In the presence of error, we might allow the overlapping fragments to differ by a small amount th t f ti

11 Greedy Assembly It was the best of age of wisdom, it was best of times, it was it was the age of it was the age of it was the worst of of times, it was the of times, it was the of wisdom, it was the It was the best of was the best of times, the best of times, it best of times, it was of times, it was the of times, it was the times, it was the worst times, it was the age the age of wisdom, it the best of times, it the worst of times, it times, it was the age times, it was the worst was the age of wisdom, was the age of foolishness, was the best of times, The repeated sequence makes the correct reconstruction ambiguous th t f ti

12 Th R l P bl The Real Problem (The easier version)

13 GATGCTTACTATGCGGGCCCC CGGTCTAATGCTTACTATGC GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT TAATGCTTACTATGC? AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT Subject genome Reads Sequencer

14 DNA Sequencing ATCTGATAAGTCCCAGGACTTCAGT Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG Bacteria: ~5 million bp Humans: ~3 billion bp Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp) Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson, et al, 2009) Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads ~144 GB of compressed sequence data GCAAGGCAAACCCGAGCCCAGTTT TCCAGTTCTAGAGTTTCACATGATC GGAGTTAGTAAAAGTCCACATTGAG

15 How do we put humpty dumpty back together?

16 Human Genome A complete human DNA sequence was published in 2003, marking the end of the Human Genome Project 11 years, cost $3 billion your tax dollars at work!

17 Subject reads CTATGCGGGC CTAGATGCTT AT CTATGCGG TCTAGATGCT GCTTATCTAT AT CTATGCGG AT CTATGCGG AT CTATGCGG TTA T CTATGC GCTTATCTAT CTATGCGGGC Alignment CGGTCTAGATGCTTAGCTATGCGGGCCCCTT Reference sequence

18 Subject reads ATGCGGGCCC CTAGATGCTT CTATGCGGGC TCTAGATGCT ATCTATGCGG CGGTCTAG ATCTATGCGG CTT CGGTCT TTATCTATGC CCTT CGGTC GCTTATCTAT GCCCCTT CGG GCTTATCTAT GGCCCCTT CGGTCTAGATGCTTATCTATGCGGGCCCCTT Reference sequence

19 Reference: ATGAACCACGAACACTTTTTTGGCAACGATTTAT Query: ATGAACAAAGAACACTTTTTTGGCCACGATTTAT Insertion Deletion Mutation

20 CloudBurst 1. Map: Catalog K mers Emit every k mer in the genome and non overlapping k mers in the reads Non overlapping k mers sufficient to guarantee an alignment will be found 2. Shuffle: Coalesce Seeds Hadoop internal shuffle groups together k mers shared by the reads and the reference Conceptually build a hash table of k mers and their occurrences 3. Reduce: End to end alignment Locally extend alignment beyond seeds by computing match distance If read aligns end to end, record the alignment Map Human chromosome 1 shuffle Reduce Read 1 Read 1, Chromosome 1, Read 2 Read 2, Chromosome 1,

21 Runtime (s) Running Time vs Number of Reads on Chr Millions of Reads Running Time vs Number of Reads on Chr 22 Runtime (s s) Millions of Reads Results from a small, 24-core cluster, with different number of mismatches Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.

22 Running time (s s) Running Time on EC2 High-CPU Medium Instance Cluster Number of Cores CloudBurst running times for mapping 7M reads to human chromosome 22 with at most 4 mismatches on EC2 Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.

23 Wh t N t? What s Next? (Michael Schatz s Ph.D. dissertation)

24 Wait, no reference?

25 de Bruijn Graph Construction D k = (V,E) V = All length-k subfragments (k > l) E = Directed edges between consecutive subfragments Nodes overlap by k-1 words Oi Original i lfragment It was the best of Directed Edge It was the best was the best of Locally constructed graph reveals the global sequence structure Overlaps implicitly computed de Bruijn, 1946 Idury and Waterman, 1995 Pevzner, Tang, Waterman, 2001

26 de Bruijn Graph Assembly It was the best was the best of the best of times, best of times, it of times, it was times, it was the it was the worst was the worst of the worst of times, worst of times, it it was the age was the age of the age of foolishness the age of wisdom, age of wisdom, it of wisdom, it was wisdom, it was the

27 Compressed de Bruijn Graph It was the best of times, it it was the worst of times, it of times, it was the the age of foolishness it was the age of the age of wisdom, it was the Unambiguous non-branching paths replaced by single nodes An Eulerian traversal of the graph spells a compatible reconstruction of the original text There may be many traversals of the graph Different sequences can have the same string graph It was the best of times, it was the worst of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness,

28 H d ifi Hadoopification (Stay tuned!)

29 Cloud worthy?

30 How much data?

31 Bottom Line: Bioinformatics Great use case of Hadoop Interesting est computer science ce problems Help unravel life s mysteries?

32 Questions? Comments? Thanks to the organizations who support our work: