NGS data format
NGS data format @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4 @SRR031028.843803 AGCCTAGGAGTTTGAAGCTGCAGTGAGCTAAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAG B==<)0.67;.><==>A2622;<7555@A?%%%%%%%%%!%!!!!!!!!!!!!!!!!!!!!!!!!!!!!! @SRR031028.1600012 TGCAGGGAATCAGGGACCCACACCCGGAGCTGATTATTCACAGCCATTGCTGACCTCTCTCTGTGAGAAC BCCCCB?<AAAAC@7?<BACBC;@@A=2(?@?;@=2:;:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
NGS data format @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4 @SRR031028.843803 AGCCTAGGAGTTTGAAGCTGCAGTGAGCTAAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAG B==<)0.67;.><==>A2622;<7555@A?%%%%%%%%%!%!!!!!!!!!!!!!!!!!!!!!!!!!!!!! @SRR031028.1600012 TGCAGGGAATCAGGGACCCACACCCGGAGCTGATTATTCACAGCCATTGCTGACCTCTCTCTGTGAGAAC BCCCCB?<AAAAC@7?<BACBC;@@A=2(?@?;@=2:;:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Sample_ID.read_number Sequence additional_sample_info Quality Scores
NGS data format @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4 @SRR031028.843803 AGCCTAGGAGTTTGAAGCTGCAGTGAGCTAAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAG B==<)0.67;.><==>A2622;<7555@A?%%%%%%%%%!%!!!!!!!!!!!!!!!!!!!!!!!!!!!!! @SRR031028.1600012 TGCAGGGAATCAGGGACCCACACCCGGAGCTGATTATTCACAGCCATTGCTGACCTCTCTCTGTGAGAAC BCCCCB?<AAAAC@7?<BACBC;@@A=2(?@?;@=2:;:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Sample_ID.read_number Sequence additional_sample_info Quality Scores
Quality Scores @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4 ASCII table G = ASCII value offset G =? - offset G = 63 33 G = 30
Code review from last week: write a script to get %GC of a DNA sequence get the DNA sequence as a command line argument print the percent GC hints: increment a value each time a nucleotide matches all of these lines of code do the same calculation $gc_count = $gc_count 1; $gc_count = 1; $gc_count;
script to get %GC of a DNA sequence./calculate_percent_gc.pl GACTCTCAG G A C T C T C A G 0 1 2 3 4 5 6 7 8 #!/usr/bin/perl w # this script will calculate %gc of a sequence $dna_sequence = shift die DNA sequence argument needed\n ; $dna_length = length($dna_sequence); for (0.. ($dna_length 1)) { $position = $_; $nucleotide = substr($dna_sequence, $position, 1); } if ($nucleotide eq 'G' $nucleotide eq 'C') { $gc_count; } $gc_percent = ($gc_count / $dna_length) * 100; print "GC content of $dna_sequence is $gc_percent%\n";
Variables in Perl Variables are used to hold data. Different data types use different variable types. There are three data structures in Perl: scalar, array and hash You should use a variable name that is informative of its contents. Use lowercase characters and separate words with underscore character. array variables starts with @ and holds multiple units of data containing text, numbers,... @weekdays = ( Mon, Tue, Wed, Thur, Fri ); @book_titles = ( The cat in the hat, Grapes of Wrath ); @grades = (100, 95, 80, 99); @good_grades = @grades; @chromosomes = (1.. 22, X, Y );
Arrays are structured like a linked group of scalar data @weekdays = ( Mon, Tue, Wed, Thur, Fri ); @weekdays = Mon Tue Wed Thur Fri index 0 1 2 3 4 alternate index 5 4 3 2 1 You could think of an array as being similar to a stack of boxes where the box on the floor has an index value of 0 index Access individual array data values using the index value in square brackets [ ] and since the individual value is now a scalar, use $ instead of @ to access individual values. 4 3 2 1 0 value Fri Thur Wed Tue Mon $weekdays[0] contains the value: Mon $weekdays[1] contains the value: Tue $weekdays[4] contains the value: Fri $weekdays[-1] contains the value: Fri
Use the shift function to capture the value of index 0 shift always removes the value from index 0 @weekdays = ( Mon, Tue, Wed, Thur, Fri ); $first_day = shift(@weekdays); index $weekdays[4] $weekdays[3] $weekdays[2] $weekdays[1] $weekdays[0] value Fri Thur Wed Tue Mon shift index 4 3 2 1 0 value Fri Thur Wed Tue Now $first_day contains the value: Mon and $weekdays[0] contains: Tue Mon is shifted into the value of $first_day and the data stack shifts down and the value found at index 4 is now empty (NULL)
Use the push function to add data to your array (at the top of your data stack) push(@weekdays, $first_day); the value of $first_day gets pushed to the top of your array Mon index value index value 4 3 2 1 0 Fri Thur Wed Tue push 4 3 2 1 0 Mon Fri Thur Wed Tue $first_day still contains the value: Mon
Using array variables in Perl cd ~/genomics_lab/ws3 gedit using_arrays.pl & #!/usr/bin/perl -w @weekdays = ( Mon, Tue, Wed, Thur, Fri ); @all_days = @weekdays; push(@all_days, Sat ); push(@all_days, Sun ); print The days of the week are:\n ; foreach (@all_days) { $day = $_; print $day\n ; } # loop through each element in your array # capture value of $_ in the variable $day
Skipping and Exiting for loops gedit using_arrays.pl & #!/usr/bin/perl -w @weekdays = ( Mon, Tue, Wed, Thur, Fri ); @all_days = @weekdays; push(@all_days, Sat ); push(@all_days, Sun ); print The days of the week are:\n ; foreach (@all_days) { $day = $_; next if ($day eq Wed ); last if ($day eq Fri ); print $day\n ; } output: The days of the week are: Mon Tue Thur # skip printing if day is Wed # exit loop when day is Fri
File Handles in Perl: read files gedit read_file.pl & # The following will read in a file that currently exists one line at a time # you can use any text for your file handle name (INFILE, IN, FILE) just keep it in ALL_CAPS #!/usr/bin/perl -w open(infile, ws3_data.tsv ); while(<infile>) { chomp; $line = $_; print $line\n ; } close(infile); # your assigned file handle is INFILE # while will read in your file one line at a time # chomp removes newline character \n from $_ # always close your file handle when finished
File Handles in Perl: write files # Read in a file and write to a new file # Use > to write to a new file or overwrite existing file # Use >> to append to an existing file or create and write to the file if it doesn't exist #!/usr/bin/perl -w open(infile, ws3_data.tsv ); open(outfile, >, ws3_data_copy.tsv ); while(<infile>) { chomp; $line = $_; print OUTFILE $line\n ; } close(infile); close(outfile); # open a file for reading # open a file for writing
split and join functions # Data is generally delimited by some character like tab, space, comma, colon,... # tab is a good delimiter to use since it makes the data easier to visualize in your text editor You see your tab delimited data (.tsv) Perl sees your tab delimited data like this: in gedit like this: 05 21 30 33 36 42 01 11 12 23 27 40 11 20 22 23 29 38 05\t21\t30\t33\t36\t42\n 01\t11\t12\t23\t27\t40\n 11\t20\t22\t23\t29\t38\n #!/usr/bin/perl -w open(infile, ws3_data.tsv ); while(<infile>) { chomp; $line = $_; $line[3] $line[2] $line[1] $line[0] 33 30 21 05 @line = split( \t, $line); $line_copy = join( \t, @line); print line[0] = $line[0]\n ; print line_copy = $line_copy\n ; } close(infile); # split a scalar into an array of values # join array of values into a scalar # print values for validating variables and debugging
Array attributes @weekdays = ( Mon, Tue, Wed ); $count = @weekdays; @copy = @weekdays; print @copy; print @copy ; $count = $count @copy; # initialize your array values # now $count = 3 # now @copy has same values as @weekdays # prints: 3 # prints: Mon Tue Wed # now $count = 0
use strict; Scope of variables in Perl # requires that you declare a variable using my the first time it is used in your code # helps when your scripts get long (many lines of code) by making sure you don't mistype a variable name in your code # helps if other people will be using/modifying your code # helps in debugging your code #!/usr/bin/perl -w use strict; my $name = Watson ; print My name is $name\n ; for (1.. 2) { my $name = Crick ; print My name is $name\n ; } # global declaration of your variable # local declaration of your variable # limited to region within the { } print My name is $name\n ;
#!/usr/bin/perl -w use strict; Reading in NGS data into Perl my $infile = SRR031028_subset.fastq.gz ; open(infile, gunzip -c $infile ); while(<infile>) { chomp; my $id = $_; my $sequence = (<INFILE>); my $id_2 = (<INFILE>); my $quality_string = (<INFILE>); chomp($sequence, $id_2, $quality_string); my @quals = split('', $quality_string); #add code to process each read here } close(infile); @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4
# this splits a string between each character so each character is in a different index my @quals = split('', $quality_string); $quals[0] now contains B $quals[1] now contains B $quals[2] now contains B $quals[3] now contains C $quals[4] now contains B $quals[5] now contains =... $quals[69] now contains 4 # This is how to use Perl to get the actual Phred quality score from the ASCII character my $quality_score = ord( $quals[0] ) 33; # subtract 33 for earlier Illumina sequence fastq data # subtract 64 for Illumina v1.3 (newest sequencing machines) @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4
To be completed before next class Programming Assignment Write a Perl script that will trim your sequence based on one of the following: - the quality scores, starting from the right, reaches a score of >= 20 - the quality scores, starting from the left, falls to a score of <= 20 Only print the sequences if the trimmed length is >= 36 Print the results to a file in FASTA format: >@SRR031028.221 GATTAGCCTATATCGC >@SRR031028.224 CTAGATGTCGTAGCATCGAT Read in the file SRR031028_subset.fastq.gz located in the following directory ~/genomics_lab/ws3 Use the code to open and read in a gzip file Use default trimmed length of 36 but accept an argument if provided