?<BACBC;@@A=2(?@?;@=2:;:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Similar documents

Perl in a nutshell. First CGI Script and Perl. Creating a Link to a Script. print Function. Parsing Data 4/27/2009. First CGI Script and Perl

#!/usr/bin/perl use strict; use warnings; use Carp; use Data::Dumper; use Tie::IxHash; use Gschem 3; 3. Setup and initialize the global variables.

Introduction to Matlab

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Microsoft Windows PowerShell v2 For Administrators

Introduction to Perl Programming

SPSS for Windows importing and exporting data

Unless otherwise noted, all references to STRM refer to STRM, STRM Log Manager, and STRM Network Anomaly Detection.

UNIX, Shell Scripting and Perl Introduction

AN INTRODUCTION TO UNIX

Basic C Shell. helpdesk@stat.rice.edu. 11th August 2003

CD-HIT User s Guide. Last updated: April 5,

Python Lists and Loops

Hands-On UNIX Exercise:

Linux Crontab: 15 Awesome Cron Job Examples

Regular Expressions and Pattern Matching

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables

How to Create and Send a Froogle Data Feed

Learn Perl by Example - Perl Handbook for Beginners - Basics of Perl Scripting Language

Introduction to Python

Exercise 1: Python Language Basics

Linux Syslog Messages in IBM Director

Data formats and file conversions

Administration Guide. BlackBerry Resource Kit for BES12. Version 12.3

Data Tool Platform SQL Development Tools

Part-time Diploma in InfoComm and Digital Media (Information Systems) Certificate in Information Systems Course Schedule & Timetable

CS 241 Data Organization Coding Standards

A Simple Shopping Cart using CGI

Programming Languages CIS 443

IVR (Interactive Voice Response) Operation Manual. Copyright 2012 Agile Networks, Inc. All Rights Reserved.

Training Assessments Assessments NAEP Assessments (selected sample)

CISC 181 Project 3 Designing Classes for Bank Accounts

JavaScript: Introduction to Scripting Pearson Education, Inc. All rights reserved.

STUDENT ASSESSMENT TESTING CALENDAR

Lab III: Unix File Recovery Data Unit Level

ESPResSo Summer School 2012

VIP Quick Reference Card

SRA File Formats Guide

Maintaining the Central Management System Database

NewsletterAdmin 2.4 Setup Manual

NLP Programming Tutorial 0 - Programming Basics

10.1 The Common Gateway Interface

Top 72 Perl Interview Questions and Answers

Xerox Standard Accounting Import/Export User Information Customer Tip

PL / SQL Basics. Chapter 3

Lab 2 : Basic File Server. Introduction

Advanced Bash Scripting. Joshua Malone

Windows PowerShell Essentials

Unix Sampler. PEOPLE whoami id who

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

Unix Scripts and Job Scheduling

Setting Up Database Security with Access 97

PharmaSUG Paper AD11

FirstClass Export Tool User and Administration Guide

Moving from CS 61A Scheme to CS 61B Java

PHP Tutorial From beginner to master

2015 Exelis Visual Information Solutions, Inc., a subsidiary of Harris Corporation

Version 5.0 Release Notes

Hands-on Exercise 1: VBA Coding Basics

VMG ONLINE TRAINING SCHEDULE WINTER 2016

Introduction to Perl Programming: Summary of exercises

Apply PERL to BioInformatics (II)

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Documentum Developer Program

Network Planning and Analysis

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Exercise 4 Learning Python language fundamentals

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison

Bash shell programming Part II Control statements

Setting up Auto Import/Export for Version 7

Administration Guide. BlackBerry Resource Kit for BES12. Version 12.1

1 Description of The Simpletron

Appendix K Introduction to Microsoft Visual C++ 6.0

Creating Charts and Graphs

Regular Expressions. In This Appendix

SendMIME Pro Installation & Users Guide

ASCII Encoding. The char Type. Manipulating Characters. Manipulating Characters

For People With Diabetes. Blood Sugar Diary

Command Line Interface User Guide for Intel Server Management Software

10 Database Utilities

Thirty Useful Unix Commands

Resources You can find more resources for Sync & Save at our support site:

UTILITIES BACKUP. Figure 25-1 Backup & Reindex utilities on the Main Menu

Retrieving Data Using the SQL SELECT Statement. Copyright 2006, Oracle. All rights reserved.

Computational Mathematics with Python

PHP Authentication Schemes

Microsoft Access 2010

Pharmacy Affairs Branch. Website Database Downloads PUBLIC ACCESS GUIDE

Elixir Schedule Designer User Manual

USEFUL UNIX COMMANDS

TCP/IP Networking, Part 2: Web-Based Control

Transcription:

NGS data format

NGS data format @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4 @SRR031028.843803 AGCCTAGGAGTTTGAAGCTGCAGTGAGCTAAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAG B==<)0.67;.><==>A2622;<7555@A?%%%%%%%%%!%!!!!!!!!!!!!!!!!!!!!!!!!!!!!! @SRR031028.1600012 TGCAGGGAATCAGGGACCCACACCCGGAGCTGATTATTCACAGCCATTGCTGACCTCTCTCTGTGAGAAC BCCCCB?<AAAAC@7?<BACBC;@@A=2(?@?;@=2:;:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

NGS data format @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4 @SRR031028.843803 AGCCTAGGAGTTTGAAGCTGCAGTGAGCTAAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAG B==<)0.67;.><==>A2622;<7555@A?%%%%%%%%%!%!!!!!!!!!!!!!!!!!!!!!!!!!!!!! @SRR031028.1600012 TGCAGGGAATCAGGGACCCACACCCGGAGCTGATTATTCACAGCCATTGCTGACCTCTCTCTGTGAGAAC BCCCCB?<AAAAC@7?<BACBC;@@A=2(?@?;@=2:;:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Sample_ID.read_number Sequence additional_sample_info Quality Scores

NGS data format @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4 @SRR031028.843803 AGCCTAGGAGTTTGAAGCTGCAGTGAGCTAAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAG B==<)0.67;.><==>A2622;<7555@A?%%%%%%%%%!%!!!!!!!!!!!!!!!!!!!!!!!!!!!!! @SRR031028.1600012 TGCAGGGAATCAGGGACCCACACCCGGAGCTGATTATTCACAGCCATTGCTGACCTCTCTCTGTGAGAAC BCCCCB?<AAAAC@7?<BACBC;@@A=2(?@?;@=2:;:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Sample_ID.read_number Sequence additional_sample_info Quality Scores

Quality Scores @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4 ASCII table G = ASCII value offset G =? - offset G = 63 33 G = 30

Code review from last week: write a script to get %GC of a DNA sequence get the DNA sequence as a command line argument print the percent GC hints: increment a value each time a nucleotide matches all of these lines of code do the same calculation $gc_count = $gc_count 1; $gc_count = 1; $gc_count;

script to get %GC of a DNA sequence./calculate_percent_gc.pl GACTCTCAG G A C T C T C A G 0 1 2 3 4 5 6 7 8 #!/usr/bin/perl w # this script will calculate %gc of a sequence $dna_sequence = shift die DNA sequence argument needed\n ; $dna_length = length($dna_sequence); for (0.. ($dna_length 1)) { $position = $_; $nucleotide = substr($dna_sequence, $position, 1); } if ($nucleotide eq 'G' $nucleotide eq 'C') { $gc_count; } $gc_percent = ($gc_count / $dna_length) * 100; print "GC content of $dna_sequence is $gc_percent%\n";

Variables in Perl Variables are used to hold data. Different data types use different variable types. There are three data structures in Perl: scalar, array and hash You should use a variable name that is informative of its contents. Use lowercase characters and separate words with underscore character. array variables starts with @ and holds multiple units of data containing text, numbers,... @weekdays = ( Mon, Tue, Wed, Thur, Fri ); @book_titles = ( The cat in the hat, Grapes of Wrath ); @grades = (100, 95, 80, 99); @good_grades = @grades; @chromosomes = (1.. 22, X, Y );

Arrays are structured like a linked group of scalar data @weekdays = ( Mon, Tue, Wed, Thur, Fri ); @weekdays = Mon Tue Wed Thur Fri index 0 1 2 3 4 alternate index 5 4 3 2 1 You could think of an array as being similar to a stack of boxes where the box on the floor has an index value of 0 index Access individual array data values using the index value in square brackets [ ] and since the individual value is now a scalar, use $ instead of @ to access individual values. 4 3 2 1 0 value Fri Thur Wed Tue Mon $weekdays[0] contains the value: Mon $weekdays[1] contains the value: Tue $weekdays[4] contains the value: Fri $weekdays[-1] contains the value: Fri

Use the shift function to capture the value of index 0 shift always removes the value from index 0 @weekdays = ( Mon, Tue, Wed, Thur, Fri ); $first_day = shift(@weekdays); index $weekdays[4] $weekdays[3] $weekdays[2] $weekdays[1] $weekdays[0] value Fri Thur Wed Tue Mon shift index 4 3 2 1 0 value Fri Thur Wed Tue Now $first_day contains the value: Mon and $weekdays[0] contains: Tue Mon is shifted into the value of $first_day and the data stack shifts down and the value found at index 4 is now empty (NULL)

Use the push function to add data to your array (at the top of your data stack) push(@weekdays, $first_day); the value of $first_day gets pushed to the top of your array Mon index value index value 4 3 2 1 0 Fri Thur Wed Tue push 4 3 2 1 0 Mon Fri Thur Wed Tue $first_day still contains the value: Mon

Using array variables in Perl cd ~/genomics_lab/ws3 gedit using_arrays.pl & #!/usr/bin/perl -w @weekdays = ( Mon, Tue, Wed, Thur, Fri ); @all_days = @weekdays; push(@all_days, Sat ); push(@all_days, Sun ); print The days of the week are:\n ; foreach (@all_days) { $day = $_; print $day\n ; } # loop through each element in your array # capture value of $_ in the variable $day

Skipping and Exiting for loops gedit using_arrays.pl & #!/usr/bin/perl -w @weekdays = ( Mon, Tue, Wed, Thur, Fri ); @all_days = @weekdays; push(@all_days, Sat ); push(@all_days, Sun ); print The days of the week are:\n ; foreach (@all_days) { $day = $_; next if ($day eq Wed ); last if ($day eq Fri ); print $day\n ; } output: The days of the week are: Mon Tue Thur # skip printing if day is Wed # exit loop when day is Fri

File Handles in Perl: read files gedit read_file.pl & # The following will read in a file that currently exists one line at a time # you can use any text for your file handle name (INFILE, IN, FILE) just keep it in ALL_CAPS #!/usr/bin/perl -w open(infile, ws3_data.tsv ); while(<infile>) { chomp; $line = $_; print $line\n ; } close(infile); # your assigned file handle is INFILE # while will read in your file one line at a time # chomp removes newline character \n from $_ # always close your file handle when finished

File Handles in Perl: write files # Read in a file and write to a new file # Use > to write to a new file or overwrite existing file # Use >> to append to an existing file or create and write to the file if it doesn't exist #!/usr/bin/perl -w open(infile, ws3_data.tsv ); open(outfile, >, ws3_data_copy.tsv ); while(<infile>) { chomp; $line = $_; print OUTFILE $line\n ; } close(infile); close(outfile); # open a file for reading # open a file for writing

split and join functions # Data is generally delimited by some character like tab, space, comma, colon,... # tab is a good delimiter to use since it makes the data easier to visualize in your text editor You see your tab delimited data (.tsv) Perl sees your tab delimited data like this: in gedit like this: 05 21 30 33 36 42 01 11 12 23 27 40 11 20 22 23 29 38 05\t21\t30\t33\t36\t42\n 01\t11\t12\t23\t27\t40\n 11\t20\t22\t23\t29\t38\n #!/usr/bin/perl -w open(infile, ws3_data.tsv ); while(<infile>) { chomp; $line = $_; $line[3] $line[2] $line[1] $line[0] 33 30 21 05 @line = split( \t, $line); $line_copy = join( \t, @line); print line[0] = $line[0]\n ; print line_copy = $line_copy\n ; } close(infile); # split a scalar into an array of values # join array of values into a scalar # print values for validating variables and debugging

Array attributes @weekdays = ( Mon, Tue, Wed ); $count = @weekdays; @copy = @weekdays; print @copy; print @copy ; $count = $count @copy; # initialize your array values # now $count = 3 # now @copy has same values as @weekdays # prints: 3 # prints: Mon Tue Wed # now $count = 0

use strict; Scope of variables in Perl # requires that you declare a variable using my the first time it is used in your code # helps when your scripts get long (many lines of code) by making sure you don't mistype a variable name in your code # helps if other people will be using/modifying your code # helps in debugging your code #!/usr/bin/perl -w use strict; my $name = Watson ; print My name is $name\n ; for (1.. 2) { my $name = Crick ; print My name is $name\n ; } # global declaration of your variable # local declaration of your variable # limited to region within the { } print My name is $name\n ;

#!/usr/bin/perl -w use strict; Reading in NGS data into Perl my $infile = SRR031028_subset.fastq.gz ; open(infile, gunzip -c $infile ); while(<infile>) { chomp; my $id = $_; my $sequence = (<INFILE>); my $id_2 = (<INFILE>); my $quality_string = (<INFILE>); chomp($sequence, $id_2, $quality_string); my @quals = split('', $quality_string); #add code to process each read here } close(infile); @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4

# this splits a string between each character so each character is in a different index my @quals = split('', $quality_string); $quals[0] now contains B $quals[1] now contains B $quals[2] now contains B $quals[3] now contains C $quals[4] now contains B $quals[5] now contains =... $quals[69] now contains 4 # This is how to use Perl to get the actual Phred quality score from the ASCII character my $quality_score = ord( $quals[0] ) 33; # subtract 33 for earlier Illumina sequence fastq data # subtract 64 for Illumina v1.3 (newest sequencing machines) @SRR031028.1708655 GGATGATGGATGGATAGATAGATGAAGAGATGGATGGATGGGTGGGTGGTATGCAGCATACCTGAAGTGC BBBCB=ABBB@BA=?BABBBBA??B@BAAA>ABB;@5=@@@?8@:==99:465727:;41'.9>;933!4

To be completed before next class Programming Assignment Write a Perl script that will trim your sequence based on one of the following: - the quality scores, starting from the right, reaches a score of >= 20 - the quality scores, starting from the left, falls to a score of <= 20 Only print the sequences if the trimmed length is >= 36 Print the results to a file in FASTA format: >@SRR031028.221 GATTAGCCTATATCGC >@SRR031028.224 CTAGATGTCGTAGCATCGAT Read in the file SRR031028_subset.fastq.gz located in the following directory ~/genomics_lab/ws3 Use the code to open and read in a gzip file Use default trimmed length of 36 but accept an argument if provided