Package hoarder. June 30, 2015



Similar documents
Package retrosheet. April 13, 2015

Package cgdsr. August 27, 2015

Package erp.easy. September 26, 2015

Package bigdata. R topics documented: February 19, 2015

Package plan. R topics documented: February 20, 2015

Package sendmailr. February 20, 2015

Bioinformatics Resources at a Glance

Package RIGHT. March 30, 2015

Package uptimerobot. October 22, 2015

Package fimport. February 19, 2015

Package pdfetch. R topics documented: July 19, 2015

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Package tagcloud. R topics documented: July 3, 2015

Package HadoopStreaming

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

Package TSfame. February 15, 2013

Package GEOquery. August 18, 2015

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Package pxr. February 20, 2015

Package MBA. February 19, Index 7. Canopy LIDAR data

Package empiricalfdr.deseq2

Package urltools. October 11, 2015

Package missforest. February 20, 2015

Package biganalytics

Package copa. R topics documented: August 9, 2016

Package dunn.test. January 6, 2016

Package OECD. R topics documented: January 17, Type Package Title Search and Extract Data from the OECD Version 0.2.

Package survpresmooth

Package optirum. December 31, 2015

Tutorial on gplink. PLINK tutorial, December 2006; Shaun Purcell,

Package sjdbc. R topics documented: February 20, 2015

Package bigrf. February 19, 2015

Package syuzhet. February 22, 2015

Package searchconsoler

Package treemap. February 15, 2013

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Package polynom. R topics documented: June 24, Version 1.3-8

Package neuralnet. February 20, 2015

Step by Step Guide to Importing Genetic Data into JMP Genomics

Listeners. Formats. Free Form. Formatted

Package MDM. February 19, 2015

Tutorial for proteome data analysis using the Perseus software platform

Package dsstatsclient

LifeScope Genomic Analysis Software 2.5

Package changepoint. R topics documented: November 9, Type Package Title Methods for Changepoint Detection Version 2.

Package httprequest. R topics documented: February 20, 2015

Package benford.analysis

Package CoImp. February 19, 2015

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

Package hazus. February 20, 2015

Package brewdata. R topics documented: February 19, Type Package

Package pmr. May 15, 2015

Package multivator. R topics documented: February 20, Type Package Title A multivariate emulator Version Depends R(>= 2.10.

Package hier.part. February 20, Index 11. Goodness of Fit Measures for a Regression Hierarchy

ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual

Data Tool Platform SQL Development Tools

Package HHG. July 14, 2015

Package xtal. December 29, 2015

Package translater. R topics documented: February 20, Type Package

Note: With v3.2, the DocuSign Fetch application was renamed DocuSign Retrieve.

Getting Started with R and RStudio 1

Package forensic. February 19, 2015

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Package colortools. R topics documented: February 19, Type Package

Jet Data Manager 2012 User Guide

GenBank, Entrez, & FASTA

Package DCG. R topics documented: June 8, Type Package

ProSightPC 3.0 Quick Start Guide

Reading and writing files

Replacing TaqMan SNP Genotyping Assays that Fail Applied Biosystems Manufacturing Quality Control. Begin

Package RCassandra. R topics documented: February 19, Version Title R/Cassandra interface

Introduction to the Data Migration Framework (DMF) in Microsoft Dynamics WHITEPAPER

DataPA OpenAnalytics End User Training

Package hive. July 3, 2015

Package decompr. August 17, 2016

Resources You can find more resources for Sync & Save at our support site:

Oracle Data Miner (Extension of SQL Developer 4.0)

Agilent CytoGenomics Software A Complete Solution for Cytogenetic Research Data Analysis

Package smoothhr. November 9, 2015

Package hive. January 10, 2011

Analytics Configuration Reference

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Introduction to the Data Migration Framework in Microsoft Dynamics by Ruben Barron

Information Server Documentation SIMATIC. Information Server V8.0 Update 1 Information Server Documentation. Introduction 1. Web application basics 2

Creating a New Annotation Package using SQLForge

Monitoring Replication

5 Correlation and Data Exploration

A database is a collection of data organised in a manner that allows access, retrieval, and use of that data.

Package GSA. R topics documented: February 19, 2015

Business Intelligence Tutorial: Introduction to the Data Warehouse Center

Package globe. August 29, 2016

Transcription:

Type Package Title Information Retrieval for Genetic Datasets Version 0.1 Date 2015-06-29 Author [aut, cre], Anu Sironen [aut] Package hoarder June 30, 2015 Maintainer <daniel.fischer@luke.fi> Depends R (>= 3.2.0) Imports httr (>= 0.2), XML (>= 3.98-1.1), stringr (>= 0.6.2), MASS (>= 7.3-31), R.utils (>= 1.32.4), stats (>= 3.2.0), utils (>= 3.2.0), graphics (>= 3.2.0) Information retrieval from National Center for Biotechnology Information (NCBI) databases, with main focus on identifying genes in unannotated organisms via Blast similarity search in annotated organisms. License GPL (>= 2) LazyData true NeedsCompilation no Repository CRAN Date/Publication 2015-06-30 17:49:03 R topics documented: hoarder-package...................................... 2 blastseq........................................... 3 getassemblies........................................ 4 getensginfo......................................... 5 getgenelocation...................................... 6 getgeneseq......................................... 6 importblasttab....................................... 7 importfa.......................................... 8 importgff3......................................... 9 1

2 hoarder-package importgtf......................................... 9 importpedmap....................................... 10 importxml......................................... 11 print.ensginfo........................................ 12 print.fa............................................ 12 print.pedmap........................................ 13 print.xmlimport....................................... 14 species............................................ 14 speciesfigure........................................ 15 subdose........................................... 16 subgprobs.......................................... 16 subphased.......................................... 17 summary.ensginfo...................................... 18 summary.fa......................................... 19 summary.pedmap...................................... 19 Index 21 hoarder-package Collect and Retrieve Annotation Data for Various Genetic Data. The hoarder package is designed for collecting, retrieving and transforming data from various sources. The current main focus is on setting up a connection to the NCBI Blast service. Also, the gene information for Ensemble Genes can be retrieved from NCBI. Methods for visualizing the results are currently under development. The latest night-build of the package can be retrieved from https://github.com/fischuu/hoarder Package: hoarder Type: Package Version: 0.1 Date: 2015-06-29 License: GPL LazyLoad: yes, Anu Sironen Maintainer: <daniel.fischer@luke.fi>

blastseq 3 blastseq Sending Genomic Sequences to NCBI Blast service This function sends genomic sequences to the NCBI Blast service. blastseq(seq, n_blast=20, delay_req=3, delay_rid=60, email=null, xmlfolder=null,logfolder=null, keepinmemory=true, database="chromosome", verbose=true, createlog=false) seq n_blast delay_req delay_rid email xmlfolder logfolder keepinmemory database verbose createlog The fasta sequence that should be blasted. Amount of parallel blast requests, in case seq is a vector. Seconds between the single Blast requests. Seconds between the single result requests. User email, required information from NCBI (String). Path to the result folder. Path to the log folder. Logical, shall the results be kept in the memory. The NCBI database to use. Shall the program give extensive feedback. Create log files, needed for continuing a crashed program. This function sends fasta sequences to the NCBI blast service. The defaults for the delays are required by NCBI and must not be smaller than the default values. Also, NCBI asks the user to provide an email address. The input seq can be a vector of strings. In that case the sequences are one after another processed. The option n_blast sets then the upper threshold of how many blast requests are send to the NCBI Blast service at a time and kept running there parallel. It is here in the users obligation not to misuse the service with too many parallel requests. The xmlfolder parameter specifies the folder to where the XML results will be stored. In case the folder does not exist, R will create it. For larger projects this option is advisable, as large projects can easily flood the memory. In case the option keepinmemory is set to TRUE the Blast results will be kept in memory, otherwise they will be just written to the HDD, given the xmlfolder. Especially if many sequences are send to NCBI it is recommended not to keep the result in the memory.

4 getassemblies If log files should be written (createlog=true) a log path should be given in logpath. However, if a xmlpath is given and the option createlog=true is set, then the log folder will be automatically created in the parental folder of the xmlfolder and is called hoarderlogs. Setting the option createlog=true is required to continue crashed blast runs. An xml file that contains the the NCBI result. Examples ## Not run: blastseq("acgtgcatcgactagctacgactacgactatc") ## End(Not run) getassemblies Extracting Assemblies. This function extracts the assemblies from an xml object. getassemblies(xml) xml An xml object. This function extracts the information from an imported xml object. A matrix.

getensginfo 5 Examples ## Not run: getassemblies(xml) ## End(Not run) getensginfo Retrieve Gene Information From The NCBI Database. This function retrieves for a given Ensemble Number the corresponding information from the NCBI database. getensginfo(ensg) ensg Ensemble ID. This function retrieves for a given Ensemble Number the corresponding information from the NCBI database. The object ensg can also be a vector of Ensemble IDs. A matrix with information. Examples ## Not run: ensg <- c("ensg00000174482", "ENSG00000113494") getensginfo(ensg) ## End(Not run)

6 getgeneseq getgenelocation Extracting Gene Locations. This function extracts the gene locations from an imported gtf or gff3 file. getgenelocation(gtf) gtf An imported gtf object. This function extracts the information from an imported gtf object. A matrix. Examples ## Not run: getgenelocation(gtf) ## End(Not run) getgeneseq Extracting a gene sequence from NCBI database. This function retrieves a gene sequence from the NCBI database. getgeneseq(chr, start, end, organism)

importblasttab 7 chr start end organism Chromosome number, numeric/string Start position, numeric End position, numeric Name of the organism, string Extracting a gene sequence from NCBI database. A string that contains the genomic sequence. Examples ## Not run: # Extracting for Sus Scrofa, build version 3: getgeneseq(1,2134,14532,"susscr3") ## End(Not run) importblasttab Import a Tab Delimited Blast Output File This function imports a tab delimited blast output. importblasttab(file) file File name of the file. This function imports a tab delimited blast output file, currently the same as read.table A data frame containing the columns of the file.

8 importfa importfa Importing a Fasta File. This function imports a standard fasta file. importfa(file) file Specifies the filename/path. This function imports a standard fasta file. It assumes that label and sequence lines are alternating, meaning in the odd lines is the sequence name given, starting with > and in the even rows are the corresponding sequences. An object of class fa containing the sequences. The names coorespond to the sequence names given in the fasta file. See Also print.fa, summary.fa Examples ## Not run: importfa(file="myfasta.fa") ## End(Not run)

importgff3 9 importgff3 Import a GFF3 File This function imports a gff3 file. importgff3(gff) gff File name of the gff3 file This function imports a gff file and splits the last column which is usually tricky to handle as the order of the variables is not always the same. A data frame containing the columns of the gtf file, including the splitted last column. importgtf Import a GTF File This function imports a gtf file. importgtf(gtf, skip = 0, nrow = -1) gtf skip nrow File name of the gtf file. Rows to skip from the top. Total amount of rows read.

10 importpedmap This function imports a gtf file and splits the column 9 which is usually tricky to handle as the order of the variables is not always the same. A data frame containing the columns of the gtf file, including the splitted last column. importpedmap Import a ped/map File Pair This function imports a ped/map file pair. importpedmap(ped, map=null, pedsep="\t", pedheader=false, genosep=" ", mapsep="\t", mapheader=false, na.value="0") ped map pedsep pedheader genosep mapsep mapheader na.value File name of the ped file. File name of the map file, optional, see details. Column separator in the ped file. Logical, ped file contains header. Separator for Genotype, see details. Column separator in the map file. Logical, map file contains header. Character, encoding of missing values. This function imports a ped/map file pair. For that it is sufficient to provide the file name of the ped file, if the map file has the same name, but just the.map ending (e.g. myfile.ped and myfile.map). Also, the file suffix.ped can be ommited. The genosep option provides the separator between the Alleles within one Genotype, e.g A A (genosep=" ") or A/A (genosep="/").

importxml 11 A list of type pedmap with the three list items: map fam geno Matrix with the Genotype Map information. Matrix with the family information. Matrix with the genotype information. importxml Import a XML File This function imports a xml file produced from blastseq. importxml(fa, folder, idth = 0.8, verbose=true) fa folder idth verbose Sequence names. Folder, where the xml files are stored. Identity threshold, see details. Logical, function give status messages. This function imports a xml files produced from the blastseq function. The idth options sets the limit, what the minimum id threshold is until a hit will be taken into the result data frame. A data frame containing the results.

12 print.fa print.ensginfo Print an ensginfo Object Prints an ensginfo object. ## S3 method for class ensginfo print(x, full=false,...) x full Object of class ensginfo. Logical, shall the full information be plotted.... Additional arguments. The print function displays an ensginfo object. By default just the Ensembl ID and the corresponding gene name is plotted. Setting the option full=true provides further information. print.fa Print an fa Object Prints an fa object. ## S3 method for class fa print(x, n=2, seq.out=50,...) x n seq.out Object of class fa. Amount of elements to be displayed, numeric. Length of each element to be displayed, numeric..... Additional parameters.

print.pedmap 13 The print function displays an fa object. By default just the first two elements with their first 50 bases are displayed. To display the full sequence, set seq.out=null. print.pedmap Print an pedmap Object Prints an pedmap object. ## S3 method for class pedmap print(x, nrow=5, ncol=10,...) x Object of class pedmap. nrow Amount of rows to be displayed, numeric. ncol Amount of cols to be displayed, numeric.... Additional parameters. The print function displays an pedmap object. By default just the first 5 rows of each list item and the first 10 columns of the geno matrix are displayed.

14 species print.xmlimport Print an xmlimport Object Prints an xmlimport object. ## S3 method for class xmlimport print(x, n=2,...) x n Object of class xmlimport. Amount of elements to be displayed, numeric.... Additional parameters. The print function displays an xmlimport object. By default just the first two elements are displayed. species Species Name for Blast Search Standardized species names for blast search. species Format This vector contains 134 species names. Note Note, the names have been extracted on 1.6.2014 from the NCBI server.

speciesfigure 15 speciesfigure Showing Quantities of Different Species. This function visualizes the different quantities of blast machtes. speciesfigure(xml, species=null, type="chr", n=2:11, plot=true) xml species type n plot An xml object. A vector with species names. The type of plot. Ranks to be plotted Logical, shall the plot be plotted This function plots the frequency barplot of the blast results, divided for each species. The species of interest can be provided with the species object. A figure. Examples ## Not run: speciesfigure(xml, species=null, type="chr", n=2:11, plot=true) ## End(Not run)

16 subgprobs subdose Rewrite the Dose File from a Beagle Output This function takes a Dose Beagle output and rewrites the output. file vmmk out subdose(file=null, vmmk=null, out=null, removeinsertions=true, verbose=true) Location of the original Beagle file (String). Location of the Variant Map Master key (String). Name and location of the output file (String). verbose The function gives feedback. removeinsertions All Indels will be removed.. This function takes a Beagle Dose file and rewrites the alleles from numerical to character, based on the information provided in a variant map master key. A rewritten beagle phased file. subgprobs Rewrite the Gprobs File from a Beagle Output This function takes a Gprobs Beagle output and rewrites the output. subgprobs(file=null, vmmk=null, out=null, chunksize=100000, removeinsertions=true, verbose = TRUE, writeout=true)

subphased 17 file vmmk out Location of the original Beagle file (String). Location of the Variant Map Master key (String). Name and location of the output file (String). chunksize For large Beagle files, the chunk size. removeinsertions All Indels will be removed. verbose writeout The function gives feedback. Logical, write the output back to the HDD. This function takes a Beagle Gprobs file and rewrites the alleles from numerical to character, based on the information provided in a variant map master key. For larger files the function can process the rewriting in chunks in order to save memory. A rewritten beagle Gprobs file. subphased Rewrite the Phased File from a Beagle Output This function takes a phased Beagle output and rewrites the output. file vmmk out subphased(file=null, vmmk = NULL, out=null, chunksize=100000, verbose=true, removeinsertions=true) chunksize Location of the original Beagle file (String). Location of the Variant Map Master key (String). Name and location of the output file (String). For large Beagle files, the chunk size. verbose The function gives feedback. removeinsertions All Indels will be removed.

18 summary.ensginfo This function takes a Beagle phased file and rewrites the alleles from numerical to character, based on the information provided in a variant map master key. For larger files the function can process the rewriting in chunks in order to save memory. A rewritten beagle phased file. summary.ensginfo Summarize an ensginfo Object Summarizes and prints an ensginfo object in an informative way. ## S3 method for class ensginfo summary(object,...) object Object of class ensginfo.... Additional parameters. Summary for a ensginfo object, providing the amount of different gene types in the query.

summary.fa 19 summary.fa Summarize an fa Object Summarizes and prints an fa object in an informative way. ## S3 method for class fa summary(object,...) object Object of class fa.... Additional parameters. Summary for a fa object, providing the amount of sequences, the minimum and maximum length as well as the average length. summary.pedmap Summarize an pedmap Object Summarizes an pedmap object in an informative way. ## S3 method for class pedmap summary(object,...) object Object of class pedmap.... Additional parameters.

20 summary.pedmap Summary for a pedmap object, providing the dimensions of the map, the fam and the geno matrix as well as the total amount of Allele A/A, A/B, B/B as well as the amount of missing data and the monomorhic locations.

Index Topic datasets species, 14 Topic methods blastseq, 3 getassemblies, 4 getensginfo, 5 getgenelocation, 6 getgeneseq, 6 importfa, 8 print.ensginfo, 12 print.fa, 12 print.pedmap, 13 print.xmlimport, 14 speciesfigure, 15 subdose, 16 subgprobs, 16 subphased, 17 summary.ensginfo, 18 summary.fa, 19 summary.pedmap, 19 Topic multivariate hoarder-package, 2 Topic print print.ensginfo, 12 print.fa, 12 print.pedmap, 13 print.xmlimport, 14 summary.ensginfo, 18 summary.fa, 19 summary.pedmap, 19 blastseq, 3 importfa, 8 importgff3, 9 importgtf, 9 importpedmap, 10 importxml, 11 print,ensginfo-method (print.ensginfo), 12 print,fa-method (print.fa), 12 print,pedmap-method (print.pedmap), 13 print,xmlimport-method (print.xmlimport), 14 print.ensginfo, 12 print.fa, 8, 12 print.pedmap, 13 print.xmlimport, 14 R/hoardeR-package (hoarder-package), 2 species, 14 speciesfigure, 15 subdose, 16 subgprobs, 16 subphased, 17 summary,ensginfo-method (summary.ensginfo), 18 summary,fa-method (summary.fa), 19 summary,pedmap-method (summary.pedmap), 19 summary.ensginfo, 18 summary.fa, 8, 19 summary.pedmap, 19 getassemblies, 4 getensginfo, 5 getgenelocation, 6 getgeneseq, 6 hoarder-package, 2 importblasttab, 7 21