Using Databases in R



Similar documents
Package RSQLite. February 19, 2015

Computing with large data sets

Creating a New Annotation Package using SQLForge

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Package RPostgreSQL. February 19, 2015

Lecture 25: Database Notes

Introduction. Overview of Bioconductor packages for short read analysis

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

IRanges, GenomicRanges, and Biostrings

Session 6: ROracle. <Insert Picture Here> Mark Hornick, Director, Oracle Advanced Analytics Development Oracle Advanced Analytics

database abstraction layer database abstraction layers in PHP Lukas Smith BackendMedia

SQL Injection. SQL Injection. CSCI 4971 Secure Software Principles. Rensselaer Polytechnic Institute. Spring

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Package TSfame. February 15, 2013

Time Series Database Interface: R MySQL (TSMySQL)

Integrating VoltDB with Hadoop

Principles of Database Management Systems. Overview. Principles of Data Layout. Topic for today. "Executive Summary": here.

Package GEOquery. August 18, 2015

Classes for record linkage of big data sets

Zend Framework Database Access

CummeRbund: Visualization and Exploration of Cufflinks High-throughput Sequencing Data

DNA Sequence formats

HowTo: Querying online Data

WI005 - Offline data sync con SQLite in Universal Windows Platform

Yale Pseudogene Analysis as part of GENCODE Project

Using Temporary Tables to Improve Performance for SQL Data Services

Database Design Patterns. Winter Lecture 24

Chapter 9 Joining Data from Multiple Tables. Oracle 10g: SQL

Simplifying Data Interpretation with Nexus Copy Number

Real SQL Programming 1

Final Project Report

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

What is a database? COSC 304 Introduction to Database Systems. Database Introduction. Example Problem. Databases in the Real-World

Extracting META information from Interbase/Firebird SQL (INFORMATION_SCHEMA)

Gene Models & Bed format: What they represent.

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

MySQL for Beginners Ed 3

Relational Databases and SQLite

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

Package cgdsr. August 27, 2015

Unified access to all your data points. with Apache MetaModel

The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi

How to Improve Database Connectivity With the Data Tools Platform. John Graham (Sybase Data Tooling) Brian Payton (IBM Information Management)

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Package retrosheet. April 13, 2015

Object Oriented Design with UML and Java. PART XVIII: Database Technology

SQL Server for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

DIPLOMA IN WEBDEVELOPMENT

Linas Virbalas Continuent, Inc.

Practical Differential Gene Expression. Introduction

GenBank, Entrez, & FASTA

Information Systems SQL. Nikolaj Popov

SQL Server. 1. What is RDBMS?

Scenario 2: Cognos SQL and Native SQL.

ECS 165A: Introduction to Database Systems

Accessing Your Database with JMP 10 JMP Discovery Conference 2012 Brian Corcoran SAS Institute

Automating SQL Injection Exploits

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

Analyzing moderately large data sets

The Benefits of Data Modeling in Data Warehousing

How To Write A Store On A Microsoft Database On A Flash (Orchestra) On A Linux Computer (Ortho) On An Ipro (Orroster) On Your Computer Or Your Computer (For Ahem) On The

Tracking Database Schema Changes. Guidelines on database versioning using Devart tools

Intro to Databases. ACM Webmonkeys 2011

D61830GC30. MySQL for Developers. Summary. Introduction. Prerequisites. At Course completion After completing this course, students will be able to:

CSCE 156H/RAIK 184H Assignment 4 - Project Phase III Database Design

Tuning Tableau and Your Database for Great Performance PRESENT ED BY

Relational Databases. Charles Severance

VBA and Databases (see Chapter 14 )

CSI 2132 Lab 3. Outline 09/02/2012. More on SQL. Destroying and Altering Relations. Exercise: DROP TABLE ALTER TABLE SELECT

Instant SQL Programming

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML?

Oracle Database 10g Express

A Brief Introduction to MySQL

Q&A for Zend Framework Database Access

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Geodatabase Programming with SQL

Unit 10: Microsoft Access Queries

BCA. Database Management System

There and Back Again As Quick As a Flash

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Oracle Database 12c: Introduction to SQL Ed 1.1

Oracle Database: SQL and PL/SQL Fundamentals NEW

Introduction to Triggers using SQL

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Knocker main application User manual

Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster.

Database 2 Lecture I. Alessandro Artale

Databases and SQL. The Bioinformatics Lab SS Wiki topic 10. Tikira Temu. 04. June 2013

SQL and Programming Languages. SQL in Programming Languages. Applications. Approaches

Transcription:

Using Databases in R Marc Carlson Fred Hutchinson Cancer Research Center May 20, 2010

Introduction Example Databases: The GenomicFeatures Package Basic SQL Using SQL from within R

Outline Introduction Example Databases: The GenomicFeatures Package Basic SQL Using SQL from within R

Relational Databases Relational database basics Data stored in tables Tables related through keys Relational model called a schema Tables designed to avoid redundancy Beneficial uses by R packages Out-of-memory data storage Fast access to data subsets Databases accessible by other software

Uses of Relational Databases in Bioconductor Annotation packages Organism, genome (e.g. org.hs.eg.db) Microarray platforms (e.g. hgu95av2.db) Homology (e.g. hom.hs.inp.db) Software packages Transcript annotations (e.g. GenomicFeatures) NGS experiments (e.g. Genominator) Annotation infrastructure (e.g. AnnotationDbi)

Outline Introduction Example Databases: The GenomicFeatures Package Basic SQL Using SQL from within R

The GenomicFeatures package What is does Builds databases on the fly as needed. Supported sources: UCSC tracks, biomart, or custom. Has accessor methods to make it easy to use this data. Why it was built Needed a generic way to support access to multiple resources Users needed reproducible data for research Data needed to be formatted in a convenient way Why it uses a database Annotation data is highly relational Other annnotation data is also stored in SQLite databases Allows access from disc (which can free up memory)

What do I mean by relational?

TranscriptDb Basics Making a TranscriptDb object > library(genomicfeatures) > mm9kg <- + maketranscriptdbfromucsc(genome = "mm9", + tablename = "knowngene") Saving and Loading > savefeatures(mm9kg, file="mm9kg.sqlite") > mm9kg <- + loadfeatures(system.file("extdata", "mm9kg.sqlite", + package = "AdvancedR"))

Accessing the TranscriptDb > head(transcripts(mm9kg), 3) GRanges with 3 ranges and 2 elementmetadata values seqnames ranges strand tx_id <Rle> <IRanges> <Rle> <integer> [1] chr9 [3215314, 3215339] + 24312 [2] chr9 [3335231, 3385846] + 24315 [3] chr9 [3335473, 3343608] + 24313 tx_name <character> [1] uc009oas.1 [2] uc009oat.1 [3] uc009oau.1 seqlengths chr1 chr2... chrx_random chry_random 197195432 181748087... 1785075 58682461

What actually happens when we do this? > options(verbose=true) > txs <- transcripts(mm9kg) SQL QUERY: SELECT tx_chrom, tx_start, tx_end, tx_strand, transcript._tx_id AS tx_id, tx_name FROM transcript ORDER BY tx_chrom, tx_strand, tx_start, tx_end Notice how the database query is pretty simple? DB joins promote flexible access BUT: there is a cost if using A LOT of joins Therefore (in this case) a hybrid approach: Retrieve relevant records and subset in R

Our database schema

Outline Introduction Example Databases: The GenomicFeatures Package Basic SQL Using SQL from within R

SQL in 3 slides Structured Query Language (SQL) is the most common language for interacting with relational databases.

Database Retrieval Single table selections SELECT * FROM gene; SELECT gene_id, gene._tx_id FROM gene; SELECT * FROM gene WHERE _tx_id=49245; SELECT * FROM transcript WHERE tx_name LIKE '%oap.1'; Inner joins SELECT gene.gene_id,transcript._tx_id FROM gene, transcript WHERE gene._tx_id=transcript._tx_id; SELECT g.gene_id,t._tx_id FROM gene AS g, transcript AS t WHERE g._tx_id=t._tx_id AND t._tx_id > 10;

Database Modifications CREATE TABLE CREATE TABLE foo ( id INTEGER, string TEXT ); INSERT INSERT INTO foo (id, string) VALUES (1,"bar"); CREATE INDEX CREATE INDEX fooind1 ON foo(id);

Outline Introduction Example Databases: The GenomicFeatures Package Basic SQL Using SQL from within R

The DBI package Provides a nice generic access to databases in R Many of the functions are convenient and simple to use

Some popular DBI functions > library(rsqlite) #loads DBI too, (but we need both) > drv <- dbdriver("sqlite") > con <- dbconnect(drv, dbname=system.file("extdata", + "mm9kg.sqlite", package="advancedr")) > dblisttables(con) [1] "cds" "chrominfo" "exon" "gene" [5] "metadata" "splicing" "transcript" > dblistfields(con,"transcript") [1] "_tx_id" "tx_name" "tx_chrom" "tx_strand" [5] "tx_start" "tx_end"

The dbgetquery approach > dbgetquery(con, "SELECT * FROM transcript LIMIT 3") _tx_id tx_name tx_chrom tx_strand tx_start tx_end 1 24308 uc009oap.1 chr9-3186316 3186344 2 24309 uc009oao.1 chr9-3133847 3199799 3 24310 uc009oaq.1 chr9-3190269 3199799

The dbsendquery approach If you use result sets, you also need to put them away > res <- dbsendquery(con, "SELECT * FROM transcript") > fetch(res, n= 3) _tx_id tx_name tx_chrom tx_strand tx_start tx_end 1 24308 uc009oap.1 chr9-3186316 3186344 2 24309 uc009oao.1 chr9-3133847 3199799 3 24310 uc009oaq.1 chr9-3190269 3199799 > dbclearresult(res) [1] TRUE Calling fetch again will get the next three results. This allows for simple iteration.

Your turn Select the exons from the minus strand of chromosome 9.

Setting up a new DB First, lets close the connection to our other DB: > dbdisconnect(con) [1] TRUE Then lets make a new database. Notice that we specify the database name with dbname This allows it to be written to disc instead of just memory. > drv <- dbdriver("sqlite") > con <- dbconnect(drv, dbname="mynewdb.sqlite") Once you have this, you may want to make a new table > dbgetquery(con, "CREATE Table foo (id INTEGER, string TEXT)") NULL

Your turn again Create a database and then put a table in it called genepheno to store the genes mutated and a phenotypes associated with each. Plan for genepheno to hold the following gene IDs and phenotypes (as a toy example): data = data.frame(id=c(69773,20586,258822,18315), string=c("dead", "Alive", "Dead", "Alive"), stringsasfactors=false)

The RSQLite package Provides SQLite access for R Much better support for complex inserts

Prepared queries > data <- data.frame(c(226089,66745), + c("c030046e11rik","trpd52l3"), + stringsasfactors=false) > names(data) <- c("id","string") > sql <- "INSERT INTO foo VALUES ($id, $string)" > dbbegintransaction(con) [1] TRUE > dbgetpreparedquery(con, sql, bind.data = data) NULL > dbcommit(con) [1] TRUE Notice that we want strings instead of factors in our data.frame

Your turn part 3 Now take a moment to insert that data into your database.

in SQLite, you can ATTACH Dbs The SQL what we want looks quite simple: ATTACH "mm9kg.sqlite" AS db; So we just need to do something like this: > db <- system.file("extdata", "mm9kg.sqlite", + package="advancedr") > dbgetquery(con, sprintf("attach '%s' AS db",db)) NULL

You can join across attached Dbs The SQL this looks like: SELECT * FROM db.gene AS dbg, foo AS f WHERE dbg.gene_id=f.id; Then in R: > sql <- "SELECT * FROM db.gene AS dbg, + foo AS f WHERE dbg.gene_id=f.id" > res <- dbgetquery(con, sql) > res gene_id _tx_id id string 1 226089 48508 226089 C030046E11Rik 2 226089 48509 226089 C030046E11Rik 3 226089 48511 226089 C030046E11Rik 4 226089 48510 226089 C030046E11Rik 5 66745 48522 66745 Trpd52l3

Your turn part 4 Now create a cross join to your database and extract the tx id s from the gene table there using your gene IDs as a foreign key.

Your turn part 5 Now conect your cross join to the transcript table in the database and extract the fields from that table while still using your gene IDs as a foreign key.