Lecture 25: Database Notes



Similar documents
Relational Database: Additional Operations on Relations; SQL

2874CD1EssentialSQL.qxd 6/25/01 3:06 PM Page 1 Essential SQL Copyright 2001 SYBEX, Inc., Alameda, CA

Database Administration with MySQL

Introduction to SQL for Data Scientists

Introduction to SQL and SQL in R. LISA Short Courses Xinran Hu

PSU SQL: Introduction. SQL: Introduction. Relational Databases. Activity 1 Examining Tables and Diagrams

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

SQL - QUICK GUIDE. Allows users to access data in relational database management systems.

Creating QBE Queries in Microsoft SQL Server

Oracle Database: SQL and PL/SQL Fundamentals NEW

Database Query 1: SQL Basics

How To Create A Table In Sql (Ahem)

[Refer Slide Time: 05:10]

The join operation allows you to combine related rows of data found in two tables into a single result set.

Oracle SQL. Course Summary. Duration. Objectives

Chapter 1 Overview of the SQL Procedure

SQL Injection. SQL Injection. CSCI 4971 Secure Software Principles. Rensselaer Polytechnic Institute. Spring

Financial Data Access with SQL, Excel & VBA

Oracle Database: SQL and PL/SQL Fundamentals

1 Structured Query Language: Again. 2 Joining Tables

A Brief Introduction to MySQL

SUBQUERIES AND VIEWS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 6

Using SQL Queries in Crystal Reports

Instant SQL Programming

Using Multiple Operations. Implementing Table Operations Using Structured Query Language (SQL)

Package sjdbc. R topics documented: February 20, 2015

Microsoft Access 3: Understanding and Creating Queries

PostgreSQL Concurrency Issues

White Paper. Blindfolded SQL Injection

Programming with SQL

AN INTRODUCTION TO THE SQL PROCEDURE Chris Yindra, C. Y. Associates

Part A: Data Definition Language (DDL) Schema and Catalog CREAT TABLE. Referential Triggered Actions. CSC 742 Database Management Systems

Performing Queries Using PROC SQL (1)

Relational Databases. Charles Severance

Alternatives to Merging SAS Data Sets But Be Careful

The Basics of Querying in FoxPro with SQL SELECT. (Lesson I Single Table Queries)

Relational Databases and SQLite

Package TSfame. February 15, 2013

Accessing Your Database with JMP 10 JMP Discovery Conference 2012 Brian Corcoran SAS Institute

Oracle Database: SQL and PL/SQL Fundamentals

Using Databases in R

To increase performaces SQL has been extended in order to have some new operations avaiable. They are:

Computing with large data sets

MAS 500 Intelligence Tips and Tricks Booklet Vol. 1

SQL Nested & Complex Queries. CS 377: Database Systems

Welcome to the topic on queries in SAP Business One.

Retrieving Data Using the SQL SELECT Statement. Copyright 2006, Oracle. All rights reserved.

Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff

Using AND in a Query: Step 1: Open Query Design

Outline. SAS-seminar Proc SQL, the pass-through facility. What is SQL? What is a database? What is Proc SQL? What is SQL and what is a database

Intro to Databases. ACM Webmonkeys 2011

Inquiry Formulas. student guide

COMP 5138 Relational Database Management Systems. Week 5 : Basic SQL. Today s Agenda. Overview. Basic SQL Queries. Joins Queries

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Lab # 5. Retreiving Data from Multiple Tables. Eng. Alaa O Shama

Oracle Database 10g: Introduction to SQL

Introduction to Proc SQL Steven First, Systems Seminar Consultants, Madison, WI

3.GETTING STARTED WITH ORACLE8i

David M. Kroenke and David J. Auer Database Processing: Fundamentals, Design and Implementation

Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster.

FileMaker 11. ODBC and JDBC Guide

CS 145: NoSQL Activity Stanford University, Fall 2015 A Quick Introdution to Redis

Once the schema has been designed, it can be implemented in the RDBMS.

Writing Scripts with PHP s PEAR DB Module

Structured Query Language (SQL)

Introduction to Microsoft Jet SQL

Tips and Tricks SAGE ACCPAC INTELLIGENCE

Package RSQLite. February 19, 2015

To use MySQL effectively, you need to learn the syntax of a new language and grow

The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi

DATABASE DESIGN AND IMPLEMENTATION II SAULT COLLEGE OF APPLIED ARTS AND TECHNOLOGY SAULT STE. MARIE, ONTARIO. Sault College

Introduction to SQL: Data Retrieving

Creating and Using Databases with Microsoft Access

Data Tool Platform SQL Development Tools

SQL Basics. Introduction to Standard Query Language

SQL SELECT Query: Intermediate

Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis

Boats bid bname color 101 Interlake blue 102 Interlake red 103 Clipper green 104 Marine red. Figure 1: Instances of Sailors, Boats and Reserves

The F distribution and the basic principle behind ANOVAs. Situating ANOVAs in the world of statistical tests

Appendix: Tutorial Introduction to MATLAB

Entity/Relationship Modelling. Database Systems Lecture 4 Natasha Alechina

The software shall provide the necessary tools to allow a user to create a Dashboard based on the queries created.

Hypercosm. Studio.

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

its not in the manual

GEM Network Advantages and Disadvantages for Stand-Alone PC

Katie Minten Ronk, Steve First, David Beam Systems Seminar Consultants, Inc., Madison, WI

Oracle Database: SQL and PL/SQL Fundamentals NEW

HTSQL is a comprehensive navigational query language for relational databases.

Exposed Database( SQL Server) Error messages Delicious food for Hackers

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

CS 338 Join, Aggregate and Group SQL Queries

Subsetting Observations from Large SAS Data Sets

Financial Reporting Using Microsoft Excel. Presented By: Jim Lee

SQL. Short introduction

T-SQL STANDARD ELEMENTS

Database Programming with PL/SQL: Learning Objectives

Transcription:

Lecture 25: Database Notes 36-350, Fall 2014 12 November 2014 The examples here use http://www.stat.cmu.edu/~cshalizi/statcomp/ 14/lectures/23/baseball.db, which is derived from Lahman s baseball database (http://www.seanlahman.com/baseball-archive/statistics/), more fully documented at http://seanlahman.com/files/database/readme58.txt. 1 Basics; Queries A database consists of one or more tables; each table is like a data-frame in R, where each record is like a data-frame s row, and each field is like a column. Related tables are linked by the use of unique identifiers or keys (.e.g., patient IDs). A database server hosts one or more databases. The server controls access to the databases, and handles the actual business of reading out information from the tables efficiently (or writing information to them). Database clients are programs which interact with the server. (Think of the difference between the Web browser program on your computer or phone [a client], and the Web server program which actually hosts the website.) The main thing you will be doing with a database is issuing queries asking it to retrieve selected parts of a database. Just as we used regular expressions to specify patterns of text, we need a special syntax to specify queries unambiguously. The de facto standard is the Structured Query Language, SQL. Most popular database server programs are designed to work with it, sometimes with slight variants from program to program. With thanks to Vince Vu 1

1.1 Common Operations SQL command SQLite variant List available databases SHOW DATABASES.databases List tables in a database SHOW TABLES IN database.tables List fields in a table SHOW COLUMNS IN table Describe the types of the columns in a table DESCRIBE table.schema table Change the default database USE database 2

2 Select The main tool is the SELECT command: you give it a specification of what information you want, and it returns a table: SELECT columns or computations FROM table WHERE condition GROUP BY columns HAVING condition ORDER BY column [ASC DESC] LIMIT offset,count; Most of this is optional (but not the semi-colon at the end). 2.1 Selecting Columns Columns get specified by name (which are case-sensitive), with multiple columns separated by commas. * specifies all columns. SELECT PlayerID,yearID,AB,H FROM Batting; returns a sub-table, with the specified columns, from the table Batting. The R equivalent on a dataframe would be Batting[,c("PlayerID","yearID","AB","H")] SELECT * FROM Salaries; returns all columns from the table Salaries. (What would be the R equivalent?) SELECT * FROM Salaries ORDER BY Salary; re-organizes so the records are sorted by the Salary field. (What would be the R equivalent?) The default is ascending order. SELECT * FROM Salaries ORDER BY Salary DESC; reverses the order. SELECT * FROM Salaries ORDER BY Salary DESC LIMIT 10; will return all columns for the records with the 10 highest salaries. (What would be the R equivalent?) 3

2.2 Selecting Records SQL uses Boolean expressions in the WHERE clause to decide which records get selected, like subset() or which() in R. SELECT PlayerID,yearID,AB,H FROM Batting WHERE AB > 100 AND H > 0; will return the specified columns from Batting, but only for the records where the AB field is over 100 and the H field is over 0. The R equivalent would be Batting[Batting$AB > 100 & Batting$H > 0, c("playerid","yearid","ab","h")] or with(batting,batting[ab > 100 & H > 0, c("playerid","yearid","ab","h")]) or (perhaps closest to the SQL) subset(batting,subset=(ab > 100 & H > 0), select=c("playerid","yearid","ab","h")) 2.3 Calculated Columns We can have the server do some calculations for us as part of the query, including simple arithmetic and basic summaries, like SUM, AVG, MIN, MAX, COUNT, VAR SAMP, STDDEV SAMP. SELECT MAX(AB) FROM Batting; SELECT MIN(AB), AVG(AB), MAX(AB) FROM Batting; do what you d expected. SELECT AB, H, H/AB FROM Batting; doesn t do quite what you expect since H and AB are both integers, H/AB is just the integer part of the ratio. SELECT AB,H,H/CAST(AB AS REAL) FROM Batting; gives the correct floating-point ratio. We can give calculated columns names, if we want to refer to them, e.g., for sorting: SELECT PlayerID,yearID,H/CAST(AB AS REAL) AS BattingAvg FROM Batting ORDER BY BattingAvg DESC LIMIT 10; returns the records with the 10 highest values of our new field BattingAvg. 4

2.4 Aggregation To aggregate, i.e., to do calculations on grouped subsets of records, we use GROUP BY clauses, much like aggregate and d*ply in R. SELECT playerid, SUM(salary) FROM Salaries GROUP BY playerid gives the total salary paid over time to each player. These calculations can be named: SELECT playerid, SUM(salary) AS totalsalary FROM Salaries GROUP BY playerid ORDER BY totalsalary DESC LIMIT 10; Exercise: What would SELECT salary, COUNT(playerID) FROM Salaries GROUP BY salary do? 2.5 Selecting Records with Aggregation SELECT processes part of what you tell it to do in order: WHERE clauses come first (initial selection of records), then GROUP BY (aggregation of records), then HAVING for post-aggregation selection. SELECT playerid, SUM(salary) AS totalsalary FROM Salaries GROUP BY playerid HAVING totalsalary > 200000000 will return grouped records (i.e., players) where the total salary exceeds 200 million. 5

3 JOIN If information is split across tables, we need to join it back together. The operation which does so is called a join, naturally enough, which in SQL shows up as a JOIN clause in SELECT, modifying the FROM. (In R, we join dataframes with the merge command; it has all the functionality of JOIN, though it s usually not as fast or as memory-efficient.) The most natural kind of JOIN is where we find records in two tables with matching keys (unique identifiers), and merge the sets of fields. (Conceptually, we look at every combination of a record from table A with a record from table B, and ask whether the pair satisfies some JOIN condition or JOIN predicate ; if it does, the pair goes in to the new table as a single record. In the obvious sort of JOIN, the condition is do the keys match? Actually examining all pairs of records that way isn t very efficient, and there are lots of tricks to avoid having to do it whenever possible.) SELECT namelast,namefirst,yearid,ab,h FROM Master INNER JOIN Batting USING(playerID); The INNER JOIN... USING part creates a (virtual) table, merging the fields from Master and Batting where playerid matches. An equivalent: SELECT namelast,namefirst,yearid,ab,h FROM Master INNER JOIN Batting ON Master.playerID == Batting.playerID; This form can be used when the key has different names in different tables. If there is just one field in common between two tables, we can write NAT- URAL JOIN instead of INNER JOIN, and skip USING or ON. SELECT namelast,namefirst,yearid,ab,h FROM Master NATURAL JOIN Batting; (However, many experts advise against using NATURAL JOIN, since if anyone ever changes the column names or adds columns with overlapping names, your code will not do what you expect.) A single SELECT can contain multiple JOINs, to combine more than two tables: SELECT namelast,namefirst,yearid,ab,h,salary FROM Master NATURAL JOIN Batting NATURAL JOIN Salaries ORDER BY salary DESC LIMIT 10; Two tables may have columns with the same name which we want to keep track of separately; then the SQL convention is to refer to them as table.var. The AS command lets us rename them inside a SELECT: SELECT patient.last_name AS patname, physicians.last_name AS docname FROM patients NATURAL JOIN physicians; 6

AS can also be used to abbreviate or rename tables: SELECT p.last_name AS patname, d.last_name AS docname FROM patients AS p NATURAL JOIN physicians AS d; An OUTER JOIN is one where a record from one table which doesn t have any matching records in the other gets added anyway, with NA values for the fields from the other table. In a LEFT OUTER JOIN, this only is done for records in the first table being joined; in a RIGHT OUTER JOIN, those in the second table; and in a FULL OUTER JOIN, both of them. 7

4 Accessing Databases from R Each database management system (DBMS) has a somewhat different protocol for actually dealing with servers. Rather than trying to master all of them, the sensible approach in R is to use the DBI package, which provides a single uniform interface, no matter what the DBMS. The programmers of DBI have to figure out how to deal with them, but you don t. The sub-programs which handle different DBMS s are called drivers; see http://cran.r-project.org/ package=dbi. install.packages("dbi", dependencies = TRUE) # Install DBI install.packages("rsqlite", dependencies = TRUE) # Install driver for SQLite Here we ve installed the driver for one particular DBMS called SQLite, largely because it works pretty reliably on the vast majority of students machines. R uses persistent objects called connections to keep track of which database server is being used for what task. 1 So the procedure goes like so: Load the driver for the DBMS you ll be dealing with. Establish a connection to the database server. Interact with the DBMS. Close the connection. Unload the driver. 4.1 Load Driver, Establish Connection library(rsqlite) drv <- dbdriver( SQLite ) con <- dbconnect(drv, dbname="baseball.db") con now is the name for the connection to the SQLite database baseball.db. We could also give dbconnect arguments host (an Internet address), use (a user name) and password. 4.2 Interacting with a Database through DBI The basic commands: dblisttables(conn) dblistfields(conn, name) dbreadtable(conn, name) # Get tables in the database (returns vector) # List fields in a table # Import a table as a data frame The really useful one: 1 Connections actually get used for other things as well, like file input/output. 8

dbgetquery(conn, statement) The second argument is an SQL query, which gets issued to the database; it returns a data frame. When building the statement, as when building a regular expression, it often helps to use paste: df <- dbgetquery(con, paste( "SELECT namelast,namefirst,yearid,salary", "FROM Master NATURAL JOIN Salaries")) 4.3 Disconnecting and unloding the driver There s an overhead involved in keeping these around, so remove them once they re not needed: dbdisconnect(con) dbunloaddriver(drv) Closing the connection also tells the database server that it doesn t have to worry any more about the possibility of your writing something into the database. 9