Introduction to SQL and SQL in R LISA Short Courses Xinran Hu 1
Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefit from the use of Statistics Collaboration: Visit our website to request personalized statistical advice and assistance with: Experimental Design Data Analysis Interpreting Results Grant Proposals Software (R, SAS, JMP, SPSS...) LISA statistical collaborators aim to explain concepts in ways useful for your research. Great advice right now: Meet with LISA before collecting your data. Short Courses: Designed to help graduate students apply statistics in their research Walk-In Consulting: M-F 1-3 PM GLC Video Conference Room; 11 AM-1 PM Old Security Building Room 103 For questions requiring <30 mins All services are FREE for VT researchers. We assist with research not class projects or homework. 2
Course Plan Part I. SQL 1. Basic Concepts in SQL 2. Querying From One Table 3. Nested Queries 4. Querying From Multiple Tables Part II. SQL in R 1. sqldf Package 2. Write SQL Command in R 3
Why SQL? SQL is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS). (wikipedia) In the era of Big Data, (1) Simple Method + Massive Dataset (2) The time spent on data management often exceeds the time spent on analysis. 4
Basic Concepts: Table A Table (or a Relation) is a set of records that have the same attributes. 1. A table always have an unique table name in SQL. Name Gender Age Occupation Jessica Female 28 Student Michael Male 28 Professor Emma Female 34 Professor 5
Basic Concepts: Column A Column ( or Attribute or Field) is a characteristics of objects. 1. A column always have a unique column name within a table. 2. A column is either numeric (age), categorical (gender) or one of the other pre-defined types in SQL, e.g. date. 3. Primary Key: a column that has unique value for each object. Name(PK) Gender Age Occupation Jessica Female 28 Student Michael Male 28 Professor Emma Female 34 Professor 6
Basic Concepts: Row A Row ( or a Tuple or Record) is an object represented by a list of attributes. 1. A row always have an unique value in Primary Key. Name(PK) Gender Age Occupation Jessica Female 28 Student Michael Male 28 Professor Emma Female 34 Professor 7
Basic SQL Command Example dataset: a world fact dataset (sqlzoo.net) http://sqlzoo.net/wiki/select_basics (Table name: world) 8
SELECT FROM Function: Selecting one or more columns from a table. Example: SELECT * FROM world SELECT name FROM world SELECT name, gdp FROM world A SQL query always start with a SELECT FROM command. 9
DISTINCT Clause Function: Removing duplicate records in query result. Example: SELECT continent FROM world (This query includes duplicate results) SELECT DISTINCT continent FROM world (This query excludes duplicate results) 10
LIMIT clause Function: Limiting the number of records returned. Example: SELECT DISTINCT continent FROM world (This query returns all distinct results) SELECT DISTINCT continent FROM world LIMIT 3 (This query returns three distinct results) 11
WHERE clause Function: Filtering the records returned by a select command. Example: Filtering based on a numeric attribute SELECT name, area FROM world /* do not forget this line*/ WHERE area = 9596961 (km^2) WHERE area<> 9596961 WHERE area > 5000000 WHERE area < 1 WHERE area IN (0, 9596961) WHERE area BETWEEN 9000000 AND 10000000 WHERE area NOT BETWEEN 1 AND 10000000 12
WHERE clause (cont.) Example: Filtering based on a categorical attribute. SELECT name, continent FROM world WHERE continent = 'north america' WHERE continent <> 'north america' WHERE continent >'north america' WHERE continent < 'north america' WHERE continent IN ('north america', south america') WHERE continent BETWEEN D AND F WHERE continent NOT BETWEEN D AND F WHERE continent LIKE '%america 13
WHERE clause (cont.) Example: Filtering based on two or more conditions. SELECT name, continent FROM world WHERE area>9000000 AND area<10000000 WHERE continent = 'north america AND area>9000000 WHERE continent = 'north america OR area>9000000 WHERE (continent = 'north america AND area>500000) OR (continent = south america AND area>500000) 14
ORDERED BY clause Function: Sorting the query result by a column. Example: Sort Ascending (Default) SELECT name, area FROM world ORDER BY area (ASC) Sort Descending SELECT name, area FROM world ORDER BY area DESC 15
Arithmetic operations SQL permits arithmetic operations on columns. Example: Query GDP: select name, gdp from world order by gdp DESC Query GDP in Trillion: select name, gdp/1000000000 from world order by gdp DESC Query GDP per capita select name, gdp/population from world order by gdp/population DESC 16
Practice: SQLZOO:SELECT basics Write the SQL queries in the first tutorial. http://sqlzoo.net/wiki/select_basics 17
Aggregate Functions Function: Producing descriptive statistics of a column. Example: Query the average GDP of the world. SELECT avg(gdp) FROM world More Aggregate Functions: max(), min(), sum(), count(), var(), stdev() Note, max() and min() operates on strings as well. 18
GROUP BY Function: Stratifying aggregate result by a categorical column Example: Query the average GDP of each continent. SELECT avg(gdp) FROM world. GROUP BY continent 19
HAVING Clause Function: Filtering aggregate results. Example: Find the continent with average country area greater than 1 million. SELECT continent, avg(area) FROM world GROUP BY continent HAVING avg(area)>1000000 20
Nested Query Query result can be used in the where clause of another query. This is referred as Nested Query SELECT name FROM world WHERE population > (SELECT population FROM world WHERE name= Russia') If we run the nested query, we will get 146000000, so the above query is equivalent to SELECT name FROM world WHERE population > 146000000 21
UNION Command Function: Concatenating two tables vertically Name Gender Age Occupation Jessica Female 28 Student Michael Male 28 Professor Emma Female 34 Professor Name Gender Age Occupation James Male 23 Student Anna Female 22 Staff Andy Male 45 Professor Name Gender Age Occupation Jessica Female 28 Student Michael Male 28 Professor Emma Female 34 Professor James Male 23 Student Anna Female 22 Staff Andy Male 45 Professor 22
UNION Command Example: SELECT * FROM world WHERE name = Germany UNION SELECT * FROM world WHERE name= France Why is UNION command useful if we can use WHERE clause? Answer: These queries could come from different tables. 23
JOIN Command Function: Concatenating two tables horizontally. Name Gender Age Occupation Jessica Female 28 Student Michael Male 28 Professor Emma Female 34 Professor Name Jessica Michael Emma Department Math Statistics Art Name Gender Age Occupation Department Jessica Female 28 Student Math Michael Male 28 Professor Statistics Emma Female 34 Professor Art 24
JOIN Command JOIN is generally more complicated than union because you have to specify a common attribute to link two tables. In the previous example, we joined tables with the name column. If both tables contain a column with the same column name, we should distinguish two columns by attaching the table name in front Table1Name.ColumnName vs Table2Name.ColumnName 25
JOIN Command There are four different types of join: (1) INNER JOIN (2) LEFT JOIN (3) RIGHT JOIN (4) FULL OUTER JOIN 26
INNER JOIN INNER JOIN produce a record if it exist in BOTH tables Name Gender Age Occupation Jessica Female 28 Student Michael Male 28 Professor Emma Female 34 Professor Name Jessica John Department Math CS Name Gender Age Occupation Department Jessica Female 28 Student Math 27
LEFT JOIN LEFT JOIN produce a record if it exists in the left table Name Gender Age Occupation Jessica Female 28 Student Michael Male 28 Professor Emma Female 34 Professor Name Jessica John Department Math CS Name Gender Age Occupation Department Jessica Female 28 Student Math Michael Male 28 Professor NULL Emma Female 34 Professor NULL 28
RIGHT JOIN RIGHT JOIN produce a record only if it exists in the right table Name Gender Age Occupation Jessica Female 28 Student Michael Male 28 Professor Emma Female 34 Professor Name Jessica John Department Math CS Name Gender Age Occupation Department Jessica Female 28 Student Math John NULL NULL NULL CS 29
FULL OUTER JOIN FULL OUTER JOIN produce a record if it exists in either table. Note, FULL OUTER JOIN is not supported in MySQL. Name Gender Age Occupation Jessica Female 28 Student Michael Male 28 Professor Emma Female 34 Professor Name Jessica John Department Math CS Name Gender Age Occupation Department Jessica Female 28 Student Math Michael Male 28 Professor NULL Emma Female 34 Professor NULL John NULL NULL NULL CS 30
JOIN syntax Olympic Game Dataset http://sqlzoo.net/wiki/select_.._join Left Table Name: Games Right Table Name: City Year City 1896 Athens 1948 London 2004 Athens 2008 Beijing 2012 London Name Sydney Athens Beijing London Country Australia Greece China UK 31
JOIN Example INNER JOIN SELECT * FROM games JOIN city ON games.city=city.name LEFT JOIN SELECT * FROM games LEFT JOIN city ON games.city=city.name RIGHTT JOIN SELECT * FROM games RIGHT JOIN city ON games.city=city.name FULL OUTER JOIN SELECT * FROM games FULL OUTER JOIN city ON games.city=city.name 32
SQL in R Two ways of applying SQL queries in R (1) A Local Way. Query from a dataframe. (2) A Remote Way Accessing a SQL database remotely. 33
Package sqldf Package sqldf allows us writing SQL-like commands to query from a dataframe just like querying from a relational table. All we need to do is wrap a SQL command in double quotation and send it to the sqldf function. E.g. library(sqldf) sqldf( select * from hpc ) 34