Below is a table called raw_search_log containing details of search queries. user_id INTEGER ID of the user that made the search.



Similar documents
Accessing your 1098T online through General Dynamics Information Technology (Vangent).

Netezza SQL Class Outline

CS 145: NoSQL Activity Stanford University, Fall 2015 A Quick Introdution to Redis

Oracle Database 10g Express

A basic create statement for a simple student table would look like the following.

Retrieving Data Using the SQL SELECT Statement. Copyright 2006, Oracle. All rights reserved.

Using SQL Server Management Studio

SQL Server An Overview

USING MYWEBSQL FIGURE 1: FIRST AUTHENTICATION LAYER (ENTER YOUR REGULAR SIMMONS USERNAME AND PASSWORD)

UCBI Web Capture Remote Deposit User Instructions

2-Factor Verification Remote Access

Create a Database Driven Application

Guide to Complete EIA SSO (Single Sign-On) Registration. 1. Open your Internet Browser, enter this address, and press Enter

Important Tips when using Ad Hoc

Databases What the Specification Says

Database Design Patterns. Winter Lecture 24

TIM 50 - Business Information Systems

AIM Dashboard-User Documentation

First, we ll look at some basics all too often the things you cannot change easily!

Banner Frequently Asked Questions (FAQs)

Once the schema has been designed, it can be implemented in the RDBMS.

W-2 DUPLICATE OR REPRINT PROCEDURE/ IMPORTING W-2 INFORMATION INTO TAX RETURN

When to consider OLAP?

Accessing multidimensional Data Types in Oracle 9i Release 2

The Relational Data Model: Structure

How To Use Econnect On A Pc Or Mac Or Ipad (For A Non Profit)

news from Tom Bacon about Monday's lecture

B.1 Database Design and Definition

Jet Data Manager 2012 User Guide

SQL NULL s, Constraints, Triggers

database abstraction layer database abstraction layers in PHP Lukas Smith BackendMedia

Support Portal User Guide. Version 3.0

Introduction This document s purpose is to define Microsoft SQL server database design standards.

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Week 13: Data Warehousing. Warehousing

MS SQL Performance (Tuning) Best Practices:

Project Management Development Scheduling Plan: Data Mart. By Turell Makins

3.GETTING STARTED WITH ORACLE8i

Teacher Activities Page Directions

Advanced SQL Queries for the EMF

How To Use Databook On A Microsoft Powerbook (Robert Birt) On A Pc Or Macbook 2 (For Macbook)

Secure Portal. A Step-by-Step Guide for Using KRS ZixCorp Secure Solution

Webapps Vulnerability Report

Microsoft End to End Business Intelligence Boot Camp

CHAPTER 5: BUSINESS ANALYTICS

Frequently Asked Questions for logging in to Online Banking

Using the SQL TAS v4

Setting up a basic database in Access 2003

LOBs were introduced back with DB2 V6, some 13 years ago. (V6 GA 25 June 1999) Prior to the introduction of LOBs, the max row size was 32K and the

Client Portal User Guide

How to Edit Your Website

OLAP Systems and Multidimensional Expressions I

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

SQL Injection. Blossom Hands-on exercises for computer forensics and security

INTRODUCTION TO DATABASES USING MICROSOFT ACCESS

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

First Time Users: Setting Up Your Account ADP Online Payroll Instructions

Oracle Database 10g: Introduction to SQL

Basics of Dimensional Modeling

CSCE 156H/RAIK 184H Assignment 4 - Project Phase III Database Design

Implementing Data Models and Reports with Microsoft SQL Server 20466C; 5 Days

Top Tips 9 IS Portal Tips & Tricks

Course 6234A: Implementing and Maintaining Microsoft SQL Server 2008 Analysis Services

Sikorsky Aircraft. Supplier Portal Password Activation Process. Revision H

Chapter 30 Exporting Inventory Management System Data

Reducing the size of your Exchange mailbox

Instant SQL Programming

Analyzing Data Using Access

THE WINNING ROULETTE SYSTEM.

Access Tutorial 1 Creating a Database

Safewhere*Identify 3.4. Release Notes

Oracle USF

Teradata Utilities Class Outline

Microsoft Implementing Data Models and Reports with Microsoft SQL Server

SAP BUSINESS OBJECTS BO BI 4.1 amron

Course 20463:Implementing a Data Warehouse with Microsoft SQL Server

Customizing MISys Manufacturing Reports Using ADO-XML 6/3/2013

The following are tips compiled by PeerPlace to assist you as you transition to the new Senior Tracking, Analysis and Reporting System (STARS).

Sample Table. Columns. Column 1 Column 2 Column 3 Row 1 Cell 1 Cell 2 Cell 3 Row 2 Cell 4 Cell 5 Cell 6 Row 3 Cell 7 Cell 8 Cell 9.

SQL Simple Queries. Chapter 3.1 V3.0. Napier University Dr Gordon Russell

Contents WEKA Microsoft SQL Database

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Access Part 2 - Design

Cognos 8 Best Practices

Conventional Files versus the Database. Files versus Database. Pros and Cons of Conventional Files. Pros and Cons of Databases. Fields (continued)

DbSchema Tutorial with Introduction in SQL Databases

Note: Password must be 7-16 characters and contain at least one uppercase letter and at least one number.

DATA WAREHOUSING II. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 23

How To Run A Run Dashboard On An Aca.Edu

Table of Contents SQL Server Option

A Brief Introduction to MySQL

1. To start Installation: To install the reporting tool, copy the entire contents of the zip file to a directory of your choice. Run the exe.

econtrol 3.5 for Active Directory & Exchange Self-Service Guide

The Benefits of Data Modeling in Data Warehousing

IS466 Decision Support Systems. SQL Server Business Intelligence Development Studio 2008 User Guide

Data warehousing/dimensional modeling/ SAP BW 7.3 Concepts

N o r m a l i z a t i o n


Relational model. Relational model - practice. Relational Database Definitions 9/27/11. Relational model. Relational Database: Terminology

Transcription:

%load_ext sql %sql sqlite:///olap.db OLAP and Cubes Activity 1 Data and Motivation You're given a bunch of data on search queries by users. (We can pretend that these users are Google search users and you are an engineer on the Google Web Search team). You want to analyze the number of search queries made, who they are made by, and how successful your search engine is at returning a result that people want to click on. A particular user can only make one query at a time. Below is a table called raw_search_log containing details of search queries. raw_search_log Field Type Description user_id ID of the user that made the search. timestamp TIMESTAMP Time at which the search occurred. A particular user_id can only make a single query at a given timestamp. query VARCHAR(500) Search query (text typed into the search bar). rank click_url VARCHAR(200) The rank of the search result they clicked on (how high it appears in the results). This is NULL if they never clicked on a result. The URL of the search result they clicked on. This is NULL if they never clicked on a result. city VARCHAR(50) The location of the user (is constant for a given user). age The age of the user (is constant for a given user). 2 OLAP Database Design We will split up the attributes of the raw data schema into a star schema with 2 dimension tables (users_dim, dates_dim), and a fact table (searches_fact). Design the star schema. users_dim User information.

searches_fact user_id city age VARCHAR(50) Search queries that were performed (who, when, what). dates_dim user_id timestamp TIMESTAMP query rank click_url VARCHAR(500) VARCHAR(200) Date/time information on search queries. 1. Write the CREATE TABLE statement for users_dim. Don't forget about the primary key! 2. Write the CREATE TABLE statement for dates_dim. This table contains attributes not found in the raw schema because we want to be able to do more detailed analysis on the data. Don't forget about the primary key! 3. Write the CREATE TABLE statement for searches_fact. Don't forget about foreign keys into the dimension tables and the primary key. (If you forgot the FOREIGN KEY syntax, take a look at Lecture 5: Design Theory3 from the course website). Discussion Why did we design it this way? What kind of queries might we be able to do with this schema design?

3 Populating Tables After designing the tables, we need to populate them with data from raw_search_log in order to do our analysis! Example. Populate the users_dim table with an INSERT...SELECT statement. [18250 rows ] 1. Populate the dates_dim table with an INSERT...SELECT statement. [28138 rows ] </p> You must extract the date (using the DATE function on the timestamp field) and hour of day (using the EXTRACT function) from timestamp. The documentation for EXTRACT is here4 and the syntax is EXTRACT( FROM timestamp). You must figure out whether or not the date is a weekend. Look at the EXTRACT function again. You will want to use a CASE statement. 2. Write an INSERT...SELECT statement to populate the searches_fact table. [28195 rows ] Discussion Think about how much work was required to set up the data for analysis. Compare this to NoSQL/Redis. How much design work did you have to do before analyzing the data? What if you wanted to add a field source to denote where the user queried from (desktop, Android browser, Safari, etc.)? What would you have to do in SQL versus what you do in Redis? What if source were unknown? How would this be denoted in SQL versus Redis? 4 OLAP Queries Now, we want to use our dimension and fact tables to analyze the data. We don't want to use raw_search_log as that would defeat the purpose of this exercise. NATURAL JOINs will come in handy and are easy to use by design of the schema. Example. Find the number of queries performed by people between the ages of 18 and 25, ordered by age. Problems

1. How many characters long are the search queries done by people at Stanford? Find the top 10 longest queries and their length for queries done by users in Stanford. LENGTH (documentation here6) and LIMIT (documentation here7) will be helpful. [Longest query is 255 characters ] 2. We want to get summary statistics on queries in each city. For each city, get the number of unique users and number of queries made. The result's schema should be (city, num_users, num_queries). 3. Find the maximum length of search queries each day (not timestamp!) that returned a click_url (i.e. searches that resulted in a user clicking on a search result). We're interested in seeing if the search engine is getting better at returning results for queries, or if people are better at searching with shorter queries. The result should be of the form (date, max_query_length). 4. Find the bad kids! Write a query that returns the user_ids of users under the age of 18 who've made searches between the hours of 2 am and 7 am (hour values of 2 and 7), inclusive. [176 rows] 5 Discussion Think about these questions below and we will discuss it together as a class. Installation and setup-wise, what was easier to do in Redis? What kind of analysis was easier to do in SQL? What are some of the disadvantages or limitations of using SQL rather than Redis? 6 Bonus Problems (Extra Practice) Do these if you have extra time! This is not required as part of the activity. 1. Report number of queries and clicked websites viewed by each user in Palo Alto. The result should be of the form (user_id, num_queried, num_clicked). [First row of results is (281371, 2, 2) ]

2. For dates between '2006-03-04' and '2006-03-07', find the number of queries. Your result should be of the form (date, num_queries). Remember, some queries could have occurred at the same time! [335, 400, 348, 369] 3. Are older people less effective at making a good search? Get the number of queries that did not return a click_url from users 50 or older who don't live in Oldsville. [2430 ] 4. Super-Bonus! It would be interesting to see if the average queries per hour is different between days during the week and days during the weekend. Do people make more searches during the weekend? Do they make more at night? Write a query that computes the number of queries per hour, averaged over the weekdays and over the weekends. The result should have rows and columns like this: timestamp date hour TIMESTAMP DATE is_weekend BOOLEAN hour avg_weekday_queries avg_weekend_queries 0 11.89 10.69 1 7.58 9.27......... avg_weekday_queries contains the number of queries per hour, averaged over all weekdays (the is_weekend field in dates_dim will be helpful), and avg_weekend_queries contains the number of queries per hour, averaged over all weekend days. Hint: Break it apart into two queries (one for the weekend, one for the weekday) and join them together at the end. %%sql