Extracting YouTube videos using SAS. Dr Craig Hansen

Similar documents
Managing Qualtrics Survey Distributions and Response Data with SAS

Leveraging APIs in SAS to Create Interactive Visualizations

Curtis Mack Looking Glass Analytics

Using SAS BI Web Services and PROC SOAP in a Service-Oriented Architecture Dan Jahn, SAS, Cary, NC

REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS

Qlik REST Connector Installation and User Guide

Managing very large EXCEL files using the XLS engine John H. Adams, Boehringer Ingelheim Pharmaceutical, Inc., Ridgefield, CT

We begin by defining a few user-supplied parameters, to make the code transferable between various projects.

Subsetting Observations from Large SAS Data Sets

Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX

An macro: Exploring metadata EG and user credentials in Linux to automate notifications Jason Baucom, Ateb Inc.

Using SAS to Control and Automate a Multi SAS Program Process. Patrick Halpin November 2008

Managing Tables in Microsoft SQL Server using SAS

Top Ten Reasons to Use PROC SQL

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

How To Create An Audit Trail In Sas

Taking Advantage of Digi s Advanced Web Server s Repeat Group Feature

Using Pharmacovigilance Reporting System to Generate Ad-hoc Reports

Visa Checkout Integration Guide V1.0

A comprehensive guide to XML Sitemaps:

Leveraging the SAS Open Metadata Architecture Ray Helm & Yolanda Howard, University of Kansas, Lawrence, KS

OnDemand for Academics

XML Processing and Web Services. Chapter 17

A Method for Cleaning Clinical Trial Analysis Data Sets

How To Create A Native Ad On A Nast On A Pc Or Mac Or Ipad (For Android) On A Mac Or Mac) On Pc Or Ipa (For Mac Or Pc) On An Android Or Ipam (For Pc Or

Course: SAS BI(business intelligence) and DI(Data integration)training - Training Duration: 30 + Days. Take Away:

Remove Orphan Claims and Third party Claims for Insurance Data Qiling Shi, NCI Information Systems, Inc., Nashville, Tennessee

Using the vcenter Orchestrator Plug-In for vsphere Auto Deploy 1.0

9.1 SAS/ACCESS. Interface to SAP BW. User s Guide

Creating a Website with Google Sites

Search Engine optimization

Search and Information Retrieval

Optimizing Data Quality and Patient Safety with EDC Integration

Automatic measurement of Social Media Use

From Database to your Desktop: How to almost completely automate reports in SAS, with the power of Proc SQL

ebookstorage.org Download 1001 Free Ebook

SAS University Edition: Installation Guide for Windows

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

CDW DATA QUALITY INITIATIVE

Using web service technologies for incremental, real-time data transfers from EDC to SAS

Yandex: Webmaster Tools Overview and Guidelines

itunes Store Publisher User Guide Version 1.1

Pervasive Data Integrator. Oracle CRM On Demand Connector Guide

Microsoft' Excel & Access Integration

Embedding Multimedia in Blackboard

Integration of Learning Management Systems with Social Networking Platforms

Debugging Complex Macros

TRANSFORM YOUR MEDE8ER TV WALL FROM THIS: TO THIS: LIBRARY VIEW SHOW VIEW SEASON VIEW FULL SYNOPSIS AND INFO

Integration Client Guide

Effective Use of SQL in SAS Programming

Health Services Research Utilizing Electronic Health Record Data: A Grad Student How-To Paper

Integrating CRM On Demand with the E-Business Suite to Supercharge your Sales Team

Methods of Social Media Research: Data Collection & Use in Social Media

Technical Paper. Defining an ODBC Library in SAS 9.2 Management Console Using Microsoft Windows NT Authentication

JW Player Quick Start Guide

Wave Analytics Data Integration

Karl Lum Partner, LabKey Software Evolution of Connectivity in LabKey Server

GoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing

Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager

Introduction to XML Applications

Future-proofed SEO for Magento stores

StarterPak: HubSpot and Dynamics CRM Lead and Contact Synchronization

Banner General Event Management Handbook. Release 9.0 February 2012

Oracle BI Cloud Service : What is it and Where Will it be Useful? Francesco Tisiot, Principal Consultant, Rittman Mead OUG Ireland 2015, Dublin

Get in Control! Configuration Management for SAS Projects John Quarantillo, Westat, Rockville, MD

The presentation explains how to create and access the web services using the user interface. WebServices.ppt. Page 1 of 14

Integration of Hotel Property Management Systems (HPMS) with Global Internet Reservation Systems

Data Tool Platform SQL Development Tools

ABSTRACT THE ISSUE AT HAND THE RECIPE FOR BUILDING THE SYSTEM THE TEAM REQUIREMENTS. Paper DM

MA-WA1920: Enterprise iphone and ipad Programming

Live Streaming with CCN & Content Transmission with CCNx

Introduction to Market Basket Analysis Bill Qualls, First Analytics, Raleigh, NC

MySQL for Beginners Ed 3

SEO Workshop Keyword and Competitor Research and On Page Optimisation

Creating a Website with Google Sites

Ignite Visibility Consulting. How to Blog. Prepared by John Lincoln. Copyright 2013 Ignite Visibility Page 1

Selling Digital Goods Online

Alterian Content Manager 7 Digital Asset Management (DAM) capabilities

Creating your own Internet Libraries for ECTACO jetbook COLOR

SAS. Cloud. Account Administrator s Guide. SAS Documentation

INTRODUCTION TO THE PROJECT TRACKING WEB APPLICATION

How To Manage Your Digital Assets On A Computer Or Tablet Device

A Macro to Create Data Definition Documents

EZcast technical documentation

Adobe Marketing Cloud How Adobe Scene7 Publishing System Supports SEO

How To Understand Data Mining In R And Rattle

Table of Contents. Overview Supported Platforms Note Demos/Downloads Known Issues Use Case... 4

DocuSign Connect for Salesforce Guide

Jenkins TestLink Plug-in Tutorial

Transcription:

Extracting YouTube videos using SAS Dr Craig Hansen

Craig Hansen has been using SAS for 15 years within the health research setting. He gained his doctorate in epidemiology at the University of the Sunshine Coast and since then has worked at the University of Queensland School of Medicine (Australia) working in cardiovascular research the United States Environmental Protection Agency (USEPA) working in air pollution research the Centers for Disease Control and Prevention (CDC, USA) working in birth defects research Kaiser Permanente Center for Health Research (USA) working on health studies using electronic medical records. He recently accepted a position at the South Australian Health and Medical Research Institute as Senior Epidemiologist within the Wardliparingga Aboriginal Health Unit. Dr Craig Hansen

Extracting YouTube Videos using SAS By Craig Hansen, PhD South Australian Health & Medical Research Institute Overview YouTube API SAS Proc HTTP SAS XML Mapper Example medications during pregnancy Questions

Have you ever wanted to scrape all the information from YouTube videos? Length Uploader Views Date Comments

You can using YouTube Data API and SAS! YouTube Data API (Metadata for videos) Proc HTTP XML File XML Mapper SAS Datasets

YouTube API Video feed example https://gdata.youtube.com/feeds/api/videos?q=surfing Google's APIs allows you to integrate YouTube videos and functionality into websites or applications. Examples: YouTube Analytics API YouTube Data API YouTube Player API XML File of all the videos found with the search term surfing Only allows 25 videos per XML file Only allows 500 videos in total per search term

XML Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both humanreadable and machine-readable A structured ( tree-like ) text that can be used to store information in a hierarchical format See the structure based on tags < >..text..</> This is what SAS XML Mapper reads

<?xml version='1.0' encoding='utf-8'?> <feed xmlns='http://www.w3.org/2005/atom' xmlns:opensearch='http://a9.com/-/spec/opensearch/1.1/' xmlns:gml='http://www.opengis.net/gml' xmlns:georss='http://www.georss.org/georss' xmlns:media='http://search.yahoo.com/mrss/' xmlns:batch='http://schemas.google.com/gdata/batch' xmlns:yt='http://gdata.youtube.com/schemas/2007' xmlns:gd='http://schemas.google.com/g/2005' gd:etag='w/"ce4eqh47ecp7ima9wxrqgeq."'> <id>tag:youtube.com,2008:standardfeed:global:most_popular</id> <updated>2008-07-18t05:00:49.000-07:00</updated> <category scheme='http://schemas.google.com/g/2005#kind' term='http://gdata.youtube.com/schemas/2007#video'/> <title>most Popular</title> <id>tag:youtube,2008:video:ztuvgyoen_b</id> <published>2008-07-05t19:56:35.000-07:00</published> <updated>2008-07-18t07:21:59.000-07:00</updated> <category scheme='http://gdata.youtube.com/schemas/2007/categories.cat' term='people' label='people'/> <title>shopping for Coats</title> <content type='application/x-shockwave-flash' src='http://www.youtube.com/v/ztuvgyoen_b?f=gdata_standard...'/> <link rel='alternate' type='text/html' href='https://www.youtube.com/watch?v=ztuvgyoen_b'/> YouTube XML SCHEMA (only a snippet) <author> <name>googledevelopers</name> <uri>https://gdata.youtube.com/feeds/api/users/googledevelopers</uri> <yt:userid>_x5xg1ov2p6uzz5fsm9ttw<</yt:userid> <yt:aspectratio>widescreen</yt:aspectratio> <yt:duration seconds='79'/> <yt:uploaded>2008-07-05t19:56:35.000-07:00</yt:uploaded> <yt:uploaderid>uc_x5xg1ov2p6uzz5fsm9ttw</yt:uploaderid> <yt:videoid>ztuvgyoen_b</yt:videoid> </media:group> <gd:rating min='1' max='5' numraters='14763' average='4.93'/> <yt:recorded>2008-07-04</yt:recorded> <yt:statistics viewcount='383290' favoritecount='7022'/> <yt:rating numdislikes='19257' numlikes='106000'/> </entry> </feed>

SAS PROC HTTP Issues Hypertext Transfer Protocol (HTTP) requests YouTube API search (https://developers.google.com/youtube/v3/) FILENAME myxml "C:\Current Work\YT.xml" encoding="utf-8"; PROC HTTP out=myxml RUN; url="https://gdata.youtube.com/feeds/api/videos?q=surfing" method="get";

SAS XML Mapper We now need to parse all the text in the XML into a dataset Perfect, we can use this to map all the structured text in the XML into a dataset it does all the work for you! SAS XML Mapper reads in the XML or Schema file and interprets the structure based on the tags <>.text.</> **Download SAS XML Mapper from SAS website

SAS XML Mapper Create a map XML window XML Map window Source window

SAS XML Mapper Create a map Open XML or Schema XML file The datasets it will create XML map generated save this map

SAS XML Mapper Creating Datasets See how the datasets are linked by ID variables

Information available about Datasets Feed information Title Description Author Date published Category Duration Number of views Ratings Comments (links to another API) Restrictions URL

SAS Code SAS Code FILENAME test1 " C:\Current Work\Surf.xml " encoding="utf-8"; PROC HTTP out=test1 url="https://gdata.youtube.com/feeds/api/videos?q=surfing" method="get"; RUN; FILENAME YOUTUBET 'C:\Current Work\Surf.xml' ; FILENAME SXLEMAP ' C:\Current Work\YouTubeGetData.map'; LIBNAME YOUTUBET xmlv2 xmlmap=sxlemap ACCESS=READONLY; This will create all the linked datasets generated from the map Merge the datasets based on the linked IDs

My project Medications during Pregnancy Assess the following: How many videos have information on medications during pregnancy The source of these videos The popularity of these videos The information in the videos (manual review) Enter additional information in a database upon review

Challenges 25 videos per XML You can pull in videos based on start index = SAS macro 500 per search term Pull in the 500 videos using a SAS macro Repeat using variations on each search term = 500 per each variation Duplicates = that s ok because you cast a wide net and clean these later 2023 search terms

%MACRO YOUTUBE; %DO k = 1 %TO &MAXID. %BY 1; PROC SQL NOPRINT; SELECT DISTINCT LEFT(TRIM(MEDSEARCH)) INTO :MEDSEARCH FROM SEARCH_TERMS WHERE ID=&k.; QUIT; %DO i = 1 %TO 500 %BY 20; %LET URL = "https://gdata.youtube.com/feeds/api/videos?q=&medsearch%nrstr(&max)-results=20%nrstr(&start)-index=&i"; [insert the proc http I showed before] 2023 search terms Increments of 20 videos for each xml Outer Macro Loops through Search terms PROC SQL; CREATE TABLE VIDEOS&i. AS SELECT DISTINCT [insert some SQL joins] ; QUIT; PROC SQL NOPRINT; SELECT COUNT(*) INTO :NOBS FROM VIDEOS&i.; QUIT; PROC APPEND DATA=VIDEOS&i. BASE=videos FORCE; RUN; PROC DATASETS LIB=WORK NOLIST; DELETE VIDEOS&i.; QUIT; RUN; Inner Macro Increments through the start index getting 20 videos until 500 are retrieved Append the videos from each iteration of the inner/outer macros %END; %EXIT: %END; %MEND YOUTUBE; %YOUTUBE; %IF &NOBS.=0 %THEN %GOTO EXIT;

Import into ACCESS database

Results What I learned? A lot of non-relevant videos Duplicates within each iteration of videos extracted Same video but different title Need to have a selection criteria to identify relevant videos (e.g. some additional data cleaning) YouTube category variable is not reliable Lag time between metadata and actual time (data on YouTube) Medication+Pregnancy Search Terms, n= 2023 YouTube API Data Feed n=97,480 records Distinct Videos n=41,438 (93,668 records) Distinct Videos n=10,462 Medication+Pregnancy Search Term in Title n=651 videos Manual Review Final Analyses n=315 videos Excluded Duplicate Records n = 3812 Medication or Pregnancy Search Term not in Title or Description n=30,976 videos Not Relevant Videos Total = 336 No mention of med = 243 Duplicate = 70 Not avail./no sound = 13 Not English = 5 Legal Process = 5

Useful Resources Jason Secosky: Executing a PROC from a DATA Step http://support.sas.com/resources/papers/proceedings12/227-2012.pdf George Zhu: Accessing and Extracting Data from the Internet Using SAS http://support.sas.com/resources/papers/proceedings12/121-2012.pdf Eric Lewerenz: An Example of Website Screen Scraping http://www.mwsug.org/proceedings/2009/appdev/mwsug-2009-a09.pdf SAS PROC HTTP http://support.sas.com/documentation/cdl/en/proc/65145/html/default/viewer.htm#n197g47i7j66x9n15xi0gaha8o v6.htm Google APIs https://developers.google.com/apis-explorer/#p/ YouTube APIs https://developers.google.com/youtube/

Thank you I love SAS! Questions Feedback Thumbs up? Thumbs down?

WRAP UP Survey Please complete & hand back for Lucky Draw IAPA Chapter Meeting SANZOC SAS Australia & New Zealand Online Community www.communities.sas.com Questions / Suggestions Hanlie.Myburgh@sas.com Thank you Lucky Draw Copyright 2012, SAS Institute Inc. All rights reserved.

ADELAIDE CHAPTER MEETING: ENSEMBLES OF 20,000 MODELS IN THE ATO Guest Speaker: Dr Graham Williams, Head of Corporate Analytics, ATO www.iapa.org.au/events Copyright 2012, SAS Institute Inc. All rights reserved.