Extracting YouTube videos using SAS Dr Craig Hansen
Craig Hansen has been using SAS for 15 years within the health research setting. He gained his doctorate in epidemiology at the University of the Sunshine Coast and since then has worked at the University of Queensland School of Medicine (Australia) working in cardiovascular research the United States Environmental Protection Agency (USEPA) working in air pollution research the Centers for Disease Control and Prevention (CDC, USA) working in birth defects research Kaiser Permanente Center for Health Research (USA) working on health studies using electronic medical records. He recently accepted a position at the South Australian Health and Medical Research Institute as Senior Epidemiologist within the Wardliparingga Aboriginal Health Unit. Dr Craig Hansen
Extracting YouTube Videos using SAS By Craig Hansen, PhD South Australian Health & Medical Research Institute Overview YouTube API SAS Proc HTTP SAS XML Mapper Example medications during pregnancy Questions
Have you ever wanted to scrape all the information from YouTube videos? Length Uploader Views Date Comments
You can using YouTube Data API and SAS! YouTube Data API (Metadata for videos) Proc HTTP XML File XML Mapper SAS Datasets
YouTube API Video feed example https://gdata.youtube.com/feeds/api/videos?q=surfing Google's APIs allows you to integrate YouTube videos and functionality into websites or applications. Examples: YouTube Analytics API YouTube Data API YouTube Player API XML File of all the videos found with the search term surfing Only allows 25 videos per XML file Only allows 500 videos in total per search term
XML Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both humanreadable and machine-readable A structured ( tree-like ) text that can be used to store information in a hierarchical format See the structure based on tags < >..text..</> This is what SAS XML Mapper reads
<?xml version='1.0' encoding='utf-8'?> <feed xmlns='http://www.w3.org/2005/atom' xmlns:opensearch='http://a9.com/-/spec/opensearch/1.1/' xmlns:gml='http://www.opengis.net/gml' xmlns:georss='http://www.georss.org/georss' xmlns:media='http://search.yahoo.com/mrss/' xmlns:batch='http://schemas.google.com/gdata/batch' xmlns:yt='http://gdata.youtube.com/schemas/2007' xmlns:gd='http://schemas.google.com/g/2005' gd:etag='w/"ce4eqh47ecp7ima9wxrqgeq."'> <id>tag:youtube.com,2008:standardfeed:global:most_popular</id> <updated>2008-07-18t05:00:49.000-07:00</updated> <category scheme='http://schemas.google.com/g/2005#kind' term='http://gdata.youtube.com/schemas/2007#video'/> <title>most Popular</title> <id>tag:youtube,2008:video:ztuvgyoen_b</id> <published>2008-07-05t19:56:35.000-07:00</published> <updated>2008-07-18t07:21:59.000-07:00</updated> <category scheme='http://gdata.youtube.com/schemas/2007/categories.cat' term='people' label='people'/> <title>shopping for Coats</title> <content type='application/x-shockwave-flash' src='http://www.youtube.com/v/ztuvgyoen_b?f=gdata_standard...'/> <link rel='alternate' type='text/html' href='https://www.youtube.com/watch?v=ztuvgyoen_b'/> YouTube XML SCHEMA (only a snippet) <author> <name>googledevelopers</name> <uri>https://gdata.youtube.com/feeds/api/users/googledevelopers</uri> <yt:userid>_x5xg1ov2p6uzz5fsm9ttw<</yt:userid> <yt:aspectratio>widescreen</yt:aspectratio> <yt:duration seconds='79'/> <yt:uploaded>2008-07-05t19:56:35.000-07:00</yt:uploaded> <yt:uploaderid>uc_x5xg1ov2p6uzz5fsm9ttw</yt:uploaderid> <yt:videoid>ztuvgyoen_b</yt:videoid> </media:group> <gd:rating min='1' max='5' numraters='14763' average='4.93'/> <yt:recorded>2008-07-04</yt:recorded> <yt:statistics viewcount='383290' favoritecount='7022'/> <yt:rating numdislikes='19257' numlikes='106000'/> </entry> </feed>
SAS PROC HTTP Issues Hypertext Transfer Protocol (HTTP) requests YouTube API search (https://developers.google.com/youtube/v3/) FILENAME myxml "C:\Current Work\YT.xml" encoding="utf-8"; PROC HTTP out=myxml RUN; url="https://gdata.youtube.com/feeds/api/videos?q=surfing" method="get";
SAS XML Mapper We now need to parse all the text in the XML into a dataset Perfect, we can use this to map all the structured text in the XML into a dataset it does all the work for you! SAS XML Mapper reads in the XML or Schema file and interprets the structure based on the tags <>.text.</> **Download SAS XML Mapper from SAS website
SAS XML Mapper Create a map XML window XML Map window Source window
SAS XML Mapper Create a map Open XML or Schema XML file The datasets it will create XML map generated save this map
SAS XML Mapper Creating Datasets See how the datasets are linked by ID variables
Information available about Datasets Feed information Title Description Author Date published Category Duration Number of views Ratings Comments (links to another API) Restrictions URL
SAS Code SAS Code FILENAME test1 " C:\Current Work\Surf.xml " encoding="utf-8"; PROC HTTP out=test1 url="https://gdata.youtube.com/feeds/api/videos?q=surfing" method="get"; RUN; FILENAME YOUTUBET 'C:\Current Work\Surf.xml' ; FILENAME SXLEMAP ' C:\Current Work\YouTubeGetData.map'; LIBNAME YOUTUBET xmlv2 xmlmap=sxlemap ACCESS=READONLY; This will create all the linked datasets generated from the map Merge the datasets based on the linked IDs
My project Medications during Pregnancy Assess the following: How many videos have information on medications during pregnancy The source of these videos The popularity of these videos The information in the videos (manual review) Enter additional information in a database upon review
Challenges 25 videos per XML You can pull in videos based on start index = SAS macro 500 per search term Pull in the 500 videos using a SAS macro Repeat using variations on each search term = 500 per each variation Duplicates = that s ok because you cast a wide net and clean these later 2023 search terms
%MACRO YOUTUBE; %DO k = 1 %TO &MAXID. %BY 1; PROC SQL NOPRINT; SELECT DISTINCT LEFT(TRIM(MEDSEARCH)) INTO :MEDSEARCH FROM SEARCH_TERMS WHERE ID=&k.; QUIT; %DO i = 1 %TO 500 %BY 20; %LET URL = "https://gdata.youtube.com/feeds/api/videos?q=&medsearch%nrstr(&max)-results=20%nrstr(&start)-index=&i"; [insert the proc http I showed before] 2023 search terms Increments of 20 videos for each xml Outer Macro Loops through Search terms PROC SQL; CREATE TABLE VIDEOS&i. AS SELECT DISTINCT [insert some SQL joins] ; QUIT; PROC SQL NOPRINT; SELECT COUNT(*) INTO :NOBS FROM VIDEOS&i.; QUIT; PROC APPEND DATA=VIDEOS&i. BASE=videos FORCE; RUN; PROC DATASETS LIB=WORK NOLIST; DELETE VIDEOS&i.; QUIT; RUN; Inner Macro Increments through the start index getting 20 videos until 500 are retrieved Append the videos from each iteration of the inner/outer macros %END; %EXIT: %END; %MEND YOUTUBE; %YOUTUBE; %IF &NOBS.=0 %THEN %GOTO EXIT;
Import into ACCESS database
Results What I learned? A lot of non-relevant videos Duplicates within each iteration of videos extracted Same video but different title Need to have a selection criteria to identify relevant videos (e.g. some additional data cleaning) YouTube category variable is not reliable Lag time between metadata and actual time (data on YouTube) Medication+Pregnancy Search Terms, n= 2023 YouTube API Data Feed n=97,480 records Distinct Videos n=41,438 (93,668 records) Distinct Videos n=10,462 Medication+Pregnancy Search Term in Title n=651 videos Manual Review Final Analyses n=315 videos Excluded Duplicate Records n = 3812 Medication or Pregnancy Search Term not in Title or Description n=30,976 videos Not Relevant Videos Total = 336 No mention of med = 243 Duplicate = 70 Not avail./no sound = 13 Not English = 5 Legal Process = 5
Useful Resources Jason Secosky: Executing a PROC from a DATA Step http://support.sas.com/resources/papers/proceedings12/227-2012.pdf George Zhu: Accessing and Extracting Data from the Internet Using SAS http://support.sas.com/resources/papers/proceedings12/121-2012.pdf Eric Lewerenz: An Example of Website Screen Scraping http://www.mwsug.org/proceedings/2009/appdev/mwsug-2009-a09.pdf SAS PROC HTTP http://support.sas.com/documentation/cdl/en/proc/65145/html/default/viewer.htm#n197g47i7j66x9n15xi0gaha8o v6.htm Google APIs https://developers.google.com/apis-explorer/#p/ YouTube APIs https://developers.google.com/youtube/
Thank you I love SAS! Questions Feedback Thumbs up? Thumbs down?
WRAP UP Survey Please complete & hand back for Lucky Draw IAPA Chapter Meeting SANZOC SAS Australia & New Zealand Online Community www.communities.sas.com Questions / Suggestions Hanlie.Myburgh@sas.com Thank you Lucky Draw Copyright 2012, SAS Institute Inc. All rights reserved.
ADELAIDE CHAPTER MEETING: ENSEMBLES OF 20,000 MODELS IN THE ATO Guest Speaker: Dr Graham Williams, Head of Corporate Analytics, ATO www.iapa.org.au/events Copyright 2012, SAS Institute Inc. All rights reserved.