Package RedditExtractoR



Similar documents
Package TSfame. February 15, 2013

Package uptimerobot. October 22, 2015

Package retrosheet. April 13, 2015

Package pdfetch. R topics documented: July 19, 2015

Package urltools. October 11, 2015

Package wordnet. January 6, 2016

Package sendmailr. February 20, 2015

Package OECD. R topics documented: January 17, Type Package Title Search and Extract Data from the OECD Version 0.2.

Package httprequest. R topics documented: February 20, 2015

Package sjdbc. R topics documented: February 20, 2015

Package searchconsoler

Package LexisPlotR. R topics documented: April 4, Type Package

Package RCassandra. R topics documented: February 19, Version Title R/Cassandra interface

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

Package hoarder. June 30, 2015

Package brewdata. R topics documented: February 19, Type Package

Package xtal. December 29, 2015

Package RIGHT. March 30, 2015

Package cgdsr. August 27, 2015

GSX Monitor & Analyzer. for IBM Collaboration Suite

FWG Management System Manual

Package missforest. February 20, 2015

Package hive. January 10, 2011

Oracle Data Miner (Extension of SQL Developer 4.0)

Package syuzhet. February 22, 2015

Package fimport. February 19, 2015

Package translater. R topics documented: February 20, Type Package

Package R4CDISC. September 5, 2015

Specify the location of an HTML control stored in the application repository. See Using the XPath search method, page 2.

Admin Guide Product version: Product date: November, Technical Administration Guide. General

An idea on how to monitor growth of database tables in an system using Bischeck

Searching your Archive in Outlook (Normal)

Social Relationship Analysis with Data Mining

Package bigdata. R topics documented: February 19, 2015

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Monitoring Agent for PostgreSQL Fix Pack 10. Reference IBM

Shark Talent Management System Performance Reports

Configuring a Custom Load Evaluator Use the XenApp1 virtual machine, logged on as the XenApp\administrator user for this task.

Contents. 2 Alfresco API Version 1.0

Evaluator s Guide. PC-Duo Enterprise HelpDesk v5.0. Copyright 2006 Vector Networks Ltd and MetaQuest Software Inc. All rights reserved.

Distributed R for Big Data

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Internet Information Services Agent Version Fix Pack 2.

Using LDAP Authentication in a PowerCenter Domain

Package plan. R topics documented: February 20, 2015

Package hive. July 3, 2015

Package pinnacle.api

Abstract 1. INTRODUCTION

Package polynom. R topics documented: June 24, Version 1.3-8

Package MBA. February 19, Index 7. Canopy LIDAR data

SnapLogic Salesforce Snap Reference

Package DCG. R topics documented: June 8, Type Package

Enterprise SysLog Manager (ESM)

Installing, Uninstalling, and Upgrading Service Monitor

Package tagcloud. R topics documented: July 3, 2015

NS DISCOVER 4.0 ADMINISTRATOR S GUIDE. July, Version 4.0

Configuring and Using the TMM with LDAP / Active Directory

The data between TC Monitor and remote devices is exchanged using HTTP protocol. Monitored devices operate either as server or client mode.

Matrix Logic WirelessDMS Service 2.0

Package whoapi. R topics documented: June 26, Type Package Title A 'Whoapi' API Client Version Date Author Oliver Keyes

StruxureWare Data Center Expert Release Notes

Package packrat. R topics documented: March 28, Type Package

IP Phone Services Configuration

CA Application Performance Management r9.x Implementation Proven Professional Exam

Package jrvfinance. R topics documented: October 6, 2015

User's Guide - Beta 1 Draft

Package HadoopStreaming

Welcome to Collage (Draft v0.1)

FileMaker Server 12. FileMaker Server Help

Commerce Services Documentation

ShopWindow Integration and Setup Guide

Package EasyHTMLReport

Unicenter NSM Integration for BMC Remedy. User Guide

- Solvent extraction Database -

Unicenter NSM Integration for Remedy (v 1.0.5)

IBM Unica Leads Version 8 Release 6 May 25, User Guide

SonicWALL GMS Custom Reports

FileMaker Server 9. Custom Web Publishing with PHP

User's Guide - Beta 1 Draft

SAP InfiniteInsight Explorer Analytical Data Management v7.0

Spectrum Technology Platform. Version 9.0. Administration Guide

Stored Documents and the FileCabinet

Microsoft Visual Studio Integration Guide

Equalizer VLB Beta I. Copyright 2008 Equalizer VLB Beta I 1 Coyote Point Systems Inc.

Installing and Using the Zimbra Reporting Tool

Actian Vortex Express 3.0

Internet Technologies. World Wide Web (WWW) Proxy Server Network Address Translator (NAT)

Package benford.analysis

Package ggrepel. February 8, 2016

Package empiricalfdr.deseq2

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

Determine the process of extracting monitoring information in Sun ONE Application Server

WORKING WITH LOAD BALANCING AND QUEUEING FOR ADOBE INDESIGN CS5 SERVER

SharePoint and LibreOffice

Using CertAgent to Obtain Domain Controller and Smart Card Logon Certificates for Active Directory Authentication

Package erp.easy. September 26, 2015

Transcription:

Type Package Title Reddit Data Extraction Toolkit Version 2.0.2 Package RedditExtractoR December 5, 2015 Imports RJSONIO, utils, igraph, grdevices, graphics Depends R (>= 3.2.0) Date 2015-12-05 Author Ivan Rivera <ivan.s.rivera@gmail.com> Maintainer Ivan Rivera <ivan.s.rivera@gmail.com> Reddit is an online bulletin board and a social networking website where registered users can submit and discuss content. This package uses Reddit API to extract Reddit data using Reddit API. Note that due to the API limitations, the number of comments that can extracted is limited to 500 per thread. The package consists of 4 functions, one for extracting relevant URLs, one for extracting features out of given URLs, one that does both together and one that constructs graphs based on the structure of a thread. License GPL-3 RoxygenNote 5.0.1 NeedsCompilation no Repository CRAN Date/Publication 2015-12-05 15:42:57 R topics documented: construct_graph....................................... 2 get_reddit.......................................... 2 RedditExtractoR...................................... 3 reddit_content........................................ 4 reddit_urls.......................................... 5 Index 7 1

2 get_reddit construct_graph Create a graph file from a single Reddit thread Create a graph file from a single Reddit thread construct_graph(content_data, plot = TRUE, write_to = NA) content_data plot write_to A data frame produced by reddit_content command A logical parameter indicating whether a graph is to be plotted or no (TRUE by default). Note that the root node corresponds to the thread itself rather than any of the comments. A character string specifying the path and file name where a graph object will be saved. The file must end with an extension such as *.gml. The following formats are allowed: edgelist, pajek, ncol,lgl, graphml, dimacs, gml, dot", leda. A graph object my_url = "reddit.com/r/web_design/comments/2wjswo/design_last_reordering_the_web_design_process/" url_data = reddit_content(my_url) graph_object = construct_graph(url_data) get_reddit Get all data attributes from search query Get all data attributes from search query get_reddit(search_terms = NA, regex_filter = "", subreddit = NA, cn_threshold = 1, page_threshold = 1, sort_by = "comments", wait_time = 2)

RedditExtractoR 3 search_terms regex_filter subreddit cn_threshold A string of terms to be searched on Reddit. An optional regular expression filter that will remove URLs with titles that do not match the condition. An optional character string that will restrict the search to the specified subreddit. Comment number threshold that remove URLs with fewer comments that cn_threshold. 0 by default. page_threshold Page threshold that controls the number of pages is going to be searched for a given search word. 1 by default. sort_by wait_time Sorting parameter, either "comments" (default) or "new". wait time in seconds between page requests. 2 by default and it is also the minimum (API rate limit). A data frame with structure / position of the comment with respect to other comments (structure), ID (id), post / thread date (post_date), comment date (comm_date), number of comments within a post / thread (num_comments), subreddit (subreddit) upvote proportion (upvote_prop), post /thread score (post_score), author of the post / thread (author), user corresponding to the comment (user), comment score (comment_score), controversiality (controversiality), comment (comment), title (title), post / thread text (post_text), URL referenced (link) domain of the references URL (domain) reddit_data = get_reddit(search_terms = "science",subreddit = "science",cn_threshold=10) RedditExtractoR Reddit Data Extraction Toolkit Details Reddit is an online bulletin board and a social networking website where registered users can submit and discuss content. This package uses Reddit API to retrieve comments together will all corresponding attributes from Reddit threads. Note that at this stage, the extraction produces a data frame with a flat structure, i.e. without preserving the order or heirarchy of individuals comments. This may be addressed in the next version of this package. Also note that due to API limitations, the number of comments available for retrieval is limited to 500 per thread.

4 reddit_content Package: RedditExtractoR Type: Package Version: 1.1.1 Date: 2015-06-14 License: GPL-3 The package contains 3 functions, ObtainRedditURLs which extracts URLs based on a search query, reddit_content which extracts URL attributes or features, and get_reddit which servers as a wrapper for the first 2. It is however recommended to use reddit_urls and reddit_content which will help you refine your query. Author(s) Ivan Rivera Maintainer: Ivan Rivera <ivan.s.rivera@gmail.com> References See Also https://www.reddit.com/dev/api www.reddit.com example_urls = reddit_urls(search_terms="science") example_attr = reddit_content(url="reddit.com/r/gifs/comments/39tzsy/whale_watching") example_data = get_reddit(search_terms="economy") reddit_content Extract data attributes Extract data attributes reddit_content(url, wait_time = 2) URL wait_time a string or a vector of strings with the URL address of pages of interest wait time in seconds between page requests. 2 by default and it is also the minimum (API rate limit).

reddit_urls 5 A data frame with structure / position of the comment with respect to other comments (structure), ID (id), post / thread date (post_date), comment date (comm_date), number of comments within a post / thread (num_comments), subreddit (subreddit) upvote proportion (upvote_prop), post /thread score (post_score), author of the post / thread (author), user corresponding to the comment (user), comment score (comment_score), controversiality (controversiality), comment (comment), title (title), post / thread text (post_text), URL referenced (link) domain of the references URL (domain) example_attr = reddit_content(url="reddit.com/r/gifs/comments/39tzsy/whale_watching") reddit_urls Returns relevant reddit URLs Returns relevant reddit URLs reddit_urls(search_terms = NA, regex_filter = "", subreddit = NA, cn_threshold = 0, page_threshold = 1, sort_by = "relevance", wait_time = 2) search_terms regex_filter subreddit cn_threshold A character string to be searched on Reddit. An optional regular expression filter that will remove URLs with titles that do not match the condition. An optional character string that will restrict the search to the specified subreddits (separated by space). Comment number threshold that remove URLs with fewer comments that cn_threshold. 0 by default. page_threshold Page threshold that controls the number of pages is going to be searched for a given search word. 1 by default. sort_by wait_time Sorting parameter, either "comments" (default) or "new". wait time in seconds between page requests. 2 by default and it is also the minimum (API rate limit).

6 reddit_urls A data frame with URLs (links), number of comments (num_comments), title (title),date (date) and subreddit (subreddit). example_urls = reddit_urls(search_terms="science")

Index Topic reddit RedditExtractoR, 3 construct_graph, 2 get_reddit, 2 reddit_content, 4 reddit_urls, 5 RedditExtractoR, 3 7