Package urltools. October 11, 2015



Similar documents
Package searchconsoler

Package uptimerobot. October 22, 2015

Package retrosheet. April 13, 2015

Package sendmailr. February 20, 2015

Package OECD. R topics documented: January 17, Type Package Title Search and Extract Data from the OECD Version 0.2.

Package whoapi. R topics documented: June 26, Type Package Title A 'Whoapi' API Client Version Date Author Oliver Keyes

Package AzureML. August 15, 2015

Package RCassandra. R topics documented: February 19, Version Title R/Cassandra interface

Package WikipediR. January 13, 2016

LabVIEW Internet Toolkit User Guide

Package tagcloud. R topics documented: July 3, 2015

Package httprequest. R topics documented: February 20, 2015

Package R4CDISC. September 5, 2015

Specify the location of an HTML control stored in the application repository. See Using the XPath search method, page 2.

Package bizdays. March 13, 2015

HireDesk API V1.0 Developer s Guide

Package syuzhet. February 22, 2015

Package ggrepel. February 8, 2016

Package TSfame. February 15, 2013

Package sjdbc. R topics documented: February 20, 2015

INSTALLATION AND CONFIGURATION MANUAL ENCODER

Package decompr. August 17, 2016

Package cgdsr. August 27, 2015

Package wordnet. January 6, 2016

HOST EUROPE CLOUD STORAGE REST API DEVELOPER REFERENCE

Package GEOquery. August 18, 2015

MXSAVE XMLRPC Web Service Guide. Last Revision: 6/14/2012

Package LexisPlotR. R topics documented: April 4, Type Package

Package hoarder. June 30, 2015

Package pxr. February 20, 2015

Package PKI. July 28, 2015

Admin Guide Product version: Product date: November, Technical Administration Guide. General

Package ChannelAttribution

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

JAVASCRIPT AND COOKIES

Package optirum. December 31, 2015

Package pdfetch. R topics documented: July 19, 2015

Package RedditExtractoR

URI and UUID. Identifying things on the Web.

Package fimport. February 19, 2015

Listeners. Formats. Free Form. Formatted

Package franc. R topics documented: November 12, Title Detect the Language of Text Version 1.1.1

Architecting the Future of Big Data

Monitoring Infrastructure (MIS) Software Architecture Document. Version 1.1

2- Electronic Mail (SMTP), File Transfer (FTP), & Remote Logging (TELNET)

IBM. Implementing SMTP and POP3 Scenarios with WebSphere Business Integration Connect. Author: Ronan Dalton

Web Server Manual. Mike Burns Greg Pettyjohn Jay McCarthy November 20, 2006

Package packrat. R topics documented: March 28, Type Package

Package DCG. R topics documented: June 8, Type Package

Package TimeWarp. R topics documented: April 1, 2015

Integrating VoltDB with Hadoop

D61830GC30. MySQL for Developers. Summary. Introduction. Prerequisites. At Course completion After completing this course, students will be able to:

Package translater. R topics documented: February 20, Type Package

Writing R packages. Tools for Reproducible Research. Karl Broman. Biostatistics & Medical Informatics, UW Madison

US Secure Web API. Version 1.6.0

DataPA OpenAnalytics End User Training

Safeguard Ecommerce Integration / API

PHP Integration Kit. Version User Guide

FileMaker Server 15. Custom Web Publishing Guide

TZWorks Windows Event Log Viewer (evtx_view) Users Guide

Package PKI. February 20, 2013

Design Notes for an Efficient Password-Authenticated Key Exchange Implementation Using Human-Memorable Passwords

CGI An Example. CGI Model (Pieces)

Mail Programs. Manual

Package hive. July 3, 2015

MySQL for Beginners Ed 3

Implementing 2-Legged OAuth in Javascript (and CloudTest)

OAuth 2.0 Developers Guide. Ping Identity, Inc th Street, Suite 100, Denver, CO

Login with Amazon. Developer Guide for Websites

Introduction to Parallel Programming and MapReduce

Global Transport Secure ecommerce. Web Service Implementation Guide

Cyber Security Workshop Encryption Reference Manual

Login with Amazon. Getting Started Guide for Websites. Version 1.0

Puppet Firewall Module and Landb Integration

Package date. R topics documented: February 19, Version Title Functions for handling dates. Description Functions for handling dates.

Package tvm. R topics documented: August 21, Type Package Title Time Value of Money Functions Version Author Juan Manuel Truppia

IP Phone Service Administration and Subscription

Application Security Testing. Generic Test Strategy

Configuring and Using the TMM with LDAP / Active Directory

IP Phone Services Configuration

Course Information Course Number: IWT 1229 Course Name: Web Development and Design Foundation

Database 10g Edition: All possible 10g features, either bundled or available at additional cost.

ORCA-Registry v1.0 Documentation

MarkLogic Server. Java Application Developer s Guide. MarkLogic 8 February, Copyright 2015 MarkLogic Corporation. All rights reserved.

reference: HTTP: The Definitive Guide by David Gourley and Brian Totty (O Reilly, 2002)

SonicOS Enhanced 3.2 LDAP Integration with Microsoft Active Directory and Novell edirectory Support

The Simple Submission URL. Overview & Guide

Transcription:

Type Package Package urltools October 11, 2015 Title Vectorised Tools for URL Handling and Parsing Version 1.3.2 Date 2015-10-09 Author Oliver Keyes [aut, cre], Jay Jacobs [aut, cre], Mark Greenaway [ctb], Bob Rudis [ctb] Maintainer Oliver Keyes <ironholds@gmail.com> A toolkit for handling URLs that so far includes functions for URL encoding and decoding, parsing, and parameter extraction. All functions are designed to be both fast and entirely vectorised. It is intended to be useful for people dealing with web-related datasets, such as server-side logs, although may be useful for other situations involving large sets of URLs. License MIT + file LICENSE LazyData TRUE LinkingTo Rcpp Imports Rcpp, methods Suggests testthat, knitr BugReports https://github.com/ironholds/urltools/issues VignetteBuilder knitr NeedsCompilation yes Repository CRAN Date/Publication 2015-10-11 00:38:28 R topics documented: domain........................................... 2 fragment........................................... 3 parameters.......................................... 3 param_get.......................................... 4 1

2 domain param_remove....................................... 5 param_set.......................................... 6 path............................................. 7 port............................................. 7 scheme........................................... 8 suffix_dataset........................................ 9 suffix_extract........................................ 9 urltools........................................... 10 url_compose......................................... 10 url_decode.......................................... 11 url_parse.......................................... 12 Index 14 domain Get or set a URL s domain as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples. domain(x) domain(x) <- value x value a URL, or vector of URLs a replacement value for x s scheme. scheme, port, path, parameters and fragment for other accessors. #Get a component example_url <- "http://cran.r-project.org/submit.html" domain(example_url) #Set a component domain(example_url) <- "en.wikipedia.org"

fragment 3 fragment Get or set a URL s fragment as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples. fragment(x) fragment(x) <- value x value a URL, or vector of URLs a replacement value for x s fragment. scheme, domain, port, path and parameters for other accessors. #Get a component example_url <- "http://en.wikipedia.org/wiki/aaron_halfaker?debug=true#test" fragment(example_url) #Set a component fragment(example_url) <- "production" parameters Get or set a URL s parameters as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples. parameters(x) parameters(x) <- value

4 param_get x value a URL, or vector of URLs a replacement value for x s parameters. scheme, domain, port, path and fragment for other accessors. #Get a component example_url <- "http://en.wikipedia.org/wiki/aaron_halfaker?debug=true" parameters(example_url) #[1] "debug=true" #Set a component parameters(example_url) <- "debug=false" param_get get the values of a URL s parameters URLs can have parameters, taking the form of name=value, chained together with & symbols. param_get, when provided with a vector of URLs and a vector of parameter names, will generate a data.frame consisting of the values of each parameter for each URL. param_get(urls, parameter_names) url_parameters(urls, parameter_names) Value urls a vector of URLs parameter_names a vector of parameter names a data.frame containing one column for each provided parameter name. Values that cannot be found within a particular URL are represented by an empty string. url_parse for decomposing URLs into their constituent parts and param_set for inserting or modifying key/value pairs within a query string.

param_remove 5 #A very simple example url <- "https://google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome" parameter_values <- param_get(url, c("this_parameter","hiphop")) param_remove Remove key-value pairs from query strings URLs often have queries associated with them, particularly URLs for APIs, that look like?key=value&key=value&key=valu param_remove allows you to remove key/value pairs while leaving the rest of the URL intact. param_remove(urls, keys) urls keys a vector of URLs. These should be decoded with url_decode but don t have to have been otherwise processed. a vector of parameter keys to remove. Value the original URLs but with the key/value pairs specified by keys removed. param_set to modify values associated with keys, or param_get to retrieve those values. # Remove multiple parameters from a URL param_remove(urls = "https://en.wikipedia.org/wiki/api.php?action=list&type=query&format=json", keys = c("action","format"))

6 param_set param_set Set the value associated with a parameter in a URL s query. URLs often have queries associated with them, particularly URLs for APIs, that look like?key=value&key=value&key=valu param_set allows you to modify key/value pairs within query strings, or even add new ones if they don t exist within the URL. param_set(urls, key, value) urls key value a vector of URLs. These should be decoded (with url_decode) but do not have to have been otherwise manipulated. a string representing the key to modify the value of (or insert wholesale if it doesn t exist within the URL). a value to associate with the key. This can be a single string, or a vector the same length as urls Value the original vector of URLs, but with modified/inserted key-value pairs. param_get to retrieve the values associated with multiple keys in a vector of URLs, and param_remove to strip key/value pairs from a URL entirely. # Set a URL parameter where there s already a key for that param_set("https://en.wikipedia.org/api.php?action=query", "action", "pageinfo") # Set a URL parameter where there isn t. param_set("https://en.wikipedia.org/api.php?list=props", "action", "pageinfo")

path 7 path Get or set a URL s path as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples. path(x) path(x) <- value x value a URL, or vector of URLs a replacement value for x s path scheme, domain, port, parameters and fragment for other accessors. #Get a component example_url <- "http://cran.r-project.org:80/submit.html" path(example_url) #Set a component path(example_url) <- "bin/windows/" port Get or set a URL s port as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples. port(x) port(x) <- value

8 scheme x value a URL, or vector of URLs a replacement value for x s port. scheme, domain, path, parameters and fragment for other accessors. #Get a component example_url <- "http://cran.r-project.org:80/submit.html" port(example_url) #Set a component port(example_url) <- "12" scheme Get or set a URL s scheme as in the lubridate package, individual components of a URL can be both extracted or set using the relevant function call - see the examples. scheme(x) scheme(x) <- value x value a URL, or vector of URLs a replacement value for x s scheme. domain, port, path, parameters and fragment for other accessors. #Get a component example_url <- "http://cran.r-project.org/submit.html" scheme(example_url) #Set a component scheme(example_url) <- "https"

suffix_dataset 9 suffix_dataset Dataset of public suffixes This dataset contains a registry of public suffixes, as retrieved from and defined by the public suffix list. It is sorted by how many periods(".") appear in the suffix, to optimise it for suffix_extract. data(suffix_dataset) Format A vector of 7430 elements. Note Last updated 2015-05-06. suffix_extract for extracting suffixes from domain names. suffix_extract extract the suffix from domain names domain names have suffixes - common endings that people can or could register domains under. This includes things like ".org", but also things like ".edu.co". A simple Top Level Domain list, as a result, probably won t cut it. suffix_extract takes the list of public suffixes, as maintained by Mozilla (see suffix_dataset) and a vector of domain names, and produces a data.frame containing the suffix that each domain uses, and the remaining fragment. suffix_extract(domains) domains a vector of damains, from domain or url_parse. Alternately, full URLs can be provided and will then be run through domain internally.

10 url_compose Value a data.frame of four columns, "host" "subdomain", "domain" & "suffix". "host" is what was passed in. "subdomain" is the subdomain of the suffix. "domain" contains the part of the domain name that came before the matched suffix. "suffix" is, well, the suffix. suffix_dataset for the dataset of suffixes. # Using url_parse domain_name <- url_parse("http://en.wikipedia.org")$domain suffix_extract(domain_name) # Using domain() domain_name <- domain("http://en.wikipedia.org") suffix_extract(domain_name) #Using internal parsing suffix_extract("http://en.wikipedia.org") urltools Tools for handling URLs This package provides functions for URL encoding and decoding, parsing, and parameter extraction, designed to be both fast and entirely vectorised. It is intended to be useful for people dealing with web-related datasets, such as server-side logs. the package vignette. url_compose Recompose Parsed URLs Sometimes you want to take a vector of URLs, parse them, perform some operations and then rebuild them. url_compose takes a data.frame produced by url_parse and rebuilds it into a vector of full URLs (or: URLs as full as the vector initially thrown into url_parse). This is currently a beta feature; please do report bugs if you find them.

url_decode 11 url_compose(parsed_urls) parsed_urls a data.frame sourced from url_parse scheme and other accessors, which you may want to run URLs through before composing them to modify individual values. #Parse a URL and compose it url <- "http://en.wikipedia.org" url_compose(url_parse(url)) url_decode Encode or decode a URI encodes or decodes a URI/URL url_decode(urls) url_encode(urls) urls a vector of URLs to decode or encode. Details URL encoding and decoding is an essential prerequisite to proper web interaction and data analysis around things like server-side logs. The relevant IETF RfC mandates the percentage-encoding of non-latin characters, including things like slashes, unless those are reserved. Base R provides URLdecode and URLencode, which handle URL encoding - in theory. In practise, they have a set of substantial problems that the urltools implementation solves:: No vectorisation: Both base R functions operate on single URLs, not vectors of URLs. This means that, when confronted with a vector of URLs that need encoding or decoding, your only option is to loop from within R. This can be incredibly computationally costly with large datasets. url_encode and url_decode are implemented in C++ and entirely vectorised, allowing for a substantial performance improvement.

12 url_parse Value No scheme recognition: encoding the slashes in, say, http://, is a good way of making sure your URL no longer works. Because of this, the only thing you can encode in URLencode (unless you refuse to encode reserved characters) is a partial URL, lacking the initial scheme, which requires additional operations to set up and increases the complexity of encoding or decoding. url_encode detects the protocol and silently splits it off, leaving it unencoded to ensure that the resulting URL is valid. ASCII NULs: Server side data can get very messy and sometimes include out-of-range characters. Unfortunately, URLdecode s response to these characters is to convert them to NULs, which R can t handle, at which point your URLdecode call breaks. url_decode simply ignores them. a character vector containing the encoded (or decoded) versions of "urls". url_decode("https://en.wikipedia.org/wiki/file:vice_city_public_radio_%28logo%29.jpg") url_encode("https://en.wikipedia.org/wiki/file:vice_city_public_radio_(logo).jpg") ## Not run: #A demonstrator of the contrasting behaviours around out-of-range characters URLdecode("%gIL") url_decode("%gil") ## End(Not run) url_parse split URLs into their component parts url_parse takes a vector of URLs and splits each one into its component parts, as recognised by RfC 3986. url_parse(urls) urls a vector of URLs Details It s useful to be able to take a URL and split it out into its component parts - for the purpose of hostname extraction, for example, or analysing API calls. This functionality is not provided in base R, although it is provided in parse_url; that implementation is entirely in R, uses regular expressions, and is not vectorised. It s perfectly suitable for the intended purpose (decomposition in the context of automated HTTP requests from R), but not for large-scale analysis.

url_parse 13 Value a data.frame consisting of the columns scheme, domain, port, path, query and fragment. See the relevant IETF RfC for definitions. If an element cannot be identified, it is represented by an empty string. url_parameters for extracting values associated with particular keys in a URL s query string, and url_compose, which is url_parse in reverse. url_parse("https://en.wikipedia.org/wiki/article")

Index Topic datasets suffix_dataset, 9 domain, 2, 3, 4, 7 9 domain<- (domain), 2 fragment, 2, 3, 4, 7, 8 fragment<- (fragment), 3 param_get, 4, 5, 6 param_remove, 5, 6 param_set, 4, 5, 6 parameters, 2, 3, 3, 7, 8 parameters<- (parameters), 3 parse_url, 12 path, 2 4, 7, 8 path<- (path), 7 port, 2 4, 7, 7, 8 port<- (port), 7 scheme, 2 4, 7, 8, 8, 11 scheme<- (scheme), 8 suffix_dataset, 9, 9, 10 suffix_extract, 9, 9 url_compose, 10, 13 url_decode, 11 url_encode (url_decode), 11 url_parameter (param_get), 4 url_parameters, 13 url_parameters (param_get), 4 url_parse, 4, 9 11, 12 URLdecode, 11 URLencode, 11 urltools, 10 urltools-package (urltools), 10 14