An Eprints Apache Log Filter for Non-Redundant Document Downloads by Browser Agents



Similar documents
AT&T Global Network Client for Windows Product Support Matrix January 29, 2015

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

Case 2:08-cv ABC-E Document 1-4 Filed 04/15/2008 Page 1 of 138. Exhibit 8

Analysis One Code Desc. Transaction Amount. Fiscal Period

Statistics for ( )

Enhanced Vessel Traffic Management System Booking Slots Available and Vessels Booked per Day From 12-JAN-2016 To 30-JUN-2017

Domain Name Abuse Detection. Liming Wang

DiskPulse DISK CHANGE MONITOR

Deep Security Intrusion Detection & Prevention (IDS/IPS) Coverage Statistics and Comparison

Deep Security/Intrusion Defense Firewall - IDS/IPS Coverage Statistics and Comparison

Application note: Connecting the to a Database

Comparison table for an idea on features and differences between most famous statistics tools (AWStats, Analog, Webalizer,...).

Assignment 4 CPSC 217 L02 Purpose. Important Note. Data visualization

Resource Management Spreadsheet Capabilities. Stuart Dixon Resource Manager

RadBlue Load Tester Version 6. [Released: 09 DEC 2009]

SVN Authentication and Authorization

Consumer ID Theft Total Costs

W3Perl A free logfile analyzer

Computing & Telecommunications Services Monthly Report March 2015

The SkySQL Administration Console

LICENSE4J FLOATING LICENSE SERVER USER GUIDE

dotdefender v5.12 for Apache Installation Guide Applicure Web Application Firewall Applicure Technologies Ltd. 1 of 11 support@applicure.

Micros Troubleshooting & Error Message Guide.

The electronic the TTF VUV-FEL. Making the next step

opennms reporting generation tool

EventTracker: Configuring DLA Extension for AWStats Report AWStats Reports

Managing Users and Identity Stores

Ashley Institute of Training Schedule of VET Tuition Fees 2015

Supervisor Instructions for Approving Web Time Entry

Perl in a nutshell. First CGI Script and Perl. Creating a Link to a Script. print Function. Parsing Data 4/27/2009. First CGI Script and Perl

Databases and SQL. Homework. Matthias Danner. June 11, Matthias Danner Databases and SQL June 11, / 16

EMC Software Release and Service Dates for NetWorker and NetWorker Modules Last Updated on August 16, 2012

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Comparing share-price performance of a stock

Log Analyzer Reference

CLC Server Command Line Tools USER MANUAL

Analysing log files. Yue Mao Supervisor: Dr Hussein Suleman, Kyle Williams, Gina Paihama. University of Cape Town

Domain Name. Domain Registrar. Web Site cpanel URL: Username: Password: Username: Password:

When choosing where to install and run the log analyzer, be aware that it requires access to the following log files:

Hadoop and Map-Reduce. Swati Gore

CENTERPOINT ENERGY TEXARKANA SERVICE AREA GAS SUPPLY RATE (GSR) JULY Small Commercial Service (SCS-1) GSR

WHY USE ILLUMIN8 MARKETING FOR HOSTING YOUR WEB SITE?

Perl In Secure Web Development

HM2016 Detailed Chronological Schedule with Interwoven Production Deadlines

EMC Software Release and Service Dates for NetWorker and NetWorker Modules Last Updated on February 21, 2013

Stanford Computer Security Lab. TrackBack Spam: Abuse and Prevention. Elie Bursztein, Peifung E. Lam, John C. Mitchell Stanford University

APACHE WEB SERVER. Andri Mirzal, PhD N

Utilizing SFTP within SSIS

Based on Chapter 11, Excel 2007 Dashboards & Reports (Alexander) and Create Dynamic Charts in Microsoft Office Excel 2007 and Beyond (Scheck)

SQL Server Instance-Level Benchmarks with DVDStore

oct 03 / 2013 nov 12 / oct 05 / oct 07 / oct 21 / oct 24 / nov 07 / 2013 nov 14 / 2013.

Choosing a Cell Phone Plan-Verizon

UQC103S1 UFCE Systems Development. uqc103s/ufce PHP-mySQL 1

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Querying Databases Using the DB Query and JDBC Query Nodes

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Cloud Services. Introduction...2 Overview...2. Security considerations Installation...3 Server Configuration...4

IceWarp Server. Log Analyzer. Version 10

HELP DESK MANUAL INSTALLATION GUIDE

PROJECTS SCHEDULING AND COST CONTROLS

Elgg 1.8 Social Networking

UCL Data Safe Haven (IDHS) User Group Town Hall Meeting

A Tiny Queuing System for Blast Servers

10CS73:Web Programming

LogProcess v1.0 User Guide

ENERGY STAR for Data Centers

Lucid Key Server v2 Installation Documentation.

Installing the ASP.NET VETtrak APIs onto IIS 5 or 6

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

Lesson 7 - Website Administration

TSM Studio Server User Guide

Department of Public Welfare (DPW)

Yandex: Webmaster Tools Overview and Guidelines

INT322. By the end of this week you will: (1)understand the interaction between a browser, web server, web script, interpreter, and database server.

Apache Logs Viewer Manual

24x7 Help Desk Services Questions & Answers for RFP 40016_

PHP on IBM i: What s New with Zend Server 5 for IBM i

CPanel User Guide DOCUMENTATION VERSION: 1.2

EventTracker: Configuring DLA Extension for AWStats report AWStats Reports

CHOOSE MY BEST PLAN OPTION (PLAN FINDER) INSTRUCTIONS

A Guide to the Insider Buying Investment Strategy

Intelligent chart recorder for compressed air and gases. Measurement - control - indication - alarm - recording - evaluation

Portal Connector Fields and Widgets Technical Documentation

ITRC announces latest updates of its Visitor Profile Study (VPS)

BCOE Payroll Calendar. Monday Tuesday Wednesday Thursday Friday Jun Jul Full Force Calc

CatDV Pro Workgroup Serve r

IBM WebSphere Application Server V8.5 lab Basic Liberty profile administration using the job manager

Glyma Deployment Instructions

Sage Line Sage MMS - Sage 200 Version Features

Transcription:

An Eprints Apache Log Filter for Non-Redundant Document Downloads by Browser Agents Ed Sponsler Caltech Library System http://resolver.caltech.edu/caltechlib:spoeal04 December, 2004 Contents 1 Abstract 1 2 Processing Apache Log Files 2 2.1 Human vs. Robot Access...................................... 2 2.1.1 Scripts............................................. 2 3 Processing Algorithm 3 3.1 Document File Format........................................ 3 3.2 Successful Downloads........................................ 4 3.3 Removing Redundancy........................................ 4 3.4 Obtaining Document Metadata................................... 4 3.5 Preparing Output for Web Analysis Software........................... 4 3.6 Script I/O............................................... 4 3.6.1 Script............................................. 4 3.7 Shell Script for Batch Processing.................................. 7 4 Generating Statistical Reports 8 4.1 Accesses vs. Submission Date Scatter Plot............................. 8 4.2 Growth per Month.......................................... 8 4.3 AWstats and other Web Analysis Software............................. 8 4.4 Geographical Analysis........................................ 8 4.5 Archive Growth............................................ 9 5 Quality Control 9 1 Abstract Web log files record a vast amount of information and much of it just gets in the way of meaningful observational studies on usage. It is therefore necessary to filter out the junk in a deliberate way before making statements on how the web is being used. 1

This report describes the methods and scripts used to accomplish apache web log filtering and report generation. It is open to scrutiny and freely available for others to use. These methods were used to generate reports presented at the following conferences. Attendees at these conferences may be curious to know exactly how the data were generated, and that is the purpose of this report. Doglas, Kim. (2004) Keynote: Experiences and Challenges. International Conference on Developing Digital Institutional Repositories, Hong Kong. Van de Velde, Eric. (2004) CODA, an Eprints Open Digital Archive. International Conference on Developing Digital Institutional Repositories, Hong Kong. Sponsler, Ed. (2004) Caltech ETD Collection Analysis: Who Accesses What and Why? The 7th International Symposium on Electronic Theses and Dissertations, U. of Kentucky. 2 Processing Apache Log Files This study analyzes Apache combined format log files generated by various versions of Eprints 1 and Virginia Tech ETD-db 2 software. 2.1 Human vs. Robot Access For the purposes of these observational studies, only log entries generated by web browsers are interesting. Known web browsers may be identified by the agent identifier recorded in the apache log. Several resources on the web catalog Agent IDs 3. The ZyTrax source is useful since each browser Agent ID is explained in detail, giving us some confidence that these strings indicate browsers and not robots. Once a list of known browser Agent IDs is prepared it may be used by a script to cull the log entries matching one of the known browser agent identifiers. 2.1.1 Scripts The browser Agent ID list is prepared by extracting the identifier strings from the other text in the html file, outputting these identifiers to a text file. The utility wget is used to obtain the source html from ZyTrax. The filter script parse-browser-ids.pl goes through the trouble of sorting the browser strings so that artifacts caused by human error in the html code float to the top. Examples of these artifacts are strings which begin with a space or contain misplaced tags. These are easy enough to edit by hand prior to using the list by another script to parse a log file. The log parsing script split-human-robot.pl produces three output files: log entries which match a browser agent ID (access-log-human), those which don t (access-log-robot) and those which do not contain an Agent ID (access-log-no-agent). In this way all of the original log entries may be found in one of these three files. The script also sends access-log-human to stdout to support piping the data to another script. Usage: $ wget http://www.zytrax.com/tech/web/browser_ids.htm $./parse-browser-ids.pl < browser_ids.htm > browser-ids.dat [Fix possible artifacts in browser-ids.dat] Script: parse-browser-ids.pl 1 http://eprints.org 2 http://scholar.lib.vt.edu/etd-db/developer/download/ 3 See ZyTrax: http://www.zytrax.com/tech/web/browser ids.htm; PGTS: http://www.pgts.com.au/pgtsj/pgtsj0212d.html; and Psychedelix: http://www.psychedelix.com/agents.html 2

#!/ usr / bin / p e r l w my (% b r s t r s, $ b r s t r ) ; while (<>) { $ b r s t r s {$1=0 i f (m g c [ sn ] >(.?)</p> ) foreach $ b r s t r ( sort keys % b r s t r s ) { print $ b r s t r \n ; Script: split-human-robot.pl #!/ usr / bin / p e r l w open (ROBOT, >a c c e s s l o g robot ) die ; open (HUMAN, >a c c e s s l o g human ) die ; open (ERROR, >a c c e s s l o g noagent ) die ; open ( IDS, <browser i d s. dat ) die ; my (% i d s, $id ) ; my $match=0; while (<IDS>) {chomp ; $ i d s { $ =1; while (<>) { i f (/.+\ s (.+) $ / ) { foreach $id ( keys % i d s ) { $match=1 i f ( $id eq $1 ) ; else { print ERROR; i f ( $match ) { print ; print HUMAN; $match =0; else { print ROBOT; 3 Processing Algorithm The following algorithm is used after the logs have been filtered of robots. The main script also filters the logs based on the other parameters described in this section. 1) Read the next line of the log file 2) Store the IP number if it is the first time we have seen it, otherwise goto 4). This list of IPs is used to generate Geographical Analysis. 3) Extract the month/year from the log and add one to a counter tracking the number of times step 2) has been true for this month. This data is used in the Usage Chart to show the number of first time visitors. 4) Extract the record identifier and IP. If this combination has never been seen before then store the combination as well as the log line itself, otherwise goto 1). The number of IPs associated with the same record identifier is calculated and provides the range (y axis) in the Scatter Plot. The log lines collected at this step become the raw material used by AWstats to produce the web stats pages. 5) Extract the month/year and add one to the counter tracking the number of times step 4) has been true for this month. This data is used to create downloads/month in the Usage Chart. 6) Goto 1) 3.1 Document File Format Document file types of interest are isolated from the logs by matching filenames with known file format extensions such as.pdf and ps. Files having other extensions such as.htm,.gif or.png are ignored. 3

3.2 Successful Downloads Log entries are discriminated based on http status code. Successful codes are accepted for analysis, others are ignored. Accepted codes are: 200 (OK), 206 (partial content) and 304 (not modified). All others are ignored, such as: 403 (forbidden), 404 (file not found) and 500 (internal server error) 4. 3.3 Removing Redundancy A redundant download is defined as a host accessing the same record more than once. It is possible to remove these from the log file by storing the set of host IPs that have accessed each record and rejecting subsequent log entries whose source IP matches one in the list. This process is important since it removes uninteresting log entries from analysis. There are many ways redundancy is introduced in log files. It is common for a web server to chop a large document up into smaller chunks prior to downloading to the browser. When this happens a single download event appears misleadingly as multiple downloads in the log file. There may also be cases in which a robot agent masquerades its identifier as one on the known browser list or a user may return to the server multiple times to download the same document rather than store a local copy. Removing redundancies is a simple way to control for these confounding influences. 3.4 Obtaining Document Metadata A scatter plot of document downloads over submission date provides a qualitative picture of the archive s activity. The submission date is obtained by querying the archive database using the identifier parsed from the apache log. It is useful to obtain other record metadata as well, such as title and URL. 3.5 Preparing Output for Web Analysis Software One of the files produced by the following script is a chronologically ordered, filtered log file for processing by other software such as AWstats 5. 3.6 Script I/O The script takes an Apache log in combined format as input. After filtering, it outputs: 1) first time accesses in apache log format to NON REDUNDANT LOGS; 2) redundant accesses to REDUNDANT LOGS; 3) logs which don t match a legitimate access regular expression to ILLEGITIMATE LOG; 4) a table of the number of non-redundant accesses per document, including the document s submission date and title to COUNT RECORD; 5) a table of the number of accesses and number of new visitors per month to COUNT MONTHLY; 6) a complete list of ip numbers that have downloaded at least one document to ALL IPS; 7) a list of record identifiers missing from the archive database to MISSING ID. The script ensures that every log line, unexpected regular expression mismatches and missing record identifiers end up in appropriate output files. 3.6.1 Script This is the main perl script 6 Script: filter-apache.pl 4 More info on http status codes: http://www.w3.org/protocols/rfc2616/rfc2616-sec10.html 5 http://awstats.sourceforge.net/ 6 Download: http://resolver.caltech.edu/caltechlib:spoeal04 4

#!/ usr / bin / p e r l w # f i l t e r apache. pl # by Ed Sponsler, December, 2 0 0 4 # e ds@library. c a l t e c h. edu use Time : : Local ; use DBI ; ####################### # BEGIN C o n f i g u r a t i o n # ####################### my $db name = databasename ; my $db user = user ; my $db pass = password ; open (NON REDUNDANT LOGS, >a c c e s s l o g non redundant ) ; open (REDUNDANT LOGS, >a c c e s s l o g redundant ) ; open ( ILLEGITIMATE LOGS, >a c c e s s l o g i l l e g i t i m a t e ) ; open ( ALL IPS, > a l l i p s. dat ) ; open (COUNT MONTHLY, >count monthly. dat ) ; open (COUNT RECORD, >count record. dat ) ; open ( MISSING ID, >count record. e r r ) ; # Regular Expression Numbered V a r i a b l e s # $1 = IP # $2 = day # $3 = month # $4 = year # $5 = hour # $6 = minute # $7 = second # $8 = record id ####################################################################### # NOTE: Each r e g u l a r e x p r e s s i o n should be one long l i n e! They have been # t y p e s e t as shown in order to f i t on the page. You w i l l need to j o i n # the l i n e s making one long l i n e per r e g u l a r e x p r e s s i o n assignment p r i o r # to running the s c r i p t. ####################################################################### my $ r e 1 = qr ˆ(.+?)\ s.+\[(\ d+)/(\w+)/(\d +):(\ d +):(\ d +):(\ d+)\s.+/ a r c h i v e /0 ([1 9]\ d )/\d+/.+?\.( pdf ps PDF PS)\ s.+?\ s ( 2 0 0 2 0 6 3 0 4 ) \ s.+$ ; my $ r e 2 = qr ˆ(.+?)\ s.+\[(\ d+)/(\w+)/(\d +):(\ d +):(\ d +):(\ d+)\s.+/\d+/\d+/\d+/\d+ /0 ([1 9]\ d ) \d+/.+?\.( pdf ps PDF PS)\ s.+?\ s ( 2 0 0 2 0 6 3 0 4 ) \ s.+$ ; 5

my $ r e 3 = qr ˆ(.+?)\ s.+\[(\ d+)/(\w+)/(\d +):(\ d +):(\ d +):(\ d+)\s.+get\s /(\d+)/\d+ /.+?\.( pdf ps PDF PS)\ s.+?\ s ( 2 0 0 2 0 6 3 0 4 ) \ s.+$ ; ##################### # END C o n f i g uration # ##################### my %mon 1 = ( Jan=> 0, Feb=> 1, Mar=> 2, Apr=> 3, May=> 4, Jun=> 5, Jul=> 6, Aug=> 7, Sep=> 8, Oct=> 9, Nov=> 1 0, Dec=> 1 1 ) ; my %mon 2 = ( Jan=> 0 1, Feb=> 0 2, Mar=> 0 3, Apr=> 0 4,May=> 0 5, Jun=> 0 6, Jul=> 0 7, Aug=> 0 8, Sep=> 0 9, Oct=> 1 0, Nov=> 1 1, Dec=> 1 2 ) ; # # Process Each Line o f the Apache Log F i l e # while (<>) { i f ( ( $ = $ r e 1 ) ( $ = $ r e 2 ) ( $ = $ r e 3 ) ) { $month= $4 $mon 2{$3 01 ; $months represented {$month = 1 ; # count non redundant a r c h i v e v i s i t s per month i f (! exists $ a l l i p s {$1 ) { $ a l l i p s {$1 =1; i f ( exists $ c o u n t n o n r e d u n d a n t v i s i t s {$month ) { $ c o u n t n o n r e d u n d a n t v i s i t s {$month += 1; else { $ c o u n t n o n r e d u n d a n t v i s i t s {$month = 1; # s t o r e non redundant record a c c e s s l o g s i f (! exists $ n o n r e d u n d a n t r e c o r d a c c e s s e s {$8 { $1 ) { $ n o n r e d u n d a n t r e c o r d a c c e s s e s {$8 { $1 =1; $ e p o c h s e c s = t i m e l o c a l ( $7, $6, $5, $2, $mon 1{$3, $4 ) ; $ l o g s { $ e p o c h s e c s = $ ; # and then count non redundant record a c c e s s e s per month i f ( exists $ c o u n t n o n r e d u n d a n t a c c e s s e s {$month ) { $ c o u n t n o n r e d u n d a n t a c c e s s e s {$month += 1; else { $ c o u n t n o n r e d u n d a n t a c c e s s e s {$month = 1; else { print REDUNDANT LOGS; else { print ILLEGITIMATE LOGS; # # Output Reports # # monthly t o t a l s $accesses sum = 0 ; 6

$ v i s i t s s u m = 0 ; print COUNT MONTHLY MONTH\tACCESSES\tACCESSES SUM\ tvisits\tvisits SUM\n ; foreach $month ( sort keys % months represented ) { $accesses sum += $ c o u n t n o n r e d u n d a n t a c c e s s e s {$month ; $ v i s i t s s u m += $ c o u n t n o n r e d u n d a n t v i s i t s {$month ; print COUNT MONTHLY $month\ t $ c o u n t n o n r e d u n d a n t a c c e s s e s {$month\ t. $accesses sum \ t $ c o u n t n o n r e d u n d a n t v i s i t s {$month\ t $ v i s i t s s u m \n ; # a l i s t o f a l l the IPs foreach $ip ( sort keys % a l l i p s ) { print ALL IPS $ip \n ; # non redundant l o g s f o r a n a l y s i s by AWstats, webalyzer, e t c. foreach $ e p o c h s e c s ( sort keys % l o g s ) { print NON REDUNDANT LOGS $ l o g s { $ e p o c h s e c s ; # record t o t a l s with t i t l e and submission date $dsn = DBI : mysql : database=$db name ; $dbh = DBI >connect ( $dsn, $db user, $db pass, { RaiseError => 1); print COUNT RECORD UID\tSUB DATE\tDOWNLOADS\tTITLE\n ; foreach $id ( sort keys % n o n r e d u n d a n t r e c o r d a c c e s s e s ) { $ c o u n t i p s = scalar ( keys %{ $ n o n r e d u n d a n t r e c o r d a c c e s s e s { $id ) ; $ s q l = SELECT datestamp, t i t l e from a r c h i v e where e p r i n t i d = $id ; $sth = $dbh >prepare ( $ s q l ) ; $sth >execute ; $sth >bind columns ( \ ( $sdate, $ t i t l e ) ) ; i f ( $sth >f e t c h ) { print COUNT RECORD $id \ t $ s d a t e \ t $ c o u n t i p s \ t $ t i t l e \n ; else { print MISSING ID Record ID : $id not found! \ n ; 3.7 Shell Script for Batch Processing In this example, Apache logs are rotated weekly into a special directory and gziped. The directory name is determined by a short string representing the archive, such as cstr for the Computer Science Technical Reports archive. Given this string as its only argument, the shell script knows where to find the Apache logs for that archive and what to name the various output files. The shell variable DATA stores the path where you would like to store the output files and LIB is the path to the scripts. Be sure split-human-robot.pl is able to find browser-ids.dat. Usage: $./get_log_proc.sh cstr #!/ bin / bash # g e t l o g p r o c. sh # S h e l l s c r i p t f o r p r o c e s s i n g apache l o g f i l e s DATA=/home/ e p r i n t s / p r o j c o d a a n a l y s i s / c u r r e n t / l o g p r o c / output LIB=/home/ e p r i n t s / p r o j c o d a a n a l y s i s / c u r r e n t / l o g p r o c / l i b f i n d / var / log / httpd /$1/ a r c h i v e / name a c c e s s l o g \ xargs gunzip c $LIB/ s p l i t human robot. nov2004. pl \ $LIB/ f i l t e r apache. nov2004. pl 7

mv a c c e s s log human $DATA/$1. human mv a c c e s s log robot $DATA/$1. robot mv a c c e s s log noagent $DATA/$1. no agent mv a c c e s s log i l l e g i t i m a t e $DATA/$1. i l l e g i t i m a t e mv a c c e s s log non redundant $DATA/ $1. non redundant mv a c c e s s log redundant $DATA/ $1. redundant mv a l l i p s. dat $DATA/$1. a l l i p s mv count monthly. dat $DATA/ $1. count monthly mv count record. dat $DATA/$1. count record mv count record. e r r $DATA/$1. count record. e r r 4 Generating Statistical Reports At this point we have various output files useful in generating charts or further analysis by other applications such as AWstats. The underlying theme for all the data is to count only document downloads by humans (web browsers) and in which any host IP may download a particular record only once. 4.1 Accesses vs. Submission Date Scatter Plot In this view, the number of accesses per document is plotted over the date of submission to observe the archives activity. To generate this report, import the data from the count-record.dat file into Excel and create a scatter plot using the submission date as the x-axis and access count per record as the y-axis. 4.2 Growth per Month This data is contained in count- How many documents are submitted per month into each archive? monthly.dat. 4.3 AWstats and other Web Analysis Software Web analysis software packages such as AWstats may use any of the various apache log output files: accesslog-human, access-log-robot or access-log-non-redundant. 4.4 Geographical Analysis The all-ips.dat file is useful if your are interested in knowing where your users are coming from. A neat utility called geoip-lookup is freely available as perl module 7. This database associates IPs number ranges with the registrant, such as which country the IP registrant is located. Once the perl module is installed and database is downloaded, the following simple script will turn your list of IP numbers (all-ips.dat) into a table of number of accesses per country. #!/ usr / bin / p e r l w while(<>) { $country = geoip lookup l $ ; chomp ( $country ) ; i f ( exists $ c o u n t i p { $country ) { $ c o u n t i p { $country +=1; else { $ c o u n t i p { $country =1; 7 http://www.maxmind.com/app/perl 8

foreach $country ( sort keys % c o u n t i p ){ print $country \ t $ c o u n t i p { $country \n ; 4.5 Archive Growth Another way to look at archive activity is to chart the number of deposits made per month. This has nothing to do with Apache log files, but can be used together with these other reports. The following script simply looks up all of the datestamp values in each of the supplied eprints databases and produces a report of the number of deposits per month. It takes no input and outputs to stdout. You will of course need to supply the eprints database names, user and password. #!/ usr / bin / p e r l w use DBI ; my @dbs = ( e p r i n t s database1, e p r i n t s database2 ) ; my $db user = user ; my $db pass = password ; foreach $db ( @dbs ) { $dsn = DBI : mysql : database=$db ; $dbh = DBI >connect ( $dsn, $db user, $db pass, { RaiseError => 1); $ s q l = s e l e c t datestamp from a r c h i v e ; $sth = $dbh >prepare ( $ s q l ) ; $sth >execute ; $sth >bind columns ( \ ( $sdate ) ) ; while ( $sth >f e t c h ) { i f ( $sdate = /(\ d+ \d+) \d+/) { i f ( exists $dat {$db{ $1 ) { $dat {$db{ $1 += 1; else { $dat {$db{ $1 =1; foreach $sdate ( sort keys % s d a t e s ) { print ${ sdate 01\ t $ s d a t e s { $sdate \n ; foreach $db ( sort keys %dat ) { print \ t$db ; foreach $db ( sort keys % dat ) { foreach $sdate ( sort keys %{$dat {$db ) { $dates { $sdate =1; foreach $date ( sort keys % dates ) { print \ n$date ; foreach $db ( sort keys % dat ) { i f ( exists $dat {$db{ $date ) { print \ t$dat {$db{ $date ; else { print \ t0 ; print \n ; 5 Quality Control It is important to survey the output files to fully appreciate what the scripts are doing. Every possible outcome is accounted for in the output files. Every log file inputed will find a home in one of the various 9

output log files. All mismatch errors and errors in sql database lookups end up in one file or another. After you run the scripts, glance over all of the output files to discern the legitimacy of this method to accurately describe the observational reports. 10