and Transaction Log Analysis Bibliometrics Citation Analysis Transaction Log Analysis
Definitions: Quantitative study of literatures as reflected in bibliographies Use of quantitative analysis and statistics to describe patterns of publication within a given field or body of literature
Generally speaking, bibliometrics helps explore questions about bodies of literature and the authors that produce it: How scholarly is the cited literature? How current is the cited literature? How research oriented is it? How interdisciplinary is it? Who writes that literature? Where does the literature appear? How do other authors use that literature?
More specifically, enables investigation of basic research questions: Provide macro perspective on scientific communication Determining influence of a single author Describing relationship between two or more authors or works Demonstrating emergence of new subject fields Describing growth of literature on a subject Quantifying productivity of individual authors Measuring dispersion of articles on a subject across journals Characterizing obsolescence of literature
Findings can be applied to range of practical problems: Collection development Thesaurus development Development of indexes, abstracts, taxonomies, metadata Collection pruning Journal, database acquisition and purchasing
Two distinct bibliometric approaches have developed in parallel Analysis of distribution properties resulting in statistical laws or mathematical models Range of methods that enable specific descriptions of the content, structure, and development of research fields
Bibliometric laws Lotka s Law of scientific productivity Describes the frequency of publication by authors in a given field Demonstrates that only a small percentage of authors in a field are highly productive
Bibliometric laws Bradford s Law of Core and Scatter in Journals Demonstrates that a small portion of journals in a field contain a substantial portion of relevant articles in the field Journals in a single field can be divided into three parts, each containing the same number of articles: 1. A core of journals, few in number, that produces one-third of all the articles 2. A second zone, containing same number of articles as first, but a greater number of journals 3. A third zone, containing the same number of articles as the second, but a still greater number of journals
Bibliometric laws Zipf s Law of Word Frequency Predicts the frequency of words within a text Based on ranking words occuring in descreasing frequency
Citation analysis Tool to identify core sets of articles, authors, or journals of particular fields of study, and to describe relationships and trends within and between these entities When one author cites another author, a relationship is established, between: Authors Journals, publishers Disciplines, fields, subject areas Keywords Institutions, countries, languages Citations both!om and to a given work can be unit of analysis
Citation analysis Three distinct approaches Co-citation analysis Bibliographic coupling Co-word analysis
Co-citation analysis Method used to establish a subject similarity between two documents Number of times two documents are jointly cited in other documents If papers A and B are both cited by paper C, they can be said to be related to one another, even though they don t directly cite each other The more papers A and B are both cited by, the stronger their relationship is Can be used to map the topical relatedness of clusters of authors, journals or articles Can also be based on authors or journals as units of analysis
Co-citation analysis Influential Authors in LIS 2000-20002 - A First Author Co-citation Map http://www.umu.se/inforsk/ Imageindexing/imageindex.htm
Co-citation analysis AuthorLink Co-citation Map http://faculty.cis.drexel.edu/ ~xlin/authorlink.html
Bibliographic coupling Assumes two documents that both cite the same document have something in common Links two papers that cite the same articles, so that if papers A and B both cite paper C, they may be said to be related, even though they do not directly cite each other The more papers they both cite, the stronger their relationship is
Co-word analysis Based on analysis of co-occurence of keywords used to index documents Useful for: Mapping the content of research in a field Creation of indexes or thesauri for a given subject domain Supplement search terms in information retrieval systems
2392"@9O"P0"0U<0?20D"7301"9"G=:01"9M2365" 6?9;"9M2365C" I,JKAS"230"?6)?=292=61"?6M12A"L65"9;;"<9=5A" ;;"P0"D=A<;9O0D"7301"230"^#367",M@P05A^" 2"=A"?;=?Q0DC""K30"HI,JK"9;G65=23@"?50920A" 59<3" =1" 73=?3" 61;O" 230" ^;09A2"?6A2^" <923A" "D5971C""*09A2)?6A2"<923A"950"D0205@=10D" 50" Bibliometrics 230"?6)?=292=61"?6M12A" 6L" 9M2365" <9=5AC""" A2"A=1G;0"?6)?=292=61"?6M12"P027001"9M2365A" A2)?6A2"70=G32S"P0?9MA0"7301"230"70=G32A" 27001"9M2365A"950"AM@@0DS"230O"0U?00D"=2C""" Co-word analysis =2A"236A0"<923A"91D"D597A"61;O"230"<923A" 2"?6)?=292=61"?6M12A"L65"09?3"<9=5C""" =@<65291?0" ConceptLink 6L" A367=1G" 230"?6M12A" =A" 2392" "6L"D6?M@012A"A02A"230";=1Q0D"<9=5A"76M;D" " 9" 56MG3" http://faculty.cis.drexel.edu/ @09AM50" 6L" =@<65291?0" 65" =:01" D6@9=1C" ~xlin/conceptlink.html " +1" I=GM50" _S" #=@61" 91D" "P623"/95;"H6<<05"91D"K36@9A"#C"/M31C""+L" )?=292=61"?6M12A"L65"230A0"50A<0?2=:0"<9=5AS" ``S" 730509A" H6<<05)/M31FA" =A" &\\C" " K3=A" 5" <56@=101?0" 6L" 230" ;92205" <9=5" =1" 9" " =1" 205@A" 6L" 230" D6?M@012A" 230=5" 19@0A" 6@"230"#=@61"0U9@<;0S"9";62"?91"P0"D610" 61"@9<AC""K30"MA05"?91"A2MDO"230"9M2365"" ;;67" ;=1QA" 26" A00" 367" 276" 9M2365A" 950" ;OB" D=:=D0" 230" 012=50" @9<" =126" D=LL05012" "D=LL05012"<62012=9;A"L65"D6?M@012"5025=0:9;C"" D05A291D=1G" 6L" 230" 9M2365A" 91D" 230=5" ^;=L2=1G^"91D"^<6A2M50^"26"230"A095?3"P6UC""K30"1M@P05"6L"3=2A" =@@0D=920;O" L9;;A" 26" (''C" " Y0" 9DDA" ^<3OA=?9;" 23059<O^" 26" 230" A095?3"P6UC"",67"230"1M@P05"6L"3=2A"=A"('C""#6"30"?;=?QA"61"230" ^$6" $02" +2g^" PM2261" 91D" P567A0A" 2356MG3" 230" ('" D6?M@012AC"" W392"30"G02A"=A"=1L65@92=61"50;920D"26"230"eM05O"^=1L65@92=61" 61" <3OA=?9;" 23059<O" L65" P9?Q" <9=1A"?9MA0D" PO" ;=L2=1G" 91D" <6A2M50"<56P;0@AS^"9;236MG3"30"10:05"0U<;=?=2;O"236MG32"9P6M2" 2392"eM05O"65"367"26"?61:0O"=2"7=23"b66;091";6G=?C""Y0"A=@<;O" 39A"26"50?6G1=>0"205@A"0U<50AA=1G"3=A"=12050A2C""" " " " " " " " " " " " " " " " " " " I=GM50".C"""4"?61?0<2" @9<" L65"230"A095?3"^P9?Q"<9=1^C""I6M5"?;MA205A"950"?;095;O" A001" 89A" =1D=?920D" PO" 230"?=5?;0AS" 73=?3" 950" 162" G0105920D" PO" 230" AOA20@h"230O"950"9DD0D"3050"L65"=;;MA2592=61"61;OBC"
Example of practical value of citation analysis Collection development Collection planning: determine information needs, make decisions about priorities Collection implementation: organizing collection, creating useful indexing aids for finding resources Tasks require knowledge about the structure of a subject field, about information resources used, about important themes and terminology upon which the collection can be organized and indexed Co-citation analysis, bibliographic coupling, co-word analysis can each be useful: Mapping the structure and use of the relevant literature Determine terms for indexing, thesauri, searching and browsing interfaces
Measuring growth and obsolescence Use of citation data to measure half-life of articles, journals, fields Median citation age: based on publishing years of citing publications and publishing years of citations Price index: measure of how many citations in a publication are at most five years old at the time of publishing Index value is a measure of the increase of publications in the subject field If the growth of a field is 10% the literature is doubled in about 7 years, 39% of the literature was published during the past five years Humanities have a low Price index; obsolescence is slow Emerging sciences have high Price index; obsolescence is relatively quick Can be calculated annually to demonstrate changes and trends
Impact Factor Measure of the frequency with which the average article in a journal has been cited in a particular year or period A = total citations in a year (example: 2001) B = 2001 citations to journal (X) articles published in years 1999-2000 (subset of A) C = number of articles published in journal (X) in years 1999-2000 D = B/C = 2001 impact factor
Impact Factor Provides an approximation of the prestige of journals in which individuals have been published Gives library administrator information about journals in existing collection and journals being considered for acquisition Can be useful but many cavets about use (eliminate self-citiations, variations between fields, journal coverage in ISI indexes, etc.)
Strengths of bibliometrics as a research approach Methods are objective and repeatable Results have a wide range of potential practical value Does not require human subject interaction High reliability in that data are collected unobtrusively, from the published record, and can be easily replicated by others
Limitations of bibliometrics as a research approach Results are only valid to extent that citations are assumed to represent signficant link between citing and cited documents, a questionable assumption: Citations made for many reasons other than topic similarity or quality Citations which should be made are often not Technical issues related to data obtained from citation indexes and bibliographies Variations and misspelling of author names, authors with same name, incomplete coverage of non-english publications
Bibliometric methods not widely used by librarians for practical problems In recent years, however: Rapid emergence of new subject fields and interdisciplinary publications Explosive growth in number of available documents Bibliometrics provides tools that can help librarians deal with challenges posed: Collection development, subject indexing, metadata and theasurus creation, etc.
Bibliometric related resources ISI Web of Knowledge Simmons Libraries -> GSLIS -> Online databases pulldown menu Userid: simm23 Password: educate Try: ISI Web of Science - citations to a given article or author ISI Journal Citation Reports - Social Sciences, subject category; information & library science, sort by impact factor
Transaction Log Analysis Number of digital documents and users of those documents growing rapidly Findings from the How Much Information? project (http://www.sims.berkeley.edu/research/projects/how-much-info -2003/) New stored information grew about 30% a year between 1999 and 2002 Almost 800 MB of recorded information is produced per person each year The World Wide Web contains about 170 terabytes of information on its surface; about seventeen times the size of the Library of Congress print collections The deep Web is estimated to be 400 to 450 times larger
Transaction Log Analysis Basic concepts of bibliometrics can also be applied to patterns of usage beyond citations Transaction log analysis or webmetrics Analyzing usage patterns in a digital environment Allows range of other types of observations Citations do not necessarily reflect usage Transaction logs generally do reflect real usage Web server log analysis ILL records, circulation records Browsing data
Transaction Log Analysis Web log data One or more log files on the Web server can record: IP address of requesting computer Date and time of request Page (filename) requested Referrer page (URL of page that brought user) Web browser/operating system of requesting computer Search terms used from search engine Can also create relatively easily customized logs for a given system to gather more specific data
Transaction Log Analysis Types of possible analysis Session level: complete sequence of requests/queries by a given user Characterize actions of and information sought by user What is the user trying to accomplish? What types of things do users in aggregate try to do?
Transaction Log Analysis Types of possible analysis Page/object level: access to specific pages or objects in the system Which pages are most popular? Which files, images, videos are most frequently looked at or downloaded? Errors resulting from page or resource requests Query level: how users navigate or attempt to find information or resources Which query terms are used? What combination of terms are used? How long or short are queries?
Transaction Log Analysis Example 1: Analyzing user queries from Excite search engine logs Jansen, Bernard J., & Amanda Spink. (2000). Methodological approach in discovering user search patterns through web log analysis: using the Excite search engine. Bulletin of the American Society for Information Science. 27, no1: 15-17. http://www.asis.org/bulletin/oct-00/janses spink.html Log of 1 million queries each in 1997 and 1999: Mean queries per user session: 4.8 in 1997, 2.0 in 1999 Mean terms per query: 2.4 in 1997, 2.35 in 1999 Users most often view at most 10 results Only about 8% of users use Boolean queries
Transaction Log Analysis Example 2: Analyzing user activity on Open Video site The open-video.org Web site redesigned in September, 2003 How are users using the redesigned site? Which pages are most popular? Which options in the search results page do they use? Log data can provide evidence upon which to make design and information architecture decisions within the Web site or digital library
Transaction Log Analysis Example 2: Analyzing user activity on Open Video site User activity in 4 months after redesign: Total of 69,589 unique visitors Total of 140,135 downloads Page Views Video Details 348,974 Search Results 276,745 Main 150,622 Popular Video 61,429 Special Collection Details 12,227 New Video 4,133 Project Information 4004 Detailed Search 3097 Special Collections 3013 Related Video 2842 Project News 2427 Random Video 1835 Contributing Video Info 1503 Help on Playing Video 1465 Project Publications 521 Browser Compatibility 390 Project Contacts 334
Transaction Log Analysis Example 2: Analyzing user activity on Open Video site User activity in 4 months after redesign: Finding video by popularity much more common than by lists of new or random video Page Views Video Details 348,974 Search Results 276,745 Main 150,622 Popular Video 61,429 Special Collection Details 12,227 New Video 4,133 Project Information 4004 Detailed Search 3097 Special Collections 3013 Related Video 2842 Project News 2427 Random Video 1835 Contributing Video Info 1503 Help on Playing Video 1465 Project Publications 521 Browser Compatibility 390 Project Contacts 334
Transaction Log Analysis Example 2: Analyzing user activity on Open Video site Which options do users use to sift search results? Visual layout of results Ordering criteria Size of visible set
Transaction Log Analysis Example 2: Analyzing user activity on Open Video site Sifting options - User choice of visual layout of results options Large thumbnails 221,540 * Text 13,223 Small thumbnails 16,029 Thumbnails only 12,730 * Default choice
Transaction Log Analysis Example 2: Analyzing user activity on Open Video site Sifting options - User choice of ordering criteria of results Option # of Selections Relevance 258,386 * Title 3,700 Year 6,735 Duration 1,320 Popularity 6,604 * Default choice
Transaction Log Analysis Example 2: Analyzing user activity on Open Video site Sifting options - User choice of size of visible set of results Option # of Selections 10 252,207 * 20 4,923 30 3,600 50 5,585 100 7,350 All 10,430 * Default choice
Transaction Log Analysis Limitations of transaction log analysis Assumption that an IP address represents unique user often not true Dynamic IP addresses - same user can have different IP addresses Shared computers - different users can have same IP address Web pages can be cached, both by the client machine and by the Internet Service Provider (ISP) Do not know user motivation for page, query selection Privacy concerns - user registration can obviate variable IP address issues, but has its own issues