Report to Identify Duplicate Bibliographic Records in Local System AMENDED TO EXCLUDE PINES-WIDE DATA Due to Evergreen s inconsistency in its ability to handle the request for local and PINES-wide copy numbers, these instructions have been edited Objective: Identify duplicate bibliographic records within the local library system. This report provides a list of TCNs with title, subtitle, publication date and publisher information to help identify duplicate records by using features in Excel. About publication dates: This report is filtered by publication date so that Evergreen can provide reasonable sized reports. Too large of a report can become unwieldy within Excel because of the calculations used. The range of publication date to be used will depend on the size and age of the collection. I would run reports of less popular years by the decade; otherwise, 5 year increments are recommended. Please note that report data will include records without publication dates. These records will continue to appear in every report despite the publication date filter. That is, until they are merged, replaced, etc. Final note: These instructions are based on Excel 2007. Because of the size of the reports and the time it takes Excel to perform certain tasks, it is recommended that all other programs are closed while processing this report. Once the final report is complete, it should be a very reasonable size and multi-tasking can be resumed. EVERGREEN REPORT TEMPLATE Template Source: Item Display Fields Item Call Number/Volume Bib Record TCN Value Item Call Number/Volume Bib Record Flattened MARC Fields Subfield Extracts subfield info from the 245 field so we can get subtitle info Item Call Number/Volume Bib Record Flattened MARC Fields Normalized Value Extracts the text value of the subfield Item Call Number/Volume Bib Record Simple Record Extracts Title Item Call Number/Volume Bib Record Simple Record Extracts ISBN Item Call Number/Volume Bib Record Simple Record Extracts Publication Yr Item Call Number/Volume Bib Record Simple Record Extracts Publisher Cristina Hernandez Trotter, chtrotter@ocrl.org Page 1
Base Filters Item Is Deleted [EQUALS] Item Call Number/Volume Owning Library Org Unit ID [IN LIST ] At time of creating report, select all branches within local library system Item Copy Status Name Item Call Number/Volume Bib Record Simple Record Extracts Pub Yr False [NOT IN LIST] Discard/Weed [BETWEEN] Set at the time of creating report. Depending on size and age of the region s collection, the years to include will affect the size of the report and the speed of Excel s ability to calculate functions. Based on my experience, the more popular decades should be dealt in 5 year increments. Item Call Number/Volume Bib Record Flattened MARC Fields Tag [EQUALS] 245 Item Call Number/Volume Bib Record Flattened MARC Fields Subfield [IN LIST] a,b In Excel 2007: Although we really only want subtitle info (subfield b), we must include both title and subtitle subfields to prevent the report from excluding those bib records that do not have subtitle. 1. Sort data by Column A ( TCN ) A-Z and then Column B ( Subfield ) Z-A (Be careful here, one is A-Z, the other Z-A. This is very important.) 2. Create a new, blank Column E by selecting the current Column E, right click Insert a. Give Column E the header Find Subtitles b. Insert this formula into E2 =IF(COUNTIF(A:A,A2)>1,IF(VLOOKUP(A2,A:B,2,FALSE)="b",VLOOKUP(A2,A:C,3,FALSE)),". ") Cristina Hernandez Trotter, chtrotter@ocrl.org Page 2
c. Copy E2 and paste throughout the column 3. Create a new, blank Column F by selecting the current Column F, right click Insert a. Give Column F the header Full Title b. Insert this formula into F2 =CONCATENATE(D2,"",E2) c. Copy F2 and paste throughout the column September 7th, 2010 Please note: Depending on the size of the data report, Excel might need some time to calculate results. You can see the calculation progress in the status bar at the bottom of the Excel window. 4. To prevent Excel from having calculate these formulas repeatedly: a. Select Columns E &F and copy data b. With the same columns selected, click Home Paste Paste Values 5. Filter results for unique TCNs: a. Select Column A ( TCNs ) b. Click Data Advanced c. Check Unique Records Only d. Click OK 6. Copy and paste results into a new workbook. You will no longer need the previous file. This step helps to keep the file a reasonable size. (Of course, if the data set is tiny and your computer is having no trouble working with the data, you can just move the data to another worksheet in the same file.) Cristina Hernandez Trotter, chtrotter@ocrl.org Page 3
In the new workbook: 7. Name Column J Title, Pub Yr. and Publisher a. Insert formula below in J2 =CONCATENATE(F2, H2,I2) b. Copy J2 and paste throughout the column 8. Sort data by Column J ( Title, Pub Yr. and Publisher ) A-Z 9. Name Column K Probable Dup a. Insert formula below in K2 =IF(COUNTIF(J:J,J2)>1,"yes","no") b. Copy K2 and paste throughout the Column Please note: Depending on the size of the data report, Excel might need some time to calculate results. You can see the calculation progress in the status bar at the bottom of the Excel window. 10. To prevent Excel from having calculate this formula again a. Select Column K and copy data b. With the same column selected, click Home -> Paste Paste Values (see illustration with Step 4) 11. Filter Column K for rows with yes a. Select Column K b. Data Filter c. Click on drop box that appears on the header row for Column K d. Uncheck all boxes, except for yes e. Click OK ***THESE RECORDS ARE PROBABLY DUPLICATES*** Cristina Hernandez Trotter, chtrotter@ocrl.org Page 4
These records have matching title, publisher, and publication year information. They should be doublechecked in Evergreen and merged if truly duplicates. Of course, a few will not be duplicates. For instance, some of these records could be for individual volumes of multi-volume sets. It is possible to continue to find possible duplicates using this worksheetthe possible duplicates are where either title and publication year match OR title and publisher match. Often the publisher information will be slightly different ( Company rather than Co. ) or one record will be missing a publication date, while the duplicate record will have a date. While duplicates will be found, there will also be many false positives. The final list of possible duplicates will require additional visual comparison in Excel before going to Evergreen. If you are interested, please let me know me and I will post additional instructions. Remember to save this file if you think you will be interested in looking for possible duplicates! 12. Copy and paste the results into a new workbook so that your final working file will be compact and contain only the data you need. Cristina Hernandez Trotter, chtrotter@ocrl.org Page 5