Final Report: Discovery and usage of digital items with enhanced metadata: a test case of machine-aided indexing

Material Information

Final Report: Discovery and usage of digital items with enhanced metadata: a test case of machine-aided indexing
Stapleton, Suzanne
Dinsmore, Chelsea
Van Kleeck, David
Spears, Laura
Perry, Laura
University of Florida
Physical Description:
Grant proposal,


This project complements digitization of the historic Florida Cattleman & livestock Journal. A controlled comparison of Access Innovation's machine-aided indexing will provide a unique contribution to development of best practices for digital records metadata. Metadata in a sample of these new digital records will either include additional Dublin Core descriptions (enhanced metadata) or not (control). After publicizing the digital initiative, an assessment of discovery and usage wilt be conducted, using Google Analytics landing page statistics. Results of the project will inform best practices for future digital initiatives, particularly for specialized collections.
Collected for University of Florida's Institutional Repository by the UFIR Self-Submittal tool. Submitted by Danielle Sessions.
General Note:
University of Florida George A. Smathers Libraries Strategic Opportunities Program

Record Information

Source Institution:
University of Florida Institutional Repository
Holding Location:
University of Florida
Rights Management:
All rights reserved by the submitter.


This item is only available as the following downloads:

Full Text


George A. Smathers Libraries Strategic Opportunities Program Awarded Grant Final Report Form Date: January 29, 2019 PI: Suzanne Stapleton Co PI Chelsea Dinsmore Project Title: Discovery and usage of digital items with enhanced metadata: a test case of machine aided indexing Funds Requested: $4,925 Cost Share: $7,079 Total Funds e xpended: $12,004 Funds Remaining: none Brief Description of Project: This project was designed to assess the benefits of enhancing metadata of digital items with machine aided indexing. Metadata fo r digital records has significant impact on the records discoverability. Academic libraries have invested much time and effort in developing digital collection and institutional repositories, and having insight into how these digital assets are discover ed by search engines is necessary (Yang, 2016). Machineaided indexing offers the potential for efficiently and comprehensively adding descriptive subject terms to digital record metadata to improve item usage. The assessment design compared rates of us e of new digital resources with and without expanded subject descriptions in the METS records. This technique offers a unique opportunity to inform development of best practices for metadata using patrondriven behavior. Results from this project will contribute to the broader professional discussion of best practices for digital initiatives in general and automated indexing in particular This research project was developed to accompany the digital initiative, Preserving Floridas Agricultural History : Digitization of the Florida Cattleman & Livestock Journal The project was designed as one of the first test cases by an academic library of Access Innovations Data Harmony machine aided indexing. Results of this project contribute to the George A. Sm athers Libraries strategic directions for digital and digitized collections to create efficient systems and workflows to facilitate the creation and management of accurate and appropriate metadata for Library resources (University of Florida George A. Sm athers Libraries Strategic Directions, 2014). Results of this project may also contribute to national conversations on best practices for efficient and sufficient metadata creation for digital collections. Full proposal is available at: http://ufdc.ufl.e du/IR00010234/00001 Promotional video is available at:


Results: As a result of this Strategic Opportunities Program Grant, the project team accomplished a number of goals, developed digital skills and learned information useful to future digital initiatives. Enhanced metadata (selection of additional Dublin Core subject descriptors) 283 subject terms vetted by the research project Advisory Team were added to UFs instance of the Ithaka S+R JSTOR thesaurus Library staff skills were expanded in use of Access Innovations Data Harmony soft ware to use machineaided indexing 458 names of state importance were contributed to the Name Authority thesaurus under development A publicity campaign was conducted to encourage patron use of the new digital collection in UFDC in anticipation of item us age assessment. Publicity included the published article, Historic Florida Cattleman Moves Online (Appendix B ), Preserving Floridas agricultural history at the University of Florida: Florida Cattleman & Livestock Journal LibGuide and numerous social medi a posts (see Appendix C ). A rubric was developed for manual selection and prioritization of subject terms representing a journal issue. This method was developed in response to delays in access to the Access Innovation machineaided indexing software. The rubric is used to assign quantitative values representing importance of the terms based upon the terms locations in a specific issue of the serial. The goal of the rubric is to standardize selection of appropriate subject terms and minimize researcher bias. See Appendix D for rubric. Machineaided indexing software licensed from Access Innovations generated frequency reports of subject terms from the modified JSTOR thesaurus for 30 randomly selected issues Impact of enhanced metadata on digital mater ial usage: Unanticipated constraints in the framework of Sobek CM, the platform that hosts the University of Florida Digital Collection (UFDC), prevented the assessment component of this project. As a result, the impact of enhanced metadata on usage of di gital items remains unanswered. A previously working search element was discovered to be broken, most likely during a recent code update to Sobek CM. The error was reported and discussed with the research team, Library IT and Digital Support Services on various dates and formats. Appendix E provides a detailed description of the system impediments. A request to fix Sobek has been submitted to the software development community and is under review by Library IT.


Lessons Learned: Project Team Recommend ations: (1) The research team for this project recommends that system error in Sobek CM be corrected so that future scholars can readily access and use search results from serials in the University of Florida Digital Collections. Although the results of thes e limitations are documented for the Florida Cattleman & Livestock Journal the same limitations will apply to all other digitized serial collections in UFDC, including the Florida Digital Newspaper Library. It will be in the best interests of the UF Libr aries to provide additional support to DSS and Library IT to address and remedy feature limitations in Sobek CM, the platform supporting UFDC. (2) The research team for this project recommends providing greater explanation to patrons of existing search functi onality in UFDC. Specifically, the explanation should mention that Text Search offers the ability to search a digital items full text and will NOT include metadata; Advanced Search will only search metadata and NOT include full text. (3) The research team for this project recommends publicity campaigns that include social media be planned as beneficial accompaniments to digitization projects. Social media campaigns with external partners should include agreement upfront to like, share and retweet posts about the project to increase visibility. Budget: (add more lines as necessary) Expenses Categories (add lines as necessary) Cost Personnel OPS Lauren Cooney, student research assistant $ 4,945.15 a Stapleton, Suzanne, PI (5%) $ 3,741 Dinsmore, Chelsea, Co PI (1%) $ 1,064 b Van Kleeck, David (1%) $ 700 b Spears, Laura (1%) $ 900 b Perry, Laurie (1%) $ 600 b Phillips, Robert none Ma, Xiaoli none Total $12,004 a A student research assistant, Lauren Cooney, was hired on OPS funding at $15/hour for th is project for Spring and Summer 2018. The student was a senior Veterinary Medicine undergraduate whose knowledge and experience in the cattle industry was an additional benefit to this project. b Cost share provided by library personnel


Still to be com pleted: A final c omparison of subject terms from the randomly selected issues using Access Innovations Data Harmony machine aided indexing reports versus priority subject terms identified via the rubric for manual selection is planned. It was not possible to complete this comparison due to access delays to Access Innovations Data Harmony software. Updated Timeline: Activity in months 1 2 3 4 5 6 7 8 9 10 11 12 Hired and trained OPS Develop ed advisory t eam Initial evaluation of project thesaurus Identified random sample of journal issues


Activity in months 1 2 3 4 5 6 7 8 9 10 11 12 Identified 283 custom terms and 458 cattle industry names to add to thesaurus and Name Authority Developed r ubric f or manual selection of priority subject terms Reported Sobek search functionality concerns Custom terms added to UF JSTOR th esaurus using Access Innovations Data Harmony A ccess Innovations Data Harmony Machine Aide d Indexing reports generated on random sample of issues Appendices A. Background on digital material and machine aided indexing B. Copy of published article, Historic Cattleman Moves Online C. List of project publicity D. Rubric for manual selection of enhanced metadata E. Detailed description of structural limitations of Sobek CM


Appendix A. Background on digital material and machine aided indexing Background on Digital Material The Florida Cattleman & Livestock Journal provides a unique record documen ting the evolution of Floridas cattle industry, an influential industry on the historic development of the state. The journal documents changes in ranching practices through articles by UF/IFAS faculty, farmers, oral historians and advertisements. Influential families in Florida history are often featured, as many have been involved in cattle ranching. With the newly digitized Florida Cattleman & Livestock Journal we anticipated that patron interest would be greatest in searching for historic production practices and names of family, farms or events. UF /IFAS scholars provide recommendations to farmers on issues of concern that persist today such as grassland management, the role of feed additives to cattle nutrition, beef prices and marketing regulations Examples of promi nent names of interest include the Adams and Lykes families, Silver Springs Rodeo and organizations such as Block & Bridle Clubs, 4H programs and Florida Brangus Association, representing improved cattle breeds of Brahman and Angus heri tage among many others This project was designed to test whether issues of the Florida Cattleman & Livestock Journal with additional subject description terms would be used more frequently by patrons using the University of Florida Digital Collections (U FDC). Background on Machine Aided Indexing Machineaided indexing compares the full text of a digital record to a controlled vocabulary (thesaurus) to produce a report on the frequency of terms. From this we can determine which terms to add to the UFDC record in order to enhance discoverability. The first step to prepare for machineaided indexing was to obtain an appropriate thesaurus. Ithaka S + Rs JSTOR Thesaurus was selected as the most appropriate in breadth of subject coverage since the thesaur us would also be used in the electronic theses & dissertations (ETD) project. George A. Smathers Libraries obtained license permission to install the JSTOR thesaurus. The PI worked with an Advisory Team and a Vet Med student assistant for greater experti se in Floridas ranching industry to identify regionally important terms. Upon review of the JSTOR thesaurus, the project team determined that a significant number of regionally important terms to Floridas cattle industry were missing and should be added. In the process of identifying important production/industry terms, the project team collect ed important cattle family names to add t o the Name Authority thesaurus in development. These names will be a valuable addition to the thesaurus that will be used for other UFDC projects as well. Customizing the controlled vocabulary in the thesaurus enables current and future projects, to identify the frequency of terms of most importance to our patrons. Access Innovations determined that a maximum of 11 subject terms in metadata is beneficial, based upon work with other clients (Marjorie Hiava, Taxonomy Fundamentals Training 10/13/2017). The Florida Cattleman & Livestock Journal catalog record


includes 3 standard Library of Congress Subject Headings (see Fig. 1), leaving a maximum of 8 additional terms. The research team elected to limit additional subject terms to a maximum of 5 per record for production/industry terms and up to 5 terms for a separate thesaurus, the Name Authority. Fig. 1 Standard subject terms for Florida Cattleman & Livestock Journal


Appendix B. Copy of published article Stapleton, S. C. & Cooney, L. Historic Florida Cattleman Moves Online in Florida Cattleman & Livestock Journal v82(11): 100, 102103


Appendix C. List of project publicity Promtional Video Cattleman s Convention Promotional Video, 6/19/2018. Produc ed and edited by L. Cooney. [MP4 Video file, 2:19] Presentations: UF/IFAS Annual Beef Short Course, 5/9/2018, estimated attendance 200 Florida Cattlemens Association Board of Directors Annual Meeting, 6/20/2018 Florida Cattlemens Association Annual Convention [tabling] Presentations on Project Ceres award (2): USAIN presentation 5/15/2018, UF Libraries Grants Showcase 8/27/2018 LibGuide: Preserving Floridas Agricultural History at the University of Florida: Florida Cattleman & Livestock Jouranl. 267 views to date. Social media presence from University of Florida George A. Smathers Libraries : o Twitter 9 posts from June October, 2018 o Facebook 10 posts from June October, 2018


Appendix D. Rubric for manual selection of enhanced metadata Rubric for manual priority selection of enhanced metadata subject description terms. Manually selected terms must be featured in at least one of two locations : Cover and Table of Contents (TOC) unless otherwise suggested by the investigator as an issue important term or author name. Note that not all issues contain text for cover topics and/or table of contents. The actual published term or phrase is captured and investigators may identify the preferred term (e.g. stud directory may = horse breeding). If there are multiple variations of a term, the most probable search term is made preferred. Key term phrases with repeated terms within them may be consolidated to allow capture of other issue topics, e.g. Brahman breeders, Brahman directory, and Brahman sale would be lumped to simply Brahman. Terms in the standard metadata for the item (e.g. already in UFDC metadata) as LCSH or FAST will not be considered. A list of these terms is generated and location based point values distributed. Category Notes Points Value Location Based Issue cover topics Terms featured on front cover. Not applicable to all issues. Include terms from cover photo caption. +3 Table of contents Terms featured in the digitized TOC. Not applicable to all issues. Ignore Schedule of events and side panel TOC in UFDC. +2 Terms on the previously generated list are awarded point values according to their informative values. Additional points are added based on term properties below. All points are summated and the top 7 valued terms selected. Term Properties Florida Important (Events, Names, Organizations, and Production Practices) Note: Names of Families, Ranches, and Organi zations will be added to the Naming Authority vocabulary. Names of Geographical Locations will be added to Events are of national interest or unique to Florida, e.g Kissimmee Rodeo. Production practice terms are from UF customized JSTOR vocabulary. Include terms from cover photo caption. +4


the GeoThes, a thesaurus of geographical terms. These terms will not be used as subject description terms but will be graded by the same rubric and point values. Issue Theme Some issues are assigned monthly or ann ual themes which capture overarching issue topics. Not applicable to all issues, e.g. Better Bulls, Annual stud directory. Investigators may identify an issue theme. +2 Repetitive, Non informative Terms not adding to the description of the specific issu e, e.g. bovine, ranching, fair, rodeo, show, sale, tour. These will generally be omitted with the initial manual selection of key terms. 2 After the grading process, the top 5 valued terms will be highlighted in each category according to their classif ication as either a production term (Green) or name/organization/Geoterm (Blue). Investigators may use the article author names, preferably unique, for issues lacking the minimum 5 terms in the name/organization/Geoterm category. Investigators will conduc t a Quality Control of these top 5 valued production terms and top 5 names/organizations/Geoterms to confirm appropriateness. In event of tie, investigators will select the most appropriate subject description terms. General guidelines are to choose the m ore unique term, specific to the particular issue to include in enhanced metadata. During this process, investigators will capture and add new terms with high point values into the customized JSTOR vocabulary. If a ranch or family name is partial on the cover or in the TOC, find the full name within the article in add to the Naming Authority vocabulary. Names should be in the preferred format: Last name, Suffix; First Name. Capture names of articles about that person (e.g. Kowbelles, CowBelles).


Appe ndix E. Detailed description of structural limitations of Sobek CM Sobek CM provides the means to search metadata (Advanced Search) or full text (Text Search tab). Although Boolean search strings are not well supported, phrases can be searched in quotes Results are generated rapidly and displayed with facets for further refinement. Unfortunately, results from searches within serials currently displays results at the parent record level and not at the individual item level, which make it functionally i mpossible to locate desired terms in the individual issues of Florida Cattleman & Livestock Journal in UFDC. This same limitation in functionality applies to all serials in UFDC, including for example the Florida Newspaper Project and is under review by t he Library IT department. Alternative searching of UFDC via Google was discussed by the research team. The algorithm for prioritizing results in Google is not known so it is unclear whether Google searches include metadata and the frequency of updating data harvests by Google is unknown. In order to evaluate usage of UFDC records with and without enhanced metadata, researchers would need to know whether Google searches include metadata terms and when the records are harvested to be able to start an assessment period for issues with and without additional Dublin Core subject terms (the enhanced metadata sample). Assessment of usage by Google Analytics is also problematic due to the current error in Sobek. Since search results default to the parent lev el record, usage of individual issues are not accurately recorded. Library patrons are better served when directed to the serial issue and/or page where search terms appear. In June three subject terms (Palooka, Swollen joints and Florida Arcadia) were added to the metadata of the October 1944 issue of the Florida Cattleman & Livestock Journal to test search & discoverability of enhanced metadata. Metadata searches in UFDC, using Advanced Search, result in facets showing the frequency of the search ter m, so it appears that Sobek conducts the search for metadata terms correctly. However, the display of the results needs to be adjusted to show the instances of the term by the issues where it appears instead of by the journal title. Full text search (Tex t search) displays the actual issues with the search terms highlighted, but when the item is selected, the serial parent record, with all volumes is displayed (Fig 1). Advanced Search of metadata shows results in facets (Figs 2, 3) but again when a term i s selected, the user is directed to the serial parent record with no indication of the issue(s) where the search term appears (Fig 4). Example: Search palooka and florida cattleman


Fig. 1 Full text search (Text Search) for palooka and flor ida cattleman yields eight instances of Palooka but clicking on any of the issues brings user to the all volumes list for this serial. Fig. 2 Search using Advanced Search feature for palooka and florida cattleman to search only for metadata.


Fig 3. Sobek successfully searched metadata and found 1 instance of Palooka in the subject field. Fig 4. Clicking on Palooka in the facets brings users back to the journal title (parent record).