- Permanent Link:
- Increasing Discoverability to ETDs through Machine Assisted Indexing: A UF Case Study
- Series Title:
- USETDA 2018
- Shorey, Christy
- Publication Date:
- Physical Description:
- Conference Poster
- The University of Florida Libraries began working with Access Innovations in 2016 to build a Machine Assisted Indexing (MAI) tool for use on UFâ€™s digital collections. Using word frequency determined by a full text scan, keywords are suggested for use on titles. A rules set prioritizes words, and redirects to preferred terms, with the final step being human review. MAI helps eliminate the time needed to identify relevant keywords, and uses more natural-language terms than found in traditional cataloging. ETDs were the first collection to go through this process. This poster looks at the steps we took towards MAI, analyzes the impact on the discoverability of the ETDs within our Institutional Repository, and looks at future improvements.
- Collected for University of Florida's Institutional Repository by the UFIR Self-Submittal tool. Submitted by Christy Shorey.
- General Note:
- Presented September 13, 2018 at USETDA 2018
- Source Institution:
- University of Florida Institutional Repository
- Holding Location:
- University of Florida
- Rights Management:
- Copyright Creator/Publisher. Permission granted to University of Florida to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Increasing Discoverability to ETDs through Machine Assisted Indexing: A UF Case Study By Christy Shorey and Chelsea Dinsmore September 13 2018 USETDA 2018 Denver, CO The University of Florida Libraries began working with Access Innovations in 2016 to build a Machine Assisted Beginning with full text scans, the system uses word frequency to suggest keywords for titles A rules set prioritizes words and redirects to preferred terms, with human review as the final step. MAI helps minimize the time needed to identify relevant keywords, and enables us to use more natural language terms than are typical in traditional cataloging. We chose our ETD collection as our pilot project for this process Overview Pilot Project: Digital Theses and Dissertations Why ETDs? UFETD collection good size for test (25,000 titles) Both digitized and born digital files Good variation in quality and formatting over the years 1934 2016 (at time of pilot) How did we start? Training with Access Innovations, Oct. 12 18, 2017: o Theory of Knowledge o Taxonomy Fundamentals o How does Search Work o What can I do with a Taxonomy o Introduction to Data Harmony (the MAI tool) Top terms found in the UFETD collection. we can remove by defining more rules around the contexts where it appears. Building the Thesaurus and Rules Base Challenges As we incorporate MAI into our daily routine, we must decide where it will best fit in the workflow. Establish Workflows Success in this project is the process, not the final product. The pilot exposed areas of our system that need more attention: Quality changes of OCR over time Flaws in system, such as bad field mapping Areas that need to be updated to reflect new practices Metadata inconsistencies MAI is just that, machine assisted indexing, requiring humans to: o Validate selected terms o Create rules relevant to our collection Lessons Learned After assessing the results of our pilot project, we plan to: Apply MAI to more UFDC collections Use MAW to transfer content between systems Finalize workflow for enhancing existing digital records Develop workflow for newly digitized items Future Directions Current Option 1 Scan or upload item Send through MAI Option 2 Create METS XML from existing metadata (MARC or XML) Create METS XML from existing metadata (MARC or XML) Scan or upload item Enhance Metadata as needed Create METS XML from existing metadata (MARC or XML) + MAI subject terms Scan or upload item Send through MAI Enhance Metadata with MAI subject terms then as needed Enhance Metadata as needed 2006 Our Thesauri Work Began with a thesaurus developed for Social Studies and Humanities content Enhanced an existing Geographical thesaurus to cover more Florida terms Developed a Florida Name Authority set to better identify relevant subject terms Will continue as an ongoing, iterative process Inconsistent Metadata Full Metadata Fields Standards Vary Over Years Metadata Mapped to Wrong Fields Missing & Incorrect Metadata Original Scan Scans of older works vary in quality, based on factors like paper type and typist skill OCR Quality OCR quality Poor scans can result in poor OCR Poor OCR means the full text search for terms is ineffective Improved OCR Access Innovations recreated OCR for many ETD titles Cross System Communication Need: Crosswalk from UFDC fields to Access Innovations fields Mechanism to send XML, txt and PDF files to AI, and receive enhanced XML to ingest Solution: Marshaling Applications Website (MAW) In house tool connecting UFDC and Access Innovations We can measure how many records we enhanced by: Improving subject terms based on full text headings How do we measure impact on users? Number of views and downloads? Increased citations? Other? Evaluation What is it? Can we measure it? Improved OCR Access Innovations had to regenerate OCR for many ETD titles Poster online at http://ufdc.ufl.edu/IR00010552/00001