Archived Documentation for the UF Digital Library Center (approx. 2007-2011): Optical Character Recognition (OCR), Interactive & Performance Media Preservation, Layered Media Preservation, and Basic Digital Curation /Data Curation
Taylor, Laurie N.
Clifton, Gus
Sullivan, Mark V.
George A. Smathers Libraries, University of Florida
Gainesville, FL
Legacy documentation from Digital Services (fDigital Library Center) at the University of Florida from approx. 2007-2011 and archived here for research and reference purposes. Documentation on OCR and text zoning as well as named entity recognition with Prime Recognition. Included documentation that was on web pages: Optical Character Recognition (OCR); Interactive & Performance Media Preservation; Layered Media Preservation; Basic Digital Curation /Data Curation; Preparation of Project in lieu of Thesis; Aerials (2009 LSTA grant workflow notes); Bulk Loading Internet Archive Files

University of Florida
University of Florida
Applicable rights reserved.
Page 1 of 8 Archived Documentation for the UF Digital Library Center ( a pprox. 2007 2011 ) : Optical Character Recognition (OCR), Interactive & Performance Media Preservation Layered Media Preservation and Basic Digital Curation /Data Curation Optical Character Recognition (OCR) After digitally imaging tex t documents, and conducting the necessary processing for image quality and skew, the DLC uses Prime OCR for optical character recognition, and for image zoning if the target data is arranged in columns or tables The OCR process recognizes the text within the images and creates plain text files from the TIFF image files. Named Entity Recognition (NER) The textual content is scanned against different database tables, and hits a re tagged accordingly: geographic names; personal names; and company names. For example, in this text exerpt, the program tagged place names (geograhical hits) with HTML to highlight them in red: earl Hld WMOM Crossing, for the site at which the Florida Souwh was completed froalatka to Gaisesville in 188 co e e1eninsular Railroad. families of Calvin Waits and James Hawthorn. A diverse economy stimulated the development of Hawthorne The addition of an "e" to the town's name 1. Wafts Baker House (606 Sid Martin Highway, U.S. 301) The original log house on this site, which was occupied in the nineteenth century by Hawthorne founder Calvin Waits and his family, had a s epa rate kitchen and dining room to minimize the possibility of fire. Members of the Baker family have lived in it since 1909 when farmer and rural mail carrier R. B. Baker moved into the house with his new brid e. 2. Hawthorne State Bank (2 N. Johnson St.) This building housed Hawthorne 's first bank, which was organized in 1911 and not long after advertised that it had assets of $15,000. Francis J. Hammond leading me rchant and grandson of town founder James M. Hawthorn, donated the land, and A. L. Webb, proprietor of a successful gen eral store in Hawthorne served as first presi dent of the institution.


Page 2 of 8 3. Moore House (20 7 W. Lake Ave.) This q house, which was built in 191 1, still looks much as it did not long after Glen D. Moore purchased the house in 1913. A sleeping porch, casement windows, and a bathroom were added by Moore, whose father, William Shepard Moore, had arrived in Hawthorne from Tennessee in 1882. 4. Mahin House (301 W. Lake Ave.) The broad porch on three sides of this tum of the century dwelling provided a perfect place for occu pants to sit and catch the breezes. Lottie Mahin, widow of a businessman influen tial in the town during the second decade of the twentieth century, lived in the house | in the 1920s and rented part of it as apart ments. 5. State Historical Marker (On the church grounds, comer N. Johnson St. and N.W. 3rd Ave.) A brief history of the town of ^ Hawthorne is provided. 6. First Baptist Church (Comer N. Johnson St. and N.W. 2nd Ave.) The First Baptist Church in Hawthorne was organized in 1853. This building was erected in 1900. Gothic style windows punctuate the white horizontal clapboard siding that covers the exterior. PrimeOCR features: six OCR engines (WordScan; TextBridge; Recore; TypeReader; OmniPage; FineReader); voting and weighing of the engine results; 11 Western languages; fault tolerant architecture, automatic engine recovery; image pre processing; 1 to 6 CPU's ; 16 output format formats including .PRO: metadata on location, confidence, etc., per character; and features for developers including an application programming interface (API) with 40 documented calls, dynamic link libraries (DLLs), and configurable ini tialization files (INIs). Job Server (Batch OCR; Prioritized jobs; Network aware; Job file) Prime Recognition Job File Version=3.90 1 E: \ Prdev \ Images \ UF70000002n001.tif E: \ Prdev \ Templates \ UF70000002n001.ptm Document Template file: Prime Recognition Doc ument Template Version=3.90


Page 3 of 8 0, 1 1,0,1,0,10,0,64,1,0,0 4 0,0,1,999999,0,0,0,0, 1,0,1,999999,0,0,0,0, 2,0,1,999999,0,0,0,0, 3,0,1,999999,0,0,0,0, ays running on the remote server reads the locations of the TIFF image and the document template from the job file. It processes the TIF file according to the template, and outputs selected file types: PDF+Image and TXT, leaving the original, archival TIF unchanged in its folder. Prime Recognition Job File Version=3.90 1 \ \ Smathersnt19 \ ScanTech \ ScanQC \ Complete \ UF00016972 \ 00116.tif \ \ Smathersnt19 \ Prdev \ Templates \ UF70000002n001.ptm Image Zoning eatures a GUI to draw zones on the image:


Page 4 of 8 Interactive & Performance Media Preservation Standards for digital preservation of interactive and performance digital media are av ailable through the Electronic Literature Organization (ELO): "Acid Free Bits: Recommendations for Long Lasting Electronic Literature" "Born Again Bits: A Framework for Migrating Electronic Literature." Resources Berkeley Art Museum, et. al. "Archiving the Avant Garde: Documenting and Preserving Digital / Variable Media Art" Collection of art icles and perspectives on archiving methods for variable media, The Variable Media Approach: Permance through Change Digital Preservation at the Library of Congress Variable Media "Toward an Organic Hypertext" on Connection Muse, a system for creating standards based hype rtext poetry and fiction Layered Media Preservation Standards for digital preservation of interactive and performance digital media are available from texts on preservation, and the following is adapted from Erich J. Kesse's "Strategies for Microfilming Scrapbooks and Layered Objects" in RLG Archives Microf ilming Manual. Scrapbooks, diaries, and unconventional book structures often contain pages with layered or dimensional objects. Such pages include parts which display one of the following characteristics: Overlapping parts (e.g., page layered with memorabi lia). Enveloped parts (e.g., a diary containing letters in pasted down envelopes). Folded or hinged parts (e.g., a page containing folded newspaper clippings). Movable or pop up parts (e.g., an "artist's book" or a child's pop up book). Steps for Digitizin g: First, film an image of the page as found, with all facets or openings on the page closed. Then, film each facet of each item in logical sequence for reading order (so this should vary as applicable): left to right and top to bottom. Within this order, overlaying items should be filmed before underlaying items, and envelopes should be filmed prior to filming their contents. Every facet of an item should be filmed before the filming of another item begins, and every item on a page should be filmed before the filming of another page begins.


Page 5 of 8 If a layered or dimensional item can be safely detached, remove it and film it separately from the page on which it is found. Otherwise, a blank sheet of ivory or black paper can be laid behind the individual item to mas k it from the rest of the page. Resources The Jackson Henson MacDonald Scrapbook and the Visionaires Scrapbook (1938 1948) are examples of basic scrapbooks no movable parts, all images on pages, and no separation of the parts of the images on the pages instead having only the pages with images scanned. The Bardin Scrapbo ok is a slightly more indepth example because each scrapbook page scanned, followed by high resolution scans of each of the individual images on the pages. Creating Online Historical S crapbooks with a User Friendly Interface article on using Flash page flip and embedded SWF files for scrapbooks Testing OpenLaszlo for pop ups and other interactive books in DHTML and Flash Digital & Data Curation : Digital Curation Processing Workflow DLC runs virus scan reviews for file types; normalizing any unusable formats and/or identifying needed software maintains 1 complete copy until copy in process is com plete and archived sends files and any processing notes/information on software needed or recommended to SASC SASC reviews for organization, maintaining any existing organization (folders, if intellectually grouped and significant contents), organizing oth er materials as appropriate based on underlying structure, and then placing other materials into a general folder(s) decides whether this accession is closed/1 set or if it can be added to/expanded with future donations (processing may change based on this) creates a finding guide/spreadsheet for all materials at the applicable level DLC imports the spreadsheet renames a ll folders into BIB_VID format normalizes all files and file names properly places all dark items for dark processing (note: multiple versions of same file are normally not kept for print collections unless literary collections because of space concerns, m ultiple versions will be retained for digital only collections unless size/cost is an issue) Other Processing Notes In the past, SASC received only small portions of collections as digital and so would preserve materials by printing files. In the future, a ll digital files should go to DLC for digital preservation. The DLC will then also print the files for SASC as wanted by SASC (and this will likely be the case for any collections with only small portions that are digital, where having the material printed is needed for ease of use for the collection as a whole). Data Curation Around the world, key scientific data are at risk of being lost, either because they are held on fragile or obsolete media or because they may be destroyed by researchers who are unaw are of their value. Now a team of scientists is planning to scour museums and research institutes to draw up a global inventory of threatened data. Launched on 29 October, shortly after the biennial


Page 6 of 8 conference of the Committee on Data for Science and Techn ology in Stellenbosch, South Africa, the project aims to publish the inventory online in 2012. ( ) The digital files humanities, arts, and social sciences. Modern research requires digital tools and the research process includes the selection, collection, organization, and analysis of digi tal files. Those digital files spanning all types of content including scientific data, images of archival letters, interactive media files, and more are all too often held in precarious ways. As such, the potential for total loss exists and this is in addition to the lost possibilities from the materials being inaccessible. The UF Digital Collections include digital objects created through the digitization of source materials as well as born digital. All materials in the UF Digital Collections are cura ted to support optimal access and ensure long term preservation. Please contact us for more information on the UF Digital Collections and services. Data Management NSF requires plans for data management to be included with grant proposals. See the full information from NSF here which also links to the requirements by directorate, office, division, program, a nd other NSF units. The NSF grant proposal preparation instructions ( online here ) are copied below for ease of reference. Plans for data management and sharing of th e products of research. Proposals must include a supplementary document of no more than two pages labeled "Data Management Plan". This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research resul ts (see AAG Chapter VI.D.4 ), and may include: 1. the types of data, samples, physical collections, software, curriculum materials, and other materials to be produce d in the course of the project; 2. the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies); 3. policies for ac cess and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements; 4. policies and provisions for re use, re distribution, and the production of derivatives; and 5. plans for archiving data, samples, and other research products, and for preservation of access to them. Data management requirements and plans specific to the Directorate, Office, Division, Program, or other NSF unit, relevant to a proposal are ava ilable at: If guidance specific to the program is not available, then the requirements established in this section apply. Simultaneously submitted collaborative proposals and proposals that include subawards ar e a single unified project and should include only one supplemental combined Data Management Plan, regardless of the number of non lead collaborative proposals or subawards included. Fastlane will not permit submission of a proposal that is missing a Data Management Plan. Proposals for supplementary support to an existing award are not required to include a Data Management Plan. A valid Data Management Plan may include only the statement that no detailed plan is needed, as long as the statement is accompa nied by a clear justification. Proposers who feel that the plan cannot fit within the supplement limit of two pages may use part of the 15 page Project Description for additional data


Page 7 of 8 management information. Proposers are advised that the Data Management P lan may not be used to circumvent the 15 page Project Description limitation. The Data Management Plan will be reviewed as an integral part of the proposal, coming under Intellectual Merit or Broader Impacts or both, as appropriate for the scientific commu nity of relevance. Resources Born Digital Collections: An Inter Institutional Model for Stewardship (AIMS) Paradigm workbook on digital private papers Acid Free Bits NSF: Data Management for Engineering Directorate Proposals and Awards o Investiga tors are expected to share with other researchers, at no more than incremental of work under NSF grants. ( Section VI.D.4.b. ) o material commonly accepted in the scientific community as necessary to validate o The basic level of di gital data to be archived and made available includes (1) analyzed data and (2) the metadata that define how these data were generated. These are data that are or that should be published in theses, dissertations, refereed journal articles, supplemental da ta attachments for manuscripts, books and book chapters, and other print or electronic publication formats. Analyzed data are (but are not restricted to) digital information that would be published, including digital images, published tables, and tables of the numbers used for making published graphs. Necessary metadata are (but are not restricted to) descriptions or suitable citations of experiments, apparatuses, raw materials, computational codes, and computer calculation input conditions. o What data are n ot included at the basic level? The Office of Management and Budget statement (1999) specifies that this for future research, peer reviews, or communications with colleag requirements may be specified in particular NSF solicitations or result from local o Expected Data: The Data Management Plan (DMP) should describe the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project. It should then describe the expected types of data to be retained o Period of data retention: The DMP should describe the period of data retention. Minimum data retention of research data is three years after conclusion of the award or three years after public release, whichever is later. Public release of data should be at the earliest reasonable time. o Data formats and dissemination: The DMP should describe the specific data formats, media, and dissemination approaches that will be used to make data available to others, including any metadata. Policies for public access and sharing should be described, including provisions for appropriate protection of privacy, confidentiality security, intellectual property, or other rights or requirements. o Data storage and preservatio n of access: The DMP should describe physical and cyber resources and facilities that will be used for the effective preservation and storage of research data. In collaborative proposals or proposals involving sub awards, the lead PI is responsible for ass uring data storage and access.


