Digitization

Digitization: Florida Digital Newspaper Library Specifications

The Florida Digital Newspaper Library takes contributions of Florida and Florida-related newspapers and ensures searchability and preservation. Contributions to the Library are attributed to the contributing institution, with links to the institution's web pages. All contributions to the Library are digitally archived with the state of the art Florida Digital Archive to ensure long term preservation.

The following information is provided to help libraries, newspaper publishers, and others contribute. We welcome contributions from others so please contact us directly to help make yesterday's news readable again and to preserve all of Florida's news. In addition to the information below, please see the Florida LSTA Grant Program website, this sample LSTA grant from the Hendry County Library Cooperative to digitize the 1928-1945 issues of the Clewiston News, and information on grants that helped establish the Florida Digital Newspaper Library.

Detailed information for contributing materials to the Florida Digital Newspaper Library:

  1. Contact Us
  2. Copyright
  3. Transfer of Digital Objects to the Florida Digital Newspaper Library
  4. Attribution Requirements
  5. General Bibliographic Metadata Requirements
  6. Imaging, General Requirements
  7. Microfilm Digitization
  8. Derivates, Part 1 – Web display images
  9. Derivatives, Part 2 – Searchable text
  10. OCR
  11. Mark-up and METS
  12. Information for State University Libraries

Contact Us

Will Canova, Coordinator for the Florida Digital Newspaper Library
E-Mail: ufdc@uflib.ufl.edu
Telephone: (352) 273-2900
FAX: (352) 846-3702
Mailing Address:
Florida Digital Newspaper Library
c/o Digital Library Center
University of Florida Libraries
P.O. Box 117007
Gainesville, FL 32611
U.S.A.
Shipping Address:
Florida Digital Newspaper Library
c/o Digital Library Center - Stop 117003
University of Florida Libraries
200 Smathers
Gainesville, FL 32611
U.S.A.

Copyright

  • Title Level Clearances
    • All titles published before 1923 are free of the following provisions.
    • Copyright clearance is the responsibility of the contributing institution. Consult your institution's legal counsel for advice.
    • Contributing institutions will be required to certify that they have cleared Internet Distribution Rights with the publisher(s) of the title(s) contributed, and, to hold the Florida Digital Newspaper Library and the University of Florida and its staff harmless.
    • Sample letters and requests used by the University of Florida are available for consultation at our Newspaper Permission Letters page. Again, the contributing institution's own counsel should advise the institution.
    • The Florida Digital Newspaper Library may decline to accept any titles without proper clearance.
  • Issue and Article Level Clearances
    • All issues and articles published before 1923 are free of the following provisions.
    • A newspaper copyright holder generally may not grant permissions for articles, comics, or other materials purchased for publication without proof that these materials were "work-for-hire". Syndicated content is never "work-for-hire".
    • All such non-permitted content must be "copyright-blurred" - the protected content must be blurred and a note indicating copyright status inserted. Example of "copyright-blur".
    • Copyright blur is non-trivial. Florida Digital Newspaper Library staff can negotiate either providing local project staff with training, or, the service at a fee.
    • The Florida Digital Newspaper Library may "copyright-blur" content of this nature. Extensive application of this method may incur a fee. Before starting your project, share sample newspaper issues with Florida Digital Newspaper Library staff to ensure agreement in advance.

Transfer of Digital Objects to the Florida Digital Newspaper Library

  • (preferred) Ship digital objects (see below) on well packed external hard-drive(s).  Drives will be return shipped once they have been processed onto the Florida Digital Newspaper Library or onto the University of Florida Libraries’ Digital Library Center (DLC) active storage space.  Return shipping information – including billing information – must be provided
  • Use of UF DLC external drives for transfer between your vendor and the DLC may be negotiated – depends on availability of drives.  Caveat: For shipments directly from your vendor to the DLC, we will be unable to verify invoicing.  For this reason, we recommend pass through shipping even though this adds cost, time and risks potential damage to the files.  Your institution needs to be able to verify the vendor product received was as billed and to verify, in general, that images meet general quality control/production requirements.  The DLC will not have seen the source materials.  We can point to general quality issues but will not know if issues result from vendor performance or source document condition.
  • FTP may be arranged, but requires negotiation of data transfer amounts – we’ll need to plan to have available space
  • Please provide contact information for the person at your library who will be responsible for the product.

Attribution Requirements

The Florida Digital Newspaper Library (FDNL) would like to attribute your contribution to you.  We do that by providing a link to your institution on the side-bar of individual issue/page display and in citation information.  Please provide the following:

  • Wordmark(s) for your institution and funding source(s), measuring not greater than 120 by 120 pixels.
  • URL to which links each Wordmark to an appropriate institution or funding source page – Note that FDNL is based on and must comply with the Internet Policy of the University of Florida.  We will be unable to display wordmarks or provide links to commercial bodies.  Policy does not prevent links from being directed to a location on your website that provide more complete attribution.
  • Funding statement, e.g., “Funded by the XYZ of Florida and the Venice Public Library, with help from the Chamber of Commerce and the Friends of the Library.”  Funding statement contain no links and may name a commercial source.

General Bibliographic Metadata Requirements

  • If not embedded in mark-up (METS files), please provide OCLC catalog record numbers for each title to be supplied.  
    FDNL imports record information including subject cataloging.

Imaging, General Requirements

  • One page per image
  • 300 dpi
  • 8-bit (i.e., grey-scale) or, as necessary, 24-bit (color) – use color as necessary to capture meaningful color content (news photos, color integral to the understanding of a story, but maybe not color advertisements, decorative borders, and the like)
  • Uncompressed, unlayered TIFF v.6
  • At 100% blow back, i.e., no reduction from original source document page dimensions.
  • Rotated on-cant of text lines, not more than 1% off-cant/skewed.
  • Cropped to the paper’s edge (preferable) , however, a slight margin around the page is acceptable - Never crop inside the edge of the paper.
  • (not required but if manipulating images) Histogram image correction only (see our Digital Library of the Caribbean MANUAL, Section 8 : Image Correction at http://ufdc.ufl.edu/English/training/Section8.pdf)
  • No image correction or other image manipulation once image has been gone into text generation and derivative production.
  • Save files to directories with chronological nesting, e.g.,
    • Title 1
      • 1884
        • January
          • 1
            • 00001.tif
            • 00002.tif
          • 2
            • 00001.tif
            • 00002.tif
        • February
          • 1
            • 00001.tif
            • 00002.tif
      • 1885
        • January
          • 1
            • 00001.tif
            • 00002.tif
    • Title 2
      • 1886
        • January
          • 1
            • 00001.tif
            • 00002.tif
          • 2
            • 00001.tif
            • 00002.tif

Microfilm Digitization

  • Page Separation:
    When digitizing microfilm containing two pages per frame, require that the vendor split the pages into separate images, even when this cuts across a bi-fold spread.
  • Cost:
    An increasing number of microfilm digitization vendors are experienced and use very good automation to accomplish page separation. On the high side of current (2008 January) pricing, expect to pay no more than $0.50 per page (or $1 per frame). Most of the more experienced newspaper microfilm digitization vendors will a lower quote less for both image and searchable text generation, particularly if you are working in bulk.
  • Vendors:
    To find a vendor search National Digital Newspaper Program state projects and make inquiries about their vendors.
    The Florida Digital Newspaper Library is prohibited by University of Florida regulations from recommending vendors. However, information about the vendors that have successfully completed our project is available. These, by no means, are the only vendors available. Exclusion from our listing does not mean that a vendor does not do good work - it only means that we haven't worked with them.
  • Quality Assurance:
    It is important both for the vendor and for you to ask the vendor to complete a no-fee demonstration. This document may be used as a guide. Florida Digital Newspaper Library are available to complete assessments of demonstration product for newspapers to be contributed to the Library. Assessment is free but requires access to the source microfilm as well as the vendor's product and the specifications given to the vendor for completion of this work. We prefer to evaluate vendor product in the blind, without knowing about the vendor.
  • Source Microfilm:
    Always try to find and use a print-master microfilm. Print-masters are films used to create the microfilm in your collection. They are usually negatives. And, they almost always allow the production of the best possible quality digital image. The University of Florida Libraries hold the largest collection of Florida newspaper microfilm print-masters, more than 10,000 reels for virtually every city in Florida; contact the Florida Digital Newspaper Library for assistance.
    When a print-master cannot be identified, assess your film for scratches and blotches. Consider purchase of a fresh print if possible. But, be aware that use of a commercially produced microfilm may not be legal. You should also know that most old microfilm was poorly filmed and poorly developed. Trash-in - bad film generates trash-out and poor images. Most of the microfilm of Florida newspapers produced before 1990, even the University of Florida's film, isn't very good. Frequently, however, it's all that remains.
    If the newspaper still exists in paper, assess and compare the quality of both the film and the paper. Frequently, a well stored paper produces a better image than a poorly filmed, poorly developed microfilm. If the vendor is willing, ask for side-by-side demostration product for both the film and the paper.

N.B.  and Other Notes

  • We appreciate any effort to provide derivatives, including text generation, and metadata creation.  However, we’d be happy to accept files without the following work.
  • Digital objects received without the following, however, are subject to longer through-put times due to additional processing demands on FDNL systems and staff.
  • FDNL digitally archives all contributed content, without fee, in the Florida Digital Archive to ensure long term preservation, backup and data migration.

Derivates, Part 1 – Web display images

All derivatives should share or be built upon the file name of the master TIFF as specified below.  Derivative sets will be needed for each TIFF.

  • JPEG2000 (JP2 – base standard; not JPX – no embedded metadata): compliant with National Digital Newspaper Program JPEG 2000 Profile (http://www.loc.gov/ndnp/pdf/JPEG2kSpecs.pdf), excluding the XML and RDF specifications (in Profile version 2.7, excluding specifications 19 and 20)
    • File name = [master TIFF file name].jp2
  • JPEG Page Image: standard JPEG, progressive encoding, optimal image quality, 630 pixels wide (variable length, depending upon source document)
    • File name = [master TIFF file name].jpg
  • JPEG Thumbnail: standard JPEG, progressive encoding, optimal image quality, maximum 200 pixels wide and maximum 300 pixels high.
    • File name = [master TIFF file name]thm.jp2

Note that FDNL does not use PDFs.  PDF for any given page is large and for bundled pages, larger.  We hope to build a PDF-on-the-fly service later when bandwidth to the average user becomes greater or PDF otherwise more efficient.

Derivatives, Part 2 – Searchable text

All derivatives should share or be built upon the file name of the master TIFF as specified below.

  • Text file generated by optical character recognition (OCR), with or without word formation, with or without correction using dictionaries.
    • File name = [master TIFF file name].txt
  • Column recognition (column formation) must be on.
  • For OCR with single engines (one software package), we recommend OmniPage Pro or Abbey Fine Reader, which have proven excellent results with older newspapers.
  • We recommend use of OCR systems with multiple, voting engines.  FDNL uses Prime Recognition (http://www.primerecognition.com/) configured with six voting engines.  Such systems OCR a single page image multiple times and vote on character/word choice.  This improves the accuracy of the resulting searchable text.
  • Correction of text is currently very expensive.  We recommend it only for the most important articles, and, only then infrequently.

OCR

OCR or Optical Character Recognition is computer software that converts digital images of text, such as the picture of a page, to searchable text, such as that created on word- processing software.  The software sees the image of an “e”, for example, and recognizes it as the letter “e”.  OCR software is frequently combined with spell-check software to ensure greater accuracy, checking formed words against a dictionary and suggesting changes.

Prime Recognition, the OCR software used by the Florida Digital Newspaper Library (FDNL), simultaneously runs six different OCR software “engines”.  The results of recognition are voted on, such that if four of the engines recognize the letter “e”, one recognizes the letter “c” and the other recognized the letter “o”, the image is converted to an “e”.  The more OCR engines running, the more accurate the conversion.  This is especially important when processing old newspapers, particularly those converted from old microfilm.  Discoloration of newsprint due to aging effects reduces contrast between print and background paper, and, can reduce OCR’s accuracy.  Poor and uneven inking or worn type used in printing newspapers of a certain age also contributes to inaccuracies.  Older microfilms, likely produced without compliance to standards, may exacerbate these effects.  Poor or uneven lighting of the page when microfilmed fades some text and darkens other text – both can reduce accuracy, visually changing the letter “e”, for example, to a “c” through fading or an “o” through darkening.  Additionally, older microfilms tend to suffer aging effects, such as splotching and scratching, that further limit accuracy.

FDNL currently runs to instances of Prime Recognition. In addition to converting images to searchable text, Prime Recognition also provides the location or bitmap reference of each letter in the image to enable text-highlighting in a future release of the FDNL site. See the Wikipedia entry on Optical Character Recognition for more information.

Mark-up and METS

This will be detailed online. In the meantime, contact Will Canova at wcanova@uflib.ufl.edu or 352.273.2900 for more information.

Information for State University Libraries

State University Libraries should contact the Florida Digital Newspaper Library to see which file transfer method best matches their needs. UF's Digital Library Center offers several tools for easily transferring files. In order for the Florida Digital Newspaper Library to process files from SULs, the SUL needs to provide:

  • Bibliographic and Attribution information:
    • Please provide an Aleph number, OCLC number, or other bibliographic information so that we can import or create a METS record for the newspaper.
    • Please provide a funding statement (e.g., “Funded by the XYZ of Florida") if one should be included.
    • What wordmark(s) does the SUL want? These need to be under 120x120 pixels and they will be on the left-side display for each page and issue of the paper. If the SUL has wordmarks prepared, those should be sent; otherwise, we will make them and send the draft designs to the SUL for approval.
    • All titles can have customized interfaces and supplemental pages. Please let us know what will or might be wanted so we can implement any preferred designs or work with you to draft a design and any supplemental pages for your review and approval.
  • Digital files:
    • The basic bibliographic information provides the core metadata for all issues of a given newspaper. For each of the issues, the volume and issue number and date are needed. This information can be submitted through one of the tools provided by the Digital Library Center or even in a spreadsheet (Excel or csv file).
    • TIFF files are required, and file format information for them is explained above.
    • All TIFF files for a single issue of a single newspaper should be in a single folder.
    • The SUL will need to provide the TIFF files via FTP, transfer on a portable external hard drive, or the SUL should contact the Digital Library Center to use another transfer method.