Group Title: Phase I : initial grant
Title: Phase II : OCR gateway to indexing
ALL VOLUMES CITATION THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095853/00002
 Material Information
Title: Phase II : OCR gateway to indexing
Physical Description: Book
Language: English
Creator: George A. Smathers Libraries, University of Florida
Publisher: George A. Smathers Libraries, University of Florida
Place of Publication: Gainesville, Fla.
 Record Information
Bibliographic ID: UF00095853
Volume ID: VID00002
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Full Text







University of Florida Smathers Libraries


Project Proposal submitted to
The Andrew W. Mellon Foundation

CARIBBEAN NEWSPAPER IMAGING PROJECT

Phase II : OCR Gateway to Indexing



Context and Proposal
Users of electronic images come to digital media with a set of expectations greater than
those they have of other media. They anticipate extensive indexing, directly and interactively
linked to the indexed information. With this proposal, the University of Florida proposes to test
the viability and costs associated with use of optical character recognition (OCR) as an
alternative to manually indexing electronic newspapers.
With funding support from the Andrew W. Mellon Foundation, the University of Florida has
scanned its microfilmed newspaper holdings of the Diario de la Marina (Havana, Cuba),
1947 1960, and Le Nouvelliste (Port-au-Prince, Haiti), 1899- 1979. In the process, these
newspapers were indexed selectively by reviewers knowledgeable in the languages.
Selective indexing is not ideal, given it is highly labor-intensive and far from comprehensive.
This proposal is made to assess the value and cost effectiveness of OCR indexing of these
same newspapers.
The proposed study will evaluate OCR effectiveness for the following target groups:
Digitizing methods and technologies;
Filming methods and technologies;
Publication dates;
Language of the source newspaper; and
SOCR software technologies.
While OCR of page images smaller than a newspaper's folio dimensions has been
successfully demonstrated and cost-effectively applied, newspapers represent an
unaddressed challenge. Too large for extant fiat-bed scanners and often too fragile for rotary
plotter-scanners, source newspapers must be reproduced in film prior to scanning or scanned
with a digital camera. Newspaper already on microfilm represents additional challenge.
Scanned images of newspaper from microfilm are problematic from a number of reasons.
The defacto "standard"2 for production of film intermediaries for oversized source documents
calls for 105 mm rather than library "standard"3 35 mm film on which newspapers are
currently microfilmed. Formulas for digitization of images on film,4 in comparison against
scanner manufacturers literature and claims, show that no microfilm scanner or digital
camera currently available can adequately scan newspaper from 35 mm microfilm.
Evaluation of the OCR indexing in each target group will be compared with the usefulness of
the human indices. Results will be published with recommendations for OCR-optimized
imaging of newspapers and optimized micro-photography for future imaging and OCR.




University of Florida proposal submitted to The Andrew W. Mellon Foundation
CARIBBEAN NEWSPAPER IMAGING PROJECT
Phase II: OCR Gateway to Indexing



Target Newspapers
Targeted titles include Diario de la Marina, Le Nouvelliste and Trinidad Guardian.
Published in one of the three predominant Caribbean languages and extensive in holdings,
each targeted newspaper affords analysis of language and printing variables, the latter
including use of fonts and print quality over time.
Microfilmed over time to changing standards, comparison of OCR accuracy from images
generated from these microfilms will quantify probabilities of successful OCR.
The Diario and Nouvelliste have been digitized. Select page images will be rescanned for test
of additional digital methodologies. Select page images of the Trinidad Guardian will be
digitized and indexed, for the first time, for purposes of this project.
The Trinidad Guardian was selected from among the University of Florida's English language
newspaper microfilm holdings for its documentation of the colonial British West Indies and of
the various independence and republican movements of the English speaking Caribbean
nations. Trinidad and Tobago, persuaded by the rhetoric of Dr. Eric Williams, compelled the
Caribbean toward a Caribbean identity and nationhood.

Selection Procedure
For each of three newspaper titles, target issues will be selected as follows:
For any given test group, a minimum of 400 page images will be selected to
maintain statistical validity consistent with +5% accuracy.
For any given sub-sample, a minimum of 200 page images will be selected to
maintain statistical validity consistent with +10% accuracy.
Page images grouped in one test group may also be included in other test
groups.
To establish data resolution as to afford comparison across titles, issues will be
selected from comparable dates, e.g., the first issue every fourth month.

Test Groups
General Application Note
TextBridge OCR, a reasonably priced, well tested, highly accurate, and very flexible OCR
package with English, French and Spanish dictionaries and fonts will be used in testing
across all test groups.
Target Summary
Total new images created: ............................... ............ 1,800
Using Minolta microfilm scanners ................................. 1,400
Using flat-bed transparency scanners ............................. 400
Additional Phase I source images selected: .............................1,200
Total images OCR processed: ............................................... 5,000
Note: Images or OCR used in a given test group, may have had other use in another test
group. The total number of items per category above is smaller than the sum of
items for all test groups.
Analysis
Test data will be recorded for subsequent analysis. In addition to accuracy rates and
"demographic" data (e.g., page counts, font names, etc.), processing time will also be
documented for use in time and cost studies. Accuracy, time and cost accounting will
document: 1) human indexing (from Phase 1), 2) an initial machine pass, 3) and a
secondary human pass, to correct obvious errors in the recognition.




University of Florida proposal submitted to The Andrew W. Mellon Foundation
CARIBBEAN NEWSPAPER IMAGING PROJECT
Phase II: OCR Gateway to Indexing




1. Digitizing Methods & Technologies Group
Factors in imaging and image characteristics are fundamental to OCR accuracy. In OCR
of newspapers, those factors and characteristics which make electronic images of text
readable handicap the generation of electronic text from those images.
Research at Cornell University5 suggests that scanning at increased bit-depth may
enhance the legibility fine detail from the source document. Yet, current OCR software is
either optimized for or restricted to use of the lowest bit-depth, i.e., of bitonal scans.
Optimization is driven by records managers rather than by librarians and archivists.
Bitonal images consume fewer computer resources for OCR processing. Regardless, the
suggestion has little utility when scanning from bitonal, high contrast microfilm rather than
from the newspapers themselves.
Where bigger-is-better in setting digital resolution measured as dots-per-inch (dpi),
microfilm scanners currently manufactured are not capable of meeting an adequate dpi
per the Comell formulas. Devices which scan from projection rather than directly from film
require 50% reduction for whole-page imaging. Metering projected newspapers into
segments for optimal capture continues to require human intervention.
Digital resolution and reduction will be examined as factors in accuracy. The test group
will consist of four sub-samples:
whole-page 400 dpi scans made at 50% reduction;
partial-page 400 dpi scans made without reduction; and
whole-page 400 dpi scans made without reduction.
The Minolta microfilm scanners will be used to create images in the first and second sub-
samples. 400 dpi is the optimal scanning resolution of Minolta microfilm scanners at 50%
reduction/blow-back of original newspaper size.
For testing purposes, a fiat-bed scanner with transparency attachment and a 600 dpi
native-hardware linear-array will be used to create images in the third sub-sample. Using
the scanner's interpolation software 8,400 dpi is approximately the resolution required for
a moderately good (Quality Index 5) scan. This 8,400 dpi scan from film with 21:1
reduction ratio is the equivalent of a 400 dpi scan from the newspaper itself.
As with OCR, interpolation is software-driven and, as such, performs subject to the
programmer's conditions. In some cases, interpolation is thought to degrade rather than
improve the quality of the image. It's use is suspect for bitonal text enhancement, where
interpolation between black and white is stark and the majority of text is formed with
horizontal and vertical strokes.
To ensure that publication artifacts, language and other factors do not skew data, target
images will be taken from the Trinidad Guardian issued between 1990 and 1997. (The
University of Florida is in contact with the Trinidad Guardian's publisher to resolve
copyright issues associated with reproduction. The publisher's initial reaction has been
positive.)
Total images created: 1,200
Total images OCR processed: 1,200




University of Florida proposal submitted to The Andrew W. Mellon Foundation
CARIBBEAN NEWSPAPER IMAGING PROJECT
Phase II: OCR Gateway to Indexing



2. Filming Methods & Technologies Group
(Filming Date Group)
Factors in microfilming and film characteristics are fundamental to optimal image capture and
subsequent OCR accuracy. In newspaper microfilming, there have been four eras, each
defined by a set of standards or the lack there of:
PRE-MODERN (pre-1977; defined by "best-practices");
MODERN (1977- 1986; defined by Library of Congress/ANSI standard6);
POST-MODERN (1987 present; defined by the first Research Libraries Group standard and
revision of Library of Congress/ANSI standard); and
CONTEMPORARY (present/select application; defined by so called, "OCR-optimized" standard,
i.e., RLG standard modified for allowable 1% skew, fixed reduction, and "one-up"filming),
Regardless of standards, film image quality is conditioned by focus and depth of field;
reduction level; exposure levels and light dispersion; and the density of imaged film. Microfilm
is a high-contrast technology optimized for capture of text, but unsuitable exposure or uneven
lighting, in particular among these conditions, can erode the legibility of text.
The role of each standards set and filming conditions will be examined as factors in accuracy.
The test group will consist of four sub-samples corresponding to microfilming eras. Analysis
will look, as well, across sub-samples at specific image quality factors.
To ensure that publication artifacts, language and other factors do not skew data, target
images will be taken from the Trinidad Guardian. To ensure consistency in analysis of
samples, whole-page images will be created at 400 dpi with 50% reduction on Minolta
microfilm scanners.)
Total images created: ............... 800 (includes 200 created in test group 1, sub-sample
1)
Total images OCR processed: .... 800 (includes 200 processed in test group 1, sub-
sample 1)

3. Publication Date Group
The condition and characteristics of the source newspaper set bounds on the quality of the
film image. Microfilm is a non-additive technology; the film image is never better than the
source newspaper.
Printing technology; print defects; font face; and font sizes will be examined as factors in
accuracy.
Issue availability necessitates a division of the group to ensure equitable assessment across language
types (cf, test group four: language group). Images from the Trinidad Guardian are suggested as a
control across date groups. Sub-samples include:
1890 1920 (Le Nouvelliste, Trinidad Guardian);
1920 1950 (Le Nouvelliste, Trinidad Guardian);
1950 1960 (Diario de la Marina, Le Nouvelliste, Trinidad Guardian); and
1960 1997 (Trinidad Guardian).
The Diario de la Marina is available only between 1947 and 1960.
Le Nouvelliste is available only before 1960.
To ensure that other factors do not skew data, whole-page images will be created at 400 dpi
with 50% reduction on Minolta microfilm scanners.)
Total images created: ............... 800 (comprised of 800 created in test group 2)
Phase 1 source images selected: 800
Total images OCR processed: .... 1,600 (includes 800 processed in test group 2)





University of Florida proposal submitted to The Andrew W. Mellon Foundation
CARIBBEAN NEWSPAPER IMAGING PROJECT
Phase II: OCR Gateway to Indexing



4. Language Group
OCR software is written primarily by English speakers for use on English language texts.
Software, with added dictionaries and character sets, provide non-English language
capability. However good, it is suspected that OCR software may not be as efficient or
accurate with non-English texts and with the accented characters found in French and
Spanish in particular.
Language will be examined as a factor in accuracy. This test group has three sub-samples:
English (Trinidad Guardian);
French (Le Nouvelliste); and
Spanish (Diario de la Marina).
To ensure that other factors do not skew data, whole-page images of issues published
between 1950 and 1960 will be created at 400 dpi with 50% reduction on Minolta microfilm
scanners.)
Total images created: ............... 200 (comprised of 200 created in test group 3, sub-
sample 3)
Phase 1 source images selected: 400
Total images OCR processed: .... 600 (includes 200 processed in test group 3, sub-
sample 3)
5. OCR Software Technologies Group
OCR software is optimized for measures of digital resolution (dpi) associated with commonly
available hardware linear-arrays or an evenly divided measure of those arrays, e.g., 300 and
600 dpi. Cornell states that images with dpi not consistent with these measures may not be
as accurate as those that are consistent.7 Regardless, OCR is software; code is written
differently for different software packages. Software publisher's claims aside, it is reasonable
to suggest that individual software packages are more or less reliable than others.
Digital resolution and different OCR packages will be examined as factors in accuracy. The
test group will consist of several sub-samples:
partial-page 400 dpi scans made without reduction; and
whole-page 400 dpi scans made without reduction.
To ensure that publication artifacts, language and other factors do not skew data, target
images will be taken, and selectively indexed, from the Trinidad Guardian between 1990 and
1997. (This is to compare time, cost, and effectiveness of human indexing, as done in Phase
1 for Le Nouvelliste and Diario de la Marina.)
Sub-samples will be processed using each of three major software packages: TextBridge,
OmniPage Pro, and TypeReader. (Because of its cost, the Prime Recognition software used
by the University of Michigan and JSTOR will not be tested in this Phase and unless cause to
believe that different software produces different levels of accuracy can be proven.) Sub-
samples will also be processed using software bundled with Adobe Exchange. Exchange,
increasingly, is the software engine used by electronic newspaper distributors (e.g.,
NewsExpress, http://www.foreigmedia.com) capturing electronic mark-up of current
newspapers; it's text recognition software is the foundation of the indexing systems used by
distributors.
Total images created: 1,000 (comprised of 1,000 created in test group 1)
Total images OCR processed: 4,000 (comprised of 1,000 created in test group 1)
In addition to this machine pass, a secondary human pass will be made for a sub-sample of
each software package's work. The human pass will correct obvious errors in the
recognition.





University of Florida proposal submitted to The Andrew W. Mellon Foundation
CARIBBEAN NEWSPAPER IMAGING PROJECT
Phase II: OCR Gateway to Indexing


* Total images corrected:


1,600


Budget
OCR Software
I ITEM REQUESTED COST SHARE TOTAL
TextBridge (5 workstation cite license) 427.50 0.00 427.50
OmniPage Pro v.9.0 0.00 99.00 99.00
TypeReader v.5.0 (5 wkstn cite license upgrade) 0.00 275.00 275.00
Adobe/Acrobat (w/Exchange) 0.00 45.00 45.00
TOTAL 427.50 419.00 846.50
Hardware (Workstation Upgrades)
ITEM REQUESTED COST SHARE TOTAL
Workstation upgrades (2) 2,200.00 2,200.00 4,400.00
Memory upgrades (2) 512.00 512.00 1,024.00
TOTAL 2,712.00 2,712.00 5,424.00
Note: Upgrades rebuild existing hardware,
effectively building new workstations. The
computer processor and memory are the single
most important non-human factor in quick,
efficient operation.
Staffing
ITEM REQUESTED COST SHARE TOTAL
Project Management & Evaluation
Kesse, Erich (Project Manager), 5% FTE 0.00 2,250.00 2,250.00
Bruner, Thomas (Imaging Supervisor) 5% FTE 0.00 1,125.00 1,125.00
Botero, Cecilia (Indexing Supervisor) 5% FTE 0.00 1,850.00 1,850.00
Phillips, Richard (Evaluation Coordinator) 5% FTE 0.00 2,235.45 2,235.45
Covey, William (Computer Support) 2.5% FTE 0.00 1,389.15 1,389.15
Imaging (with Minolta microfilm scanners) 303.34 0.00 303.34
Method of Calculation:
(1,400 images x 2 min) x $6.50/hr
Imaging (with flat-bed transparency scanners) 130.00 0.00 130.00
Method of Calculation:
[(400 images x 15 min)x $6.50/hr] + 5
Note: Division assumes that 5 Minolta scans can
be created while the flat-bed scans I image.
Indexing (Traditional Method) 1 a808.00 0.00 1808.00
Method of Calculation:
(800 issues reviewed x 2 articles indexed) x $4-43-
Note: Method and cost calculation correspond with
those used in Phase I of the Andrew W. Mellon
Foundation funded Caribbean Newspaper Imaging
Project (CHIP).
Note: Necessary to allow comparison of traditional
indexing (CHIP1) with OCR-based key-word
indexing (CHIP2) methods for the Trinidad Guardian,
which was not part of CHIP, Phase 1.
OCR processing Machine pass 270.83 0.00 270.83
Method of Calculation:
[(6,000 page images x 5 min) x $6.50/hr] + 12
Note: Division assumes that 12 images can be
batch processed in a hour.
OCR processing Human pass 3,300.00 0.00 3,300.00
Method of Calculation:
(1,600 page images x 15 rain)x $8.25/hr
Benefits 0.00 2,477.89 2,477.89
Method of Calculation: 28% full time staff


I





University of Florida proposal submitted to The Andrew W. Mellon Foundation
CARIBBEAN NEWSPAPER IMAGING PROJECT
Phase II: OCR Gateway to Indexing


TOTAL
Miscellaneous
| ITEM
DAT (primary backup media)
CD-ROM (secondary backup media)
TOTAL


5,912.17


REQUESTED
0.00
0.00
0.00


9,051.67 14,565.74 22,637.41


11,327.49


TOTAL


17,239.66


COST SHARE
65.00
42.25
107.25


TOTAL
65.00
42.25
107.25


I




University of Florida proposal submitted to The Andrew W. Mellon Foundation
CARIBBEAN NEWSPAPER IMAGING PROJECT
Phase II: OCR Gateway to Indexing



Notes
1. Lenzar, the Florida company which manufactured large format linear-array fiat-bed
scanners, went out of business in 1997. It was the only manufacturer of such products.
2. Gertz, Janet. Oversize Color Images Project, 1994-1995: A Report to the Commission on
Preservation and Access. Washington, D.C.: Commission on Preservation and Access,
1995.
3. RLG Preservation microfilming handbook, edited by Nancy E. Elkington. Mountain View,
CA: Research Libraries Group, 1992.
The Handbook was preceded by a set of guidelines in: Research Libraries Group.
RLG
Preservation Manual (second edition). Stanford, CA: the Group, 1986.
4. "Film scanning formulas" insert in: Kenney, Anne R. and Stephen Chapman. Digital
imaging for libraries and archives. Ithaca, NY: Cornell University Library, Department of
Preservation and Conservation, 1996.
5. Kenney, Anne R. and Stephen Chapman. Digital imaging for libraries and archives.
Ithaca, NY: Cornell University Library, Department of Preservation and Conservation,
1996.
6. Library of Congress newspaper microfilming guidelines, compiled in 1971 as: Library of
Congress. Specifications for the microfilming of newspapers in the Library of Congress.
Washington, D.C.: the Library, 1971.
Republished as a national standard: American National Standards Institute.
Recommended practice for microfilming newspapers (MS 111 ). Silver Spring, MD:
National Micrographics Association, 1977.
A revised standard was issued in 1987 and published by the Association for
Information and Image Management.
7. Kenney, Anne R. and Stephen Chapman. Digital imaging for libraries and archives.
Ithaca, NY: Cornell University Library, Department of Preservation and Conservation,
1996.
Page 133.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs