www.dloc.com Module 3: 1 Module 3: Imaging
www.dloc.com Module 3: 2 Basic Theory and Specifications dLOC Requirement for Digital Master Files. 8-bit Grayscale or 24-bit RGB Color (depending on whether original has significant color) 300 dpi for standard text or 600 dpi for stand-alone images (photographs, maps) Save archival files as uncompressed TIFFs Bits Depth In digitization three levels of Bit Depth are widely used: 1 Bit, 8 Bit, and 24 Bit images. A 1 Bit image is referred to as Â“bi-tonalÂ” or, less precisely, as Â“black-and-whiteÂ”. The picture elements of a 1 Bit image are expressed in stings of one bit. That bit may be either one color or an alternate and, frequently either black or white. An 8 Bit image is referred to as Â“grey-scaleÂ”, though an 8 Bit image may represent a very limited color spectrum as well. Most scanning equipment defaults 8 Bi t imaging to grey-scale. The picture elements of an 8 Bit image are expressed in strings of eight bits, for example: 00001111. 8 Bit images allow for as many as 255 shades or colors. (N.B. Technically 8 Bit images allow for 256 shade/color values, but one of these is reserved as a check-digit and is not used to express a shade/color value.) An 24 Bit image is referred to as Â“true colorÂ” or, less precisely, as a Â“colorÂ” image. The picture elements of a 24 Bit image are expressed in string s of twenty-four bits. 24 Bit images allow for as many as 16,777,216 shades or colors. You may he ar digitization specialists using the short-hand Â“sixteen million colorsÂ”. The 24 bits are divide d into three 8 Bit channels, one for each of three composite colors (Red, Green, and Blue.) Color Space Color fidelity is fundamental to ac curate reproduction of source. Digitization, faithful to original colors, requir es a basic understanding of color and how color reproduction differs from printing technology to digital technology. Fundamental to these differences is the media on which a color image is printed. The color space most commonly used by digitization projects and required by dLOC, is a standardized Red/Green/Blue (sRGB) color space.
www.dloc.com Module 3: 3 Choosing the Appropriate Bit Depth and Color Space 1 Bit Image 8 Bit Image 24 Bit Image dLOC recommends that 1 Bit imaging should not be used. 1 Bit images, even at very high resolution (see, Resolution below), tend to pixelate text. Imperfections on the page or artifacts of age may read as black, obscuring text in 1 Bit images. In the 1 Bit page image above, bleed through from the text printed on the inverse page as well as artifacts of age obscure the text. Obscured text will introduce imperfections that redu ce the accuracy of text conversion by optical character recognition (OCR) software. The 8 Bit grayscale image above captures the textua l information. And, the reader of the page can make sense of the text. Readers of Latin religious texts, such as that seen above, will recognize the red text as instructions to the faithful, commentary on the spoken text of a religious service, or the narrative of the priest as opposed to that of the congregationÂ’s response. dLOC advocates preserving meaningful color. Meaningful color is color required to interpret the text. In the case of a newspaper with colored images, a color image accompanying an article demonstrates meaningful color, while a color advertisement may not. It is true that Â“The greater the Bit Depth the greater the size of the digital image fileÂ”. But, digitization technicians are encouraged to produc e images that meet the readerÂ’s needs rather than the needs of the digitization technician to conserve space. Resolution
www.dloc.com Module 3: 4 The resolution of digital images is expressed in terms of pixels. A pixel is a picture element or, simply, a block of solid shade or color that, to gether with other picture elements comprises a digital image. Trinidad and TobagoÂ’s Coat of Arms ( Zoom area in black box. ) RESOLUTION USE FOR OR 300 pixels per inch (ppi) 118 pixels per centimeter (ppc) Printed text with normal sized fonts Oversized documents and maps Manuscripts with legible script OR 600 pixels per inch (ppi) 236 pixels per centimeter (ppc) Photographs and select graphic arts Printed text with very small fonts Manuscripts with difficult scripts The dLOCÂ’s minimum digital resolution standard for printed text with normal sized fonts is 300 pixels per inch (ppi) or 118 pixels per centimet er (ppc). This threshold is based on both the characteristics of printed graphics and optical character recognition (OCR) tests. 300 ppi / 118 ppc The Rationale for Printed Graphics In general, the resolution of printed graphics do es not exceed 300 dots per inch (dpi) or 118 dots per centimeter (dpc). Dots per inch/centimeter are rough equivalents of pixels per inch/centimeter; so comparison is appropriate.
www.dloc.com Module 3: 5 Carifesta Â’72 logo printed in GuyanaÂ’s Sunday Post and Weekend Argosy ( Zoom area in red box. ) Graphics printed in newspapers, for example, of ten have 80 to 100 dpi (32 to 40 dpc). Most graphics in magazines are printed with 120 dpi (48 dpc) print resolution while graphics in high-end magazines and on post-cards are printed with 300 dpi (118 dpc) print resolution. Digitization of printed graphics at resolutio n greater than 300 ppi (118 ppc) would be excessive. The Rationale for Optical Character Recognition ( Text Generation ) When a document page is digitized an image of the page is created. All text page images sent to the dLOCÂ’s central servers are subj ect to Optical Character Recognition (OCR). OCR is a process by which page images are converted to searchable text. Several OCR programs are in common use. Most are optimized for the conversion of images digitized with 200, 300, 400 or 600 ppi (80, 118, 158 or 236 ppc). Images created with other resolution can be converted to searchable text but, generally, with less accurate results. Resolution and OCR Accuracy in high contrast images 75 ppi Image 150 ppi Image 300 ppi Image 600 ppi Image OCR results OCR results OCR results OCR results L ~iC Label C Label C Label C
www.dloc.com Module 3: 6 ud L~ddPa of d Laid Pa of d Laid Pa of d Laid Pa The Importance of Bit-Depth on Text Recognition: the Latin word Feltis = Goodness 1 Bit Image This letter may be any of the following: c Â– e Â– o 0 8 Bit Image This letter may be any of the following: c Â– e Â– o Â– 0 24 Bit Image The letter e appears now to be more probable. dLOC central servers use the Prime Recognition OCR system, configured with six OCR engines to ensure a high level of accuracy in text ge neration. For printed texts with normal size fonts, whether plain (sans serif) or embellished (serif), tests demonstrate that the average modern printed document is accurate ly recognized at 200 ppi (80 ppc).
www.dloc.com Module 3: 7 dLOC sets a slightly higher st andard, 300 ppi (118 ppc), for pr inted texts with normal size fonts to compensate for occasional uses of small fonts or colored, aged (discolored), or blemished paper. Digitization of normal printed texts at higher resolution (e.g., 600 ppi/236 ppc), in tests, generally showed no increase in text conversi on accuracy. 600 ppi/236 ppc images result in higher conversion accuracy only when the source document is printed with very small fonts. 600 ppi / 236 ppc dLOC recommends digitizing at 600 ppi (236 ppc) only when working with printed texts with very small fonts; photographs and other cont inuous-tone graphics, and manuscripts with difficult scripts. Photographs Photographs, unlike printed graphics, have cont inuous-tone. In the source document, one shade or color blends into adjacent shades and colors. Continuous-tone images may be digitized at any resolution. dLOC recommends 600 ppi (236 ppc) resolution to facilitate special uses of images. Users of digital photographs frequently consult images for their various subjects as for the whole image. A user may want to zoom on the jewelry or hair braids in the photograph of a woman or on shop sizes in the photograph of a street scene. dLOC central servers use JPEG 2000 technology to facilitate zoom. Images digitized at 600 ppi (236 ppc) produce clearer, sharper, and more readable images than do 300 ppi (118 ppc) images. Saving Files and Image Compression Once the digital image is created, there remains the issue of saving or archiving the file. The digitization technician prefers not to loose a qual ity image to the imperfections of file saving and image compression routines. TIFF JPEG GIF
www.dloc.com Module 3: 8 TIFF contains all image data. JPEG compresses the image, seen here at leaf edges. GIF also compresses the image, seen here in color patches. Saving Files When saving an image file, the technician has a choice of file types, commonly including GIF, JPEG and TIFF. GIF and JPEG (sometimes: JPG) are Internet deliverable file formats. dLOC creates these derivative or secondary file formats for participating institutions from a digital master. Institutions either not participating in dLOC or not using dLOCÂ’s central servers, should observe similar practice. Only the TIFF (sometimes: TIF; Tagged Image File Format) is considered archival within the international digital library comm unity. It alone serves as a di gital master. There are several reasons for this, primarily: image compressio n. The illustration above demonstrates image quality issues as a factor in file choice. Image Compression When saving an image file, of ten regardless file type, the technician will be given the opportunity to compress the image. Compressio n saves file space but has produces other and unwelcome artifacts. There are two classes of compression: lossy and lossless. Lossless compression is an oxymoron. Technica lly, a lossless image has no compression. A lossless image contains every bit of information created during th e scanning process. Here is another simplification: when the scanner captures the bit-stream 1 1 1 1 the lossless file saves 1 1 1 1 Though this makes for large files, it also makes for an ideal archival format and, therefore, optimal for file recovery should the digital master ever be damaged in use or degrade in storage. Lossy data compression technologies atte mpt to eliminate redundant or unnecessary information, storing a mathematical representa tion of the eliminated data. Here is yet another simplification: when the scanner captures the bit-stream 1 1 1 1 the lossy file saves a representation of 4 Because lossy images generate sma ller files, they can be delivered to
www.dloc.com Module 3: 9 readers via the Internet quickly. The human ey e compensates for image loss by filling in the gaps. But, because there is image loss, recovery from damage or degradation is more difficult and, in many cases, may be impossible without great expense. Effects of Compression on Image Quality 0% Compression 9 KB file size No image artifacts Shown here: Both as scanned and color enhanced versions. 50% Compression 5 KB file size Image artifacts appear as dark discoloration at the bridge of the nose, and lightening together with slight blockiness at the temple. Shown here: Both as scanned and color enhanced versions. 85% Compression 3 KB file size Image artifacts appear as blocky discoloration. Compression brings similar colors together, resulting in the block effect. Shown here: Both as scanned and color enhanced versions.
www.dloc.com Module 3: 10
www.dloc.com Module 3: 11 Scanning Creating Directories Before scanning, the toolkit should create the folder(s) in which you save the scanned images. For each item a sepa rate folder is created at C:/D LOC/ with the appropriate dLOC ID, for example o C:/DLOC/CA00000001/00001 o C:/DLOC/CA00000002/00001 o C:/DLOC/CA00000003/00001 dLOC Requirement for Digital Master Files. 8-bit Grayscale or 24-bit RGB Color (depending on whether original has significant color) 300 dpi for standard text or 600 dpi for stand-alone images (photographs, maps) Save archival files as uncompressed TIFFs Flatbed Scanning: Epson Expression 10000 XL The following screenprints are specific to the Epso n Expression scanner, but the same settings apply to any flatbed scanner. Scan Settings Scan documents using Adobe Photoshop rather than using the scannerÂ’s stand-alone image capture software and check the scanner settings with each new document. Before opening Adobe Photoshop, turn on the scanner and make sure that the bed is clean and free of any dust, debris, etc. If necessary clean the glass with a lint free cloth and a very small amount of glass cleaning fluid. 1. Launch Adobe Photoshop 2. Select: File Import Epson Expression 10000 XL
www.dloc.com Module 3: 12 The scanning interface will then open two wi ndows: a scan settings window and a preview window ( as seen here ) 3. Select the appropriate settings for your document a. At the top of the scan settings wi ndows select PROFESSIONAL MODE b. Always select the following settings 1. Documents Type: REFLECTIVE 2. Document Source: DOCUMENT TABLE 3. Auto Exposure Type: PHOTO 4. Document Size: DO NOT ADJUST 5. Target Size: ORIGINAL c. Select the appropriate color space and bit depth 1. 8-bit grayscale for items without significant color 2. 24-bit RGB color for all other items d. Select the appropriate resolution 1. 300 dpi for mostly textual items
www.dloc.com Module 3: 13 2. 600 dpi for stand-alone image items (photographs, maps, etc..) e. Click the CONFIGURATION button ( below the Preview and Scan buttons ) 1. Click the COLOR tab 2. Select NO COLOR CORRECTION 3. Click OK Scanning 1. Place item, image down, on scanner glass. Be care ful to place item as straight as possible in order to save time later. Close the scanner lid as much as item permits. 2. Click the PREVIEW button in the Scan Settings window. A small preview of your image will appear in the preview window. Make sure the entire document is visible, if not reposition on glass and re-preview. 3. Draw a bounding box around your entire image. If your original has 2 pages facing each other, draw a second box by selecting the dual marquee button Arrange each box to completely
www.dloc.com Module 3: 14 include each side of the item. Once you are satisfied with the boxes positioning click the ALL button. DO NOT move the boxes or change settings after pressing this button! 4. Click the SCAN button Saving Files 1. Save your image by selecting: File Save 2. Select the dLOC ID folder that corresponds to which the image being saved belongs. E.g., In separates folder at C:/DLOC/ with the appropriate dLOC ID, for example C:/DLOC/CA00000001/00001 C:/DLOC/CA00000002/00001 C:/DLOC/CA00000003/00001 3. Type in a sequential four digit file name, such as 0001 0002 0003 etc. 4. Select TIFF from the file format drop down menu 5. Always uncheck the ICC profile box 6. Click Save For TIFF Options select same as below
www.dloc.com Module 3: 15 Image Correction The intent of any digitization should be a faithf ul reproduction of the original document. Toward this goal, images will need to be deskewed and cropped to fit the in-hand original. In addition, it may be desirable to perform color correction either to reproduce the in-hand original, or the original state of the document. Applying these tech niques in Adobe Photoshop is the topic of this section. Image Correction in Adobe Photoshop 1. To straighten drastically skewed images: a. Click and hold the Eyedropper Tool in the Photoshop tool box Â– Select the Measure Tool b. Click and draw a line to follow the bottom of any printed text, line or image ( line is red, here, for purposes of illustration ) c. Select: Image Rotate Canvas Arbitrary ( DO NOT change the angle ) click OK 2. Crop the image to remove any excess bo rders added during straightening using the crop tool 3. If necessary (e.g., if the image is muddy), adjust the levels/histogram by selecting Image Adjustments Levels
www.dloc.com Module 3: 16 Adjust the black, white and midpoints to improve your image quality and contrast. If the image is COLOR, you may make histogram adjustments for each RGB channel: Red, Green and Blue. But, do not over correct and eliminate detail. A histogram shows the distribution of tones over a range. The image characterize d by the histogram above is predominantly white. While the image contains shades of gray, deeper tones of black are almost entirely absent. 4. Images with good, thick printed text can al so be quickly corrected by selecting the documentÂ’s white point. This is done by opening the levels/histograms by selecting Image Adjustments Levels In the levels window select the eyedropper furthest to the right and then select the point in your image that should be the brightest white. The images below show this effect before and after the white point selection. You will notice that the backgr ound becomes almost uniformly white, but the text is also lightened. Before selecting OK in the levels/histograms you will need to bring in the black point in order to improve the text. This is done by moving the arrow furthest to the left, in towards the right. You will notice that th e numbers in the Input Levels boxes increase.
www.dloc.com Module 3: 17 It is helpful to perform this correction while zoomed in to 100% on your image, as shown below. 5. If the image is extremely stained the document should be scanned in RGB and if possible, the stains should be lightened using Image Adjustments Replace Color
www.dloc.com Module 3: 18 Select Â“ImageÂ” and not Â“SelectionÂ” in th e Replace Color Window. Then using the eyedropper tool select the darker color of the stain. Adjust the Lightness, Saturation and Hue slider bars as needed to minimize th e stains. The fuzziness meter indicates how closely a color must match the selected color to be replaced. Be aware that stains may be similar in color to text and therefore too much manipulation is undesirable in order to not lose information. Often it is useful to zoom into one section of text while performing the color replacement. One must be careful not to make the text harder to read for the OCR engine. 6. Remember that any adjustments done to im ages can be undone as long as the file remains open. Maintain your history window open by selecting Window History in Photoshop, then simply select the previous step done. You can always go back several steps and re-correct your image. Other Adobe Photoshop Resources The original Adobe Photoshop installation package should include a tutorial of the software you purchased. In addition, Adobe has an on-line resource at the following URL: http://www.adobe.com/products/tips/photoshop.html
www.dloc.com Module 3: 19 Adobe, the Adobe Logo, and Photoshop are either regi stered trademarks or trad emarks of Adobe Systems Incorporated in the United St ates and/or other countries.