LISTSERV - ARCHIVES Archives - LISTSERV.MIAMIOH.EDU

If you do not have images, but only text from the original, you face the
possibility that the rendered PDF text can be wrong, badly read by the OCR
program, and/or not reflect changes made in handwriting to the originals.
In many cases, optical-character-recognition rendering, done to make the
PDF, cannot read or reflect what is in the original documents because your
program may automatically decide what is text and what is graphics and
improperly try to read graphics or "decide" that text is graphics and make a
hodgepodge of your page.  Trying to correct PDF or other text from
essentially graphic documents is extremely time-consuming, frustrating, and
sometimes impossible.  

We've been facing that here as we work to set up a digitization project.
Published reports can often be simply rendered into PDF without incident,
but mixed (graphics/text), handwritten, or hand-corrected documents can
cause problems.  One excellent example that we tried scanning first off was
a photocopy of a carbon copy that had been hand-corrected as to one
important mistake.  The OCR-text version needed a large investment of time
to correct due to the difficulty of reading the original, and only an image
file could pick up the correction and make it legible. I thought it
inauthentic to simply correct the stenographer's mistake in the PDF file.
We do use PDFs as a text-file format.  One advantage of an image format is
that you always have an image from which to correct your text files should
this not be possible on first pass, or should errors be discovered later.

The PDF format can be used to create image-only files, but then you have no
searchable text unless you make one image and one text file of the same
document.  (We've been unable to see and correct the OCR'd "invisible layer"
of text in the Adobe "Searchable Image (Exact)," and we haven't been able to
get an answer out of Adobe on how to do this.) So you would need to make two
files of each document:  one image and one text, unless there is nothing
remarkable about the image (a published document), or no text worth
searching in a mixed or graphic document.  While you're saving two images,
why not make one of them a TIFF?

You can see in the archives of this list the discussions of the reasons for
using the TIFF format for images.  One excellent article to which I think I
was referred from this list is at
http://aic.stanford.edu/sg/emg/library/pdf/vitale/2006-01-vitale-digital_ima
ge_file_formats.pdf (For those not interested in the history of imaging
systems, start with #5, page 30, Image File Formats.")  The arguments for
TIFF include the lack of compression (in at least one variety) and its
likely projection into the future as an imaging format.  Note that there are
different TIFF formats, and that you will need to be sure that the format
you choose from your imaging program produces files that can be opened and
read by other software.  (We scanned upwards of 10 images before realizing
that we were using an incompatible TIFF format.)  

So, yes, I agree that your best policy is to save a "master" TIFF file of
each document image, and a PDF (if that's your chosen format) for searchable
text.  If you do only one or the other, even assuming a perfectly legible
textual document, you have lost either the ability to search the text or the
ability to correct the searchable text without consulting the original
document.

Arel Lucas, C.A.
Archives/Special Collections Librarian
Embry-Riddle Aeronautical University
Prescott Campus
-----Original Message-----
From: Archives & Archivists [mailto:[log in to unmask]] On Behalf
Of Rhue, Monika
Sent: Friday, September 08, 2006 12:19 PM
To: [log in to unmask]
Subject: Digitization Question
Importance: High

 We are working on a digitization project which involves scanning
original documents from our archival collection. The web master want to
scan these letters, correspondence, etc into PDF without creating a
master file. From my research, we should scan all original documents in
TIFF as the master files. Maybe the PDF can be the means in which people
access the documents. It was stated that both the Florida Digital
Archive and the Deep Blue repository at the University of Michigan find
this compliance an acceptable archival format.

http://www.fcla.edu/digitalArchive/pdfs/PDFGuideline.pdf#search=%22adobe
%20acrobat%20professional%20A-1b%20standard%22

http://deepblue.lib.umich.edu/about/deepbluepreservation.jsp

However, I want to get the opinion of my colleagues.

Thanks

Nooma Monika Rhue, MLIS
Archivist/Archival Services Librarian
Inez Moore Parker Archives and Research Center
Johnson C. Smith University
100 Beatties Ford Road
Charlotte, NC 28216
704-371-6741
Email: [log in to unmask] <mailto:[log in to unmask]>



<http://archives.jcsu.edu/echo>

A posting from the Archives & Archivists LISTSERV List sponsored by the
Society of American Archivists, www.archivists.org.
For the terms of participation, please refer to
http://www.archivists.org/listservs/arch_listserv_terms.asp.

To subscribe or unsubscribe, send e-mail to [log in to unmask]
      In body of message:  SUB ARCHIVES firstname lastname
                    *or*:  UNSUB ARCHIVES
To post a message, send e-mail to [log in to unmask]

Or to do *anything* (and enjoy doing it!), use the web interface at
     http://listserv.muohio.edu/archives/archives.html

Problems?  Send e-mail to Robert F Schmidt <[log in to unmask]>

A posting from the Archives & Archivists LISTSERV List sponsored by the Society of American Archivists, www.archivists.org.
For the terms of participation, please refer to http://www.archivists.org/listservs/arch_listserv_terms.asp.

To subscribe or unsubscribe, send e-mail to [log in to unmask]
      In body of message:  SUB ARCHIVES firstname lastname
                    *or*:  UNSUB ARCHIVES
To post a message, send e-mail to [log in to unmask]

Or to do *anything* (and enjoy doing it!), use the web interface at
     http://listserv.muohio.edu/archives/archives.html

Problems?  Send e-mail to Robert F Schmidt <[log in to unmask]>