Digitization of Text Documents Using PDF/A

Abstract

The purpose of this article is to demonstrate a practical use case of PDF/A file format for digitization of textual documents, following recommendation of using PDF/A as a preferred digitization file format. The authors showed how to convert and combine all the TIFFs with associated metadata into a single PDF/A-2b file for a document. Using open source software with real-life examples, the authors show readers how to convert TIFF images, extract associated metadata and ICC profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self-described container which accommodates all the data from digitization of textual materials, including page-level metadata and/or ICC profiles. With theoretical analysis and empirical examples, PDF/A file format has many advantages over traditional preferred file format TIFF / JPEG2000 for digitization of textual documents.

Author Biography

Xueheng Wan, The University of Arizona
Department of Computer Science
Published
2018-03-19
How to Cite
Han, Y., & Wan, X. (2018). Digitization of Text Documents Using PDF/A. Information Technology and Libraries, 37(1), 52-64. https://doi.org/10.6017/ital.v37i1.9878
Section
Communications