Digitization of Text Documents Using PDF/A

Authors

DOI:

https://doi.org/10.6017/ital.v37i1.9878

Abstract

The purpose of this article is to demonstrate a practical use case of PDF/A file format for digitization of textual documents, following recommendation of using PDF/A as a preferred digitization file format. The authors showed how to convert and combine all the TIFFs with associated metadata into a single PDF/A-2b file for a document. Using open source software with real-life examples, the authors show readers how to convert TIFF images, extract associated metadata and ICC profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self-described container which accommodates all the data from digitization of textual materials, including page-level metadata and/or ICC profiles. With theoretical analysis and empirical examples, PDF/A file format has many advantages over traditional preferred file format TIFF / JPEG2000 for digitization of textual documents.

Author Biography

Xueheng Wan, The University of Arizona

Department of Computer Science

Downloads

Published

2018-03-19

How to Cite

Han, Y., & Wan, X. (2018). Digitization of Text Documents Using PDF/A. Information Technology and Libraries, 37(1), 52–64. https://doi.org/10.6017/ital.v37i1.9878

Issue

Section

Communications