Digitization of Text Documents Using PDF/A

Yan Han; Xueheng Wan

doi:10.6017/ital.v37i1.9878

Authors

Yan Han The University of Arizona http://orcid.org/0000-0001-9518-2684
Xueheng Wan The University of Arizona http://orcid.org/0000-0001-5577-0502

DOI:

https://doi.org/10.6017/ital.v37i1.9878

Abstract

The purpose of this article is to demonstrate a practical use case of PDF/A file format for digitization of textual documents, following recommendation of using PDF/A as a preferred digitization file format. The authors showed how to convert and combine all the TIFFs with associated metadata into a single PDF/A-2b file for a document. Using open source software with real-life examples, the authors show readers how to convert TIFF images, extract associated metadata and ICC profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self-described container which accommodates all the data from digitization of textual materials, including page-level metadata and/or ICC profiles. With theoretical analysis and empirical examples, PDF/A file format has many advantages over traditional preferred file format TIFF / JPEG2000 for digitization of textual documents.

Author Biography

Xueheng Wan, The University of Arizona

Department of Computer Science

Digitization of Text Documents Using PDF/A

Authors

DOI:

Abstract

Author Biography

Xueheng Wan, The University of Arizona

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Information