Digitization of Text Documents Using PDF/A

Yan Han, Xueheng Wan

Abstract


The purpose of this article is to demonstrate a practical use case of PDF/A file format for digitization of textual documents, following recommendation of using PDF/A as a preferred digitization file format. The authors showed how to convert and combine all the TIFFs with associated metadata into a single PDF/A-2b file for a document. Using open source software with real-life examples, the authors show readers how to convert TIFF images, extract associated metadata and ICC profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self-described container which accommodates all the data from digitization of textual materials, including page-level metadata and/or ICC profiles. With theoretical analysis and empirical examples, PDF/A file format has many advantages over traditional preferred file format TIFF / JPEG2000 for digitization of textual documents.

Full Text:

PDF


DOI: https://doi.org/10.6017/ital.v37i1.9878

Refbacks

  • There are currently no refbacks.




Copyright (c) 2018 Information Technology and Libraries

License URL: http://creativecommons.org/licenses/by/3.0/

/ojs/public/site/images/ejadmin/lita_67

ISSN:2163-5226

SCImago Journal & Country Rank data for ITAL