Accessibility of Tables in PDF Documents
Issues, Challenges and Future Directions
People access and share information over the web and in other digital environments, including digital libraries, in the form of documents such as books, articles, technical reports, etc. These documents are in a variety of formats, of which the Portable Document Format (PDF) is most widely used because of its emphasis on preserving the layout of the original material. The retrieval of relevant material from these derivative documents is challenging for information retrieval (IR) because the rich semantic structure of these documents is lost. The retrieval of important units such as images, figures, algorithms, mathematical formulas, and tables becomes a challenge. Among these elements, tables are particularly important because they can add value to the resource description, discovery, and accessibility of documents not only on the web but also in libraries if they are made retrievable and presentable to readers. Sighted users comprehend tables for sensemaking using visual cues, but blind and visually impaired users must rely on assistive technologies, including text-to-speech and screen readers, to comprehend tables. However, these technologies do not pay sufficient attention to tables in order to effectively present tables to visually impaired individuals. Therefore, ways must be found to make tables in PDF documents not only retrievable but also comprehensible. Before developing such solutions, it is necessary to review the available assistive technologies, tools, and frameworks for their capabilities, strengths, and limitations from the comprehension perspective of blind and visually impaired people, along with suitable environments like digital libraries. We found no such review article that critically and analytically presents and evaluates these technologies. To fill this gap in the literature, this review paper reports on the current state of the accessibility of PDF documents, digital libraries, assistive technologies, tools, and frameworks that make PDF tables comprehensible and accessible to blind and visually impaired people. The study findings have implications for libraries, information sciences, and information retrieval.
Alexey Shigarov et al., “Tabbypdf: Web-Based System for PDF Table Extraction,” in International Conference on Information and Software Technologies (Springer International Publishing, 2018): 257–69, https://doi.org/10.1007/978-3-319-99972-2_20.
Alexey Shigarov, Andrey Mikhailov, and Andrey Altaev, “Configurable Table Structure Recognition in Untagged PDF Documents,” in Proceedings of the 2016 ACM Symposium on Document Engineering, (2016): 119–22, https://doi.org/10.1145/2960811.2967152.
Ana Costa e Silva, “Parts that Add up to a Whole: A Framework for the Analysis of Tables,” (PhD diss., Edinburgh University, UK, 2010).
Andreiwid Sheffer Corrêa and Pär-Ola Zander, “Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools,” in Proceedings of the 18th Annual International Conference on Digital Government Research (June 2017): 54–63, https://doi.org/10.1145/3085228.3085278.
Asim Ullah, Shah Khusro, and Irfan Ullah, “Bibliographic Classification in the Digital Age: Current Trends & Future Directions,” Information Technology and Libraries 36, no. 3 (2017): 48–77, https://doi.org/10.6017/ital.v36i3.8930.
Asima Latif et al., “A Hybrid Technique for Annotating Book Tables,” Int. Arab J. Inf. Technol 15, no. 4 (2018): 777–83.
Azadeh Nazemi, “Non-Visual Representation of Complex Documents for Use in Digital Talking Books” (PhD diss., Curtin University, 2015).
Bahadar Ali and Shah Khusro, “A Divide-and-Merge Approach for Deep Segmentation of Document Tables,” in Proceedings of the 10th International Conference on Informatics and Systems (May 2016): 43–49, https://doi.org/10.1145/2908446.2908473.
Burcu Yildiz, Katharina Kaiser, and Silvia Miksch, “pdf2table: A Method to Extract Table Information from PDF Files,” in Proceedings of the 2nd Indian International Conference on Artificial Intelligence (IICAI, 2005): 1773–85.
Christopher Clark and Santosh Divvala, “Looking beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers” (paper, AAAI Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, January 25–26, 2015).
Dae Hyun Kim et al., “Facilitating Document Reading by Linking Text and Tables,” in Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (October 2018): 423–34, https://doi.org/10.1145/3242587.3242617.
Dafang He et al., “Multi-scale Multi-task FCM for Semantic Page Segmentation and Table Detection,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2017): 254–61, https://doi.org/10.1109/ICDAR.2017.50.
David Pinto et al., “Table Extraction Using Conditional Random Fields,” in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (July 2003): 235–42, https://doi.org/10.1145/860435.860479.
David Reinsel, John Gantz, and John Rydning, “Data Age 2025: The Digitization of the World, From Edge to Core,” IDC white paper, #US44413318 (Framingham, MA: IDC, November 2018), https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf/.
David W Embley, Sharad Seth, and George Nagy, “Transforming Web Tables to a Relational Database,” 2014 22nd International Conference on Pattern Recognition (2014) 2781–86, https://doi.org/10.1109/ICPR.2014.479.
Ermelinda Oro and Massimo Ruffolo, “PDF–Trex: An Approach for Recognizing and Extracting Tables from PDF Documents,” in 2009 10th International Conference on Document Analysis and Recognition (ICDAR) (2009): 906–10, https://doi.org/10.1109/ICDAR.2009.12.
Heidi M. Schroeder, “Implementing Accessibility Initiatives at the Michigan State University Libraries,” Reference Services Review 46, no. 3 (2018): 399–413, https://doi.org/10.1108/RSR-04-2018-0043.
Irfan Ullah et al., “An Overview of the Current State of Linked and Open Data in Cataloging,” Information Technology and Libraries 37, no. 4 (2018): 47–80, https://doi.org/10.6017/ital.v37i4.10432.
Iris Xie et al., “Enhancing Usability of Digital Libraries: Designing Help Features to Support Blind and Visually Impaired Users,” Information Processing and Management 57, no. 3 (2020): 102110, https://doi.org/10.1016/j.ipm.2019.102110.
Iris Xie et al., “Using Digital Libraries Non-Visually: Understanding the Help-Seeking Situations of Blind Users,” Information Research 20, no. 2 (2015): 673.
Ivan Ermilov, Sören Auer, and Claus Stadler, “User-Driven Semantic Mapping of Tabular Data,” in Proceedings of the 9th International Conference on Semantic Systems (September 2013): 105–12, https://doi.org/10.1145/2506182.2506196.
Jean-Claude Guédon et al., Future of Scholarly Publishing and Scholarly Communication: Report of the Expert Group to the European Commission (Brussels: European Commission, Directorate-General for Research and Innovation, 2019), https://doi.org/10.2777/836532.
Jiaoyan Chen et al., “Colnet: Embedding the Semantics of Web Tables for Column Type Prediction,” in Proceedings of the AAAI Conference on Artificial Intelligence 33, no. 1: 29–36, https://doi.org/10.1609/aaai.v33i01.330129.
Jing Fang et al., “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures,” in 2011 International Conference on Document Analysis and Recognition (2011): 779–83, https://doi.org/10.1109/ICDAR.2011.304.
Jing Fang et al., “Table Header Detection and Classification,” in Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (July 2012): 599–605.
Joanne Oud, “Accessibility of Vendor-Created Database Tutorials for People with Disabilities,” Information Technology and Libraries 35, no.4 (2016): 7–18, https://doi.org/10.6017/ital.v35i4.9469.
Juan Cao, “Generating Natural Language Descriptions from Tables,” IEEE Access 8 (2020): 46206–16, https://doi.org/10.1109/ACCESS.2020.2979115.
Julius T. Nganji, “An Assessment of the Accessibility of PDF Versions of Selected Journal Articles Published in a WCAG 2.0 Era (2014–2018),” Learned Publishing 31, no. 4 (2018): 391–401, https://doi.org/10.1002/leap.1197.
Julius T. Nganji, “The Portable Document Format (PDF) Accessibility Practice of Four Journal Publishers,” Library and Information Science Research 37, no.3 (2015): 254–62, https://doi.org/10.1016/j.lisr.2015.02.002.
Kyunghye Yoon, Laura Hulscher, and Rachel Dols, “Accessibility and Diversity in Library and Information Science: Inclusive Information Architecture for Library Websites,” Library Quarterly 86, no. 2 (2016): 213–29, https://doi.org/10.1086/685399.
Maartje ter Hoeve et al., “Conversations with Documents: An Exploration of Document-Centered Assistance,” in Proceedings of the 2020 Conference on Human Information Interaction and Retrieval (March 2020): 43–52, https://doi.org/10.1145/3343413.3377971.
Mark T. Maybury, “Communicative Acts for Explanation Generation,” International Journal of Man-Machine Studies 37, no. 2 (1992): 135–72.
Martha O Perez-Arriaga, Trilce Estrada, and Soraya Abad-Mota, “Table Interpretation and Extraction of Semantic Relationships to Synthesize Digital Documents,” in Proceedings of the 6th International Conference on Data Science, Technology and Application—DATA (2017): 223–32, https://doi.org/10.5220/0006436902230232.
Martha O. Perez-Arriaga, Trilce Estrada, and Soraya Abad-Mota, “TAO: System for Table Detection and Extraction from PDF Documents,” Florida Artificial Intelligence Research Society Conference, North America (2016).
Max Göbel et al., “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents,” in Proceedings of the 2012 ACM Symposium on Document Engineering (September 2012): 45–48, https://doi.org/10.1145/2361354.2361365.
Max Göbel et al., “ICDAR 2013 Table Competition,” in 2013 12th International Conference on Document Analysis and Recognition (2013): 1449–53, https://doi.org/10.1109/ICDAR.2013.292.
Mexhid Ferati and Wondwossen M. Beyene, “Developing Heuristics for Evaluating the Accessibility of Digital Library Interfaces,” in Universal Access in Human–Computer Interaction, Design and Development Approaches and Methods, UAHCI 2017, Lecture Notes in Computer Science 10277 (Springer, Cham), https://doi.org/10.1007/978-3-319-58706-6_14.
Michael Cafarella et al., “Ten Years of Webtables,” in Proceedings of the VLDB Endowment 11, no. 12 (August 2018): 2140–49, https://doi.org/10.14778/3229863.3240492.
Minghao Li et al., “TableBank: Table Benchmark for Image-Based Table Detection and Recognition,” preprint, arXiv:1903.01949.
Mireia Ribera Turró, “Are PDF Documents Accessible?” Information Technology and Libraries 27, no. 3 (2008): 25–43, https://doi.org/10.6017/ital.v27i3.3246.
Nicholas J Tierney and Karthik Ram, “A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility,” preprint, arXiv:2002.11626.
Nikola Milosevic et al., “A Framework for Information Extraction from Tables in Biomedical Literature,” International Journal on Document Analysis and Recognition (IJDAR) 22, no. 1 (2019): 55–78, https://doi.org/10.1007/s10032-019-00317-0.
Nikola Milosevic et al., “Disentangling the Structure of Tables in Scientific Literature,” in Natural Language Processing and Information Systems, NLDB 2016, Lecture Notes in Computer Science 9612 (Springer, Cham), https://doi.org/10.1007/978-3-319-41754-7_14.
Nosheen Fayyaz, Irfan Ullah, and Shah Khusro, “On the Current State of Linked Open Data: Issues, Challenges, and Future Directions,” International Journal on Semantic Web and Information Systems (IJSWIS) 14, no. 4 (2018): 110–28, https://doi.org/10.4018/IJSWIS.2018100106.
Patricia Wright, “The Comprehension of Tabulated Information: Some Similarities between Reading Prose and Reading Tables,” NSPI Journal 19, no. 8 (1980): 25–29, https://doi.org/10.1002/pfi.4180190810.
Rachel Wittmann et al., “From Digital Library to Open Datasets,” Information Technology and Libraries 38, no. 4 (2019): 49–61, https://doi.org/10.6017/ital.v38i4.11101.
Rahul Anand, Hye-Young Paik, and Cheng Wang, “Integrating and Querying Similar Tables from PDF Documents Using Deep Learning,” 2019, preprint, arXiv:1901.04672.
Rakesh Babu and Iris Xie, “Haze in the Digital Library: Design Issues Hampering Accessibility for Blind Users,” Electronic Library 35, no. 5 (2017): 1052–65, https://doi.org/10.1108/EL-10-2016-0209.
Richard Zanibbi, Dorothea Blostein, and James R Cordy, “A Survey of Table Recognition,” Document Analysis and Recognition 7, no. 1 (2004): 1–16, https://doi.org/10.1007/s10032-004-0120-9.
Roya Rastan, “Automatic Tabular Data Ex WCAG traction and Understanding” (PhD diss., University of New South Wales, 2017).
Roya Rastan, Hye-Young Paik, and John Shepherd, “TEXUS: A Unified Framework for Extracting and Understanding Tables in PDF Documents,” Information Processing & Management 56, no. 3 (2019): 895–918, https://doi.org/10.1016/j.ipm.2019.01.008.
Saman Arif and Faisal Shafait, “Table Detection in Document Images using Foreground and Background Features,” in 2018 Digital Image Computing: Techniques and Applications (DICTA), (2018): 1–8, https://doi.org/10.1109/DICTA.2018.8615795.
Sebastian Schreiber et al., “Deepdesrt: Deep Learning for Detection and Structure Recognition of Tables in Document Images,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2017): 1162–67, https://doi.org/10.1109/ICDAR.2017.192.
Shah Khusro, Asima Latif, and Irfan Ullah. “On Methods and Tools of Table Detection, Extraction and Annotation in PDF Documents,” Journal of Information Science 41, no. 1 (2015): 41–57, https://doi.org/10.1177/0165551514551903.
Shoaib Ahmed Siddiqui et al., “Decnt: Deep Deformable CNN for Table Detection,” IEEE Access 6 (2018): 74151–61, https://doi.org/10.1109/ACCESS.2018.2880211.
Syed Tahseen Raza Rizvi et al., “Ontology-based Information Extraction from Technical Documents,” in Proceedings of the 10th International Conference on Agents and Artificial Intelligence (ICAART) (2018): 493–500, https://doi.org/10.5220/0006596604930500.
Tamir Hassan and Robert Baumgartner, “Table Recognition and Understanding from PDF Files,” in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) (2007): 1143–47, https://doi.org/ 10.1109/ICDAR.2007.4377094.
Varish Mulwad, “TABEL—A Domain-Independent and Extensible Framework for Inferring the Semantics of Tables,” (PhD diss., University of Maryland, 2015).
Vidhya Govindaraju, Ce Zhang, and Christopher Ré, “Understanding Tables in Context Using Standard NLP Toolkits,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Sofia, Bulgaria: Association for Computational Linguistics, August 2013): 658–64.
Wenhao Yu et al., “Tablepedia: Automating PDF Table Reading in an Experimental Evidence Exploration and Analytic System,” in The World Wide Web Conference (May 2019): 3615–19, https://doi.org/10.1145/3308558.3314118.
Wenyuan Xue et al., “Table Analysis and Information Extraction for Medical Laboratory Reports,” in 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech) (2018): 193–99, https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.00043.
World Health Organization, World Report on Vision, October 8, 2019, https://www.who.int/publications-detail/world-report-on-vision/.
Xinxin Wang, “Tabular Abstraction, Editing, and Formatting” (PhD diss., University of Waterloo, 1996).
Yan Han and Xueheng Wan, “Digitization of Text Documents Using PDF/A,” Information Technology and Libraries 37, no. 1 (2018): 52–64, https://doi.org/10.6017/ital.v37i1.9878.
Zewen Chi et al., “Complicated Table Structure Recognition,” preprint, arXiv:1908.04729.
Ziqi Zhang, “Towards Efficient and Effective Semantic Table Interpretation,” in International Semantic Web Conference (2014): 487–502, https://doi.org/10.1007/978-3-319-11964-9_31.
How to Cite
Copyright (c) 2021 Nosheen Fayyaz, Shah Khusro, and Shakir Ullah
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Authors that submit to Information Technology and Libraries agree to the Copyright Notice.