Automatic Extraction of Figures from Scientific Publications in High-Energy Physics

Authors

  • Piotr Adam Praczyk CERN Universidad de Zaragoza
  • Javier Nogueras-Iso Universidad de Zaragoza

DOI:

https://doi.org/10.6017/ital.v32i4.3670

Abstract

Plots and figures play an important role in the process of understanding a scientificpublication, providing overviews of large amounts of data or ideas that are difficult to in-tuitively present using only the text. State of art in digital libraries, serving as gatewaysto knowledge encoded in scholarly writings, does not take full advantage of the graphicalcontent of documents. Enabling machines to automatically unlock the meaning of scien-tific illustrations would allow immense improvements in the way scientists work and theknowledge is being processed.    In this paper we present a novel solution for the initial problem of processing graphicalcontent, obtaining figures from scholarly publications stored in PDF format. Our methodrelies on vector properties of documents and as such, does not introduce additional errors,characteristic for methods based on raster image processing. Emphasis has been placed oncorrectly processing documents in High Energy Physics. The described approach makesdistinction between different classes of objects appearing in PDF documents and usesspatial clustering techniques to group objects into larger logical entities. A number ofheuristics allow the rejection of incorrect figure candidates and the extraction of differenttypes of metadata.

Author Biography

Piotr Adam Praczyk, CERN Universidad de Zaragoza

PhD student between CERN and the University of Zaragoza

References

R. Baeza-Yates and B. Ribeiro-Neto. Modern information retrieval. ACM Press Books. Addison-Wesley, Reading, MA, 1999.

S. Bhatia, S. Lahiri, and P. Mitra. Generating synopses for document-element search. In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM ’09, pages 2003–2006, New York, NY, USA, 2009. ACM.

W. Browuer, S. Kataria, S. Das, P. Mitra, and C. L. Giles. Segregating and extracting overlapping data points in two-dimensional plots. In Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, JCDL ’08, pages 276–279, New York, NY, USA, 2008. ACM.

H. Chao and J. Fan. Layout and content extraction for pdf documents. In Document Analysis Systems, pages 213–224, 2004.

W. S. Cleveland. Graphs in Scientific Publications. The American Statistician, 38(4):261–269, 1984.

T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT electrical engineering and computer science series. MIT, Cambridge, 1990.

H. Edelsbrunner and H. A. Maurer. On the intersection of orthogonal objects. Information Processing Letters, 13, 1981.

G. Eichhorn. Trends in Scientific Publishing at Springer. In Future Professional Communication in Astronomy II, Astrophysics and Space Science Proceedings. Springer, 2011. 22

Elsevier. SciVerse Science Direct: Image Search. http://www.info.sciverse.com/sciencedirect/using/searching-linking/image, 2012. last access: 6 November 2012.

J. Ferraiolo, editor. Scalable Vector Graphics (SVG) 1.0 Specification. Iuniverse Inc, 2001.

M. A. Hearst, A. Divoli, J. Ye, and M. A. Wooldridge. Exploring the efficacy of caption search for bioscience journal search interfaces. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, BioNLP ’07, pages 73–80, 2007.

A. Holtkamp, S. Mele, T. Simko, and T. Smith. INSPIRE: Realizing the dream of a global digital library in High-Energy Physics. In 3rd Workshop Conference: Towards a digital mathematics library, pages 83–92, Paris, France, 07 - 08 Jul 2010.

L. Johnston. Web reviews: See the science: Scitech image databases. Sci-Tech News, 65, 2011.

S. Kataria. On utilization of information extracted from graph images in digital documents. Bulletin of IEEE Technical Comittee on Digital Libraries, 4, 2008.

S. Kataria, W. Browuer, P. Mitra, and C. L. Giles. Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents. In Proceedings of the 23rd national conference on Artificial intelligence - Volume 2, pages 1169–1174. AAAI Press, 2008.

Y. Liu, K. Bai, P. Mitra, and C. L. Giles. Tableseer: Automatic table metadata extraction and searching in digital libraries. In JCDL’07, June 18-23, 2007, Vancouver, British Columbia, Canada, 2007. JCDL.

Y. Liu, K. Bai, P. Mitra, and C. L. Giles. Tableseer: automatic table metadata extraction and searching in digital libraries. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, JCDL ’07, pages 91–100, New York, NY, USA, 2007. ACM.

P. Praczyk, J. Nogueras-Iso, S. Dallmeier-Tiessen, and M. Whalley. Integrating Scholarly Publications and Research Data - Preparing for Open Science, a Case Study from High-Energy Physics with Special Emphasis on (Meta)data Models. In Metadata and Semantics Research, volume 343 of CCIS, pages 146–157. Springer, 2012.

P. Praczyk, J. Nogueras-Iso, S. Kaplun, and T. Simko. A Storage Model for Supporting Figures and Other Artefacts in Scientific Libraries: the Case Study of Invenio. In Proc. of 4th Workshop on Very Large Digital Libraries (VLDL 2011), Berlin, Germany, 2011.

S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition, Dec. 2009.

S. Theodoridis and K. Koutroumbas. Pattern Recognition, Third Edition. Academic Press, February 2006.

Downloads

Published

2013-12-22

How to Cite

Praczyk, P. A., & Nogueras-Iso, J. (2013). Automatic Extraction of Figures from Scientific Publications in High-Energy Physics. Information Technology and Libraries, 32(4), 25–52. https://doi.org/10.6017/ital.v32i4.3670

Issue

Section

Articles