Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet

Authors

  • Kumar Sharma Research Scholar, Department of Computer Science and Engineering University of Kalyani, Kalyani West Bengal. http://orcid.org/0000-0003-0133-3926
  • Ujjal Marjit System-in-Charge, Center for Information Resource Management (CIRM) University of Kalyani, Kalyani West Bengal.
  • Utpal Biswas Professor, Department of Computer Science and Engineering University of Kalyani, Kalyani West Bengal.

DOI:

https://doi.org/10.6017/ital.v37i3.10177

Abstract

Resource Description Framework (RDF) is a commonly used data model in the Semantic Web environment. Libraries and various other communities have been using the RDF data model to store valuable data after it is extracted from traditional storage systems. However, because of the large volume of the data, processing and storing it is becoming a nightmare for traditional data-management tools. This challenge demands a scalable and distributed system that can manage data in parallel. In this article, a distributed solution is proposed for efficiently processing and storing the large volume of library linked data stored in traditional storage systems. Apache Spark is used for parallel processing of large data sets and a column-oriented schema is proposed for storing RDF data. The storage system is built on top of Hadoop Distributed File Systems (HDFS) and uses the Apache Parquet format to store data in a compressed form. The experimental evaluation showed that storage requirements were reduced significantly as compared to Jena TDB, Sesame, RDF/XML, and N-Triples file formats. SPARQL queries are processed using Spark SQL to query the compressed data. The experimental evaluation showed a good query response time, which significantly reduces as the number of worker nodes increases.

References

Eric Miller et al., “Bibliographic Framework as a Web of Data: Linked Data Model and Supporting Services,” Library of Congress, November 11, 2012, https://www.loc.gov/bibframe/pdf/marcld-report-11-21-2012.pdf.

Brighid M. Gonzales, “Linking Libraries to the Web: Linked Data and the Future of the Bibliographic Record,” Information Technology and Libraries 33 no. 4 (2014): 10, https://doi.org/10.6017/ital.v33i4.5631; Myung-Ja K. Han et al., “Exposing Library Holdings Metadata in RDF Using Schema.org Semantics,” in International Conference on Dublin Core and Metadata Applications DC-2015, São Paulo, Brazil, September 1–4, 2015, pp. 41–49, http://dcevents.dublincore.org/IntConf/dc-2015/paper/view/328/363.

Franck Michel et al., “Translation of Relational and Non-relational Databases into RDF with xR2RML,” in Proceedings of the 11th International Conference on Web Information Systems and Technologies, Lisbon, Portugal, 2015, pp. 443–54, https://doi.org/10.5220/0005448304430454; Varish Mulwad, Tim Finin, and Anupam Joshi, “Automatically Generating Government Linked Data from Tables,” Working Notes of AAAI Fall Symposium on Open Government Knowledge: AI Opportunities and Challenges 4, no. 3 (2011), https://ebiquity.umbc.edu/_file_directory_/papers/582.pdf; Matthew Rowe, “Data.dcs: Converting Legacy Data into Linked Data,” LDOW 628 (2010), http://ceur-ws.org/Vol-628/ldow2010_paper01.pdf.

Virginia Schilling, “Transforming Library Metadata into Linked Library Data,” Association for Library Collections and Technical Services, September 25, 2012, http://www.ala.org/alcts/resources/org/cat/research/linked-data.

Getaneh Alemu et al., “Linked Data for Libraries: Benefits of a Conceptual Shift from Library-Specific Record Structures to RDF-Based Data Models,” New Library World 113, no. 11/12 (2012): 549–70 (2012), https://doi.org/10.1108/03074801211282920.

Lisa Goddard and Gillian Byrne, “The Strongest Link: Libraries and Linked Data,” D-Lib Magazine, 16, no. 11/12 (2010), https://doi.org/10.1045/november2010-byrne.

T. Nasser and R. S. Tariq, “Big Data Challenges,” Journal of Computer Engineering & Information Technology 4, no. 3 (2015), https://doi.org/10.4172/2324-9307.1000133.

Alexandru Adrian Tole, “Big Data Challenges,” Database Systems Journal 4, no. 3 (2013): 31–40, http://dbjournal.ro/archive/13/13_4.pdf.

Carol Jean Godby and Karen Smith-Yoshimura, “From Records to Things: Managing the Transition from Legacy Library Metadata to Linked Data,” Bulletin of the Association for Information Science and Technology 43, no. 2 (2017): 18–23, https://doi.org/10.1002/bul2.2017.1720430209.

Corine Deliot, “Publishing the British National Bibliography as Linked Open Data,” Catalogue & Index, issue 174 (2014): 13–18, http://www.bl.uk/bibliographic/pdfs/publishing_bnb_as_lod.pdf; Gustavo Candela et al., “Migration of a Library Catalogue into RDA Linked Open Data,” Semantic Web 9, no. 4 (2017): 481–91, https://doi.org/10.3233/sw-170274; Martin Malmsten, “Exposing Library Data as Linked Data,” IFLA satellite preconference sponsored by the Information Technology Section: Emerging Trends in 2009, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.181.860&rep=rep1&type=pdf; Keri Thompson and Joel Richard, “Moving Our Data to the Semantic Web: Leveraging a Content Management System to Create the Linked Open Library,” Journal of Library Metadata 13, no. 2–3 (2013): 290–309, https://doi.org/10.1080/19386389.2013.828551; Jason A. Clark and Scott W. H. Young, “Linked Data is People: Building a Knowledge Graph to Reshape the Library Staff Directory,” Code4lib Journal 36 (2017), http://journal.code4lib.org/articles/12320; Martin Malmsten, “Making a Library Catalogue Part of the Semantic Web,” Humbolt University of Berlin, 2008, https://doi.org/10.18452/1260.

R. Hastings, “Linked Data in Libraries: Status and Future Direction,” Computers in Libraries 35, no. 9 (2015): 12–28, http://www.infotoday.com/cilmag/nov15/Hastings--Linked-Data-in-Libraries.shtml.

Mirjam Keßler, “Linked Open Data of the German National Library,” In ECO4r Workshop LOD of DNB, 2010; Antoine Isaac, Robina Clayphan, and Bernhard Haslhofer, “Europeana: Moving to Linked Open Data,” Information Standards Quarterly 24, no. 2/3 (2012)<>; Carol Jean Godby and Ray Denenberg, “Common Ground: Exploring Compatibilities between the Linked Data Models of the Library of Congress and OCLC,” OCLC Online Computer Library Center, 2015, https://files.eric.ed.gov/fulltext/ED564824.pdf.

Chunning Wang et al., “Exposing Library Data with Big Data Technology: A Review,” 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1-6, https://doi.org/10.1109/icis.2016.7550937.

B. McBride, “Jena: a Semantic Web Toolkit,” IEEE Internet Computing 6, no. 6 (2002): 55–59, https://doi.org/10.1109/mic.2002.1067737; Jeen Broekstra, Arjohn Kampman, and Frank Van Harmelen, “Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema,” International Semantic Web Conference, ed. J. Davies, D. Fensel, and F. van Harmelen (Berlin and Heidelberg: Springer, 2002), https://doi.org/10.1002/0470858060.ch5.

“Apache Jena—TDB,” Apache Jena, accessed August 22, 2018, https://jena.apache.org/documentation/tdb/.

“Sesame (framework),” Everipedia, July 15, 2016, https://everipedia.org/wiki/Sesame_(framework)/.

Asim Ullah et al., “BookOnt: A Comprehensive Book Structural Ontology for Book Search and Retrieval,” 2016 International Conference on Frontiers of Information Technology (FIT), 211–16, https://doi.org/10.1109/fit.2016.046.

Tom Heath and Christian Bizer, “Linked Data: Evolving the Web into a Global Data Space,” Synthesis Lectures on the Semantic Web: Theory and Technology 1, no. 1 (2011): 1–136, https://doi.org/10.2200/s00334ed1v01y201102wbe001.

Christian Bizer et al., “Linked Data on the Web (LDOW2008),” Proceeding of the 17th International Conference on World Wide Web—WWW 08, 2008, pp. 1265–66 (2008), https://doi.org/10.1145/1367497.1367760.

Eric Prud and Andy Seaborne, “SPARQL Query Language for RDF,” W3C Recommendation, January 15, 2008, https://www.w3.org/TR/rdf-sparql-query/.

Devin Gaffney, “How to Use SPARQL,” Datagov Wiki RSS, last modified April 7, 2010, https://data-gov.tw.rpi.edu/wiki/How_to_use_SPARQL.

Tom White, Hadoop: The Definitive Guide (Sebastopol, CA: O’Reilly Media,, 2012), https://www.isical.ac.in/~acmsc/WBDA2015/slides/hg/Oreilly.Hadoop.The.Definitive.Guide.3rd.Edition.Jan.2012.pdf.

Dhruba Borthakur, “The Hadoop Distributed File System: Architecture and Design,” Hadoop Project Website, 2007, http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.16.3/docs/hdfs_design.pdf; Seema Maitrey and C. K. Jha, “MapReduce: Simplified Data Analysis of Big Data,” Procedia Computer Science 57 (2015), 563–71 (2015), https://doi.org/10.1016/j.procs.2015.07.392.

Michael Armbrust et al., “Spark SQL: Relational Data Processing in Spark,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (New York: ACM, 2015), 1383–94, https://doi.org/10.1145/2723372.2742797.

Abdul Ghaffar Shoro and Tariq Rahim Soomro, “Big Data Analysis: Apache Spark Perspective,” Global Journal of Computer Science and Technology 15, no. 1 (2015), https://globaljournals.org/GJCST_Volume15/2-Big-Data-Analysis.pdf.

Salman Salloum et al., “Big Data Analytics on Apache Spark,” International Journal of Data Science and Analytics 1, no. 3–4 (2016): 145–64, https://doi.org/10.1007/s41060-016-0027-9.

Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem, “Column-Stores vs. Row-Stores: How Different are They Really?,” in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (New York: ACM, 2008), 967–80, https://doi.org/10.1145/1376616.1376712.

Deepak Vohra, “Apache Parquet,” in Practical Hadoop Ecosystem (Berkeley, CA: Apress, 2016), 325–35, https://doi.org/10.1007/978-1-4842-2199-0_8.

“Google/Snappy,” GitHub, January 04, 2018, https://github.com/google/snappy.

Jean-loup Gailly and Mark Adler, “Zlib Compression Library,” 2004, https://www.repository.cam.ac.uk/bitstream/handle/1810/3486/rfc1951.txt?sequence=4.

Sergey Melnik et al., “Dremel: Interactive Analysis of Web-Scale Datasets,” Proceedings of the VLDB Endowment 3, no. 1–2 (2010): 330–39, https://doi.org/10.14778/1920841.1920886.

Marcel Kornacker et al., “Impala: A Modern, Open-Source SQL Engine for Hadoop,” in Proceedings of the 7th Biennial Conference on Innovative Data Systems Research, Asilomar, California, January 4–7, 2015, http://www.inf.ufpr.br/eduardo/ensino/ci763/papers/CIDR15_Paper28.pdf.

Downloads

Published

2018-09-26

How to Cite

Sharma, K., Marjit, U., & Biswas, U. (2018). Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet. Information Technology and Libraries, 37(3), 29–49. https://doi.org/10.6017/ital.v37i3.10177

Issue

Section

Articles