Applying Topic Modeling for Automated Creation of Descriptive Metadata for Digital Collections

Keywords: metadata, subject headings, natural language processing, topic modeling, R programming language


Creation of descriptive metadata for digital objects tends to be a laborious process. Specifically, subject analysis that seeks to classify the intellectual content of digitized documents typically requires considerable time and effort to determine subject headings that best represent the substance of these documents. This project examines the use of topic modeling to streamline the workflow for assigning subject headings to the digital collection of New Mexico State University news releases issued between 1958 and 2020. The optimization of the workflow enables timely scholarly access to unique primary source documentation.

Author Biography

Monika Glowacka-Musial, New Mexico State University

Monika Glowacka-Musial

Assistant Professor/Metadata Librarian

Technical Services, New Mexico State University Library


A. Krowne and M. Halbert, “An Initial Evaluation of Automated Organization for Digital Library Browsing,” in JCDL '05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, (June 7–11, 2005): 246–255.

Alexandra Lesnikowski et al., “Frontiers in Data Analytics for Adaptation Research: Topic Modeling,” WIREs Climate Change 10, no. 3 (2019): e576,

Andrew Goldstone and Ted Underwood, “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us,” New Literary History 45, (2014): 359–84.

Andrew J. Torget and Jon Christensen, “Mapping Texts: Visualizing American Historical Newspapers,” Journal of Digital Humanities 1, no. 3 (Summer 2012),

Anne Burdick et al., Digital_Humanities (Cambridge, Massachusetts: The MIT Press, 2012), 32–33.

Arlene G. Taylor and Daniel N. Joudrey, The Organization of Information, 3rd ed. (Westport, Connecticut: Libraries Unlimited, 2009), 303–28.

Arlene G. Taylor, Introduction to Cataloging and Classification, 10th ed. (Westport, Connecticut: Libraries Unlimited, 2006), 19–20, 301–14.

Bettina Grün and Kurt Hornik, “topicmodels: An R Package for Fitting Topic Models,” Journal of Statistical Software 40, no. 13 (2011): 1–30,

Boyed-Graber, Hu, and Mimno, “Applications of Topic Models,” Foundations and Trends® in Information Retrieval 11, no. 2–3 (2017): 143–296.

Carina Jacobi, Wouter van Atteveldt, and Kasper Welbers, “Quantitative Analysis of Large Amounts of Journalistic Texts Using Topic Modelling,” Digital Journalism 4, no. 1 (2015),

Carlos G. Figuerola, Francisco Javier Garcia Marco, and Maria Pinto, “Mapping the Evolution of Library and Information Science (1978–2014) Using Topic Modeling on LISA,” Scientometrics 112, (2017): 1507–35,

Cassidy R. Sugimoto et al., “The Shifting Sands of Disciplinary Development: Analyzing North American Library and Information Science Dissertations Using Latent Dirichlet Allocation,” Journal of the American Society for Information Science and Technology 62, no. 1 (January 2011),

Christopher M. Bishop, Pattern Recognition and Machine Learning (New York, NY: Springer Science + Business Media, 2006), 32–33.

Craig Boman, “An Exploration of Machine Learning in Libraries,” ALA Library Technology Report 55, no. 1 (January 2019): 21–25.

Daniel Johnson and Mark Dehmlow, “Digital Exhibits to Digital Humanities: Expanding the Digital Libraries Portfolio,” in New Top Technologies Every Librarian Needs to Know, ed. Kenneth J. Varnum, (Chicago: ALA Neal-Schuman, 2019), 124.

David Andrzejewski and David Buttler, “Latent Topic Feedback for Information Retrieval,” in KDD '11: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011),

David J. Newman and Sharon Block, “Probabilistic Topic Decomposition of an Eighteenth-Century American Newspaper,” Journal of the American Society for Information Science and Technology 57, no. 6 (April 1, 2006): 753–67.

David M. Blei, “Topic Modeling and Digital Humanities,” Journal of Digital Humanities 2, no. 1 (Winter 2012),

David M. Blei, Andrew Ng, and Michael Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research 3, no. 1 (2003).

David Mimno, “Computational Historiography: Data Mining in a Century of Classics Journals,” Journal on Computing and Cultural Heritage 5, no. 1 (April 2012): 3:1–3:19.

David Newman, Kat Hagedorn, and Chaitanya Chemudugunta, “Subject Metadata Enrichment Using Statistical Topic Models,” paper presented at ACM IEEE Joint Conference on Digital Libraries JCDL’07, Vancouver, BC, June 17–22, 2007.

Gerald W. Thomas, Academic Ecosystem: Issues Emerging in a University Environment (Gerald W. Thomas, 1998), 159–64.

Hamed Jelodar, Yongli Wang, Chi Yuan, Xia Feng, “Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, a Survey,” (2017),

Jonathan O. Cain, “Using Topic Modeling to Enhance Access to Library Digital Collections,” Journal of Web Librarianship 10, no. 3 (2016): 210–25,

Jordan Boyed-Graber, Yuening Hu, and David Mimno, “Applications of Topic Models,” Foundations and Trends® in Information Retrieval 11, no. 2–3 (2017): 143–296.

Julia Silge and David Robinson, Text Mining with R: A Tidy Approach (Sebastopol, California: O’Reilly Media, Inc., 2017), 90.

Jung Sun Oh and Ok Nam Park, “Topics and Trends in Metadata Research,” Journal of Information Science Theory and Practice 6, no. 4 (2018): 39–53.

Manika Lamba and Margam Madhusudhan, “Author-Topic Modeling of DESIDOC Journal of Library and Information Technology (2008–2017), India,” Library Philosophy and Practice (2019): 2593,

Manika Lamba and Margam Madhusudhan, “Metadata Tagging of Library and Information Science Theses: Shodhganga (2013–2017),” paper presented at ETD 2018: Beyond the Boundaries of Rims and Oceans Globalizing Knowledge with ETDs, National Central Library, Taipei, Taiwan,

Matt Erlin, “Topic Modeling, Epistemology, and the English and German Novel,” Cultural Analytics 1, no. 1 (May 1, 2017),

Megan R. Brett, “Topic Modeling: A Basic Introduction,” Journal of Digital Humanities 2, no. 1 (Winter 2012),

Rachel Wittmann, Anna Neatrour, Rebekah Cummings, and Jeremy Myntti, “From Digital Library to Open Datasets: Embracing a ‘Collections as Data’ Framework,” Information Technology and Libraries 38, no. 4 (December 2019),

Rania Albalawi, Tet Hin Yeap, and Morad Benyoucef, “Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis,” Frontiers in Artificial Intelligence 3 (2020): 42,

Robert K. Nelson, “Mining the Dispatch,” last modified November 2020,

The R Project for Statistical Computing,

Thomas G. Padilla, “Collections as Data Implications for Enclosure,” ACRL News 79, no. 6 (2018),

Tiziano Piccardi and Robert West, “Crosslingual Topic Modeling with WikiPDA,” in Proceedings of The Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia (ACM, New York),

Topic Modeling in R (DataCamp), chap. 3,

Topic Modeling in R (DataCamp),

Tze-I Yang, Andrew Torget, and Rada Mihalcea, “Topic Modeling on Historical Newspapers,” in LaTeCH '11: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (2011),

Zhijun Yin et al., “Geographical Topic Discovery and Comparison,” in WWW: Proceedings of the 20th International Conference on the World Wide Web (2011),

How to Cite
Glowacka-Musial, M. (2022). Applying Topic Modeling for Automated Creation of Descriptive Metadata for Digital Collections. Information Technology and Libraries, 41(2).