Batch Ingesting into EPrints Digital Repository Software

This paper describes the batch importing strategy and workflow used for the import of theses metadata and PDF documents into the EPrints digital repository software. A two-step strategy of importing metadata in MARC format followed by attachment of PDF documents is described in detail, including Perl source code for scripts used. The processes described were used in the ingestion of 6,000 theses metadata and PDFs into an EPrints institutional repository.


INTRODUCTION
Tutorials have been published about batch ingestion of ProQuest metadata and electronic theses and dissertations (ETDs), 1 as well as EndNote library, 2 into the Digital Commons platform.The procedures for bulk importing of ETDs using DSpace have also been reported. 3However, bulk importing into the EPrints digital repository software has not been exhaustively addressed in the literature. 4A recent article by Walsh provides a literature review of batch importing into institutional repositories. 5The only published report on batch importing into the EPrints platform describes Perl scripts for metadata-only records import from Thomson Reuters Reference Manager. 6lk importing is often one of the first tasks after launching a repository, so it is unsurprising that requests for reports and documentation on EPrints-specific workflow have been a recurring question on the EPrints Tech List. 7A recently published review of EPrints identifies "the absence of a bulk uploading feature" as its most significant weakness. 8Although EPrints' graphical user interface for bulk importing is limited to the use of the installed import plugins, the software does have a versatile infrastructure for this purpose.Leveraging EPrints' import functionality requires some Perl scripting, structuring the data for import, and using the command line interface.
In 2009, when Concordia University launched Spectrum, 9 its research repository, the first task was a batch ingest of approximately 6,000 theses dated from 1967 to 2003.The source of the metadata for this import consisted in MARC records from an integrated library system powered by Innovative Interfaces and ProQuest PDF documents.This paper is a report on the strategy and workflow adopted for batch ingestion of this content into the EPrints digital repository software.
importing is recommended because it is easier to monitor the operation in real time by adding progress information output to the import plugin code.
The task of batch importing can be split into the following subtasks: 1. import of metadata of each item 2. import of associated documents, such as full-text PDF files The strategy adopted was to first import the metadata for all of the new items into the inbox of an editor's account.After this first step was completed, a script was used to loop through the newly imported eprints and attach the corresponding full-text documents.Although documents can be imported from the local file system or via HTTP, import of the files from the local file system was used.
The batch import procedure varies depending on the format of the metadata and documents to be imported.Metadata import requires a mapping of the source schema fields to the default or custom fields in EPrints.The source metadata must also be converted into one of the formats supported by EPrints' import plugins, or a custom plugin must be created.Import plugins are available for many popular formats, including BibTeX, DOI, EndNote, and PubMedXML.In addition, community-contributed import plugins such as MARC and ArXiv are available at EPrints Files. 11ince most repositories use custom metadata fields, some customization of the import plugins is usually necessary.

MARC Plugin for EPrints
In EPrints, the import and export plugins ensure interoperability of the repository with other systems.Import plugins read metadata from one schema and load it into the EPrints system through a mapping of the fields into the EPrints schema.Loading MARC-encoded files into EPrints requires the installation of the import/export plugin developed by Romero and Miguel. 12The installation of this plugin requires the following two CPAN modules: MARC::Record and MARC::File::USMARC.The MARC plugin was then subclassed to create an import plugin named "Concordia Theses," which is customized for thesis MARC records.

Concordia Theses MARC Plugin
The MARC plugin features a central configuration file (see appendix A) in which each MARC field is paired with a corresponding mapping to an EPrints field.Most of the fields were configured through this configuration file (see table 1).
The source MARC records from the Innovative Interfaces Integrated Library System (ILS) encode the physical description of each item using the Anglo American Cataloguing Rules, as in the following example: "ix, 133 leaves : ill.; 29 cm."Since the default EPrints field for number of pages is of the type integer and does not allow multipart physical descriptions from the MARC 300 field, a custom text field for these physical descriptions (pages_aacr) had to be added.
The marc.pl configuration file cannot be used to map compound fields, such as author names-the fields need custom mapping implementation in Perl.For instance, the MARC 100 and 700 fields are transferred into the EPrints author compound field (in MARC.pm).Similarly, MARC 599 is mapped into a custom thesis advisor compound field.Knüttel added two methods that make it easier to subclass the general MARC plugin and add unique mappings: handle_marc_specialities and post_process_eprint.The post_process_eprint function was not used to attach the full-text documents to each eprint.Instead, the strategy to import the full-text documents using a separate attach_documents script was used (see "Theses Document File Attachment" below).Import of all of the specialized fields, such as thesis type (mapped from MARC 710t), program, department, and proquest id, was implemented in the function handle_marc_specialities of ConcordiaTheses.pm.For instance, 502a in the MARC record contains the department information, whereas an EPrints system like Spectrum stores department hierarchy as subject objects in a tree.Therefore importing the department information based on the value of 502a required regular expression searches of this MARC field to find the mapping into a corresponding subject id.This was implemented in the handle_marc_specialities function.

Execution of the Theses Metadata Import
The depositing user's name is displayed along with the metadata for each eprint.A batchimporter user with the corporate name "Concordia University Libraries" was created to carry out the import.As a result, the public display of the imported items shows the following as a part of the metadata: "Deposited By: Concordia University Libraries."The MARC plugin requires the encoding of the source MARC files to be UTF-8, whereas the records are exported from the ILS with MARC-8 encoding.Therefore MarcEdit software developed by Reese was used to convert the MARC file to UTF-8. 14 activate the import, the main MARC import plugin and its subclass, ConcordiaTheses.pm,have to be placed in the plugin folder /perl_lib/EPrints/Plugin/Import/MARC/.The configuration file (see appendix A) must also be placed with the rest of the configurable files in /archives/REPOSITORYID/cfg/cfg.d.The plugin can then be activated from the command line using the import script in the /bin folder.A detailed description of this script and its usage is documented on the EPrints Wiki.The following EPrints command from the /bin folder was used to launch the import: import REPOSITORYID --verbose --user batchimporter eprint MARC::ConcordiaTheses Theses-utf8.mrc Following the aforementioned steps, all the theses metadata was imported into the EPrints software.The new items were imported with their statuses set to inbox.A status set to inbox means that the imported items are in the work area of batchimporter user and will need to be moved to live public access by switching their status to archive.

Theses Document File Attachment
After the process of importing the metadata of each thesis is complete, the corresponding document files need to be attached.The proquest id was used to link the full-text PDF documents to the metadata records.All of the MARC records contained the proquest id, while the PDF files, received from ProQuest, were delivered with the corresponding proquest id as the filename.The PDFs were uploaded to a folder on the repository web server using FTP.The attach_documents script (see appendix B for source code) was then used to attach the documents to each of the imported eprints in the batchimporter's inbox and to move the imported eprints to the live archive.
Several variables need to be set at the beginning of the attach_documents operation (see table 2).

Variable Comment
$root_dir = 'bin/importdata/proquest' This is the root folder where all the associated documents are uploaded by FTP.

$depositor = 'batchimporter'
Only the items deposited by a defined depositor, in this case batchimporter, will be moved from inbox to live archive.$dataset_id = 'inbox' Limit the dataset to those eprints with status set to inbox $repositoryid = 'library' The internal EPrints identifier of the repository Table 2

. Variables to be Set in the attach_documents Script
The following command is used to proceed with file attachment, while the output log is redirected and saved in the file ATTACHMENT: /bin/attach_documents.pl> ./ATTACHMENT2>&1 The thesis metadata record was made live even if it did not contain a corresponding document file.A list of eprint ids of theses that did not contain a corresponding full-text PDF document are listed at the end of the log file, along with the count of the number of theses that were made live.
After the import operation is complete, all the abstract pages need to be regenerated with the following command:

CONCLUSIONS
This paper is a detailed report on batch importing into the EPrints system.The authors believe that this paper and its accompanying source code is a useful contribution to the literature on batch importing into digital repository systems.In particular, it should be useful to institutions that are adopting the EPrints digital repository software.Batch importing of content is a basic and fundamental function of a repository system, which is why the topic has come up repeatedly on the EPrints Tech List and in a repository software review.
The methods that we describe for carrying out batch importing in EPrints make use of the command line and require Perl scripting.More robust administrative graphical user interface support for batch import functions would be a useful feature to develop in the platform.

Figure 1 .
Figure 1.Concordia Theses Class Diagram, created with the Perl module UML::Class::Simple

Table 1 . Mapping Table from MARC to EPrints
13lge Knüttel's refinements to the MARC plugin shared on the EPrints Tech List were employed in the implementation of a new subclass of MARC import for the Concordia Theses MARC records.In the implementation of the Concordia Theses plugin, ConcordiaTheses.pminheritsfrom MARC.pm.(See figure 1.)13