in Metadata Processing: Matching in Large Databases

This article discusses structural, systems, and other types of bias that arise in matching new records to large databases. The focus is databases for bibliographic utilities, but other related database concerns will be discussed. Problems of satisfying a “match” with sufficient flexibility and rigor in an environment of imperfect data are presented, and sources of unintentional variance are discussed.

This article discusses structural, systems, and other types of bias that arise in matching new records to large databases.The focus is databases for bibliographic utilities, but other related database concerns will be discussed.Problems of satisfying a "match" with sufficient flexibility and rigor in an environment of imperfect data are presented, and sources of unintentional variance are discussed.
Editor's note: This article was submitted in honor of the fortieth anniversaries of LITA and ITAL.

S
ameness is a sometime thing.Libraries and other informationintensive organizations have long faced the problem of large collections of records growing incrementally.Computerized records in a net worked environment have encouraged the recognition that duplicate records pose a serious threat to efficient information retrieval.
Yet what constitutes a duplicate record may be neither exact nor completely predictable.Levels of discernment are required to permit matches on records that do not dif fer significantly and records that do.

n Initial definitions
Matching is defined as the process by which additions to a large database are screened and compared with existing database records.Ideally, this process of matching ensures that duplicates are not added, nor erroneous replacements made of record pairs that are not really equivalent.
OCLC (Online Computer Library Center, Inc.) is a non profit organization serving member libraries and related institutions throughout the world.It is the chief database capital of the organization, and it is "owned" in a sense by the member libraries worldwide that use and contribute to it.At this writing, it contains over seventythree mil lion records.This discussion focuses chiefly on OCLC's Extended WorldCat (XWC), though many of the issues are common to other bibliographic databases.Examples of these include the Research Libraries Group's Research Libraries Information Network (RLIN) database, PICA (a European cooperative of libraries headquartered in the Netherlands), and other union catalogs.The literature will demonstrate that the problems described exist in many if not most large bibliographic databases.The database contents are representations or surrogates of the objects in shared collections.Individual records in XWC are com plex bibliographic representations of physical or virtual objects-books, films, URLs, maps, slides, and much more.Each of these records consists of metadata, i.e., "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource" 1 (appendix A).The records use an XML varia tion of the MARC communications format. 2 For example, a record for a book might typically contain such fields for author, title, publisher, and date, and many more in addi tion.The representation of any one object can be quite com plex, containing scores of fields and subfields.Such a record may be quite brief, or several thousand characters long.The depth and richness of the records varies enormously.They may describe materials in more than 450 languages.This is a database against which millions of searches and millions of records are processed, each month.
Why is matching a challenge?Two records describing the same intellectual creation or work (e.g., Shakespeare's Othello) can vary by physical form and other attributes.Two records describing both the same work and exactly the same form can differ from each other if the records were created under different rules of record description (catalog ing).Two records intended to describe the same object can vary unintentionally if typographical or other entry errors are present in one or both.Thus sorting out significant from insignificant differences is critical.An example of the challenges of developing matching software in the Metadata Capture Project is described elsewhere. 3he scope of misinformation is limited to information storage and retrieval, and specifically to comparison of incoming records to candidate matches in the database.The authors define misinformation as follows: 1. Anything that can cause two database records, i.e., representations of different items to be mistaken as representations of the same item.These can lead to inappropriate merging or updates.2. The effect of techniques or processes of search that can obscure distinctions in differing items.3. Any case where matching misses an appropriate match due to nonsignificant differences in two records that really represent the same item.
Note that disinformation (the intentional effort to mis represent) is not considered in scope for this discussion.
The assumption is that cooperation is in the interests of all parties contributing to a shared database.We do not assume that all institutions sharing the database have the same goals.

INFORMATION TECHNOLOGY AND LIBRARIES | JuNE 2007
What is bias?Bias can be defined as factors in the creation or processing of database records that feed on misinformation or missing information, and skew charac terizations of the database records in question.

Context-Matching and bias
How are matching and bias related to each other?The growth of a database is in part a function of the matching process.If matching is not tuned correctly, the database can grow or change in nonoptimal ways.
Another way to look at the problem is to consider the goal of success in searching, and the need to know when to stop.Human beings recognize that failure to find the best information for a given problem may be costly.Finding the best information when less would suffice may also be costly.Systems need to know this.For a large shared data base, hundreds of thousands of records may be processed in a day; the system must be as efficient as possible.
What are some costs?Fail to match when one should, and duplicates may proliferate in the database.Match badly, and there is risk of merging multiple records that do not represent the same item.
A system of matching can fail in more than one way.Balance is needed.
1. Searches, which are based on data in the incom ing record, may be too precise to find legitimate matches.Loosen the criteria too much, and the search may return too many records to compare.2. Once retrieved, candidate matches are evaluated.
Compare candidates too narrowly, and records with insignificant differences will be rejected.Fail to take note of salient differences between incom ing record and database record, and the match will be wrong, undetected, and potentially hard to detect in the future.
The goals vary in different matching projects.For some projects, setting "holdings," the indication that a member library owns a copy of something, is the main goal of the processing.This does not involve adding, replacing, or merging database records.For other projects, the goal is to update the database, either by replacing matched records, merging multiple duplicate records into one, or by adding new records if no match is found in the database.For the latter, bad matching could compromise database contents.
n Background Hickey and Rypka provide a good review of the problems of identifying duplicates and the implications for match ing software. 4Their study notes concerns from a variety of library networks including that of the University of Toronto (UTLAS), Washington Library Network (WLN), and Research Libraries Group (RLIN).They also refer ence studies on duplicate detection in the Illinois state wide bibliographic database and at Oak Ridge National Laboratories.Background discussion of broader misinfor mation issues in shared library catalogs can be found in Bade's paper. 5A good, though dated, review of duplicate record problems can be found in the O'Neill, Rogers, and Oskins article. 6The authors discuss their analysis of differences in records that are similar but not identical, and which elements caused failure to match two records for the same item.For example, when there was only one differing element in a pair, they found that element was most often publication date.Their study shows the difficulties for experts to determine with certainty that a bibliographic record is for the same item.
Problems of typographical errors in shared biblio graphic records come under discussion by Beall and Kafadar. 7Their study of copy cataloging errors found only 35.8 percent were corrected later by libraries, though the ordinary assumption is that copy cataloging will be updated when more information is available for an item.Pollock and Zamora report on a spelling error detection project at Chemical Abstracts Service (CAS) and charac terize the types of errors they found. 8Chemical Abstracts databases are among the most searched databases in the world.CAS is usually characterized as a set of sources with considerable depth and breadth.Of the four most common typographical errors they describe, errors of omission are most common, with insertion second, substitution third, and transposition fourth.Over 90 percent of the errors they found were single letter errors.This is in agreement with the findings of O'Neill and Aluri, though the databases were substantially different. 9Another study on moving image materials focuses on problems of nearequivalents in cataloging. 10Yee suggests that cataloging practice tends to lead to making too many separate records for near equivalents.Owen Gingerich provides insight in the use of holdings information in OCLC and other bibliographic utilities such as RLIN for scholarly research in locating early editions of Copernicus' De Revolutionibus. 11Among other sources, he used holdings information in multiple bibliographic utilities to help in collecting a census of copies of De Revolutionibus, and plotting its movements through Europe in the sixteenth century.His article high lights the importance of distinguishing very similar items for scholarly research.Shedenhelm and Burk discuss the introduction of vendor records into OCLC's WorldCat database. 12Their results indicate that these minimallevel records increase the duplication rate within the database and can be costly to upgrade.(See further discussion in the section Change in Contributor Characteristics below.)One problem in analysis of sources of mismatch in previous studies is that there is no good way to detect and charac terize typos that form real words.Jasco reviews studies characterizing types and sources of errors. 13heila Intner compares the quality issues in the databases of OCLC and the Research Libraries Group (RLG) and finds the issues similar. 14Intner used matched samples of records from both WorldCat and RLIN to list and compare types of errors in the records.She noted that while the perception at that time was that RLIN had higherquality cataloging, the differences found were not statistically significant.
Jeffrey Beall, while focusing in his study on the full text online database JSTOR, notes the commonality of problems in metadata quality. 15In addition, he discusses the special quality problems in a database of scanned images.The scanning software itself may introduce typo graphical errors.Like XWC, the database changes rapidly.O'Neill and VisineGoetz present a survey of quality con trol issues in online databases. 16Their sections on dupli cate detection and on matching algorithms illustrate the commonalities of these problems in a variety of shared cataloging databases.They cite variation in title as the most common reason for failure to identify a duplicate record that should match.Variations in publisher, names, and pagination were noted as common.Lei Zeng pres ents a study of Chinese language records in the OCLC and RLIN databases. 17Zeng discusses quality problems including (1) format errors such as field and subfield tagging and incorrect punctuation; (2) content errors such as missing fields and internal record inconsisten cies; and (3) editing and inputting errors such as spacing and misspelling.Part 2 of her study presents the results of the prototype rulebased system developed to catch such errors. 18While the author refrains from comparing the quality of OCLC and RLIN Chinese language catalog records, the discussion makes clear that the quality issues are common to a number of online databases.
More work is needed on quality and accuracy of shared records in nonRoman scripts, or in other lan guages transliterated to Roman script.
n Types of bias to be considered Specific factors that may tend to bias an attempt to match one record to another include:

Violated Expectations
Data may not conform to expectations.Expectations about the nature of records in the data bases are frequently violated.What seem to be good rules for matching may not work well if the incoming data is not well formed, or simply not constructed as expected.
Biasing sources in the incoming data include the fol lowing: 1. Typographical errors occur in titles and other parts of the record.Anywhere the software has to parse text, an entry error-or even correction of an entry error by a later update-could con found matching.This could confound both (a) query execution and (b) candidate comparisons.
Basically the system expects textual data such as the name of a title or publisher to be correct, and machinebased efforts to detect errors in data are expensive to run.Spelling detection techniques can compensate in some ways for data problems, but will not identify cases of realword errors.See Kukich for a survey of spelling error, realword, and contextdependent techniques. 20. There is also the issue of real word differences in similar text strings.The assumption made here is that the use of all pos sible syllables contained in the title should tend to mitigate language problems.Nothing short of semantic analysis by the software is likely to solve such a problem, and contextual approaches to detection have had most success (in the produc tion environment) in carefully controlled cases.Matching overall must be generic in its problem solving techniques.

Temporal bias
Large databases developed over time have their contents influenced by changes in standards for record creation, changes in contributor perception of the role of the data base, and changes in technology to be described.Changes may include the following: 1. Description level: e.g.changes such as book or elec tronic book.These have evolved from format to contentbased descriptions that transcend format.Over time, the cataloging rules for describing formats have changed.Thus a format description created earlier might inadvertently "mismatch" the newer description of exactly the same item.
For example, the rules for describing a book on a CD originally emphasized the CD format, whereas now, the emphasis might be shifted to focus on the intellectual content, the fact that it is a book.2. The role of the database once perceived as chiefly repository or even backup source for a given library has become a shared resource with responsibilities to a community larger than any one library.3.Over time, the use of the database may change.
(This is further discussed in the section on Growth of the Environment later.)Searching has to satisfy the reference function of the database, but match ing as a process also relies on searching, and its goals are different.4. Varied standards worldwide challenge coopera tion.While U.S. libraries usually follow AACR2 and use the MARC21 communications format, other parts of the world may use UNIMARC and countryspecific cataloging rules.For instance, the PICA Bibliotekssystem, which hosts the Dutch Union Catalog, used the Prussian cataloging rules, which tended to focus on title entries. 22The switch to the RAK was made by the early nineties. 23

Some libraries may not use any form of MARC
but submit a spreadsheet that is then converted to MARC.There is some potential for ambiguities in those conversions due to lack of 1:1 correspon dence of parts.6.Even within a country, standards change over time, so that "correct" cataloging in one decade may not match that in a later period.Neither is wrong, in its own temporal context, but each results in different metadata being created to describe the same item.Intner points out that OCLC's database was initi ated a full decade before RLG implemented RLIN, and RLIN started almost the same time as the AACR2 publication. 24Thus RLIN had many fewer preAACR2 records in its database, while Worldcat had many more preexisting records to try to match with the newer AACR2 forms.7. Objects referenced in the database may change over time.For instance, a record describing an elec tronic resource may point to a location no longer valid for that resource.8. Vendor records are created as advance advertis ing, but there is no guarantee the records will be updated later.Estimating the time before updates occur is impossible.9. Records themselves change over time as they are copied, derived, and migrated into other systems.They may be enhanced or corrected in any system where they reside.So when they return to the origi nating database, they may have been transformed so far as to be unrecognizable as representations of the same item.This problem is not unique to XWC; it is a challenge for any shared database where export of records and reentry is likely.

Design bias
The title, author, publisher, place of publication, and other elements of a record, designed in a time when most of the contents of a library were books, may not appear as clear or usable for other forms of informa tion, such as Web sites or software.There is a risk to any design of a representation for an object, that it may favor distinctions in one format over another.Or representations imported from other schemes may lose distinctions in the crosswalk from one scheme to another.A crosswalk is a mechanism for the mapping of data elements/content from one metadata scheme to another.Dublin Core and MARC are just two examples of schemes used by library professionals.Software exists to convert Dublin Core metadata to MARC for mat, but the process of converting less complex data to a scheme of more structured data has inevitable limita tions.For instance, Dublin Core has "SUBJECT" while MARC has dozens of ways to indicate subject, each with a different kind of designation for subject aspects of an item. 25See discussion in Beall. 26Libraries commonly exchange or purchase records from external sources to reduce the volume or costs of inhouse cataloging.If an institution harvests metadata from multiple sources, there can be varying structures, content standards, and overall quality, all of which can make record compari sons error prone.While library and information science professionals have been creating metadata in the form of catalog records for a long time, the wider community of digital repositories may be outside the LIS commu nity, and have varied understanding of the need for consistent representations of data.Robertson discusses the challenges of metadata creation outside the library community. 27Museums and archives may take a dif ferent view of what quality standards in metadata are.For example, for a museum, extensive detail about the provenance of an object is necessary.Archives often record information at the collection level rather than the object level; for example, a box of miscellaneous papers, as opposed to a record for each of the papers within the box.Educators need to describe resources such as learning objects.A learning object is any entity, digital or nondigital, which can be used, reused, or referenced during technologysupported learning 28 For these objects a metadata record using the IEEE LOM standard may be used. 29While this is as complex as a MARC record, it has less bibliographic description and more focus on description of the nature and use of the learning object.In short, for one type of institution the notion of appropriate granularity of description may be too detailed or too vague for the needs of another type of institution.

Judgment calls
Two persons creating independent records for the same item exercise judgment in describing what is most impor tant about the object.One may say it is a book with an accompanying CD, another may say it is software on a CD, accompanied by a book of documentation.Another example of legitimate variation is the choice of use of ellipses […] to leave out parts of long titles in a metadata description.One record creator may list the whole title, another may list only the first part followed by the mark of ellipsis to indicate abbreviation of the lengthy title.Either is correct, but may not match each other without special techniques.See appendix B for the perils of ellipsis handling.
The form of name of a publisher, given other occur rences of a publisher name in a record, may be abbrevi ated.For instance, in one place the corporate author who is also the publisher might be listed in the author field as "Department of Health and Human Services" and then abbreviated-or not-in the publisher area as "The Department." Note that there are limitations inherent to the valida tion of any system of matching, in that human reviewers may not be able to determine whether two representa tions in fact describe the same item.

Structural bias
1. Process bias refers to any features of the software which at runtime may change the way matching is carried out, whether by shortening or lengthen ing the analysis, or otherwise branching the logical flow.This can arise from many sources, including but not limited to the following factors.a.There is need for efficient processing of large num bers of incoming records.This can force an empha sis on speedy matching.That is, matching not required to replace records tends to be optimized to stop searching/matching as early as is reason able.In the case where unique key searching finds a single match to an incoming record, it is fairly easy for the software to "justify" stopping.If there are multiple matches found, more analysis may be needed before the decision to stop matching can be made.Over time the numbers of records processed has increased enormously.b.Matching needs to exploit "unique" keys to speed searching, yet these may not prove to be unique.Though agreements are in place for use of numeric keys such as ISBNs, creation of these keys is not under the control of any one organization.
c. Problems arise when brief records are com pared with fuller records.Comparisons may be biased inadvertently towards false matches.Such sparseness of data has been identified as a problem in RLIN matching as well as in XWC.d.At the same time there is bias toward less generic titles in matching.Requirements of sys tem throughput mandate an upper limit on the size of result set that the matching software will even attempt to analyze.This upper limit could tend to discriminate against effective retrieval of generic titles.Matching will reject very large results sets of searches.So the query that has fewer title terms may tend to retrieve too much.Titles such as "Proceedings" or "Bulletin" may be difficult to match if insufficient other informa tion is present in the record for the query to use.Ironically this can mean addition of more generic titles to the database, since what is there is in effect less findable.e. Transparency can contribute to bias in that, for each layer of transparency a layer of opacity may be added, when information is filtered out from a user's view.That user may be a human or an application.OpenURL access to "appropriate copy" is an example from the standards world.The complexity of choosing among multiple online copies has become known as the "appro priate copy" problem.There are a number of instances where more than one legitimate copy of an electronic article may exist, such as mir roring or aggregator databases.It is essentially a problem of where and how to introduce localiza tion into the linking process. 30 Even more confusing, a URI that is not broken may point to content which has changed to the point where the metadata no longer describes the item it once did.At one extreme, Bruce and Hillmann describe the curious case of citation of judicial opinions, for which a record of the opinion may be created as much as eighteen months before the volume with the official citation is printed, and thus the official citation cannot be created. 31e.Expectations for creation of metadata play a role as well.Traditional cataloging has generally had an expectation that most metadata is being cre ated once and reused.Yet current practice may be more iterative, and must be, if such problems as records with broken Internet URIs are to be avoided.f.Loss of synchronization can subvert process ing.Note that other elements of metadata may become divorced or out of synch with the origi nal target /purpose.The prefix to an ISBN was originally intended to describe the publisher, but is now an unreliable discriminator.Numeric keys intended to identify items uniquely can retrieve multiple items, if the scheme for assign ing them is not applied consistently.In the worst case, meaningful data elements may become so corrupted as to be useless for record retrieval or even comparison of two records.g.Ownership issues can detract from optimal data base management.Member institutions' percep tions of ownership of individual records can conflict with the goals of efficient search and retrieval.Members may resist the idea of a "bet ter" record being merged with a "lesser" one.So systems have ways of ranking records by source or contents with the general goal of trying to avoid losing information, but with the specific effect of refraining from actions that might be enriching in a given case.

Changes in contributor characteristics
Copy cataloging practices in an institution can affect XWC indirectly.An institution previously oriented to fixing downloaded records may adopt a policy of refrain ing from changing downloaded records.Historical inde pendence of libraries is one illustration.Prior to the 1970s, most libraries did not share their cataloging with other libraries.Many institutions, especially smaller ones, were outside the loop and did things their own way.They used what rules they felt were useful, if they used any rules at all.Later they converted sparse and poorly formed data into MARC records and sent them to OCLC for matching, perhaps in an effort to get back a more complete and useful record.Yet the matching process is not always able to distinguish or interpret these local dialects.Changes in specialization of cata loging staff at an institution, or cutbacks in staff can lead to reduced facility in providing original cataloging.Outsourcing of cataloging work can affect handling of specialized materials as well.The introduction of Vendor Records and their characteristics has been noted by Shedenhelm and Burk. 32As they note, these records are very brief bibliographic records originally designed to advertise an item for sale by the vendor.These mini mal level records have a relatively high degree of dupli cation with existing records (37.5 percent in their study) and because of their sparseness can increase the cost of cataloging.n Conclusions In this review we have seen that characterizing metadata at a high level is difficult.Challenges for adding to a large, complex database include some of the following: This is a major problem with automated authority control where context clues may not be trustworthy.3. Errors of formatting of variable fields in the meta data contribute to false mismatch.The rules for data entry in the MARC record are complex and have changed over time.Erroneous placement or coding of subfields poses challenges for iden tification of relevant data.The software must be fault tolerant wherever possible.Changes in the format of the data itself in these fields/sub fields may further complicate record comparisons.ISBNs (International Standard Book Numbers) and LCCNs (Library of Congress Control Numbers) have both changed format in the recent past.4. Errors occur in the fields that indicate format of the information.In bibliographic records, format information is used to derive the overall type of material being described: book, URL, DVD, and so on.Errors in the data in combination can generate an incorrect material type for the record. 5. Language of cataloging: this comparison has in the past caused inappropriate mismatches.The require ments in the new matching aimed to address this.6. Language in formation of queries: MARC records frequently are a mixture of languages.As has been seen in other projects with intensive comparison of text, overlap in languages has the potential to confuse comparisons of short strings of text.
Equivalence tables can crossreference known variations on wellknown publisher names, but cannot predict merges and other organizational changes.Or consider author names: are "John Smith" and "Jon Smith" the same?
A shared database can grow in unpredictable ways.A change in the relative proportions of different types of materials or topical coverage can render onceeffective searches ineffective due to large result sets.An example of this is the number of Internetrelated entries in XWC.A search such as "dog" restricted to "Internetrelated" entries in 1995 retrieved thirtyfour hits.This might be a manageable number.But in 2005, 225 entries were in the result set.Similarly with subject headings, one search on "computer animation" retrieved fourteen hits in 1980, and 342 in 2005.In both cases the result sets grew from manageable to "too large" over time.The increase in the number of foreign language entries in a database can cause problems.Just determining what language an entry is in can be difficult, and records may contain multiple languages.Also, such languages as Chinese, Changes in the proportion of contribu tors who create records in nonMARC formats such as Dublin Core can affect the completeness of bibliographic entries.The use of such formats, meant to facilitate the entry of bibliographic materials, does come with a cost.Group cataloging is a process whereby smaller libraries can join a larger set of institutions in order to reduce costs and facilitate cataloging.This larger group then contributes to OCLC's database as an entity.The growth of group cataloging has resulted in the addition of more records from smaller libraries, which may in the future have an effect on searching/matching in XWC WorldCat overall.Internationalization may be a factor as well.The MARC format is an Anglobased format with Englishlanguagebased documentation.Rapid inter national growth thrusts a broader range of traditions into a MARC/OCLC world.The role of character sets is heightened as the database grows.A Cyrillic record may not be confidently matched to a transliterated record for the same item.Although WorldCat has a long his tory with CJK records, MARC and WorldCat are not yet accustomed to a wide repertoire of character sets.Now, however, XWC is an environment in which expanding character coverage is possible, and likely.How does the conversion from new metadata schemes affect matching to MARC records?Does it help to know in what format a record arrived, or under what rules it was created?How can we address sparseness in vendor records or legal citations?How can we deal with other advance publication issues?How do changes in philosophy of the database affect the integrity of the matching process?
nWe need more systematic study of the types of errors/omissions encountered in MARC record cre ation.nHow can the process of matching accomodate objects that change over time?n n n