Automatic Format Recognition of MARC Bibliographic Elements: A Review and Projection

A review and discussion of the technique of automatic format recogni tion ( AFR) of bibliographic data are presented. A comparison is made of the record-building facilities of the Library of Congress, the Univer sity of California (both AFR techniques), and the Ohio College Library Center (non-AFR). A projection of a next logical generation is described.


INTRODUCTION
The technique commonly identified as "format recognition" has more potential for radically changing the automation programs of libraries than any other technical issue today.
While the development of MARC has provided an international standard, and various computer developments provide increasingly lower operating costs, the investment in converting a catalog into machine-readable form has kept most libraries from integrating automated systems into their operations.
The most expensive part of the conversion to machine-readable form has been the human editing required (generally by a cataloger) to identify the many variable portions of the MARC-format cataloging record. A full cataloging record contains several hundred possible sections (or fields) in the MARC format. Research at the Library of Congress (LC) into this problem resulted in the concept of "format recognition" to reduce cataloging input costs.
With the automatic format recognition ( AFR) approach, an unedited cataloging entry is prepared (keypunched or otherwise converted to machine-readable form). Then the AFR computer program provides identification of the various elements of the catalog record through sophisticatedcomputer editing. A degree of human post-editing is generally assumed, but· the computer basically is assigned the responsibility of editing an un-.(lae;n.1:itie~d block of text into a MARC-format cataloging record. pioneering AFR work at the Library of Congress is presently in use original cataloging input to the MARC Distribution Service. This system is quite sophisticated because its output goal is a complete MARC record with all fields, subfields, tags, and delimiters identified almost entirely through computer editing. The Institute of Library Research (ILR) at the University of California, faced with the need to convert 800,000 catalog records to MARC format, has developed a ·Jess ambitious AFR program which provides a level of identification sufficient to provide the desired book catalog bibliographic output, or to print catalog cards.
. . · The aim of this paper is to examine these two AFR strategies and consider their implications for input of two major classes of cataloging records: ( 1) LC or other cataloging records in standard card format; and ( 2) Original cataloging not yet in card format.
Comparing the two AFR strategies to an essentially non-AFR format used at the Ohio College Library Center for on-line ca:taloging input, we will propose a median strategy for original cataloging. format recognition ( OFR). The thesis is that differing strategies of input should be used for records already formatted into catalog card images and for those original cataloging items being input prior to card production.

AUTOMATIC FORMAT RECOGNITION
An examination of the Library of Congress ( LC), University of California ( U C), Ohio College Library Center ( OCLC), and original format recognition ( OFR) strategies will show the operating differences .. A ·detailed field-by-field comparison of the nearly 500 distinct codes which can be identified in creation of a MARC record is attached as Appendix I. General comparisons can be made in several areas: input documents, manual coding, level of identification, input and processing costs, error correction, and flexibility in use.
Input Documents-The LC/AFR program operates from an uncoded typescript to a machine-readable record prepared through MT /ST magnetic tape input. This typescript is, however, prepared from an LC cataloger's Manuscript Worksheet, in which thereis some inherent bibliographic order.· The LC/ AFR program does not rely on this inherent order although its design takes advantage of the probable order in search strategies. LC/ AFR could operate with keying of catalog cards, book catalog en-tries~ or any structure of bibliographic data.
The UC program is designed more specifically to handle input of formatted catalog cards, and some of its AFR strategy is based on· the sequence and physical indentation pattern on standard catalog cards. It would not work effectively on noncard format input without special recognition of some tagging conventions.
The OCLC program allows direct input to CRT screen from any input docutnent; it requires complete identification of each cataloging field or subelement input.

Automatic Fo1'mat Recognition/BUTLER 29
Manual Coding-LC/ AFR requires minimal input coding. Within the title paragraph, the title proper, the edition statement, and imprint are explicitly separated at input. Series, subject, and other added entries are recognized initially from the Roman and Arabic numerals preceding them. Aside Jrom these items, virtually all MARC fields are recognized by the computer editing program.
UC/ AFR inserts a code after the call number input, thus providing explicit identification at input. It also identifies each new indentation on the catalog card explicitly, thus implicitly identifying main entry, title, and certain other major cataloging blocks on the card.
The OCLC input specifications require explicit coding, some of which is prompted by the CRT screen.
Level of Identification-LC/ AFR provides the highest possible level of MARC record identification, deriving practically every field, subfield, and other code if it is present in an individual cataloging record.~ In evaluation of this element of LC/ AFR it should be realized that the needs of the Library of Congress in creating original MARC records for nationwide distribution (and its own use) are much more sophisticated and complex than those of any individual user library or system.
The UC/ AFR approach reflects a more task-oriented approach, deriving a sufficient level of identification to separate major bibliographic elements. This technique is clearly sufficient to produce computer-generated catalog cards or similar output in a standard manner. However, UCjAFR lacks several identifiers, such as specific delimitation of information in the imprint field, which would make feasible the use of its records for further computer-generated processes.
The OCLC input format is of variable level; many elements are optional and are noted with an asterisk in Appendix I. At its most complete, the OCLC format specifically excludes only a very few MARC fields, most notably Geographic Area and Bibliographic Price. Input and P1'ocessing Costs-Direct cost information has not been published for production costs of any of the format recognition systems. The Library of Congress has reported that ..... the format recognition technique is of considerable value in speeding up input and lowering the cost per record for processing." 3 While formal reports have not been published, informed opinion has placed the cost of creation of a MARC record at a level of $3.00 ± $.50. Format recognition is credited with an increase in productivity of about one-third on input keying and an increase of over 80 percent in human editing/proofreading, and actual computer 0 A number of standard subdivi'sions of various fields were first announced as part of the MARC format in the 5th edition of Books: A MARC Format, which was published in 1972. 1 Consequently they are not specified in Format Recognition Process for MARC Records, published in 1970, which was used as the reference for this paper. 2 They are, however, clearly subfields which could be identified by expansions of AFR. These elements are marked with a lower-case "r" in Appendix I. processing times approximate those achieved with earlier Library of Congress MARC processing programs. 4 lt would seem that AFR may have lowered Library of Congress MARC processing costs to the level of $2.00 ± $.50. In the final report of the RECON Pilot Project, cost simulation projections for full editing and format recognition editing were given as $3.46 and $3.06 per record, respectively. 6 While full cost information has not been derived for the UC/ AFR program itself, figures have been informally reported at library automation meetings indicating that the cost of record creation was approximately $1.00 per entry. Included in this figure is computer editing of name and other entries against a computerized authority file, which is done manually in the LC/ AFR system. This program is undeniably the least-cost effort to date providing a MARC-format bibliographic record.
No cost data are provided on the OCLC on-line input system. It can be observed that the coding required is quite similar to the pre-AFR system in use at the Library of Congress itself, and that on-line CRT input had been evaluated at LC as a higher-cost input technique than the magnetictape typewriters currently providing MARC input. LC is considering, though, on-line CRT access for subsequent human editing of the MARC record created through off-line input and AFR editing.
Error Rate and Correction-Any AFR strategy, with present state of the art, generates some error above the normal keying rate observed with edited records. The strategy aims for lowest overall cost by catching these errors in a postprocessing edit which must be performed even for records edited prior to input. The Library of Congress reports, "The format recognition production rate of 8.4 records per hour (proofing only) . . . is slightly less than that (about 9.2 per hour) for proofing edited records. With format recognition records, the editors must be aware of the errors made by the program ... as well as keying mistakes." 6 The savings in prekeyboard editing and increased keying rates more than make up for this slight decrease in postprocessing editing.
At the Library of Congress, where AFR is used for production of MARC records, a full editing process aims at 100 percent accuracy of input. While such a goal is statistically unreachable, considerable effort· is expended by the MARC Distribution Service to provide the most accurate output possible. From a systems perspective, errors existing in MARC records are perhaps less reprehensible than errors in printed bibliographic output, simply because the distributed MARC record can be updated by subsequent distribution of a "correction" record. It should be noted that some MARC subscribers have voiced concern about the increased percentage of "correction" records, which the Library of Congress indicates come primarily from cataloging changes rather than input edit errors.
The UC/ AFR program clearly takes a statistical approach to bibliographic element input and processing. Shoffner has indicated that the scale of the 1,000,000 record input project caused a reevaluation of the feasibility of traditional procedures. 7 The result is, in the UC/ AFR implementa-tion, a MARC record essentially devoid of human editing. For a smaller scale of production, the UC approach could be combined with post-editing such as that used at LC to increase overall file accuracy. In passing, however, it should be noted that rather sophisticated verification techniques are used in the UC/ AFR approach which could be of value in future approaches. These include, for instance, comparison of all words against a machine-readable English-language dictionary; words not found in the dictionary are output for manual editing as suspected keypunch errors.
Little information is available on the error rates and corrections in the OCLC system. However, most records keyed to the OCLC system are for a local member's catalog card production, so feedback is provided and presumably errors are corrected through re-inputting to obtain a proper set of catalog cards. There is no central control on the quality of locally entered OCLC records at present, except for the encoding standards developed by OCLC.
Flexibility in Use-A number of considerations are appropriate herehow many types of format (catalog cards, worksheets, etc. ) can be used as input, how many possible outputs can be developed from the derived MARC format, how adaptable is the system to remote and multiple input locations, how many special equipment restrictions are there?
The LC/ AFR program is clearly the most flexible in ability to accept varying inputs and provide a flexible output. It is, however, not capable of any authority-file editing at present (this is done manually against LC's master files before input). While the input form could be used rather easily at remote locations, the MARC AFR programs themselves are not available for use outside the Library of Congress. The UC/AFR program provides a rather minimal set of cataloging element subfields but does provide more sophisticated textual editing within the program. It is quite adaptable to remote input as long as the original "worksheet" is in catalog card format, a restriction which in effect requires a preinput human editing step for original cataloging input. The MARC format provided would not be sufficient for some currently operating programs using the full MARC format, but is quite sufficient for most bibliographic outputs.
The OCLC input program is dependent on visual editing at the time of CRT keying. Its flexibility in input is considerable, and outputs can approach a full MARC record if all optional fields are identified.

ORIGINAL FORMAT RECOGNITION
A working conclusion of this review is that an AFR program developed according to the strategy of the University of California will deliver a satisfactory MARC-format record at a lower cost than other AFR or non-AFR alternatives. However, much of the efficiency of the UCjAFR is based on the presence of an already existing LC-format catalog card from which to keyboard machine-readable data.
For original cataloging to be keyboarded from a cataloger's worksheet, an original format . recognition strategy is proposed which · provides a somewhat more detailed format than the UC/ AFR MARC while retaining a generally flexible system and low input costs. Several system considerations also guide the design of an OFR system designed for relatively general-purpose user input and multiple output functions: • no special equipment requirements for input keying; • no special knowledge of the MARC format required; • minimal table-lookup or text searching in processing; • flexible options for depth of coding provided; and • sufficient depth of format derived for most applications.
The OFR input strategy outlined in Appendix I provides a much greater degree of explicit field coding at input than the AFR programs outlined above. The basis for this decision is the judgment that this cataloging, being done originally by a professional, can readily be coded by element name prior to input.
No effort is made to identify MARC field elements which· occur with very low frequency, or which are of limited utility for most applications. For instance, the "MEETING" type of entry occurs in all combinations, in only 1.8 percent of all records studied by the Library of Congress in its format recognition study. 8 MARC elements requiring either extensive human editing or complex computer processing are likewise excluded from input, on a cost-utility basis. An example is the Geographic Area Code, which must either be assigned by a knowledgeable editor or derived through extensive computer searching for the city /county of publication.
However, where little penalty is attached to allowing input of coded information, the OFR format allows input for inclusion in the derived MARC-format record.

CONCLUSION
It is clear that the AFR programs developed for specific needs by the Library of Congress and the University of California can be great factors for change in library automation strategies over the next decade. Striking benefits in cost savings, ease of input, and subsequent processing are to be gained.
The abbreviated outline of an original cataloging ( OFR) input strategy is simply a suggestion of a second generation of format recognition programs which will undoubtedly develop to serve more general needs for MARC-format bibliographic input. Code Outline FIELD TAG The number listed is the field tag number of that bibliographic element in the MARC format. Each general field is listed first. Following it are notes indicating areas within the field. Fixed-field indicators within the field are listed first; each one's code number follows a slash after the field code (041/ 1 =field 41, indicator code 1). If there is more than one group of indicators, an additional code describes group 1 (Il) or group 2 (I2). Subfields within the field are alphabetic codes following a "+" sign after the field code ( 070+b =field 070, subfield b).
FIELD NAME The overall field name is listed first. Fixed-field indicator names are listed at the first indenti'on under the Field N arne. Subfield names are listed at the second indention under the Field N arne.
TREATMENT BY PROGRAM These codes indicate the processing provided for each field and subelement by the four computer processing systems considered. Codes are slightly different for each column considered: LC The Library of Congress system. "R'' indicates that the element described is recognized by the program, rather than explicitly identified at input. "I" indicates the element is keyed and not recognized by the format recognition process. A small "r" denotes elements introduced to the MARC format since AFR documentation was published, but presumably treated by the AFR program just as "R" elements. "0" indicates that the element marked is omitted from input altogether. UC The University of California system. Codes are identical to those above, but the "r" code is not used. OCLC The Ohio College Library Center system. In addition to the above codes, "~" following any item denotes that input is optional. "I" code is used wherever an element is tagged even though the OCLC programs create the MARC format from these tags.