Corporate Author Entry Records Retrieved by Use of Derived Truncated Search Keys

An experiment was conducted to design a corporate author index to a large bibliographic file. The nature of corporate entries necessitates a different search key construction from that of personal names or titles. Derivation of a search key to select distinct corporate entry records is discussed .


INTRODUCTION
This paper describes the findings of an experiment conducted to design a corporate author index to entries in a large file of catalog records at the Ohio College Library Center; a companion paper describes findings of a similar investigation into retrieval employing a personal author index. 1 The center has operated an on-line, shared cataloging system since August 1971.In addition to a Library of Congress card number index, the system maintains truncated name-title and title index files.The user is thus able to retrieve entries employing truncated search keys.Three previous papers report results of experiments which led to the design of the name-title and title indexes. 2 -4 For monographs having personal names as main entries, a truncated 3,3 search key consisting of the first three letters of the author's name plus the first three letters of the first non-English-article word of the title was judged to be satisfactory in that this key yielded five or fewer entries per query in more than 99 percent of the cases when keys were selected at random.5However, a recent study by Guthrie and Slifko reveals that a model which employs random selection of entries yields results closer to actual experience, and with a higher average number of entries per reply. 6 search key composed of the first five or four characters of the surname and the first or first and second initials makes possible efficient re-trievaP However, the situation is different in the case of corporate entries because many corporate names begin with the same or similar words.For example, in the records examined, the initial words of more than 1,300 publications are "U.S. Congress, House Committee On .. .. " Obviously a type of search key different from that which proved efficient for retrieving personal authors is required for retrieval of corporate entries.

MATERIAL AND METHODS
The experiment used a file of approximately 200,000 MARC II records having a total of 68,169 corporate name entries.Corporate entries were extracted from the llO, Ill, 410, 411, 710, 711, 810, and 811 fields in the records.A program edited the file to extract keys; initial English language articles were removed from each entry, and the words "United States," "U.S.," "U.S.," "Great Brit.," and "Great Britain" appearing anywhere in the entry were replaced with "US" and "Gt Brit" respectively.A blank was substituted for each subfield delimiter and associated code, and unwanted characters such as punctuation, diacritics, and special symbols were removed; the program also closed up the space that the unwanted character had occupied.One blank replaced multiple blanks.The elements extracted consisted of five segments of eight characters each, representing the initial eight characters of the first five words of the corporate entry.Segments containing fewer than eight characters were padded out with blanks.If a corporate name had fewer than five words, the remaining segments were blank.
To study a given type of key, the file was sorted on a specified number of initial characters of each segment; these initial characters were then employed as search keys by a program which sequentially compared the characters in the key, counting distinct and identical keys.

RESULTS AND DISCUSSION
Table 1 presents the number of distinct keys and the maximum number of occurrences of identical keys for the structures studied in the experiment.The larger the number of distinct keys for a fixed number of entries in the file, the better the key will be for retrieval purposes.Given two search keys which are more or less equally specific, the one which is simpler to use is preferable.
The peculiarity of corporate-entry keys can be observed from Table 1.Even for the 8,8,8)8,8 key structure the percentage of distinct keys ( 33.7 percent) is low, and the maximum number of occurrences of an identical key ( 1304) is high.Another observation revealed by Table 1 is that as the key structure goes from five to three segments, there is a steady decrease in the percentage of distinct keys and consequently an increase in the maximum number of entries per key.However, a reduction in the number of characters in a segment does not cause a great deal of deterioration.For example, for 8,8,8,~,8 keys, the percentage of unique keys and the maximum number of entries per key are respectively 33.7 percent and 1304, while for 2,2,2,2,2 keys, the corresponding figures are 32.3 percent and 1307.
Thus, the 2,2,2,2,2 key structure seemed a good candidate for a corporate entries index and therefore the number of entries per reply for this key structure was more intensely studied.
On the average it is desirable that the number of replies per query be such that information by which the user can choose among the possible replies can be displayed on a single CRT screen.This maximizes the utility of a computer system, since it minimizes the amount of system activity to promptly satisfy a user's request.Since some query keys produce but one reply while others produce hundreds of candidate records, it is necessary to use the mathematics of probability to determine the likely long-term effect of a given choice of system parameters.Using the approach indicated as useful by Guthrie and Slifko, the analysis of the effect of various choices of search key becomes the following.Assume that every entry has an equal probability of being accessed.Then, in attempting to retrieve each entry once, keys having i number . of entries will cause a total of i 2 entries to be accessed.If ft denotes the frequency of keys having i number of entries and M denotes the maximum allowable occurrences of any key in the file, the average number of entries per reply y, is given by: Jl{ where ~ i ft is the number of entries in the file whose derived keys have • = 1 a frequency of M or less.
The above formula yields the average number of entries per reply for the 2,2,2,2,2 key to be much larger than 20 for M > 100; but some 2, keys corresponded to more than 500 file entries.A typical CRT display terminal can accommodate only ten or fewer entries per screen.Therefore, if the average number of entries per reply is desired to be ten or fewer, it is necessary either to ignore entries with high multiplicity or to adopt a different scheme of storing and retrieving such items, in which case the mathematical result would be the same as ignoring high-frequency items.
The average number of entries per reply was computed for five different values of M ( 19,29,39,49, and 59); the results of these computations are in Table 2, which reveals that if keys in the file are allowed a maximum recurrence of 39 entries per key, it would be possible to have keys in the main index for about 75 percent of total records, while entries for only 142 high frequency keys would have to be shunted to a secondary index.In this case, the average number of entries per reply would be about eight.
Table 3 gives the probability of number of entries per reply for the index file consisting of 50,854 (out of a total of 68,169) records with the maximum frequency of any key in the file being 39.For preparing this table the assumption is made that each entry in the file has an equal probability of being accessed.Thus the probability of obtaining i entries per reply is given by: where f, is frequency of keys occurring exactly i number of times in the index file.An inspection of this table shows that in 87.7 percent of the time there would be 20 or fewer replies.This represents two screensful of information on a typical CRT display.

CONCLUSION
A file containing only those entries for which the frequencies of 2,2,2,2,2 search keys is 39 or fewer would produce 20 or fewer entries per

Table 1 .
Number of Distinct Keys and Maximum Number of Identical Entries Per Key for Different Key Structures in 68,169 MARC II Records.

Table 3 .
Probability of Number of Entries Per Reply for an Index File Using 2,2,2,2,2 Key.