Application of the Variety-Generator Approach to Searches of Personal Names in Bibliographic Data Bases-Part 1 . Microstructure of Personal

Conventional approaches to processing records of linguistic origin for storage and retrieval tend to regard the data as immutable. The data generally exhibit great variety and disparate frequency distributions, which are largely ignored and which entail either the storage of extensive lists of items or the use of complex numerical algorithms such as hash coding. The results in each case are far fmm ideal. The variety-generator approach seeks to reflect the microstructure of data elements in their description for storage and search, and takes advantage of the consistency of statistical characteristics of data elements in homogeneous data bases. In this paper, the application of the variety-generator approach to the description of personal author names from the INSPEC data base by means of small sets of keys is detailed. It is shown that high degrees of partitioning of names can be obtained by key-sets generated from the initial characters of surnames, fmm the terminal characters of surnames, and from the initials. The implications of the findings for computer-based bibliographical information systems are discussed.


INTRODUCTION
The application of computer technology to the storage of bibliographic data bases and to the selection of items from them on the basis of the content of specified data elements poses considerable problems.Among the most important of these, from the viewpoint of the efficiency of computer use, is the fact that many of the individual data elements exhibit great variety (i.e., lists of their contents are extensive), and show relatively disparate distributions.-4 Such distributions have been extensively studied in various contexts by Bradford,Zip£, In general, the distributions are approximately hyperbolic, so that a small proportion of items may account for a substantial proportion of occurrences, while the majority of items occur only infrequently.The studies have been well reviewed by Fairthorne. 7 Of all the data elements, personal author names exhibit a distribution which is at its most exh•eme in one direction.As is shown later in this paper, the most frequent author name in a file of 50,000 names occurred only sixteen times, while over 35,000 of the names, or over 70 percent of the file, occurred once only.
• 9 Based on information-theoretic principles, it involves a two-stage search procedure in which in the first and rapid stage the majority of items which cannot possibly fulfill the search criteria are eliminated, while those which meet the criteria are examined for an exact match at the second stage.The criteria (or attributes) are selected on the basis of an examination of the microstructure of the items in the data base, and are chosen so that their frequencies are approximately equal.The number of criteria or attributes chosen for description of the items is variable within a wide range; with their aid, the variety of items can be described so as to facilitate discrimination among them.
In the context of substructure searching, the attributes are representations of fragments of chemical structures, 10 while in the case of text, they are strings of characters which are variable in length.These strings are long when the characters comprising them represent frequent combinations, and short when the characters are infrequent. 11Since the sets of attributes can generate, in an approximate manner, the variety of items encountered in the data base, they are termed variety generato1•s.They are intermediate in number between the primitive set of symbols ( alphanumeric characters in the case of text, atoms and bonds in that of chemical structures) and the actual variety of items in the collection (words or word fragments in text in the first instance, and molecules in the second).
The variety-generator approach involves recognition of the fact that the statistical properties of specific data elements within homogeneous data bases are relatively constant, and that the primitive symbols of the data elements themselves usually show hyperbolic distributions.New symbol sets can therefore be defined, consisting of sequences of primitive symbols such that their frequencies of occurrence become comparable.The new symbol sets then constitute the attributes which are employed, singly or in combination, to represent the items within a search file.These symbol sets approximate to the ideal of equifrequency postulated by Shannon for optimal efficiency in communication. 12Only an approximation can be obtained, however, since the distributions of the newly defined symbols still cover a relatively wide range, and since they are seldom entirely independent of one another in statistical terms, and may often be strongly associated.
The variety-generator concept is not entirely novel.• 14 However, the greater flexibility of computer techniques would appear to make its use today even more attractive.
This paper thus describes a study of a large file of authors' names with a view to identifying attributes of the names which can be used for efficient reh•ieval purposes.Assessment of the effectiveness of the attributes in retrieval is described in Part 2 of this series.(t The main terms used here are n-gram, key, and key-set, where an n-gram is a string of n adjacent characters.A key consists of an n-gram, and keys are chosen so that the frequencies of a set of keys (or key-set) are approximately equivalent in a given file.
The measures used in assessing frequency distributions are Shannon's expressions for the entropy of a sequence of symbols: and relative entropy: Hmaxlmum is reached when the probabilities of occurrence of the symbols of the sequence are equal; its value is the binary logarithm of the variety of symbols, since The value of the relative entropy is thus a measure of the degree of equifrequency of a set of symbols, and is independent of their variety.

CHARACTERISTICS OF NAME FILE
The file studied was a collection of 100,000 personal names taken from ten issues of the INSPEC data base dating from the period 1969 to 1972.The names are represented in variable-length format, surname followed by a comma, space and initials each followed by a period.For the present purpose, case and diacritic shift symbols were ignored.
Subsets of the file were first sorted into sequence on the basis of the full names, and distributions determined both for surnames and initials, and for surnames alone, as shown in Table 1 for the subset of 50,000 names.Since the great majority of full names occur once only, the relative entropy of this distribution, at 0.975 (computed with respect to the 50,000 names, i.e., Hmax= log250,000), is high, while that for surnames alone is lower, at 0.904.An analysis of the ratio of unique surnames to the total number of entries in files of 25,000, 50,000, 75,000 and 100,000 names showed that the proportion of different surnames added to the file as it increases in size is predictable.The relationship between the number of different surnames (D) and the total number of entries ( N) conforms to the expression: D=aNtl where a = 5.89 and {3 = 0.78.
Next, the frequencies of characters at different positions in the surnames and of the initials were determined.The most important positions in the surname are the first and last characters, as will be seen shortly.The distributions of these characters and of the first and second initials are shown in Table 2.The relative entropy of the first initial is, interestingly, the highest of the four; the highest ranking initial is J, which is one of the least frequent characters in English text.Thereafter follow the first and last letters of the surname, and the second initial.The low relative entropy of the last is partly accounted for by the fact that a single initial occurred in 37 percent of the entries.Distributions were also obtained for the second and subsequent characters of the surname.• 16 However, due to the variable lengths of names, the dominant character at the sixth and subsequent positions of the surname is the space character.

KEY-SET GENERATION TECHNIQUE
The basic key-set generation technique involves creating fixed-length n-grams from some point or points of reference within each record, the strings generated being initially of length greater than those anticipated within the key-set.These strings are sorted into lexicographic order and counted.(The resultant distribution of the fixed-length strings is again hyperbolic.)The frequencies are compared with a predetermined threshold frequency-at the first stage none of the string frequencies should exceed this value.The strings are then shortened by truncation of the right-hand character, and the frequencies of the strings which have become identical through truncation are accumulated.The new n-gram frequencies are compared with the threshold value; any strings which exceed the value are noted.The procedure is repeated until the single characters are reached.Two types of analysis are possible, redundant and nonredundant.•In the latter, any string exceeding the threshold value is removed from the list and not processed further, while in the former they continue to the next processing stage.While redundant analysis is valuable at the exploratory stage, the nonredundant type is preferred for key-set generation.
The procedure was first applied to strings of characters starting with the first character of each surname, as illustrated in Figure 1.Here the frequency of the surname FOREMAN in a _file of 50,000 names is eleven.When successively shortened, other surnames with the same initial n-gram are included in the count.Comparison of the count with a threshold value results in selection of a key.Here, if the threshold were 100, the key selected would be FOR.Application of the procedure to the surnames of the 50,000 name file (the name records had a maximum of eighteen characters, left-justified and space-filled if less than this length), with a threshold frequency of 300 (i.e., a probability of 0.006), gave a key-set consisting of eighty-seven keys, including all the alphabetic characters.The key-set is shown, in alphabetic order, together with the probabilities, in Table 3.It is clear that the most frequent characters at the beginning of the surname have produced most keys, S and M with eight keys each, B with seven, K with six, and H, G, P, and R each with five keys.Whereas the relative entropy of the initial surname letter was 0.917, that of the key-set is 0.977.The probabilities of no less than seventy of the eighty-seven keys now lie between 0.005 and 0.015.The key-set itself consists of the twenty-six alphabetic characters (one of these, X, is not represented in the collection), fifty- eight digram keys, and the three trigram keys BAR, MAR, and SCH.The predominance of vowels as the second character of keys is noticeable; forty-nine of the sixty-one n-grams have a vowel in the second position.
The size of the key-set produced from a given data base can be varied arbitrarily by changing the threshold value.An approximately hyperbolic relation obtains between the value of the threshold and the number of keys selected.As the size of the key-set increases, the length of the longest n-gram in the key-set increases, and the distribution of n-grams shifts toward higher values, as shown in Figure 2.
Stability of the key-sets with increase in file size is clearly an important factor.To determine the extent of this, successive portions of the entire file of 100,000 surnames were subjected to the analysis at a threshold value of 0.005.As illustrated in Table 4, the key-sets are remarkably stable in regard to total key-set size, the number of keys of each length, and to the actual keys.As the size of the key-set increases, the range of probabilities represented among the keys narrows, and the relative entropy of the distribution increases, becoming eventually asymptotic with the value of one.This i~ illustrated in Figure 3, for the surnames in a file of 50,000 entries.Beyond a key-set size of about 100, increases in the relative entropy of the resultant distribution are marginal.Furthermore, with increasing key-set size, the shorter and more frequent surnames begin to appear in their entirety as keys.
As an alternative to increasing the variety of the keys, the production of keys from character positions after the first letter of the surname was considered.The problem of variations in name length, as well as the very different distributions of the characters at these positions, were not encouraging, and instead the production of key-sets from the last letter of the sur- Total number of keys for the front of surnames Fig. 3. Increase in relative entropy with increase in key-set size; keys generated from 50,000 surnames name was investigated, and proved much more ath•active, since it is largely independent of surname length.

KEY-SETS FROM THE END OF THE SURNAME
For this purpose, each surname in the file was reversed within a record and subjected to key-generation.The relative entropy of the last character of the surname is substantially lower than that of the first character, at 0.860.Accordingly, the key-sets have a higher proportion of longer keys than those produced from the front of the surname, as shown in Table 5.This key-set consists of the twenty-six characters, seventy-eight digrams, forty trigrams, ten tetragrams, and a single pentagram.The breakdown of the individual terminal characters of the surname is also more extreme, since the distribution is more skew.Thus N, the most frequent last character, has no fewer than nineteen different keys in this set, closely followed by R, with seventeen keys.The relative entropy of the distribution is again high, at 0.970 for this key-set.Figure 4 shows the relation between key-set size and relative entropy, and indicates that a larger number of keys from the last character of the surname is required to reach the same relative en-tropy as keys from the first character.There is an anomalous section of the curve, which may well derive from the much greater prevalence of suffixes than prefixes in personal names.

CONCLUSIONS
This study has demonstrated the feasibility of devising partial representations of author names by applying the variety-generator approach to overcome the substantial frequency variations encountered in their distributions.It has also been shown that within a homogeneous file, i.e., one of consistent provenance, there exists a substantial level of consistency in terms of character distributions, as illustrated in Table 4.The characteristics may vary substantially between data bases of different provenance, e.g., as between INSPEC and MARC files. 17onventional approaches to processing records comprising linguistic data tend to disregard the statistical properties of the items, and attempt to overcome the resultant problems either by storage of extensive lists of items or by using complex numerical algorithms.Typical of this latter approach, in the present context, is the use of truncated search keys for access to bibliographical files in direct access stores, in which fixed-length character strings are the keys, as, for instance, in the system in operation at the Ohio College Library Center. 18The problems encountered in the use of fixed-length truncated author and title search keys for monograph data are indicated by the fact that the search files using hash-addressing are operated, on average, at a density of only 62.5 percent.Once the density reaches 75 percent, the proportion of collisions and the resultant degradation in performance are such that the files are recreated at a density of only 50 percent.
Fixed-length keys from author and title entries are demonstrably inefficient in performance since the information content is low.The distribution of the initial trigrams of 50,000 names from the INSPEC file provides corroboration of this fact.The number of possible combinations of three characters is 17,576 (26 3 ), yet only 3,285 trigrams were represented in the file, or 18.7 percent of the total variety.Moreover, the relative entropy of the trigrams is much lower than that of the initial characters of the surnames, at 0.73.Performance figures for precision illustrate this point. 19he present work, together with other studies of the scope for application of the variety-generator approach, thus stands in considerable contrast to prior work, and must be viewed as a means whereby the microstructure of particular data elements is fully reflected in their manipulation, affording substantial advantages. 20Part 2 of this paper illustrates this in regard to searches of personal names.

Fig. 1 .
Fig. 1.Successive right-hand truncations of a surname during key-set generation

Table 1 .
Distribution of full names and surnames alone in a file of 50,000 INSPEC names.

Table 2 .
Distributions of first and last characters of surname and of initials in 50,000 INSPEC name me.

Table 3 .
Key-set of 87 keys produced from 50,000 surnames from INSPEC files.

Table 4 .
Stability of size and composition of keys with increasing file size.

Table 5 .
Key-set of 155 n-grams produced from last letter of 50,000 INSPEC surnames at threshold of 0.003.