The Efficient Storage of Text Documents in Digital Libraries

In this paper we investigate the possibility of improving the efficiency of data compression, and thus reducing storage requirements, for seven widely used text document formats. We propose an open-source text compression software library, featuring an advanced word-substitution scheme with static and semidynamic word dictionaries. The empirical results show an average storage space reduction as high as 78 percent compared to uncompressed documents, and as high as 30 percent compared to documents compressed with the free compression software gzip.


The Efficient Storage of Text Documents in Digital Libraries
In this paper we investigate the possibility of improving the efficiency of data compression, and thus reducing storage requirements, for seven widely used text document formats.We propose an open-source text compression software library, featuring an advanced word-substitution scheme with static and semidynamic word dictionaries.The empirical results show an average storage space reduction as high as 78 percent compared to uncompressed documents, and as high as 30 percent compared to documents compressed with the free compression software gzip.

I
t is hard to expect the continuing rapid growth of global information volume not to affect digital libraries. 1The growth of stored information volume means growth in storage requirements, which poses a problem in both technological and economic terms.Fortunately, the digital librarys' hunger for resources can be tamed with data compression. 2he primary motivation for our research was to limit the data storage requirements of the student thesis electronic archive in the Institute of Information Technology in Management at the University of Szczecin.The current regulations state that every thesis should be submitted in both printed and electronic form.The latter facilitates automated processing of the documents for purposes such as plagiarism detection or statistical language analysis.Considering the introduction of the three-cycle higher education system (bachelor/master/doctorate), there are several hundred theses added to the archive every year.
Although students are asked to submit Microsoft Word-compatible documents such as DOC, DOCX, and RTF, other popular formats such as TeX script (TEX), HTML, PS, and PDF are also accepted, both in the case of the main thesis document, containing the thesis and any appendixes that were included in the printed version, and the additional appendixes, comprising materials that were left out of the printed version (such as detailed data tables, the full source code of programs, program manuals, etc.).Some of the appendixes may be multimedia, in formats such as PNG, JPEG, or MPEG. 3 Notice that this paper deals with text-document compression only.Although the size of individual text documents is often significantly smaller than the size of individual multimedia objects, their collective volume is large enough to make the compression effort worthwhile.The reason for focusing on text-document compression is that most multimedia formats have efficient compression schemes embedded, whereas text document formats usually either are uncompressed or use schemes with efficiency far worse than the current state of the art in text compression.
Although the student thesis electronic archive was our motivation, we propose a solution that can be applied to any digital library containing text documents.As the recent survey by Kahl and Williams revealed, 57.5 percent of the examined 1,117 digital library projects consisted of text content, so there are numerous libraries that could benefit form implementation of the proposed scheme. 4n this paper, we describe a state-of-the-art approach to text-document compression and present an opensource software library implementing the scheme that can be freely used in digital library projects.
In the case of text documents, improvement in compression effectiveness may be obtained in two ways: with or without regard to their format.The more nontextual content in a document (e.g., formatting instructions, structure description, or embedded images), the more it requires format-specific processing to improve its compression ratio.This is because most document formats have their own ways of describing their formatting, structure, and nontextual inclusions (plain text files have no inclusions).
For this reason, we have developed a compound scheme that consists of several subschemes that can be turned on and off or run with different parameters.The most suitable solution for a given document format can be obtained by merely choosing the right schemes and adequate parameter values.Experimentally, we have found the optimal subscheme combinations for the following formats used in digital libraries: plain text, TEX, RTF, text annotated with XML, HTML, as well as the device-independent rendering formats PS and PDF. 5 First we discuss related work in text compression, then describe the basis of the proposed scheme and how it should be adapted for particular document formats.The section "Using the scheme in a digital library project" discusses how to use the free software library that implements the scheme.Then we cover the results of experiments involving the proposed scheme and a corpus of test files in each of the tested formats.

n Text compression
There are two basic principles of general-purpose data compression.The first one works on the level of character sequences, the second one works on the level of individual characters.In the first case, the idea is to look for matching character sequences in the past buffer of the file being compressed and replace such sequences with shorter code words; this principle underlies the algorithms derived from the concepts of Arbraham Lempel and Jacob Ziv (LZ-type). 6n the second case, the idea is to gather frequency statistics for characters in the file being compressed and then assign shorter code words for frequent characters and longer ones for rare characters (this is exactly how Huffman coding works-what arithmetic coding assigns are value ranges rather than individual code words). 7s the characters form words, and words form phrases, there is high correlation between subsequent characters.To produce shorter code words, a compression algorithm either has to observe the context (understood as several preceding characters) in which the character appeared and maintain separate frequency models for different contexts, or has to first decorrelate the characters (by sorting them according to their contexts) and then use an adaptive frequency model when compressing the output (as the characters' dependence on context becomes dependence on position).Whereas the former solution is the foundation of Prediction by Partial Match (PPM) algorithms, Burrows-Wheeler Transform (BWT) compression algorithms are based on the latter. 8itten et al., in their seminal work Managing Gigabytes, emphasize the role of data compression in text storage and retrieval systems, stating three requirements for the compression process: good compression, fast decoding, and feasibility of decoding individual documents with minimum overhead. 9The choice of compression algorithm should depend on what is more important for a specific application: better compression or faster decoding.
An early work of Jon Louis Bentley and others showed that a significant improvement in text compression can be achieved by treating a text document as a stream of space-delimited words rather than individual characters. 10his technique can be combined with any general-purpose compression method in two ways: by redesigning character-based algorithms as word-based ones or by implementing a two-stage scheme whose first step is a transform replacing words with dictionary indices and whose second step is passing the transformed text through any generalpurpose compressor. 11From the designer's point of view, although the first approach provides more control over how the text is modeled, the second approach is much easier to implement and upgrade to future general-purpose compressors. 12Notice that the separation of the wordreplacement stage from the compression stage does not imply that two distinct programs have to be used-if only an appropriate general-purpose compression software library is available, a single utility can use it to compress the output of the transform it first performed.
An important element of every word-based scheme is the dictionary of words that lists character sequences that should be treated as single entities.The dictionary can be dynamic (i.e., constructed on-line during the compression of every document), 13 static (i.e., constructed off-line before the compression stage and once for every document of a given class-typically, the language of the document determines its class), 14 or semidynamic (i.e., constructed off-line before compression stage but individually for every document). 15Semidynamic dictionaries must be stored along with the compressed document.Dynamic dictionaries are reconstructed during decompression (which makes the decoding slower than in the other cases).When the static dictionary is used, it must be distributed with the decoder; since a single dictionary is used to compress multiple files, it usually attains the best compression ratios, but it is only effective with documents of the class it was originally prepared for.
n The basic compression scheme The basis of our approach is a word-based, lossless text compression scheme, dubbed Compression for Textual Digital Libraries (CTDL).The scheme consists of up to four stages: 1. document decompression 2. dictionary composition 3. text transform 4. compression Stages 1-2 are optional.The first is for retrieving textual content from files compressed poorly with generalpurpose methods.It is only executed for compressed input documents.It uses an embedded decompressor for files compressed using the Deflate algorithm, 16 but an external tool-Precomp-is used to decode natively compressed PDF documents. 17he second stage is for constructing the dictionary of the most frequent words in the processed document.Doing so is a good idea when the compressed documents have no common set of words.If there are many documents in the same language, a common dictionary fares better-it usually does not pay off to store an individual dictionary with each file because they all contain similar lists of words.For this reason we have developed two variants of the scheme.The basic CTDL includes stage 2; therefore it can use a document-specific semidynamic dictionary in the third stage.The CTDL+ variant uses a static dictionary common for all files in the same language; therefore it can omit stage 2.
During stage 2, all the potential dictionary items that meet the word requirements are extracted from the document and then sorted according to their frequency to form a dictionary.The requirements define the minimum length and frequency of a word in the document (by default, 2 and 6 respectively) as well as its content.Only the following kinds of strings are accepted into the dictionary: n a sequence of lowercase and uppercase letters ("a"-"z", "A"-"Z") and characters with ASCII code values from range 128-255 (thus it supports any typical 8-bit text encoding and also UTF-8) n URL address prefixes of the form "http:// domain/," where domain is any combination of letters, digits, dots, and dashes n e-mails-patterns of the form "login@domain," where login and domain are any combination of letters, digits, dots, and dashes n runs of spaces Stage 3 begins with parsing the text into tokens.The tokens are defined by their content; as four types of content are distinguished, there are also four classes of tokens: words, numbers, special tokens, and characters.Every token is then encoded in a way that depends on the class it belongs to.
The words are those character sequences that are listed in the dictionary.Every word is replaced with its dictionary index, which is then encoded using symbols that are rare or nonexistent in the input document.Indexes are encoded with code words that are between one and four bytes long, with lower indexes (denoting more frequent words) being assigned shorter code words.
The numbers are sequences of decimal digits, which are encoded with a dense binary code, and, similarly to letters, placed in a separate location in the output file.
The special tokens can be decimal fractions, IP numerical addresses, dates, times, and numerical ranges.As they have a strict format and differ only in numerical values, they are encoded as sequences of numbers. 18inally, the characters are the tokens that do not belong to any of the aforementioned group.They are simply copied to the output file, with the exception of those rare characters that were used to construct code words; they are copied as well, but have to be preceded with a special escape symbol.
The specialized transform variants (see the next section) distinguish three additional classes from the character class: letters (words not in the dictionary), single white spaces, and multiple white spaces.
Stage 4 could use any general-purpose compression method to encode the output of stage 3.For this role, we have investigated several open-licensed, generalpurpose compression algorithms that differ in speed and efficiency.As we believe that document access speed is important to textual digital libraries, we have decided to focus on LZ-type algorithms because they offer the best decompression times.CTDL has two embedded backend compressors: the standard Deflate and LZMA, wellknown for its ability to attain high compression ratios. 19Adapting the transform for individual text document formats The text document formats have individual characteristics; therefore the compression ratio can be improved by adapting the transform for a particular format.As we noted in the introduction, we propose a set of subschemes (modifications of the original processing steps or additional processing steps) that can help compressionprovided the issue that a given subscheme addresses is valid for the document format being compressed.There are two groups of subschemes: the first consists of solutions that can be applied to more than one document format.It includes n changing the minimum word frequency threshold (the "MinFr" column in table 1) that a word must pass to be included in the semidynamic dictionary (notice that no word can be added to a static dictionary); n using spaceless word model ("WdSpc" column in table 1) in which a single space between two words is not encoded at all; instead, a flag is used to mark two neighboring words that are not separated by a space; n run-length encoding of multiple spaces ("SpRuns" column in table 1); n letter containers ("LetCnt" column in table 1), that is, removing sequences of letters (belonging to words that are not included in the dictionary) to a separate location in the output file (and leaving a flag at their original position).
Table 1 shows the assignment of the mentioned subschemes to document formats, with "+" denoting that a given subscheme should be applied when processing a given document format.Notice that we use different subschemes for the same format depending on whether a semidynamic (CTDL) or static (CTDL+) dictionary is used.
The remaining subschemes are applied for only one document format.They attain an improvement in compression performance by changing the definition of acceptable dictionary words, and, in one case (PS), by changing the definition of number strings.
The encoder for the simplest of the examined formats-plain text files-performs no additional formatspecific processing.
The first such modification is in the TEX encoder.The difference is that words beginning with "\" (TEX instructions) are now accepted in the dictionary.
The modification for PDF documents is similar.In this case, bracketed words (PDF entities)for example "(abc)"-are acceptable as dictionary entries.Notice that PDF files are internally compressed by default-the transform can be applied after decompressing them into textual format.The Precomp tool is used for this purpose.
The subscheme for PS files features two modifications: Its dictionary accepts words beginning with "/" and "\" or ending with "(", and its number tokens can contain not only decimal but also hexadecimal digits (though a single number must have at least one decimal digit).The hexadecimal number must be at least 6 digits long, and is encoded with a flag: a byte containing its length (numbers with more than 261 digits are split into parts) and a sequence of bytes, each containing two digits from the number (if the number of digits is odd, the last byte contains only one digit).
For RTF documents, the dictionary accepts the "\"-preceded words, like the TEX files.Moreover, the hexadecimal numbers are encoded in the same way as in the PS subscheme so that RTF documents containing images can be significantly reduced in size.
Specialization for XML is roughly the transform described in our earlier article, "Revisiting Dictionary-Based Compression." 20It allows for XML start tags and entities to be added to dictionary, and it replaces every end tag respecting the XML well-formedness rule (i.e., closing the element opened most recently) with a single flag.It also uses a single flag to denote XML attribute value begin and end marks.
HTML documents are handled similarly.The only difference is that the tags that, according to the HTML 4.01 specification, are not expected to be followed by an endtag (BASE, LINK, XBASEHREF, BR, META, HR, IMG, AREA, INPUT, EMBED, PARAM and COL) are ignored by the mechanism replacing closing tags (so that it can guess the correct closing tag even after the singular tags were encountered). 21Using the scheme in a digital library project Many textual digital libraries seriously lack text compression capabilities, and popular digital library systems, such as Greenstone, have no embedded efficient text compression.22 Therefore we have decided to develop CTDL as an open-source software library.The library is free to use and can be downloaded from www.ii.uni.wroc.pl/~inikep/research/CTDL/CTDL09.zip.
The library does not require any additional nonstandard libraries.It has both the text transform and back-end compressors embedded.However, compressing PDF documents requires them to be decompressed first with the free Precomp tool.
The compression routines are wrapped in a code selecting the best algorithm depending on the chosen compression mode and the input document format.The interface of the library consists of only two functions: CTDL_encode and CTDL_decode, for, respectively, compressing and decompressing documents.
CTDL_encode takes the following parameters: n char* filename-name of the input (uncompressed) document n char* filename_out-name of the output (compressed) document n EFileType ftype-format of the input document, defined as: enum EFileType { HTML, PDF, PS, RTF, TEX, TXT, XML}; n EDictionaryType dtype-dictionary type, defined as: enum EDictionaryType { Static, SemiDynamic }; CTDL_decode takes the following parameters: n char* filename-name of the input (compressed) document n char* filename_out-name of the output (decompressed) document The library was written in the C++ programming language, but a compiled static library is also distributed; thus it can be used in any language that can link such libraries.Currently, the library is compatible with two platforms: Microsoft Windows and Linux.
To use static dictionaries, the respective dictionary file must be available.The library is supplied with an English dictionary trained on a 3 GB text corpus from Project Gutenberg. 23Seven other dictionaries-German, Spanish, Finnish, French, Italian, Polish, and Russiancan be freely downloaded from www.ii.uni.wroc.pl/~inikep/research/dicts.There also is a tool that helps create a new dictionary from any given corpus of documents, available from Skibiński upon request via e-mail (inikep@ii.uni.wroc.pl).
The library can be used to reduce the storage requirements or also to reduce the time of delivering a requested document to the library user.In the first case, the decompression must be done on the server side.In the second case, it must be done on the client side, which is possible because stand-alone decompressors are available for Microsoft Windows and Linux.Obviously, a library can support both options by providing the user with a choice whether a document should be delivered compressed or not.If documents are to be decompressed client-side, the basic CTDL, using a semidynamic dictionary, seems handier, since it does not require the user to obtain the static dictionary that was used to compress the downloaded document.Still, the size of such a dictionary is usually small, so it does not disqualify CTDL+ from this kind of use.

n Experimental results
We tested CTDL experimentally on a benchmark set of text documents.The purpose of the tests was to compare the storage requirements of different document formats in compressed and uncompressed form.
In selecting the test files we wanted to achieve the following goals: n test all the formats listed in table 1 (therefore we decided to choose documents that produced no errors during document format conversion) n obtain verifiable results (therefore we decided to use documents that can be easily obtained from the Internet) n measure the actual compression improvement from applying the proposed scheme (apart from the RTF format, the scheme is neutral to the images embedded in documents; therefore we decided to use documents that have no embedded images) For these reasons, we used the following procedure for selecting documents to the test set.First, we searched the Project Gutenberg library for TEX documents, as this format can most reliably be transformed into the other formats.From the fifty-one retrieved documents, we removed all those containing images as well as those that the htlatex tool failed to convert to HTML.In the eleven remaining documents, there were four Jane Austen books; this overrepresentation was handled by removing three of them.The resulting eight documents are given in table 2.
From the TEX files we generated HTML, PDF, and PS documents.Then we used Word 2007 to transform HTML documents into RTF, DOC, and XML (thus this is the Microsoft Word XML format, not the Project Gutenberg XML format).The TXT files were downloaded from Project Gutenberg.
The tests were conducted on a low-end AMD Sempron 3000+ 1.80 GHz system with 512 MB RAM and a Seagate 80 GB ATA drive, running Windows XP SP2.
For comparison purposes, we used three generalpurpose compression programs: n gzip implementing Deflate n bzip2 implementing a BWT-based compression algorithm Bitrates are given in output bits per character of an uncompressed document in a given format, so a smaller Looking at the results obtained for TXT documents (table 3), we can see an average improvement of 17 percent for CTDL and 27 percent for CTDL+ compared to the baseline Deflate implementation.Compared to the baseline LZMA implementation, the improvement is 10 percent for and 20 percent for CTDL+.Also, CTDL+ combined with LZMA compresses TXT documents 31 percent better than gzip, 11 percent better than bzip2, and slightly better than the state-of-the-art PPMVC implementation.
In case of TEX documents (table 4), the gzip results were improved, on average, by 16 percent using CTDL and by 26 percent using CTDL+; the numbers for LZMA are 10 percent for CTDL and 19 percent for CTDL+.In a cross-method comparison, CTDL+ with LZMA beats gzip by 31 percent, bzip2 by 10 percent, and attains results very close to PPMVC.
On average, Deflate-based CTDL compressed XML documents 20 percent better than the baseline algorithm (table 5), and with CTDL+ the improvement rises to 26 percent.CTDL improves LZMA compression by 11 percent, and CTDL+ improves it by 18 percent.CTDL+ with LZMA beats gzip by 33 percent, bzip2 by 8 percent, and loses only 4 percent to PPMVC.
Similar results were obtained for HTML documents (table 6): they were compressed with CTDL and Deflate 18 percent better than with the Deflate algorithm alone, and 27 percent better with CTDL+.LZMA compression efficiency is improved by 11 percent with CTDL and 20 percent with CTDL+.CTDL+ with LZMA beats gzip by 33 percent, bzip2 by 9 percent, and loses only 2 percent to PPMVC.
For RTF documents (table 7), the gzip results were improved, on average, by 18 percent using CTDL, and 25 percent using CTDL+; the numbers for LZMA are respectively 9 percent for CTDL and 17 percent for CTDL+.In a cross-method comparison, CTDL+ with LZMA beats gzip by 34 percent, bzip2 by 7 percent, and loses 5 percent to PPMVC.
Although there is no mode designed especially for DOC documents in CTDL (table 8), the basic TXT mode was used, as it was found experimentally to be the best choice available.The results show it managed to improve Deflate-based compression by 9 percent using CTDL, and by 21 percent using CTDL+, whereas LZMA-based compression was improved respectively by 4 percent for CTDL and 14 percent for CTDL+.Combined with LZMA, CTDL+ compresses DOC documents 30 percent better than gzip, 13 percent better than bzip2, and 1 percent better than PPMVC.
In case of PS documents (table 9), the gzip results were improved, on average, by 5 percent using CTDL, and by 8 percent using CTDL+; the numbers for LZMA improved 3 percent for CTDL and 5 percent for CTDL+.In a cross-method comparison, CTDL+ with LZMA beats gzip by 8 percent, losing 5 percent to bzip2 and 7 percent to PPMVC.
Finally, CTDL improved Deflate-based compression of PDF documents (table 10) by 9 percent using CTDL and 10 percent using CTDL+ (compared to gzip; the numbers are The results presented in tables 3-10 show that CTDL manages to improve compression efficiency of the general-purpose algorithms it is based on.The scale of improvement varies between document types, but for most of them it is more than 20 percent for CTDL+ and 10 percent for CTDL.The smallest improvement is achieved in case of PS (about 5 percent).Figure 1 shows the same results in another perspective: the bars show how much better compression ratios were obtained for the same documents using different compression schemes compared to gzip with default options (0 percent means no improvement).
Compared to gzip, CTDL offers a significantly better compression ratio at the expense of longer processing time.The relative difference is especially high in case of decompression.However, in absolute terms, even in the worst case of PDF, the average delay between CTDL+ and gzip is below 180 ms for compression and 90 ms for decompression per file.Taking into consideration the low-end specification of the test computer, these results Although text documents are often compressed with general-purpose methods such as Deflate, much better compression can be obtained with a scheme specialized for text, and even better if the scheme is additionally specialized for individual document formats.We have developed such a scheme (CTDL), beginning with a text transform designed earlier for XML documents and The improvement in compression efficiency, which can be observed in the experimental results, amounts to a significant reduction of data storage requirements, giving the reasons to use the library in both new and existing digital library projects instead of general-purpose compression programs.To facilitate this process, we implemented the scheme as an open-source software library under the same name, freely available at http://www.ii.uni.wroc.pl/~inikep/research/CTDL/CTDL09.zip.
Although the scheme and the library are now complete, we plan future extensions aiming both to increase the level of specializations for currently handled document formats and to extend the list of handled document formats.

Figure 1 .
Figure 1.Compression improvement relative to gzip

Table 2 .
Test set documents specification n the bitrate attained on each test file by the Deflatebased gzip in default mode, the proposed compression scheme in the semidynamic and static variants with Deflate as the back-end compression algorithm, 7-zip in LZMA mode, the proposed compression scheme in the semidynamic and static n the total compression and decompression times (in seconds) for the whole test corpus, measured on the test platform (they are total elapsed times including program initialization and disk operations).

Table 3 .
Compression efficiency and times for the TXT documents

Table 4 .
Compression efficiency and times for the TEX documents bitrate (of, e.g., RTF documents compared to the plain text) does not mean the file is smaller, only that the compression was better.Uncompressed files have a bitrate of 8 bits per character.

Table 5 .
Compression efficiency and times for the XML documents

Table 6 .
Compression efficiency and times for the HTML documents

Table 7 .
Compression efficiency and times for the RTF documents

Table 8 .
Compression efficiency and times for the DOC documents

Table 9 .
Compression efficiency and times for the PS documents

Table 10 .
Compression efficiency and times for the (uncompressed) PDF documents