From Paper To Collection: Greenstone Digital Library

GREENSTONE DIGITAL LIBRARY
FROM PAPER TO COLLECTION
Dr Michel Loots, Dan Camarzan and Ian H. Witten
Human Info NGO, Belgium

Simple Words, Romania
University of Waikato, New Zealand
Greenstone is a suite of software for building and distributing digital library

collections. It provides a new way of organizing information and publishing it
on the Internet or on CD-ROM. Greenstone is produced by the New Zealand
Digital Library Project at the University of Waikato, and developed and
distributed in cooperation with UNESCO and the Human Info NGO. It is
open-source software, available from http://greenstone.org under the terms
of the G nu General Public License.
We want to ensure that this software works well for you. Please report any
problems to greenstone@cs.waikato.ac.nz
Greenstone gsdl-2.50 March 2004

About this manual
This document explains how to create CD-ROM collections from paper documents. It
describes in full detail the procedures and economics involved in the scanning and
optical character recognition (OCR) processes, so that you end up with text in the right
format to apply the Greenstone software. It also describes how to create and edit the
material associated with a collection.
We have tried to be as plain as possible in our explanation. Reference to any trade

mark or company product is purely for illustrative purposes, and does not imply that we
endorse or favor this product over any other.
Companion documents
The complete set of Greenstone documents include five volumes:
• Greenstone Digital Library Installer's Guide

• Greenstone Digital Library User's Guide
• Greenstone Digital Library Developer's Guide
• Greenstone Digital Library: From Paper to Collection (this document)
• Greenstone Digital Library: Using the Organizer
Copyright
Copyright © 2002 2003 2004 2005 2006 2007 by the New Zealand Digital Library
Project at the University of Waikato, New Zealand.
Permission is granted to copy, distribute and/or modify this document under the terms
of the GNU Free Documentation License, Version 1.2 or any later version published by
the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no
Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free
Documentation License.”
Acknowledgements
The scanning operation and other know-how relating to the creation of collaborative
non-profit collections have been developed by Dr Michel Loots, MD, of Human Info
NGO and HumanityCD, Dan Camarzan of Simple Words, and their team of
collaborators in Brasov, Romania.
The Greenstone software is a collaborative effort between many people. Rodger McNab
and Stefan Boddie are the principal architects and implementors. Contributions have
been made by David Bainbridge, George Buchanan, Hong Chen, Michael Dewsnip,
Katherine Don, Elke Duncker, Carl Gutwin, Geoff Holmes, Dana McKay, John
McPherson, Craig Nevill-Manning, Dynal Patel, Gordon Paynter, Bernhard Pfahringer,
Todd Reed, Bill Rogers, John Thompson, and Stuart Yeates. Other members of the
New Zealand Digital Library project provided advice and inspiration in the design of the
system: Mark Apperley, Sally Jo Cunningham, Matt Jones, Steve Jones, Te Taka
Keegan, Michel Loots, Malika Mahoui, Gary Marsden, Dave Nichols and Lloyd Smith.
We would also like to acknowledge all those who have contributed to the GNU-licensed
packages included in this distribution: MG, GDBM, PDFTOHTML, PERL, WGET,
WVWARE and XLHTML.
Contents
1 Introduction 1
2 Scanners and scanning 3

2.1 Scanners 3
Low-cost flat-bed scanner
Low-end scanner with sheet feeder
Color scanners
Professional duplex scanners
Scanning programs
2.2 Preparing the documents 4
2.3 The scanning process 5

Quality control
Filename conventions
2.4 Productivity and resources 6

Scanning costs
3 OCR: Optical Character Recognition 8

3.1 The OCR process 8
Quality control
Tables
Images
Specialized material
3.2 Productivity and resources 10

Intensive OCR
Achievable productivity
3.3 Alternatives to OCR 12

Manual retyping
Image files
3.4 Combining scanning and OCR 13
4 Three examples: 1000 to 100,000 pages 14

4.1 Typical small collection: 500 to 1000 pages 14
4.2 All publications from an organization: 5000 pages 14

4.3 A small library: 100,000 pages 15
5 Creating an electronic collection 17

5.1 Methods of collection building 17
5.2 Getting started in seven steps and 15 minutes 18
GNU Free Documentation License 20

1 Introduction
1
Introduction
One goal of the Greenstone Digital Library software is to empower organizations such
as universities, United Nations agencies, non-governmental organizations, non-profit
organizations and governments to create varied collections of information that can be
delivered online or on CD-ROM.
Typical steps that have to be implemented are:
1. Selecting the documents to be included

2. Securing copyrights permissions to use these documents in the digital
library
3. Scanning and OCR of the hard-copy documents which are not available
in to digital form to have a perfect digital format
4. Converting all documents to a format (integrating text and images)
which can be imported into Greenstone (preferably HTML or Microsoft
Word, but others are also covered at varying levels of precision by a
“plugin” (see the Greenstone User’s Manual)
5. Tagging the chapters, paragraphs and images of the digital documents
6. Organising the collection into a optimally structured digital library
7. Building the digital library using the Greenstone software
8. Printing and distributing the collection on CD-ROM and/or distributing it
over the Internet
In order to create a digital collection, the publications must be available in digital format.
If books, newsletters or other documents are only available on paper, they will need to
be scanned and processed into machine-readable form (step iii). Usually this is done
using optical character recognition (OCR), but sometimes by manual retyping. This
process is covered in Chapters 2-4 of this manual.
Step v. enables the different parts of a document to be independently selected and

displayed by readers in the final library, while step vi. involves assigning attributes to the
documents such as subject categories, keywords and bibliographic data for ordering
and searching the library. These steps are covered in Chapter 5 of this manual.
This manual introduces many issues that affect the editorial process of creating a
collection from paper. Before reading on, you should consider these questions:
• What is the goal of your collection?

• What is your target group?
• How big is it—local, regional, or global?
• How many documents are you making available?
• How many pages?
• How much graphics content?
• Does the material split into parts that will be consulted by a limited
audience and parts that need to be disseminated widely?
• Are the documents already available electronically?
• If so, in which formats? (Note incidentally that PDF files are not
automatically equivalent to digital full-text form, as they often contain
2 Introduction
only page images.)

• What is the copyright status of the documents?
• Who owns the copyright?
• Are there other organizations with the same target audience?
• Are you willing to collaborate with other groups?
• What budget is available for the whole project?
• What human resources are available (in person-months) for
co-ordination, editing, scanning and programming?
• How many computers are available for this project?
• How many CD-ROMs do you want to distribute?
• Will they be free, or for sale?
3 Scanners and scanning
2
Scanners and scanning
The first step in converting paper documents into a digital library collection is to obtain
images of all pages of all publications in digital format. The next stage is optical
character recognition (OCR), and clean, high-quality images are essential for successful
OCR. The digitization process requires a scanner capable of working at a resolution of
300 dpi (dots per inch). Most scanning can be done in black-and-white, but if color
illustrations are included they must be scanned with a color scanner. In most cases the
covers of the book contain colors and will have to be scanned as a color photographic
image.
2.1 Scanners
Scanners are available in all price ranges, and all shapes and sizes. They range from
$100 for flat-bed scanners to upwards of $50,000 for large industrial scanners from
manufacturers such as Bell & Howell.[1]There are many websites that offer a wide
range of scanners for sale. To locate them, just search for “scanners” in search engines
like Google, Altavista, or Yahoo.
The output format of a scanned page is a computer file that is usually stored in TIFF or
Bitmap format. Compressed TIFF IV is the best format to use. An average page
scanned and converted to this format occupies only 50 Kb, compared to perhaps 2 Mb
for the equivalent page in uncompressed Bitmap form.
Low-cost flat-bed scanner
Low-cost flat-bed units are the cheapest and most widely available type of scanner.
There are many brands: HP, Agfa, Acer, etc. Prices range from $100 to $300. Both
black-and-white and color images can be scanned. The low price allows each computer
to have its own scanner.
Disadvantages of these scanners include the medium quality of the result, the slow rate
of scanning, unreliability in warm environments, and relatively frequent breakdown.
Pages must be scanned manually, one by one. Each page must be positioned carefully
on the scanning plate to ensure that it is aligned correctly. Productivity of these
scanners is low. Despite manufacturers' claims that each page can be scanned in less
than a minute, the fact is that rates exceeding twelve pages per hour are rarely
achieved. The scanning process monopolizes the computer on which the work is being
performed.
Consequently these scanners are useful only for small jobs with limited numbers of
pages—no more than 200 to 400 pages a month on a regular basis, or one-time jobs of
[1] All sums of money mentioned in this document are in US dollars, and were current
in 2001.
up to 1000 or 2000 pages.
Low-end scanner with sheet feeder
Low-end scanners with sheet feeders typically cost between $500 and $1200. Ten to
fifty pages can be inserted, scanned and processed at once: thus the operator does not
have to attend constantly to the machine. This increases capacity up to 150 to 200
pages per day. These scanners are more robust, and have a larger lifespan before
repair—usually in the range 30,000 to 50,000 pages.
A disadvantage is that only one side of the page is scanned at a time—the stack of
pages must be reversed and rescanned in order to obtain an image of both sides. This
often creates problems because sheet feeders are never without problems and
sometimes pages get blocked.
These scanners are useful for up to 1500 to 3000 pages a month.
Color scanners
Any scanning operation invariably involves some color images, so a color scanner will
always be required. Generally speaking, less than 5% of any publication contains color
images, plus the cover. Thus a low cost flat-bed scanner as described above suffices. It
is advisable to select one capable of scanning up to 600 dpi resolution.
Professional duplex scanners
Professional scanners are reliable, heavy-duty machines capable of processing a large

volume of pages—typically from 2000 pages to 10,000 pages per day. They have an
automatic sheet-feeder tray system that processes batches of about 50 to 200 pages.
The best and fastest are duplex machines that scan both sides of the page at once.
Professional duplex scanners require a powerful computer with a hard disk of at least
10 to 20 Gb. Prices range from $5000 to $50,000. For example, the Canon DR-6020
duplex scanner costs $5000 and works with double-sided documents. It has a capacity
of about 2000 pages per day and a lifespan of 600,000 to 800,000 pages. Bell & Howell
and Fujitsu scanners range from $10,000 to $50,000 and have a lifespan of many
millions of pages.
Micro-fiche scanners cost from $15,000 for a semi-manual unit to $80,000 for one that
operates fully automatically.
Scanning programs
Every scanner comes with its own software, which means that the program must be
installed on the computer that manages the scanner. Some have a computer card that
needs to be installed in your computer to speed up the scanning operation.
2.2 Preparing the documents

Before being scanned, documents must be properly prepared. Dusty documents must
be cleaned, humid documents dried, clips removed, pages unfolded.
The spine of each book should be removed by cutting it off, straight and precisely.
Books provided by libraries must often be rebound, and if so you should be particularly
careful when removing spines in order to facilitate smooth rebinding.
If there are just a few documents, cutting can be done manually with a ruler and cutters.
Be careful with your hands! For more documents, special manual cutting machines are
available.
For high volumes—more than 20 documents—we recommend asking a printer or

copy-shop if you can use their professional cutting machine. Do not forget to remove
metal clips which could damage the cutting blades.
2.3 The scanning process
Using software provided with the scanner, a digital image of each paper page is
scanned and transformed into a Bitmap or TIFF image. These images should be stored
on hard disk with standard filenames. The OCR process starts once some or all of a
batch of documents have been scanned. It can be undertaken by the person who
operates the scanner, or by someone else.
Typically a scanning resolution of 300 dpi is needed, although sometimes 200 dpi is
acceptable.
Quality control
The final goal of scanning is either to OCR the pages to obtain perfect word processor
or HTML versions of the publications, or to produce enhanced image files such as PDF
image files. In either case the quality of the image is very important. If quality is
sub-standard, image files will not look good and will consume more memory. Image
quality seriously affects the OCR process: with sub-standard quality, productivity
deteriorates by up to 40%. OCR typically represents more than 90% of the total cost, so
scanning quality can have a very substantial effect on the final cost.
The quality of the TIFF file can be enhanced by adjusting the scanning process to each
type of paper, using settings provided by the scanner software. Relatively transparent
kind of paper will require a lighter setting; the contrast must be adjusted depending on
the quality of printing, and so on.
First divide the material into batches with similar paper and print qualities. Perform OCR
tests on a sample from the first batch to determine the optimal settings. Then scan all
material in this batch before proceeding to the next one.
Filename conventions
Give each book or document a job number or unique code, which will become the name
of the folder that contains all TIFF images in the document. Depending on the computer
system (DOS, Windows, UNIX, LINUX, etc) from 8 characters to 128 characters can be
used in a filename. We recommend restricting this unique document identifier to 8 to 16

characters. The first five characters might identify the document, the following letter
might contain a language code, and the remaining characters might identify the
particular page. For example, the identifier u7548e12.tif might identify the TIFF image of
page 12 of a book written in English with code u7548e.
Allocate one directory on the hard disk for scanning jobs, say scanjobs. Then make a
subdirectory for each job. Within this make a subdirectory for each publication—say
u7548e for the above document. Store all the TIFF images of the publication, including
color images, in this folder.
2.4 Productivity and resources
You should not underestimate the magnitude of the scanning operation—and

particularly the OCR process that follows. It is best to consider scanning and OCR as
completely separate activities. The optimal choice from an economic and practical point
of view should be madeindividually for each one.
Some points to consider are the investment in scanners and computers that is
necessary; the availability of appropriate space and human resources; training the
workforce; salary costs; the initial and total number of pages to be scanned; deadlines;
and whether documents can be outsourced to third parties.
Scanning costs
An important decision is whether to invest in scanning equipment and perform all

scanning oneself, or outsource it to a scanning company. The main considerations are:
• pressure of time for the scanning job;

• total number of pages;
• salary costs of those who perform the scanning.
The people who perform the scanning must be highly motivated, technically skilled, and
quality-oriented.
The typical cost of scanning by a professional company is $0.06 per page. To this must
be added the cost of shipment, which can be up to $0.03 per page for transport from
developing countries to developed countries, and $0.015 per page for transport within
countries.
Table 1 estimates the cost of doing it yourself, using various scanner types. Note that all
figures are approximate. They are provided as rough guidelines based on the authors'
experience. The first three columns concern labor costs. The first is the capacity in
pages/month, assuming full-time work. The resources required in person-hours per
page is obtained by dividing the number of working hours per month by the
pages/month capacity in the second column. It is shown in the second column, which
assumes 180 working hours per month.
Table 1 Scanning cost

Capacity Hours/page Cost/page Scanner Scanner Outsourced
(pages/month)(180-hour (assuming acquisit- lifespan pages for
month) $4/hour) ion (pages) scanner cost
(at $.06 each)
Flat bed scanner 2,500 0.072 $0.288 $300 7,000 5,000
Scanner with 8,000 0.0225 $0.09 $800 30,000 13,000
sheet-feeder
Professional: 40,000 0.0045 $0.018 $6,000 600,000 100,000
low-end duplex
Professional: 150,000 0.0012 $0.0048 $50,000 8,000,000 833,000
high-end duplex
To determine the price per page, multiply the total hourly salary costs in your situation
by the second column of Table 1.As an example, the third column gives the price of
in-house scanning at a salaryrate of $4/hour—not including investment costs.
These calculations assume that the scanner is used for a sufficient volume to justify the
investment. The final three columns of Table 1 give more information about the cost of
the scanner itself. The first of these shows the acquisition cost of the scanner, and the
next gives its expected lifetime. The last shows the number of pages that could be
scannedcommercially, at a cost of $0.06/page, for the price of the scanner alone.
Of course, many other factors affect the choice of scanner: availability of funds, need to
minimize dependence on others, desire to build local capacity, obligations to libraries to
scan books locally and not transport them, and so on.
The above figures give some idea of the volume of pages needed to justify different
levels of investment. Rarely will an institute or organization need to scan 800,000
pages. At such levels more complex issues arise—such as maintenance and the
possibility of recouping costs by offering scanning services to others—that we will not
discuss here.
It is tempting to regard the development of scanning capacity asa commercial venture,

particularly in developing countries. But one shouldalways bear in mind that scanning is
not a repetitive business. Oncedocuments have been scanned, clients never place new
orders for the same documents—no matter how good the relationship with the scanning
company. From a commercial point of view, intensive marketing efforts are needed. We
do not advise NGOs or other non-profit organizations to venture into this realm without
thorough initial trials and a carefully-considered business plan.
In conclusion, if 10,000 to 50,000 pages are to be scanned, one should consider

outsourcing the job. A low-end professional scanner costing about $6000 can only be
justified if more than 100,000 pages have to be scanned. You might consider banding
together with a few other institutions—perhaps NGOs or libraries—to purchase such a
scanner.
8 OCR: Optical Character Recognition
3
OCR: Optical Character
Recognition
An optical character recognition or OCR system transforms a scanned image into text.
The input is a digitized image in TIFF or Bitmap format—preferably a clean, high-quality
image. The output is a word-processor or web file, typically in RTF, Word, or HTML
format.
The following steps are involved in converting paper documents to computer form:
• scanning;
• page layout analysis;
• recognition;
• scanning images and tables.
Following these, you must perform quality checks on the resulting files, and save them
in the appropriate format.
On the market are many good OCR programs, with prices ranging from $100 to
$400.[2]For example, among many others are:
• Read-Iris(http://www.readiris.com/)
• Omnipage(http://www.omnipage.com/)
• Fine-Reader(http://www.finereader.com/)
All information, including lists of local distributors, can be found on the manufacturers'
websites. Among these, in the authors' experience the most user-friendly are
Fine-Reader and Omnipage. Fine-Reader is cheapest, costing about $100. It offers a
great deal of flexibility, and the widest range of different language options.
A choice must be made between undertaking the scanning and OCR in-house or
outsourcing it to a commercial organization. To do it in-house requires a scanner, OCR
software program, OCR skill development, and a quality-conscious, highly motivated
workforce.
3.1 The OCR process
The OCR process differs from one OCR program to another, and each one requires a
considerable amount of learning. The program's manual will explain this process in
detail. Four points deserve particular attention: quality control, tables, images, and
specialized material such as formulas, foreign characters etc.
[2] Recall that all sums of money are expressed in 2001 US dollars.
Quality control
We cannot place enough emphasis on quality control. Quality checks are best
performed by native speakers, or people with an excellent command of the language to
check. The best people are at the university or high-school level. We should also note
that young people tend to sustain higher concentration than older people for this kind of
work.
Normally there are four quality checks.
The first is performed at the same time as OCR. Every OCR program has a built-in
spell-checker that highlights every suspect letter. At the same time the image of the
word appears too, making it easy to check and correct the error.
The second is a general check of the text once the OCR process is finished. Common
errors are to miss a page, a paragraph, chapter titles, and so on. A general overview is
necessary to check if pages are missing. It is essential to check titles, chapter headings,
paragraphs, and tables.
The third is a spelling check using Microsoft Word. This program has a dictionary that is
often more sophisticated than the one embedded in OCR programs. By importing the
book into Word and performing a spelling check there, more errors can be found and
corrected. Be sure to add to the spell-checker any particularly difficult or error-prone
words, or scientific and technical terms common in that type of publication.
Finally, the completed document should be checked by an independent person who

samples the complete book and checks for errors, problems with tables and images,
tagging, and the general look of the resulting text. Only after this final check can a book
be considered ready for digital dissemination.
Tables
OCR programs do not cope well with tables. Moreover, tables are hard to check. They
contain many digits, sometimes with points and commas, and entries are easily
misplaced into the wrong row or column. They require concentrated effort, dedicated
work, intensive proof-reading, careful checking, and good quality control. They can be
handled in three basically different ways.
First, tables can be treated as images. This involves scanning them as black-and-white
images and placing them in this form at the appropriate point in the document. This is
the easiest solution. There are no errors, and the only time taken is that involved in
creating the image. However, this solution consumes more memory than others. Also,
the resolution is not always sufficient when large tables are displayed on a computer
screen. If you make the complete table fit, the resolution is too small. If you make the
table over-wide, the user must scroll to see all columns and rows, and cannot get an
overview of the contents.
Second, tables can be recreated manually by making a table with the same number of
rows and columns and filling the entries by typing them in, character by character.
Third, the table can be OCR'd. This saves time compared to the manual process, but
has a potential for more errors. Columns sometimes get merged, and commas and
points are not recognized.
Images
Publications contain three different general types of image:
• black and white line art;

• black and white photographs;
• color photographs.
Black and white line art should be scanned in line art mode and saved as GIF or PNG
files. Black and white photographs should be scanned in greyscale mode and saved as
GIF or JPEG files. Color photographs should be scanned in color mode and saved as
JPEG files. Generally speaking, medium-quality JPEG provides adequate resolution.
For most collections, images consume the bulk of the space required on a hard-disk or
CD-ROM. This makes it important to optimize each image for clarity and visibility, while
minimizing its size. To save space you might drop some or all of the images if they are
not relevant to the text.
Images should be scanned separately, one by one. We recommend giving the image
files a name that consists of the first five or six characters used to denote the document
followed by the number of the page on which the image was found. An alternative,
assuming each document is in its own directory, is to simply use the letter p followed by
the page containing the image. If there are several images on a single page, append an
additional letter a, b, c… to the filename. For example, if a JPEG image appeared on
page 36 of the publication u7548e discussed earlier, it would be placed in a file named
u7548e36.jpg or p36.jpg.
Once the images have been scanned, you can put batch-processing programs to work
to resize or enhance all the images at once.
Specialized material
Many documents contain specialized material such as special characters, formulas, and
difficult pages. Special characters generally relate to different languages and diacritical
marks. The language option for the OCR program should be set for the specific
language being read. Formulas will have to be recreated manually. Sometimes this is
not possible in the OCR program, but only in a word processor like MICROSOFT Word.
Difficult pages that contain complex material or are damaged so that a clear image
cannot be obtained might have to be retyped manually.
3.2 Productivity and resources
As mentioned earlier, you should not underestimate the difficulty of OCR. Although the
economic and practical options for OCR should be considered separately from
scanning, similar points arise: the necessary investment in computers; the availability of
human resources and management skills; training the workforce; salary costs; the total
number of pages to be processed; and whether documents can be outsourced to third
parties.
In this section we share our experience of OCR operations in Belgium, Romania and
India. All case studies, calculations and figures assume average situations, documents
of standard difficulty (including tables and images) such as are found in most archives
or libraries, very high-quality results, and a medium- to long-term operation.
Intensive OCR
OCR is difficult. It demands great concentration and much skill. Before attaining peak
productivity level and quality, a learning period of about six weeks is needed.
Typically, best results and productivity are achieved during the first hours of each day.
After three hours of OCR work, productivity declines very rapidly, perhaps to 50% of the
initial level. After six hours most people become very tired.
The same kind of evolution occurs over the initial weeks. In the first few weeks
everyone achieves fairly high productivity, but after that up to two-thirds of people
become bored and frustrated. These people either quit or perform poorly in terms of
quality and productivity. Even those who pass the first three to five critical weeks and
become part of the regular work team often leave in search of a better position after 6 to
12 months.
The remarks made in Section 3.1 about personnel apply particularly to intensive OCR.
Quality checks are best undertaken by native speakers or people with a good command
of the language being checked. Young people generally sustain higher concentration
than older people for OCR work. As a rule-of-the-thumb, people aged between 18 and
23 years tend to be better suited than those over 25.
Finally, OCR can be a boring job, which makes motivation and sustained commitment
to quality exceptionally important.
These facts about OCR lead to the following guidelines:
• Young people between 18 and 25 are best suited for this job.
• Because the first hours are always the most productive, the work should
either be organized on a part-time basis or only the most motivated and
concentrated people should be selected for full-time work.
• Two-thirds of people tend to quit or get bored after about three to five
weeks. This translates into poorer quality and low productivity in the last
weeks.
• A regular supply of work is needed to justify the necessary training, to
maintain concentration, and to keep spirits high.
Achievable productivity
Table 2 OCR productivity

Working hours/day Pages/day Pages/month
Initial training (6 weeks) 3 6 120
Optimal productivity level 3 9 150 to 200
7 28 500 to 600
Table 2 gives typical OCR productivity figures. Documents come in all sizes and
qualities, and these figures assume that the mix of documents contains an average
number of images or tables—say one image and one table of five rows by five columns
every 8 pages. They also assume that the page images are of medium to high
quality—note that, as discussed above, this depends on the quality of scanning—and
that the OCR workers have a good command of the language.
Table 2 gives separate figures for people undergoing training and for those who have
reached their optimal productivity level. If a member of the administrative staff were to
allocate three hours a day to OCR, they could achieve 180 to 200 pages OCR per
month. For full-time staff with proper training, high concentration and dedication to
quality, 500 to 600 pages a month can be achieved.
However, the rates that are achieved on difficult pages of low quality, with many
columns or many tables, are far lower—perhaps 300 to 400 pages per month for
full-time work.
Assume that the salary cost for dedicated and motivated full-time OCR workers is $400
per month, and the overhead—including management costs, computers, office space,
utilities, etc.—comes to another $300 to $400 per person per month. Then the cost of
OCR comes to about $1.2 to $1.6 per page. Taking into account the training period,
total volume, time-span, and layoff costs should the operation close down for lack of
work, these figures rise to $1.5 to $2.5 per page.
The cost of in-house OCR should be weighed against the cost of outsourcing the work
to a professional OCR company. These typically charge from $1.5 to $4 per page,
including images and tables. Human Info NGO/Simple Words has such a unit in
Romania, and charges humanitarian non-profit organizations a special price that ranges
from $1.2 to $2 per page. Please contact us at scanning@humaninfo.org for further
information and advice.
3.3 Alternatives to OCR
There are two alternatives to OCR that we discuss here.
Manual retyping
One, which eliminates most scanning as well, is to retype the documents manually,
using a word processor. This still requires the images and front cover to be scanned,
but the remaining pages need not be scanned—thus one can dispense with both
powerful scanners and OCR software.
The people who do this work do not have to understand the text. They must be accurate
typists and re-key exactly what they see. Retyping does introduce errors, and
double-keying is often used to find and correct these. This method involves two people
who independently re-key the same document, after which both digital versions are
compared word for word using a special software program by an operator who has the
original document in front of them. The assumption is that if the same word has been
typed independently twice in the same way, it is correct. However, this is not always
true, and for extremely high precision, triple-keying is performed.
The advantage of rekeying is that cost is saved because an OCR program is not
needed and so the computers can be older, lower-range, or second-hand
models—whereas powerful computers are needed for OCR. Also, the work can be
performed by people with a lower level of skill. The disadvantages are that a training
period of at least two months is needed. Single keying usually produces too many
errors, and double or triple keying is needed.
The cost depends entirely on salary level. Typically, re-keyers in developing countries
are paid on the order of $150/month. Their productivity could be twenty to thirty pages
per day—corresponding to 400 pages per month, images included. With double-keying,
this makes the total salary costs around $300 per month, plus overheads.
Image files
A very low cost alternative to OCR is simply to use a PDF image version of the
document pages. The cost is only a fraction of OCR's—about $0.1 per page.
Once scanning has been completed and TIFF files are available, an automatic
converter (usually Adobe Acrobat or Adobe Photoshop) converts all TIFF files of book
pages into PDF files.
The downside is that these files are not searchable. Also, they are quite large—usually
50 Kb per page, plus or minus 20% depending on the quality of the original TIFF file.
PDF image files are slow—sometimes, in developing countries, impossible or

prohibitively expensive—to download. They rarely fit on a floppy disk, and do not
support text manipulation functions such as cut-and-paste.
The PDF image file method should only be used if no OCR budget is available, and for
documents that are likely to be used by a small number of people who have high-speed
low-cost Internet access.
3.4 Combining scanning and OCR
If a scanner is connected directly to the computer that runs the OCR software, most
OCR programs can scan a page and perform OCR immediately. Page-by-page
scanning and OCR is a reasonable strategy for low volumes, but will prove
time-consuming for bigger and more continuous jobs.
For up to 100 to 150 pages per month, this solution may suffice. For higher volumes it is
faster and more efficient to scan the document first, then perfom OCR on all the pages
as a separate step.
14 Three examples: 1000 to 100,000 pages
4
Three examples: 1000 to 100,000
pages
4.1 Typical small collection: 500 to 1000 pages
Most NGOs have 500 to 1000 pages to scan. This volume can be OCRed in-house if
motivated volunteers are available.
Scanning
The first step is to scan the publications to generate a high-quality TIFF file of each
page, and a separate line-art, grey-scale or color bitmap image for each illustration.
Assuming that 1000 pages have to be scanned, this might represent a part-time job of
about one month—just for scanning. The TIFF files would consume 60 to 80 Mb of
hard-disk space, and a good policy is to create a CD-R containing these files. A
low-cost flatbed scanner of $100 to $300 will be sufficient for the job. Scanning can be
done after working hours or during the weekends by a volunteer in the office or at
home.
OCR
The second step is OCR by another volunteer, or team of volunteers, skilled in

language and correction. The TIFF files can either be shared between computers, or
one computer can be used for the entire job. Typically, it will take five or six months of
part-time labor (e.g. 20 hours a week) to convert 1000 pages into perfect Word or HTML
documents.
Outsourcing
An alternative is to outsource the scanning and OCR process. It would probably cost
$1500 to $2000 to convert everything into perfect Word and HTML files.
4.2 All publications from an organization: 5000 pages
Many larger organizations have archives of around 5000 pages of currrent or out-of
print books, journals, newsletters, grey literature, etc.
Scanning
This is too much for a flat-bed scanner. Scanning should either be outsourced
(approximately $400 for 5000 pages) or a sheet-feeder scanner purchased
(approximately $900). Alternatively, a more expensive scanner could be bought
together with a few other institutions or NGOs ($6000 costs divided by the number of
participants). All 5000 pages in TIFF format will take about 300 to 400 Mb of hard-disk
space. Again, a good policy is to create a CD-R containing these files.
OCR
The second step is OCR by another volunteer, or team of volunteers, skilled in OCR
and correction. Again, several computers might be used, or one computer for the whole
job. It would take 25 to 30 months of half-time labor (assuming 20 hours a week) to
convert 5000 pages into perfect Word or HTML. In practice this is too long and too
computer-intensive to manage on a volunteer basis. One would have to pay volunteers,
monitor them for performance and quality, provide adequate space, etc, in order to have
the job finished within reasonable time at a high level of quality.
Alternatively one could create image PDF files, which would take 300 to 400 Mb of
space and would be harder to download over the Internet.
Outsourcing
An alternative is to outsource the scanning and OCR processes. It would probably cost
$7500 to $10,000 to convert everything into perfect Word and HTML files.
4.3 A small library: 100,000 pages
Larger organizations, universities, governments, and specialized libraries might have a

whole library to digitize—say 100,000 pages. The first issue to consider is the copyright
status of the publications. If they are not in the public domain, explicit permission to
digitize them must be obtained from the copyright holders. You should also check
whether the files are already available digitally.
Scanning
The volume is too high for a sheet-feed scanner. Scanning should either be outsourced
($8000 for 100,000 pages), or a more expensive scanner purchased together with a few
other institutions or NGOs ($6000 shared between the participants). 100,000 pages in
TIFF format will take 6 to 8 Gb of hard-disk space. The best plan is to create a set of
CD-R copies containing these files.
OCR
The second step is OCR (or creation of PDF files for less widely used documents). It
would take 500 to 700 months of half-time labor to convert 100,000 pages into perfect
Word or HTML. This is impossible to realize with volunteers, and the job must be done
on a professional basis.
To save cost, some of the less-frequently-used pages—say 80% or 80,000

pages—could be transformed into PDF, and the other 20,000 pages into Word and
HTML. The PDFs would take 4 to 6 Gb space and be harder to download on the
Internet, but would cost only $0.2 per page to create by a professional organization
(total of $16,000). If 80,000 PDF files were created from TIFF files by volunteers using
PDF conversion programs like Adobe Acrobat, 10 to 20 months of part-time work would
be necessary on a powerful computer.
Outsourcing
An alternative is to outsource the work. If the 80% PDF and 20% HTML mix were
maintained, the PDF would cost around $16,000 and the HTML $30,000 to $40,000—a
total budget of around $50,000. If everything were OCRed, it would cost $150,000 to
$200,000 to convert the entire collection into perfect Word and /HTML files.
17 Creating an electronic collection
5
Creating an electronic collection
Three important aspects should be kept in mind when deciding to create digital
collections. First, the collection must be organized. The more content there is, the
greater the need for indexes and powerful search systems. For collections of 3000 to
5000 pages or more, indexes and search systems are essential. Second, the needs of
end-users must prevail. The target groups that will use the collection should be
identified, and a process of regular consultation set up. Third, the available budget will
determine how much can be done.
5.1 Methods of collection building
There are many examples of excellent CD-ROMs that are created on the web-page
model. HTML, PDF or Word documents are added and linked using hyperlinks.
Navigation is made simple and attractive by the use of hyperlinks, frames, keywords,
indexes and so on. Such systems work well up to a few thousand pages, but from 3000
to 5000 pages onwards it is important to have a well-structured collection and a
powerful search facility. This is where the Greenstone software can help.
The Greenstone Digital Library software creates a structured digital library including a
very powerful search and retrieval engine. Up to 150,000 pages can be indexed on a
single CD-ROM. Every CD-ROM can become an Internet server. Greenstone is
open-source software, and is freely available under the GNU license.
The companion manuals describe how to build Greenstone collections. There are
essentially three different ways of building collections:
• The librarian interface

• The Collector
• Building from the command line.
The first method is the “librarian” interface, described in the Greenstone Digital Library
User's Guide(Chapter 3, “Making Greenstone Collections”). This is a comprehensive
interactive facility for collection-building. With it, you can collect sets of documents,
import or assign metadata, and build them into a Greenstone collection. The second
method is the “Collector” subsystem, described in Chapter 4 of the User's Guide. This is
an older facility that provides an alternative way of building collections of web pages or
other documents. It guides you through a sequence of interactive web pages that
request the information needed. However, it does not provide any way of adding
metadata to the documents, and—because it is a web interface—it is not really suitable
for collections that take more than a few minutes to build. The third method is to run the
programs for collection-building directly from the command line; this is in the
Greenstone Digital Library Developer's Guide(Chapter 1). This gives more flexibility in
running programs individually and saving intermediate results, which may be desirable
for collections that take many hours to build. You will also need to read Chapter 2 of the
Developer's Guide in order to harness the full power of Greenstone to build advanced
collections.
There is a fourth method for creating and editing the material associated with a
collection, a program called the Collection Organizer. However, its functionality has
been superseded by the librarian interface mentioned above. It is described in a legacy

document entitled Using the Organizer.
5.2 Getting started in seven steps and 15 minutes
The best way of getting the look and feel of the librarian interface is to actually create a
small test library. If you have 15 minutes please follow these steps and you will
understand this program much better.
Before getting started, first install Greenstone (see the Greenstone Installer's Guide)
which includes the Demo collection in DLS format and its source files. Note, if you
wish to be able to add to your collection any of the 140 documents in the DLS
collection (instead of just the 11 of these documents in the Greenstone Demo
collection), you should install DLS as one of the sample Greenstone libraries. The
Demo and DLS collections will be installed in C:\Program Files\gsdl\collect, in
subdirectories demo and dls respectively. If you previously installed Greenstone without
DLS and wish to install it, then you may re-insert your Greenstone CD-ROM and add
this collection. It is not necessary to uninstall Greenstone first.
We suggest that you print the instructions below and follow them step by step:
1. Launch the librarian interface under Windows by selecting Greenstone

Digital Library from the Programs section of the Start menu and
choosing Librarian InterfaceIf you are using Unix, instead type
cd ~/gsdl
cd gli
./gli.sh
where ~/gsdl is the directory containing your Greenstone system.
2. Select New from the File menu in the horizontal menu bar at the top of
the window. Give it a title, for example “My First Collection,” and fill out
your email address and a brief description of the collection. In the “Base
this collection on” menu, choose “greenstone demo” or “Development
Library Subset” (the effect is the same because these two collections
have the same structure).
3. Add some documents from the Demo collection (or the DLS collection if
it is installed) to your new collection. To do this, double-click the
Greenstone Collections folder in the left-hand panel, then double-click
the collection you desire. The documents in it are displayed underneath.
Select one of these, drag it, and drop into the right-hand panel. This
panel represents the collection you are building. Choose several
documents and drag them into it one by one, or using multiple selection
in the standard way.
4. Add some of your own documents that are not in the Demo or DLS
collections. Close the Greenstone Collections folder in the left-hand
panel and double-click the Local Filespace folder. Navigate to a
directory that contains some documents (e.g. small Word or HTML files).
Drag a few of these into the right-hand panel to include them in your
collection.
5. Add metadata to the documents in your collection. So far you have been
operating under the Gather panel, indicated by the Gather tab
underneath the horizontal menu bar at the top of the window. Click the
Enrich tab beside it. The documents in your collection now appear in the
left-hand panel: click one and examine the metadata associated with it
in the “Element … Value” table at the top right. Use the panel
underneath to change individual values by selecting the desired
Element and either choosing an existing value from the list or typing a
new value into the box near the bottom. Add Title, Organization, and
Keyword metadata to each of your own documents that you put in the
collection. After you type each value you need to click “ Appendix to add
that value to the metadata.
6. Click the Create tab to leave the Enrich mode and create your new
collection. Click the Build Collection button at the bottom. While the
computer is building the collection you will receive some feedback on
what it is doing.
7. When it has finished, click the Preview tab to view the collection from
within the librarian interface. Check the titles a-z, organisations and how
to lists to ensure that your documents have been included in the
collection. You will also find when you visit your Greenstone home page
that the collection has been installed as one of the regular collections.
20 GNU Free Documentation License
GNU Free Documentation License

Version 1.2, November 2002
Copyright (C) 2000,2001,2002 Free Software Foundation, Inc. 51 Franklin St, Fifth
Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute
verbatim copies of this license document, but changing it is not allowed.
0. PREAMBLE
The purpose of this License is to make a manual, textbook, or other functional and
useful document "free" in the sense of freedom: to assure everyone the effective
freedom to copy and redistribute it, with or without modifying it, either commercially or
noncommercially. Secondarily, this License preserves for the author and publisher a
way to get credit for their work, while not being considered responsible for modifications
made by others.
This License is a kind of "copyleft", which means that derivative works of the document
must themselves be free in the same sense. It complements the GNU General Public
License, which is a copyleft license designed for free software.
We have designed this License in order to use it for manuals for free software, because
free software needs free documentation: a free program should come with manuals
providing the same freedoms that the software does. But this License is not limited to
software manuals; it can be used for any textual work, regardless of subject matter or
whether it is published as a printed book. We recommend this License principally for
works whose purpose is instruction or reference.
1. APPLICABILITY AND DEFINITIONS
This License applies to any manual or other work, in any medium, that contains a notice
placed by the copyright holder saying it can be distributed under the terms of this
License. Such a notice grants a world-wide, royalty-free license, unlimited in duration, to
use that work under the conditions stated herein. The "Document", below, refers to any
such manual or work. Any member of the public is a licensee, and is addressed as
"you". You accept the license if you copy, modify or distribute the work in a way
requiring permission under copyright law.
A "Modified Version" of the Document means any work containing the Document or a
portion of it, either copied verbatim, or with modifications and/or translated into another
language.
A "Secondary Section" is a named appendix or a front-matter section of the Document

that deals exclusively with the relationship of the publishers or authors of the Document
to the Document's overall subject (or to related matters) and contains nothing that could
fall directly within that overall subject. (Thus, if the Document is in part a textbook of
mathematics, a Secondary Section may not explain any mathematics.) The relationship
could be a matter of historical connection with the subject or with related matters, or of
legal, commercial, philosophical, ethical or political position regarding them.
The "Invariant Sections" are certain Secondary Sections whose titles are designated, as
being those of Invariant Sections, in the notice that says that the Document is released
under this License. If a section does not fit the above definition of Secondary then it is
not allowed to be designated as Invariant. The Document may contain zero Invariant
Sections. If the Document does not identify any Invariant Sections then there are none.
The "Cover Texts" are certain short passages of text that are listed, as Front-Cover
Texts or Back-Cover Texts, in the notice that says that the Document is released under
this License. A Front-Cover Text may be at most 5 words, and a Back-Cover Text may
be at most 25 words.
A "Transparent" copy of the Document means a machine-readable copy, represented in

a format whose specification is available to the general public, that is suitable for
revising the document straightforwardly with generic text editors or (for images
composed of pixels) generic paint programs or (for drawings) some widely available
drawing editor, and that is suitable for input to text formatters or for automatic
translation to a variety of formats suitable for input to text formatters. A copy made in an
otherwise Transparent file format whose markup, or absence of markup, has been
arranged to thwart or discourage subsequent modification by readers is not
Transparent. An image format is not Transparent if used for any substantial amount of
text. A copy that is not "Transparent" is called "Opaque".
Examples of suitable formats for Transparent copies include plain ASCII without
markup, Texinfo input format, LaTeX input format, SGML or XML using a publicly
available DTD, and standard-conforming simple HTML, PostScript or PDF designed for
human modification. Examples of transparent image formats include PNG, XCF and
JPG. Opaque formats include proprietary formats that can be read and edited only by
proprietary word processors, SGML or XML for which the DTD and/or processing tools
are not generally available, and the machine-generated HTML, PostScript or PDF
produced by some word processors for output purposes only.
The "Title Page" means, for a printed book, the title page itself, plus such following
pages as are needed to hold, legibly, the material this License requires to appear in the
title page. For works in formats which do not have any title page as such, "Title Page"
means the text near the most prominent appearance of the work's title, preceding the
beginning of the body of the text.
A section "Entitled XYZ" means a named subunit of the Document whose title either is
precisely XYZ or contains XYZ in parentheses following text that translates XYZ in
another language. (Here XYZ stands for a specific section name mentioned below, such
as "Acknowledgements", "Dedications", "Endorsements", or "History".) To "Preserve the
Title" of such a section when you modify the Document means that it remains a section
"Entitled XYZ" according to this definition.
The Document may include Warranty Disclaimers next to the notice which states that
this License applies to the Document. These Warranty Disclaimers are considered to be
included by reference in this License, but only as regards disclaiming warranties: any
other implication that these Warranty Disclaimers may have is void and has no effect on
the meaning of this License.
2. VERBATIM COPYING
You may copy and distribute the Document in any medium, either commercially or
noncommercially, provided that this License, the copyright notices, and the license
notice saying this License applies to the Document are reproduced in all copies, and
that you add no other conditions whatsoever to those of this License. You may not use
technical measures to obstruct or control the reading or further copying of the copies
you make or distribute. However, you may accept compensation in exchange for
copies. If you distribute a large enough number of copies you must also follow the
conditions in section 3.
You may also lend copies, under the same conditions stated above, and you may
publicly display copies.
3. COPYING IN QUANTITY
If you publish printed copies (or copies in media that commonly have printed covers) of
the Document, numbering more than 100, and the Document's license notice requires
Cover Texts, you must enclose the copies in covers that carry, clearly and legibly, all
these Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on the
back cover. Both covers must also clearly and legibly identify you as the publisher of
these copies. The front cover must present the full title with all words of the title equally
prominent and visible. You may add other material on the covers in addition. Copying
with changes limited to the covers, as long as they preserve the title of the Document
and satisfy these conditions, can be treated as verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit legibly, you should put the
first ones listed (as many as fit reasonably) on the actual cover, and continue the rest
onto adjacent pages.
If you publish or distribute Opaque copies of the Document numbering more than 100,
you must either include a machine-readable Transparent copy along with each Opaque
copy, or state in or with each Opaque copy a computer-network location from which the
general network-using public has access to download using public-standard network
protocols a complete Transparent copy of the Document, free of added material. If you
use the latter option, you must take reasonably prudent steps, when you begin
distribution of Opaque copies in quantity, to ensure that this Transparent copy will
remain thus accessible at the stated location until at least one year after the last time
you distribute an Opaque copy (directly or through your agents or retailers) of that
edition to the public.
It is requested, but not required, that you contact the authors of the Document well
before redistributing any large number of copies, to give them a chance to provide you
with an updated version of the Document.
4. MODIFICATIONS
You may copy and distribute a Modified Version of the Document under the conditions
of sections 2 and 3 above, provided that you release the Modified Version under
precisely this License, with the Modified Version filling the role of the Document, thus
licensing distribution and modification of the Modified Version to whoever possesses a
copy of it. In addition, you must do these things in the Modified Version:
A. Use in the Title Page (and on the covers, if any) a title distinct from
that of the Document, and from those of previous versions (which
should, if there were any, be listed in the History section of the
Document). You may use the same title as a previous version if the
original publisher of that version gives permission.
B. List on the Title Page, as authors, one or more persons or entities
responsible for authorship of the modifications in the Modified Version,

together with at least five of the principal authors of the Document (all of
its principal authors, if it has fewer than five), unless they release you
from this requirement.
C. State on the Title page the name of the publisher of the Modified
Version, as the publisher.
D. Preserve all the copyright notices of the Document.
E. Add an appropriate copyright notice for your modifications adjacent to
the other copyright notices.
F. Include, immediately after the copyright notices, a license notice
giving the public permission to use the Modified Version under the terms
of this License, in the form shown in the Addendum below.
G. Preserve in that license notice the full lists of Invariant Sections and
required Cover Texts given in the Document's license notice.
H. Include an unaltered copy of this License.
I. Preserve the section Entitled "History", Preserve its Title, and add to it
an item stating at least the title, year, new authors, and publisher of the
Modified Version as given on the Title Page. If there is no section
Entitled "History" in the Document, create one stating the title, year,
authors, and publisher of the Document as given on its Title Page, then
add an item describing the Modified Version as stated in the previous
sentence.
J. Preserve the network location, if any, given in the Document for public
access to a Transparent copy of the Document, and likewise the
network locations given in the Document for previous versions it was
based on. These may be placed in the "History" section. You may omit a
network location for a work that was published at least four years before
the Document itself, or if the original publisher of the version it refers to
gives permission.
K. For any section Entitled "Acknowledgements" or "Dedications",
Preserve the Title of the section, and preserve in the section all the
substance and tone of each of the contributor acknowledgements and/or
dedications given therein.
L. Preserve all the Invariant Sections of the Document, unaltered in their
text and in their titles. Section numbers or the equivalent are not
considered part of the section titles.
M. Delete any section Entitled "Endorsements". Such a section may not
be included in the Modified Version.
N. Do not retitle any existing section to be Entitled "Endorsements" or to
conflict in title with any Invariant Section.
O. Preserve any Warranty Disclaimers.
If the Modified Version includes new front-matter sections or appendices that qualify as
Secondary Sections and contain no material copied from the Document, you may at
your option designate some or all of these sections as invariant. To do this, add their
titles to the list of Invariant Sections in the Modified Version's license notice. These titles
must be distinct from any other section titles.
You may add a section Entitled "Endorsements", provided it contains nothing but
endorsements of your Modified Version by various parties--for example, statements of
peer review or that the text has been approved by an organization as the authoritative
definition of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage of up
to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified
Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be
added by (or through arrangements made by) any one entity. If the Document already
includes a cover text for the same cover, previously added by you or by arrangement
made by the same entity you are acting on behalf of, you may not add another; but you
may replace the old one, on explicit permission from the previous publisher that added
the old one.
The author(s) and publisher(s) of the Document do not by this License give permission
to use their names for publicity for or to assert or imply endorsement of any Modified
Version.
5. COMBINING DOCUMENTS
You may combine the Document with other documents released under this License,
under the terms defined in section 4 above for modified versions, provided that you
include in the combination all of the Invariant Sections of all of the original documents,
unmodified, and list them all as Invariant Sections of your combined work in its license
notice, and that you preserve all their Warranty Disclaimers.
The combined work need only contain one copy of this License, and multiple identical
Invariant Sections may be replaced with a single copy. If there are multiple Invariant
Sections with the same name but different contents, make the title of each such section
unique by adding at the end of it, in parentheses, the name of the original author or
publisher of that section if known, or else a unique number. Make the same adjustment
to the section titles in the list of Invariant Sections in the license notice of the combined
work.
In the combination, you must combine any sections Entitled "History" in the various
original documents, forming one section Entitled "History"; likewise combine any
sections Entitled "Acknowledgements", and any sections Entitled "Dedications". You
must delete all sections Entitled "Endorsements."
6. COLLECTIONS OF DOCUMENTS
You may make a collection consisting of the Document and other documents released
under this License, and replace the individual copies of this License in the various
documents with a single copy that is included in the collection, provided that you follow
the rules of this License for verbatim copying of each of the documents in all other
respects.
You may extract a single document from such a collection, and distribute it individually
under this License, provided you insert a copy of this License into the extracted
document, and follow this License in all other respects regarding verbatim copying of
that document.
7. AGGREGATION WITH INDEPENDENT WORKS
A compilation of the Document or its derivatives with other separate and independent
documents or works, in or on a volume of a storage or distribution medium, is called an
"aggregate" if the copyright resulting from the compilation is not used to limit the legal
rights of the compilation's users beyond what the individual works permit. When the
Document is included in an aggregate, this License does not apply to the other works in
the aggregate which are not themselves derivative works of the Document.
If the Cover Text requirement of section 3 is applicable to these copies of the

Document, then if the Document is less than one half of the entire aggregate, the
Document's Cover Texts may be placed on covers that bracket the Document within the
aggregate, or the electronic equivalent of covers if the Document is in electronic form.
Otherwise they must appear on printed covers that bracket the whole aggregate.
8. TRANSLATION
Translation is considered a kind of modification, so you may distribute translations of

the Document under the terms of section 4. Replacing Invariant Sections with
translations requires special permission from their copyright holders, but you may
include translations of some or all Invariant Sections in addition to the original versions
of these Invariant Sections. You may include a translation of this License, and all the
license notices in the Document, and any Warranty Disclaimers, provided that you also
include the original English version of this License and the original versions of those
notices and disclaimers. In case of a disagreement between the translation and the
original version of this License or a notice or disclaimer, the original version will prevail.
If a section in the Document is Entitled "Acknowledgements", "Dedications", or

"History", the requirement (section 4) to Preserve its Title (section 1) will typically
require changing the actual title.
9. TERMINATION
You may not copy, modify, sublicense, or distribute the Document except as expressly
provided for under this License. Any other attempt to copy, modify, sublicense or
distribute the Document is void, and will automatically terminate your rights under this
License. However, parties who have received copies, or rights, from you under this
License will not have their licenses terminated so long as such parties remain in full
compliance.
10. FUTURE REVISIONS OF THIS LICENSE
The Free Software Foundation may publish new, revised versions of the GNU Free
Documentation License from time to time. Such new versions will be similar in spirit to
the present version, but may differ in detail to address new problems or concerns. See
http://www.gnu.org/copyleft/.
Each version of the License is given a distinguishing version number. If the Document
specifies that a particular numbered version of this License "or any later version" applies
to it, you have the option of following the terms and conditions either of that specified
version or of any later version that has been published (not as a draft) by the Free
Software Foundation. If the Document does not specify a version number of this
License, you may choose any version ever published (not as a draft) by the Free
Software Foundation.
How to use this License for your documents

To use this License in a document you have written, include a copy of the License in the
document and put the following copyright and license notices just after the title page:
Copyright (c) YEAR YOUR NAME. Permission is granted to copy, distribute and/or
modify this document under the terms of the GNU Free Documentation License, Version
1.2 or any later version published by the Free Software Foundation; with no
Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
license is included in the section entitled "GNU Free Documentation License".
If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the
"with...Texts." line with this:
with the Invariant Sections being LIST THEIR TITLES, with the Front-Cover Texts
being LIST, and with the Back-Cover Texts being LIST.
If you have Invariant Sections without Cover Texts, or some other combination of the
three, merge those two alternatives to suit the situation.
If your document contains nontrivial examples of program code, we recommend

releasing these examples in parallel under your choice of free software license, such as
the GNU General Public License, to permit their use in free software.

From Paper To Collection: Greenstone Digital Library

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

From Paper To Collection: Greenstone Digital Library

Uploaded by

Copyright:

Available Formats

GREENSTONE DIGITAL LIBRARY

FROM PAPER TO COLLECTION

Dr Michel Loots, Dan Camarzan and Ian H. Witten

Human Info NGO, Belgium

Greenstone is a suite of software for building and distributing digital library

Greenstone gsdl-2.50 March 2004

We have tried to be as plain as possible in our explanation. Reference to any trade

The complete set of Greenstone documents include five volumes:

• Greenstone Digital Library Installer's Guide

2 Scanners and scanning 3

2.2 Preparing the documents 4

2.3 The scanning process 5

2.4 Productivity and resources 6

3 OCR: Optical Character Recognition 8

3.2 Productivity and resources 10

3.3 Alternatives to OCR 12

3.4 Combining scanning and OCR 13

4 Three examples: 1000 to 100,000 pages 14

4.2 All publications from an organization: 5000 pages 14

5 Creating an electronic collection 17

5.2 Getting started in seven steps and 15 minutes 18

GNU Free Documentation License 20

Typical steps that have to be implemented are:

1. Selecting the documents to be included

Step v. enables the different parts of a document to be independently selected and

• What is the goal of your collection?

only page images.)

Low-cost flat-bed scanner

up to 1000 or 2000 pages.

Low-end scanner with sheet feeder

These scanners are useful for up to 1500 to 3000 pages a month.

Professional duplex scanners

Professional scanners are reliable, heavy-duty machines capable of processing a large

2.2 Preparing the documents

For high volumes—more than 20 documents—we recommend asking a printer or

2.3 The scanning process

used in a filename. We recommend restricting this unique document identifier to 8 to 16

2.4 Productivity and resources

You should not underestimate the magnitude of the scanning operation—and

An important decision is whether to invest in scanning equipment and perform all

• pressure of time for the scanning job;

Table 1 Scanning cost

It is tempting to regard the development of scanning capacity asa commercial venture,

In conclusion, if 10,000 to 50,000 pages are to be scanned, one should consider

3.1 The OCR process

Normally there are four quality checks.

Finally, the completed document should be checked by an independent person who

Publications contain three different general types of image:

• black and white line art;

3.2 Productivity and resources

These facts about OCR lead to the following guidelines:

Table 2 OCR productivity

3.3 Alternatives to OCR

There are two alternatives to OCR that we discuss here.

PDF image files are slow—sometimes, in developing countries, impossible or

3.4 Combining scanning and OCR

The second step is OCR by another volunteer, or team of volunteers, skilled in

4.2 All publications from an organization: 5000 pages

4.3 A small library: 100,000 pages

Larger organizations, universities, governments, and specialized libraries might have a

To save cost, some of the less-frequently-used pages—say 80% or 80,000

5.1 Methods of collection building