You are on page 1of 4

Name

htdigconfig - script to create fuzzy databases for ht://Dig


Synopsis
htdigconfig
Description
htdigconfig is a script to create fuzzy databases such as word2root, root2word a
nd synonyms databases for the ht://Dig search engine.
Name
htpurge - remove unused odocuments from the database (general maintenance script
)
Synopsis
htpurge [-][-a][-c configfile][-u][-v]
Description
Htpurge functions to remove specified URLs from the databases as well as bad URL
s, unretrieved URLs, obsolete documents, etc. It is recommended that htpurge be
run after htdig to clean out any documents of this sort.
Options
Take URL list from standard input (rather than specified with -u). Format of
input file is one URL per line. -a Use alternate work files. Tells htpurge to a
ppend .work to database files, causing a second copy of the database to be built
. This allows the original files to be used by htsearch during the run.
-c configfile
Use the specified configfile instead of the default.
-u URL
Add this URL to the list of documents to remove. Must be specified multiple
times if more than one URL are to be removed. Should nor be used together with .
-v
Verbose mode. This increases the verbosity of the program. Using more than 2
is probably only useful for debugging purposes. The default verbose mode (using
only one -v) gives a nice progress report while digging.
Files
/etc/htdig/htdig.conf
The default configuration file.
Name
htstat - returns statistics on the document and word databases, much like the -s
option to htdig or htmerge.
Synopsis
htstat [-v][-a][-c configfile][-u]
Description
Htdig retrieves HTML documents using the HTTP protocol and gathers information f
rom these documents which can later be used to search these documents. This prog
ram can be referred to as the search robot.
Options
-a
Use alternate work files. Tells htstat to append .work to database files, ca
using a second copy of the database to be built. This allows the original files
to be used by htsearch during the run.
-c configfile
Use the specified configfile instead of the default.
-u
Give a list of URLs in the document database.
-v

Verbose mode. This increases the verbosity of the program. Using more than 2
is probably only useful for debugging purposes. The default verbose mode (using
only one -v) gives a nice progress report while digging.
Name
htnotify - sends email notifications about out-dated web pages discovered by htm
erge
Synopsis
htnotify [options]
Description
Htnotify scans the document database created by htmerge and sends an email messa
ge for every page that is out of date. Please have a look at the ht://Dig notifi
cation manual for instructions on how to set up this service.
Options
-b database
Specifies an alternative database than what is specified in the configuratio
n file.
-c configfile
Use the specified configfile instead of the default. -v Verbose mode. This i
ncreases the verbosity of the program. Used once will display a log of what emai
l messages were sent. Used more than once will display information about each do
cument that has email notification set.
Files
/etc/htdig/htdig.conf
The default configuration file.
Name
htload - reads in an ASCII-text version of the document database
Synopsis
htload [options]
Description
Htload reads in an ASCII-text version of the document database in the same form
as the -t option of htdig and htdump. Note that this will overwrite data in your
databases, so this should be used with great care.
Options
-a
Use alternate work files. Tells htload to append .work to database files, al
lowing it to operate on a second set of databases.
-c configfile
Use the specified configfile instead of the default.
-i
Initial. Do not use any old databases. This is accomplished by first erasing
the databases.
-v
Verbose mode. This doesn't have much effect.
File Formats
Document Database
Each line in the file starts with the document id followed by a list of fiel
dname : value separated by tabs. The fields always appear in the order listed be
low:
u
URL
t
Title

a
State (0 = normal, 1 = not found, 2 = not indexed, 3 = obsolete)
m
Last modification time as reported by the server
s
Size in bytes
H
Excerpt
h
Meta description
l
Time of last retrieval
L
Count of the links in the document (outgoing links)
b
Count of the links to the document (incoming links or backlinks)
c
HopCount of this document
g
Signature of the document used for duplicate-detection
e
E-mail address to use for a notification message from htnotify
n
Date to send out a notification e-mail message
S
Subject for a notification e-mail message
d
The text of links pointing to this document. (e.g. <a href="docURL">descript
ion</a>)
A
Anchors in the document (i.e. <A NAME=...)
Word Database
While htdump and htload don't deal with the word database directly, it's wor
th mentioning it here because you need to deal with it when copying the ASCII da
tabases from one system to another. The initial word database produced by htdig
is already in ASCII format, and a binary version of it is produced by htmerge, f
or use by htsearch. So, when you copy over the ASCII version of the document dat
abase produced by htdump, you need to copy over the wordlist as well, then run h
tload to make the binary document database on the target system, followed by run
ning htmerge to make the word index.
Each line in the word list file starts with the word
followed by a list of fieldname : value separated by tabs. The fields always
appear in the order listed below, with the last two being optional:
i
Document ID
l
Location of word in document (1 to 1000)
w
Weight of word based on scoring factors
c
Count of word's appearances in document, if more than 1
a
Anchor number if word occurred after a named anchor
Files
/etc/htdig/htdig.conf
The default configuration file.
/var/lib/htdig/db.docs
The default ASCII document database file.

/var/lib/htdig/db.wordlist
The default ASCII word database file.