Professional Documents
Culture Documents
Problem
The explosive growth of the Internet has become an overused cliche, yet the problems of
information overload remain as real as ever. Web search engines provide one way to manage the
deluge of information on the Internet, but they have some serious drawbacks for many
applications. Common search engines do not index dynamic content; any URL with a '?' is
ignored. Neither do search engines provide finer granularity than a single HTML page. Their
design makes them unsuitable for comparison shopping or data integration.
The DISL group has constructed a powerful set of information extraction tools to work at solving
some of these problems. There are several remaining research challenges however. The
following figure presents a simple architecture for a dynamic search engine.
Within this framework there are several possible short proejcts suitable for a 7001 mini project,
or an extended Special Problems.
1. Design and implffement a robot crawler that discovers new dynamic search engine
interfaces
2. Design a technique to categorize a search engine by its contents (the pages that it
dynamically generates), the types of queries it responds to (query interface), or the
context of the search interface.
3. In conjunction with the categorization system, develop a user interface that assists users
in selecting the appropriate types of sources that are applicable to their query (see the
AQR project for an example static system)
4. Improve the automated object extraction system. This may be broken down into
individual projects by itself.
Currently, the automated object extraction system works in two phases: (1) identify the
region of a dynamically generated web page that contains data objects; (2) discover how
the objects are separated (e.g. is there a single tag that separates objects?), and use the
separator to split the data region into objects.
Background
You are expected to have a solid grasp of Java programming. Familiarity with XML is useful but
not required.
Deliverables
A report describing the work you did and how you evaluate your results; any source code you
produced to accomplish your results.
Evaluation
You will be graded on the novelty and quality of your report and implementation.