You are on page 1of 108

TEXT: Automatic Template Extraction from Heterogeneous Webpages

1. INTRODUCTION
Now a days most of the information is stored in text databases. This information consists
of large collection of documents from Heterogeneous web pages. Now we extract template from
these heterogeneous templates, and to extract template we use different algorithms to find
similarity of underlying template structures in the documents and we cluster the web documents
based on the similarity of underlying template structure in the documents so that template is
extracted with various clusters. The Web poses itself as the largest data repository ever available
in the history of humankind. Major efforts have been made in order to provide efficient access to
relevant information within this huge repository of data. Although several techniques have been
developed to the problem of Web data extraction, their use is still not spread, mostly because of
the need for high human intervention and the low quality of the extraction results.

In this project, a domain-oriented approach to Web data extraction and discuss its
application to automatically extracting news from Web sites. Our approach is based on a highly
efficient tree structure analysis that produces very effective results. The HTML DOM follows a
naming convention for properties, methods, events, collections, and data types. All names are
defined as one or more English words concatenated together to form a single string. Properties
and Methods the property or method name starts with the initial keyword in lowercase, and each
subsequent word starts with a capital letter. For example, a property that returns document Meta
information such as the date the file was created might be named "fileDateCreated". In the
ECMA Script binding, properties are exposed as properties of a given object.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

The HTML DOM follows a naming convention for properties, methods, events,
collections, and data types. All names are defined as one or more English words concatenated
together to form a single string. Properties and Methods The property or method name starts with
the initial keyword in lowercase, and each subsequent word starts with a capital letter. For
example, a property that returns document meta information such as the date the file was created
might be named "fileDateCreated".
In the ECMA Script binding, properties are exposed as properties of a given object. In
Java, properties are exposed with get and set methods. Non-HTML 4.0 interfaces and attributes
While most of the interfaces defined below can be mapped directly to elements defined in the
HTML 4.0

Recommendation, some of them cannot. Similarly, not all attributes listed below have
counterparts in the HTML 4.0 specification (and some do, but have been renamed to avoid
conflicts with scripting languages). Interfaces and attribute definitions that have links to the
HTML 4.0 specification have corresponding element and attribute definitions there; all others are
added by this specification, either for convenience or backwards compatibility with "DOM Level
0" implementations.

The steps of building a rule-based metadata extraction system are typically as follows:
first, some experts examine samples of the document collection and define rules for metadata
extraction; then, software developers implement these rules either as part of an expert system or

TEXT: Automatic Template Extraction from Heterogeneous Webpages

as part of an ad hoc rule engine. The accuracy, inventiveness, and appropriateness of the rules
that experts defined play a critical role in building a system with high accuracy.

2. COMPANY PROFILE
Comdex Infotech, Chennai
Preview

Comdex Infotech is a leading global IT services provider, delivering technology-driven


business solutions that meet the strategic objectives of our clients. We provide optimal business
solutions in the field of Banking, Financial, Health Care and Retail services. We also has its
focus on innovative, attractive and efficient Web Designing.
We believe in exploring and implementing simple solutions which are easy to understand,
develop, deploy and maintain. We always try to keep the investment cost, by the client, as low as
possible and still get the best service.

Technology
Comdex Infotech predominantly works in the software development area and develops
software practically based on all firmware platforms using technology used in the IT industry
currently. Comdex Infotech has been delivering engineering solutions across diverse industries
enabling customers to foster product innovation, improve operational efficiencies, and decrease
time-to-market for their products.
Web: ASP.NET, ASP, JSP, XML/XSLT, SOAP, HTML/DHTML, JavaScript, Flash, Active X

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Programming Languages: C, C++, OOAD Paradigms, JAVA, J2ME, J2EE, Dot NET
Technologies, PHP, Spring, Struts, JSF

Databases: Oracle, SQL Server, MySQL

Development Technologies: MS Project, Visio, CVS

Workflow Automation: Lotus Notes/Domino (Lotus Script, Formula, JavaScript)

Products
Comdex Infotech is experienced in financial, systems integration and outsourcing which enables
us to deliver innovative results-driven solutions to government and commercial clients around
the world.

Customer Support provides timely, reliable, and cost-efficient assistance to our clients. Our
support organization is staffed with highly motivated, trained professionals dedicated to
providing quality support as quickly as possible.

Our range of IT services is offered in these areas:

Banking

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Web Designing

Hospital Management

Financial

Enterprise Resource Planning

Retail

Student Management

Sales
Our quality assurance performs tests on every PC system for a minimum of 12 hours. Our
systems are guaranteed and come with our on-site and phone technical support.
Assembled Desktops
Branded Desktops
Laptops
Computer Accessories
Computer Peripherals
Software's

Services
Comdex PC shall support various services like:

On-Call Service
Annual Maintenances Contract

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Software / Hardware Troubleshooting


Upgrading an Existing PC
Debug Computer Networks
Computer Rental

Contacts
Comdex Infotech
No.24/19, Bharathi Street,
West Mambalam
Chennai - 600 033, INDIA
Phone : + 91 44 42136340
Mobile : + 91 94441 18194
Email : info@comdexinfotech.com
Website : www.comdexinfotech.com

TEXT: Automatic Template Extraction from Heterogeneous Webpages

3. SYSTEM ANALYSIS

System analysis is process of totally understanding the current system by gathering and
interpreting facts, diagnosing problems and using the facts to improve the current system. It is
the detailed study of the various operations performed by a system and their relationships within
and outside of the system.

System analysis is done in order to understand the problem and to emphasis what is
needed from the system. This part of system development life cycle is crucial phase where the
information of the user is determined.

3.1 Existing System


An HTML document can be naturally represented with a Document Object Model
(DOM) tree, web documents are considered as trees and many existing similarity measures for
trees have been investigated for clustering. However, clustering is very expensive with tree-

TEXT: Automatic Template Extraction from Heterogeneous Webpages

related distance measures. For instance, tree-edit distance has at least On1n2 time complexity,
where n1 and n2 are the sizes of two DOM trees and the sizes of the trees are usually more than a
thousand. Thus, clustering on sampled web documents is used to practically handle a
large number of web documents.

The problem of extracting a template from the web documents conforming to a common
template has been studied in. Due to the assumption of all documents being generated from a
single common template, solutions for this problem are applicable only when all documents are
guaranteed to conform to a common template. However, in real applications, it is not trivial to
classify massively crawled documents into homogeneous partitions in order to use these
techniques

3.2 Proposed System:


The problem area is the page-level template detection where the template is computed
within a single document. Lerman et al. proposed systems to identify data records in a document
and extract data items from them. Zhai and Liu proposed an algorithm to extract a template using
not only structural information, but also visual layout information. Chakrabarti et al. solved this
problem by using an isotonic smoothing score assigned by a classifier. Since the problem
formulation of this area is far from ours, we do not discuss it in detail. Our algorithms to be
presented later represent web documents as a matrix and find clusters with the matrix.
Biclustering or coclustering is another clustering technique to deal with a matrix.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Coclustering algorithms find simultaneous clustering of the rows and columns of a matrix
and require the numbers of clusters of columns and rows as input parameters. However, one can
cluster only documents not paths, and moreover, the numbers of clusters of columns and rows
are unknown.
One proposes to represent a web document and a template as a set of paths in a DOM
tree. As validated by the most popular XML query language XPATH, paths are sufficient to
express tree structures and useful to be queried. By considering only paths, the overhead to
measure the similarity between documents becomes small without significant loss of
information.

3.3 Feasibility Study


The feasibility of the project is analyzed in this phase and business proposal is put forth
with a very general plan for the project and some cost estimates. During system analysis the
feasibility study of the proposed system is to be carried out. This is to ensure that the proposed
system is not a burden to the company. For feasibility analysis, some understanding of the major
requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
1. Economic Feasibility
2. Technical Feasibility
3. Social Feasibility
1. Economic Feasibility
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development

TEXT: Automatic Template Extraction from Heterogeneous Webpages

10

of the system is limited. The expenditures must be justified. Thus the developed system as well
within the budget and this was achieved because most of the technologies used are freely
available. Only the customized products had to be purchased.

2. Technical Feasibility
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the available
technical resources. This will lead to high demands on the available technical resources. This
will lead to high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.

3. Social Feasibility
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity.

The level of acceptance by the users solely depends on the methods that are employed to
educate the user about the system and to make him familiar with it. His level of confidence must
be raised so that he is also able to make some constructive criticism, which is welcomed, as he is
the final user of the system.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

11

4. SOFTWARE ENGINEERING PARADIGM APPLIED


Software engineering is comprised of a set of steps that encompass methods, tools, and
procedures. These steps are often called software engineering paradigm. A paradigm for software
engineering is chosen based on the nature of the project and application, the methods and tools to
be used and the controls and deliverables that are needed.

There are several different software engineering paradigms (SEPs). They include

The Classic Life Cycle

The Prototyping Model

The RAD Model

The Incremental Model

TEXT: Automatic Template Extraction from Heterogeneous Webpages

The Spiral Model

The WINWIN Spiral Model

12

Of these SEPs, We chose the first and simple paradigm called the classic life cycle (CLC)
for Automatic Template Extraction from Heterogeneous web Pages project. This paradigm is
also known as the Water-Fall Model or Linear Sequential Model. The CLC model has the
following phases:

Analysis

Design

Code

Test

Support

System / Information
Engineering

Analysis

Design

Code

Test

TEXT: Automatic Template Extraction from Heterogeneous Webpages

13

The above figure illustrates the CLC model. This model was applied to develop the
system for Automatic Template Extraction from Heterogeneous web Pages. The five
phases of the paradigm are explained below.

Software Requirements Analysis


The requirements gathering process is intensified and focused specifically on software.
To understand the nature of the program to be built, the software engineer must understand the
information domain for the software, as well as required function behaviour performance, and
interface. Requirements for the system and the software are documented and review with the
customer.
Design
Software design is actually a multi-step process that focus on four distinct attributes of a
program: Data structure, Software architecture, Interface representations and procedural detail.
The design process translates requirements into a representation of the software that can be
assessed for quality before coding begins. Like requirements, the design is documented and
becomes part of the software configuration.

Code Generation

The design must be translated into a machine-readable form. The code generation step
performs this task. If design is performed in, a detailed manner code generation can be
accomplished mechanistically.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

14

Testing

Once code has been generated, program testing beings. The testing process focus on the
logical internals of the software, ensuring that all statements have been tested, and on the
functional externals; that is, conducting tests to uncover errors and ensure that defined input will
produced actual results that agree with required results.

Support
Software will undoubtedly undergo change after it is delivered to the customer. Change
will occur because errors have been encountered, because the software must be adapted to
accommodate changes in its external environment, or because the customer requires functional
or performance enhancements. Software supports/maintenance reapplies each of the preceding
phases to an existing program rather than a new one.

5. HARDWARE & SOFTWARE REQUIREMENT SPECIFICATIONS


Hardware Configuration

Processor

Intel Dual Core or later

CPU speed

1.2 MHz or higher

RAM

2 GB or more

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Hard Disk

15

320 GB or more

Operating System

Windows XP or later

Web Framework

ASP .NET3.5 (2008) or later

Database Server

SQL Server 2008

Web Server

ASP.NET Development Server

Web browser

Internet Explorer / Firefox / Chrome / Opera / Safari

Software Configuration

6. SOFTWARE PROFILE
6.1 Windows XP
Windows XP is a family of 32-bit and 64-bit operating systems produced by
Microsoft for use on personal computers, including home and business desktops, notebook
computers, and media centers. The name "XP" stands for experience. Windows XP is the

TEXT: Automatic Template Extraction from Heterogeneous Webpages

16

successor to both Windows 2000 Professional and Windows Me, and is the first consumeroriented operating system produced by Microsoft to be built on the Windows NT kernel (version
5.1) and architecture. Windows XP was first released on 25 October 2001, and over 400 million
copies were in use in January 2006, according to an estimate in that month by an IDC analyst.

The most common editions of the operating system are Windows XP Home Edition,
which is targeted at home users, and Windows XP Professional, which offers additional features
such as support for Windows Server domains and two physical processors, and is targeted at
power users, business and enterprise clients.
Windows XP is known for its improved stability and efficiency over the 9x versions of
Microsoft Windows. It presents a significantly redesigned graphical user interface, a change
Microsoft promoted as more user-friendly than previous versions of Windows.
Windows XP has also been criticized by some users for security vulnerabilities, tight
integration of applications such as Internet Explorer 6 and Windows Media Player, and for
aspects of its default user interface. Later versions with Service Pack 2, and Internet Explorer 7
addressed some of these concerns.

As of the end of September 2008, Windows XP is the most widely used operating
system in the world with a 69% market share, having peaked at 85% in December 2006.

Features
Improved Device Support

TEXT: Automatic Template Extraction from Heterogeneous Webpages

17

Windows XP provides new and/or improved drivers and user interfaces for devices compared to
Windows Me and 98.Windows Image Acquisition (WIA), originally introduced in Windows Me,
replaced the traditional TWAIN support for scanners and digital cameras. As TWAIN does not
separate the user interface from the driver of a device, it is difficult to provide transparent
network access; whenever an application loads a TWAIN driver, it is completely undetectable
from the supplied manufacturer's GUI.

On old versions of Windows, when users upgrade a device driver, there is a chance the new
driver is less efficient or functional than the original. Reinstalling the old driver can be a major
hassle and to avoid this quandary, Windows XP keeps a copy of an old driver when a new
version is installed. If the new driver has problems, the user can return to the previous version.
This feature does not work with printer drivers.
Improved Interface
Windows XP includes a new set of visual themes, known by its codename, Luna. Available in
three schemes, the interface is more task-based than the basic one included since Windows 95,
with options available in Explorer windows to interact with each file. It also includes other
modifications, such as grouping of related programs, hiding of taskbar icons, and many other
elements.
Fast User Switching
Fast User Switching allows another user to log in and use the system without having to log out
the previous user and quit his or her applications. Previously (on both Windows Me and
Windows 2000) only one user at a time could be logged in (except through Terminal Services),
which was a serious drawback to multi-user activity. Fast User Switching, like Terminal

TEXT: Automatic Template Extraction from Heterogeneous Webpages

18

Services, requires more system resources than having only a single user logged in at a time and
although more than one user can be logged in, only one user can be actively using their account
at a time. This feature is not available when the Welcome Screen is turned off, such as when
joined to a Windows Server Domain or with Novell Client installed.
Remote Assistance
Remote Assistance allows a Windows XP user to temporarily take over a remote Windows XP
computer over a network or the internet to resolve issues. As it can be a hassle for system
administrators to personally visit the affected computer, Remote Assistance allows them to
diagnose and possibly even repair problems with a computer without ever personally visiting it.
CD Burning
Windows XP includes technology from Roxio which allows users to directly burn files to a
compact disc through Windows Explorer. Previously, end users had to install CD burning
software, such as Nero Burning ROM. Now, CD and DVD-RAM burning has been directly
integrated into the Windows interface; users burn files to a CD in the same way they write files
to a floppy disk or to the hard drive. The burning functionality is also exposed as an API called
the Image Mastering API. Windows XP's CD burning support does not do disk to disk copying or
disk images although the API can be used programmatically to do these tasks. Creation of audio
CDs is integrated into Windows Media Player.
Clear Types
Windows XP includes Clear Type sub-pixel font anti-aliasing, which makes onscreen fonts
smoother and more readable on liquid crystal display (LCD) screens, although this causes a
minor performance hit. Although Clear Type has an effect on cathode ray tube (CRT) monitors,
its primary use is for LCD/TFT-based (laptop, notebook and modern 'flat screen') displays.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

19

Remote Desktop
Users can log into Windows XP Professional remotely through the Remote Desktop service. It is
built on Terminal Services technology (RDP), and is similar to Remote Assistance, but allows
remote users to access local resources such as printers. Any Terminal Services client, a special
"Remote Desktop Connection" client, or a web-based client using an ActiveX control may be
used to connect to the Remote Desktop. (Remote Desktop clients for earlier versions of
Windows, Windows 95, Windows 98 and 98 Second Edition, Windows Me, Windows NT 4.0, or
Windows 2000, have been made available by Microsoft. This permits earlier versions of
Windows to connect to a Windows XP system running Remote Desktop, but not vice-versa.)

There are several resources that users can redirect from the remote server machine to the local
client, depending upon the capabilities of the client software used. For instance, File System
Redirection allows users to use their local files on a remote desktop within the terminal session,
while Printer Redirection allows users to use their local printer within the terminal session as
they would with a locally or network shared printer. Port Redirection allows applications running
within the terminal session to access local serial and parallel ports directly, and Audio allows
users to run an audio program on the remote desktop and have the sound redirected to their local
computer. The clipboard can also be shared between the remote computer and the local
computer.
Power Management

TEXT: Automatic Template Extraction from Heterogeneous Webpages

20

Before Windows 98, power management was based on the Advanced Power Management
architecture. It was of limited use to most users and the feature was easily broken by the addition
of hardware devices or software. Windows XP's power management architecture is based on the
ACPI standard and still supports APM. (In Windows 98 ACPI was supported but disabled by
default. Windows Me enabled ACPI by default.) It supports multiple levels of sleep states,
including critical sleep states when a mobile (or UPS connected) computer is running out of
battery power, processor power control (the ability to adjust the speed of the computer's
processor on-the-fly to save energy), selective suspend of externally attached (such as USB)
devices, and turning off the power to the screen of a laptop when the lid is closed. In addition, it
also dims the screen when the laptop has low battery power.
Hibernate Mode
When Windows XP hibernates it dumps the entire contents of the RAM to disk and
powers down the entire machine. On start-up it quickly reloads the data back to RAM. This
allows the system to be completely powered off while in hibernate mode. This requires a file the
size of the installed RAM to be placed in the system's root directory, using up space even when
not in hibernation. Hibernation is enabled by default and can be disabled in order to recover disk
space.
Standby (Sleep) Mode
When Windows enters standby mode, it turns off all nonessential hardware, including the
monitor, hard drives, and removable drives. This means that the system reactivates itself very
quickly when "woken up". This does not power down the system. In order to save power without
user intervention, a system can be configured to go to standby when idle and then hibernate if not
re-activated.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

21

The Windows Standby feature conforms to the S1 and S3 Sleep States in the ACPI standards.
Kernel Improvements
The Windows XP kernel is completely different from the kernel of the Windows 9x/ me line of
operating systems. As an upgrade of the Windows 2000 kernel, the improvements are major,
albeit transparent to the end user. They include some enhancements to the scalability and
performance of the system.

Windows XP includes Simultaneous Multithreading Support, or the ability to utilize the HyperThreading feature of newer Intel Pentium 4 processors. Simultaneous Multithreading is a
processor's ability to process more than one data thread at a time. Intel has described the effect as
being more or less 70% that of having the processing power of two processors.

The ability to boot in 30 seconds was a design goal for Windows XP, and Microsoft's developers
made efforts to streamline the system as much as possible; many people have found that without
extra services Windows XP can boot from the PC's power on self-test (POST) to the Windows
GUI in about 30 seconds. The Perfected is a significant part of this; it monitors what files are
loaded during boot, and optimizes the locations of these files on disk so that less time is spent
waiting for the hard drive's heads to move.

Application Compatibility
As Windows XP merged the consumer and enterprise versions of Windows into one, it folded the
user-friendly interface of Windows Me onto the kernel of Windows 2000. A drawback of this is
that older software designed for previous versions of Windows may not function. Microsoft

TEXT: Automatic Template Extraction from Heterogeneous Webpages

22

addressed this by going to great lengths to improve compatibility with application specific
tweaks and shims and providing tools to allow users to try these tweaks and shims on their own
applications.
Application Isolation & Side-by-Side Assemblies
A common issue in previous versions of Windows was that users frequently suffered from DLL
hell, where more than one version of the same Dynamically Linked Library (DLL) was installed
on the computer. As software relies on DLLs, using the wrong version could result in nonfunctional applications, or worse. Windows XP solved this problem by introducing side-by-side
assemblies.

The technology keeps multiple versions of a DLL in the WinSxS

folder and runs them on

demand to the appropriate application keeping applications isolated from each other and not
using common dependencies.

Windows XP also introduced a new mode of COM object registration called Registration-free
COM. This makes it possible for applications that need to install COM objects to store all the
required COM registry information in the application's directory, instead of in the global registry,
where, strictly speaking, only a single application will ever use it.

DLL hell can be substantially avoided using Registration-free COM, the only limitation being it
requires at least Windows XP or later Windows versions and that it must not be used for EXE
COM servers or system-wide components such as MDAC, MSXML, DirectX or Internet
Explorer.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

23

6.2 ASP.NET

The .NET Framework is the next evolution of the Microsoft development platform. It
supports multiple languages and allows them to seamlessly interact with each other via common
standards. One of these standards is the Common Language Runtime (CLR), which provides a
set of services that are common to all applications such as memory management, cross language
integration, access security, debugging and more. Common Type System (CTS) is an important
common standard, provides a standard set of data types that are used by all .NET languages. A
powerful versioning system is a part of the .NET framework. It avoids the occurrence of DLL
collision problems of the past. Powerful object-oriented features have been introduced to all
.NET languages. It also provides a rich library of well-defined base classes. It is intended for
highly distributed software, making internet functionality and interoperability easier and more
transparent than ever before.

Visual Studio .NET is a complete set of development tools for building ASP web
applications, XML web services, desktop applications, and mobile applications. Visual Basic
.NET, Visual C++ .NET and Visual C# .NET, all use the same IDE, which allows them to share
tools and facilities in the creation of mixed-language solutions. In addition, these languages
leverage the functionality of the .NET Framework, which provides access to key technologies
that simplify the development of ASP web applications and XML web services.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

24

The .NET Framework is a multi-language environment for building, deploying and


running XML web services and applications. It consists of three main parts. A .NET framework
is the Common Language Runtime the engine that drives key functionality which provides
memory management features such as automatic garbage collection and type-safety checking.
Despite its name, the runtime actually has a role in both a components runtime and development
time experiences. While the component is running, the runtime is responsible for managing
memory allocation, starting up and stopping threads and processes, and enforcing security policy,
as well as satisfying any dependencies that the component might have on other components. At
development time, the runtimes role changes slightly because it automates so much; the runtime
makes the developers experience very simple, especially when compared to COM as it is today.
In particular, features such as reflection dramatically reduce the amount of code a developer must
write in order to turn business logic into a reusable component.

Unified Programming Classes


The framework provides developers with a unified, Object-oriented, hierarchical,
and extensible set of class libraries (APIs). Currently, C++ developers use the Microsoft
Foundation classes and Java developers use the windows foundation classes. The framework
unifies these disparate models and gives Visual Basic and Jscript programmers access to class
libraries as well. By creating a common set of APIs across all programming languages, the CLR
enables cross language inheritance, error handling, and debugging. All programming languages
have similar access to the framework and developers are free to choose the language that they
want to use.
ASP .NET

TEXT: Automatic Template Extraction from Heterogeneous Webpages

25

ASP.NET builds on the programming classes of the .NET framework, providing a


web application model with a set of controls and infrastructure that make it simple to build ASP
web applications. ASP.NET includes a set of controls that encapsulate common HTML user
interface elements, such as textboxes and drop-down menus. These controls run on the web
server, however, and push their user interface as HTML to the browser. On the server, the
controls expose an object-oriented programming model that brings the richness of objectoriented programming to the web developer. ASP.NET also provides infrastructure services; such
as session state management and process recycling that further reduces the amount of code a
developer must write and increase application reliability. In addition, ASP.NET uses these same
concepts to enable developers to deliver software as a service.

Objectives of .NET Framework


The .NET framework is a new computing platform that simplifies application
development in the highly distributed environment of the internet.
The different objectives are:

To provide a consistent object oriented programming environment,

Whether object code is stored and executed locally, but internet-distributed or executed
remotely.

To provide a code-execution environment that minimizes

Software deployment and versioning conflicts:

TEXT: Automatic Template Extraction from Heterogeneous Webpages

26

To provide a code execution environment that guarantees safe Execution of code,

including the code created by an unknown or semi-trusted third party and an environment that
eliminates the performance problems of scripted or interpreted environments.

To make the developer experience consistent access widely varying types of

applications such as windows and web-based applications.

To build all communication on industry standards to ensure that code based on .NET

framework can integrate with any other code.

.NET Framework Class Library


The .NET framework class library is a collection of reusable types that highly integrate
with the common language runtime. The class library is object oriented, providing types from
which your own managed code can derive functionality. This not only makes the .NET
framework types easy to use, but also reduces the time associated with learning new features of
the .NET framework. .NET framework can be used to develop the following types of
applications and services such as:

Console Applications

Scripted or Hosted Applications

Windows GUI Applications

ASP.NET Applications

XML Web Services

Window Services

TEXT: Automatic Template Extraction from Heterogeneous Webpages

27

ASP.NET Web Applications in Visual Studio


Visual Studio .NET allows you to create applications that leverage the power of the
world web. This includes everything from a traditional web site that serves HTML pages to fully
featured business applications that run on an internet or the internet, to sophisticated business-tobusiness applications providing web based components that can exchange data using XML.
It build on the programming classes of the .NET framework, providing a web
applications model with a set of controls and infrastructure that makes it simple to build ASP
web applications. ASP.NET includes a set of controls that encapsulate common HTML user
interface elements, such as text boxes and drop down menus. These controls run on the web
server however, and push their user interface as HTML to the browser.
On the server, the controls expose an OOP model that brings the richness of OOP to
the web developer. It also provides infrastructural services, such as session state managed and
process recycling that further reduces the amount of a code a developer must write and increase
application reliability. In addition ASP.NET uses these same concepts to enable developers to
deliver software as a service. Using XML web services features, ASP.NET developers can write
this business logic and use the ASP.NET infrastructure to deliver that service via SOAP.
Visual Studio ASP.NET Web Applications
A Visual Studio web application is built around ASP.NET. ASP.NET is a platform
including design-time objects and controls and run time execution context for developing and
running applications in a web server.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

28

ASP.NET in turn is part of the .NET framework, so that it provides access to all of the
features of that framework. For example, you can create ASP.NET web applications using any
.NET programming language (Visual Basic, C#, managed extensions for C++, and many others)
and .NET debugging facilities. You access data using .NET framework classes, and so on.
ASP.NET web applications run on a web server configured with Microsoft Internet
Information Services (IIS). However, you do not used to work directly with IIS. You can
program IIS facilities using ASP.NET classes and Visual Studio handles file management tasks
such as creating IIS applications when needed and providing ways for you to deploy your web
applications to IIS.
Features of ASP.NET
ASP.NET Page Framework and Web Forms Page

The ASP.NET page framework is a programming framework that runs on a web


server to dynamically produce and manage web forms pages. In Visual Studio, web forms
provides a form designer, editor, controls and debugging which together allow you to rapidly
build server-based, programmable user interfaces for browsers and web client devices.

Web forms pages run on any browser or client device. However you can design
your web forms page to target a specific browser such as MS IE5.0 and take advantage of the
features of a specific browser or client device. ASP.NET supports mobile controls for web
enabled devices such as cellular phones, handheld computers and PDAs.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

29

The ASP.NET page framework creates an abstraction of the traditional clientserver web interaction so that you can program your application using traditional methods and
tools that support Rapid Application Development (RAD) and OOP.

Within web forms pages you can work with HTML elements using properties,
methods and events. The ASP.NET page framework removes the implementation details of the
separation of client and server inherent in web based applications by presenting a unified model
for responding to client events in code that runs at the server. The framework also automatically
maintains the state of the page and the controls on that page during the page processing life
cycle.
XML Web Services

ASP.NET supports XML web services. An XML web services is a component


containing business functionality exposed through Internet protocols. Any XML web services
enables applications to exchange information between web based applications using standards
like HTTP and XML messaging to move data across firewalls. XML web services are not tied to
a particular component technology or object calling convention. As a result, programs written in
any language, using any component model, and running on any OS can access XML web
services.
State Management Facilities

TEXT: Automatic Template Extraction from Heterogeneous Webpages

30

ASP.NET provides intrinsic state management functionality that allows you to save
and manage application-specific, session-specific, and developer-defined information. This
information can be independent of any controls on the page. It can be shared between pages,
such as customer information. ASP.NET offers distributed state facilities. You can create multiple
instances of the same application on one computer or on several computers.

ASP.NET Optimization
In this day of business-to-business and business-to-consumer, e-commerce, slow web
applications can waste resources and drive customers away from your company. Web site
performance is an extremely important issue for the developers writing code and for the system
administrator maintaining applications.
ASP.NET incorporates a variety of features and tools that allow you to design and
implement high-performance web applications.
The features include:

An improved process model

Compilation of requested pages and automatic storage on the server

ASP.NET-specific performance counters

Web application testing tools

ASP.NET gives you the ability to create web applications that meet the demands that
arise when they must process large numbers of requests simultaneously.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

31

Application Events
ASP.NET allows you to include application event handling code in the optimal
global aspx file. You can use application events to manage application wide information and
perform orderly application start up and clean up tasks.

Compilation

All ASP.NET code including server scripts, is compiled which allows for a
strong typing, performance optimizations and early binding among other benefits. Once the code
has been compiled the runtime further compiles ASP.NET to native code, providing improved
performance.
Configuration
ASP.NET configuration settings are stored in XML based files. Since these
XML files are ASCII text files, you can read and modify them, so it is simple to make
configuration changes to your web application. Each of your applications can have its own
configuration file and you can extend the configuration scheme to suit your requirements.

Security
ASP.NET provides default authorization and authentication schemes for web
application. You can easily remove, add to, or replace these schemes depending upon the needs
of your application.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

32

Debugging Support
ASP.NET takes advantages of the runtime debugging infrastructure to provide
cross-language and cross-compiler debugging support used both locally and remotely from a web
server. In addition the ASP.NET page framework provides a trace-mode that enables you to insert
instrumentation messages into your forms.

Extensibility
Mobile web forms and mobile web forms control offer the same extensibility
features available in ASP.NET and add support for working with multiple devices. Specifically,
the following kinds of extensibility are provided.

New mobile controls can be written and used in a mobile web forms page.

ASP.NET user controls can be used to write simple mobile controls declaratively.

The output of any control can be customized on a device-specific basis by adding a new
adapter for the control.

Support for an entirely new device can be added by using adapter extensibility, without
any changes to individual applications.

Web Server Controls


A developer can create mobile web forms pages consisting of these controls either
in the Mobile Internet Designer or with any text editor. The following controls are available in
the Mobile Internet Controls Runtime:

TEXT: Automatic Template Extraction from Heterogeneous Webpages

33

Calendar
The calendar control offers date-picking functionality and is displayed on mobile
devices. If a user selects a date, browses to another page and browses back to the page with the
calendar control the user has no indication of which data they selected. Usability is improved if
the page displays a date.

Command
The command control provides a way to invoke Microsoft ASP.NET event
handlers from UI elements.

CompareValidator

The comparevalidator control compares one control to another by using a


specified comparison operator.

CustomValidator
The CustomValidator control allows the developer to provide a custom method to
validate another controls field.

Form

TEXT: Automatic Template Extraction from Heterogeneous Webpages

34

The form control represents the outermost grouping of controls within a mobile
page object.

Image
Image control provides the capability to specify the image that you want to
display on wireless device.

RangeValidator
The rangevalidator control validates that the values of another control fall within
an allowable range, where the minimum and the maximum are provided either directly or by
reference to another control.

RegularExpressionValidator

The regularexpressionvalidator validates that the values of another control match a


specified expression.
RequiredFieldValidator

The requiredfieldvalidator validates that the value of another control is something


than its initial value.

Textbox

TEXT: Automatic Template Extraction from Heterogeneous Webpages

35

The textbox control generates a single-line textboxes.

Label

The label control creates a text-based control that displays output-only text on a
mobile device.

Hyperlink

The Hyperlink control creates a text-based, output-only control that represents a


hyperlink to another form control on a mobile page or an arbitrary URL.

Panel

The panel control provides a grouping mechanism for organizing controls. Panel
control can be recursively nested within a form control.

Accessing Data with ADO.NET


Active X data objects for the .NET framework (ADO.NET) is a set of classes
that expose data access services to the .NET programmer. ADO.NET provides a rich set of

TEXT: Automatic Template Extraction from Heterogeneous Webpages

36

components for creating distributed, data sharing applications. It is an integral part of the .NET
framework, providing access to relational data, XML and application data. ADO.NET supports a
wide variety of development needs, including the creation of front end database clients and
middle-tier business objects used by applications, tools, languages or Internet browsers.
ASP.NET includes data access tools that make it easier than ever for you to
design sites that allow you users to interact with databases through web pages. The .NET
framework includes two data providers for accessing enterprise databases, the OLE DB.NET
data provider and the SQL server .NET data provider. SQL databases can be accessed from
ASP.NET using the different features like SQL connection class, Dataset, SQL Data Reader
class, etc. The .NET framework includes three controls that make the display of large amounts of
data easier.

The Repeater control

Data List control

Data Grid control

Overview of ADO.NET

ADO.NET provides consistent access to data sources such as MS SQL server,


as well as data sources exposed via OLE DB and XML. Data sharing consumer applications can
use ADO.NET to connect to these data sources and retrieve, manipulate and update data.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

37

ADO.NET clearly factors data access from data manipulation into discrete
components that can be used separately or in tandem. It includes .NET data providers for
connecting to a database, executing commands, and retrieving results. Those results are either
processed directly, or placed in an ADO.NET Dataset object in order to be exposed to the use in
an ad-hoc manner, combined with data from multiple sources, or remote between tiers. The
ADO.NET dataset object can also be used independently of a .NET data provider to manage data
local to the application or sourced from XML.

ADO.NET Components
The ADO.NET components have been designed to factor data access from the
data manipulation. There are two central components of ADO.NET that accomplish this. The
Dataset, the .NET data provider which is a set of components including the connection,
command, Data Reader and Data Adapter objects. The ADO.NET dataset is the core component
of the disconnected architecture of ADO.NET.
Internet Information Server (IIS)
Internet Information Server is a web server developed by Microsoft that runs on
Windows NT/Windows 2000 platform. Internet Information Server 5.0 (IIS) is fully integrated at
the operating system level, Windows 2000 server lets organization add internet capabilities that
weave directly into the rest of their computing infrastructure.
Application Protection
Internet Information Server5.0 offers improved protection and increased
reliability for web applications. By default, IIS runs all applications in a common or a pooled

TEXT: Automatic Template Extraction from Heterogeneous Webpages

38

process that is separate from core IIS processes. In addition, administrator can still isolate
mission-critical applications and that should be run outside of both core IIS and pooled
processes.
Integrated setup and upgrade

Internet Information server 5.0 installs as a networking service of Windows


2000 Server. Customer with any existing version of Windows NT server 3.5 or 4.0 will
automatically be upgraded to the new features and services of Windows 2000 server and IIS.

Remote Administration

Internet Information Server 5.0 has web-based administration tools that allow
remote management of a server from almost any browser on any platform with IIS 5.0,
administrators can set up administration accounts called privileges on web sites, to help
distribute administrative tasks.

Certificate Storage

IIS certificate storage is now integrated with Windows crypto API storage. The
Windows certificate manager provides a single point of entry that lets administrator store, back
up and configure server certificates.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

39

E.g., Kerberos version 5 Authentication

Protocol Compliance
IIS is fully integrated with the Kerberos v5 authentication protocol implemented in
Microsoft Windows 2000. This means administrators can pass authentication credentials among
connected computers running windows.

6.3 HTML

HTML, an initialize of Hypertext Markup Language, is the predominant markup


language for Web pages. It provides a means to describe the structure of text-based information
in a document by denoting certain text as links, headings, paragraphs, lists, and so on and
to supplement that text with interactive forms, embedded images, and other objects.

HTML is written in the form of tags, surrounded by angle brackets. HTML can also
describe, to some degree, the appearance and semantics of a document, and can include
embedded scripting language code (such as JavaScript) which can affect the behavior of Web
browsers and other HTML processors.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

40

Elements
Elements are the basic structure for HTML markup. Elements have two basic properties:
attributes and content. Each attribute and each element's content has certain restrictions that must
be followed for an HTML document to be considered valid. An element usually has a start tag
(e.g. <element-name>) and an end tag (e.g. </element-name>). The element's attributes
are contained in the start tag and content is located between the tags (e.g. <elementname attribute="value">Content</element-name>). Some elements, such as
<br>, do not have any content and must not have a closing tag. Listed below are several types of
markup elements used in HTML.

Structural markup describes the purpose of text. For example, <h2>Golf</h2>


establishes "Golf" as a second-level heading, which would be rendered in a browser in a manner
similar to the "HTML markup" title at the start of this section. Structural markup does not denote
any specific rendering, but most Web browsers have standardized on how elements should be
formatted. Text may be further styled with Cascading Style Sheets (CSS).

Presentational markup describes the appearance of the text, regardless of its function. For
example <b>boldface</b> indicates that visual output devices should render "boldface" in
bold text, but gives no indication what devices which are unable to do this (such as aural devices
that read the text aloud) should do. In the case of both <b>bold</b> and <i>italic</i>,
there are elements which usually have an equivalent visual rendering but are more semantic in

TEXT: Automatic Template Extraction from Heterogeneous Webpages

41

nature, namely <strong>strong emphasis</strong> and <em>emphasis</em>


respectively. It is easier to see how an aural user agent should interpret the latter two elements.

However, they are not equivalent to their presentational counterparts: it would be


undesirable for a screen-reader to emphasize the name of a book, for instance, but on a screen
such a name would be italicized. Most presentational markup elements have become deprecated
under the HTML 4.0 specification, in favor of CSS based style design.

Attributes
Most of the attributes of an element are name-value pairs, separated by "=", and written
within the start tag of an element, after the element's name. The value may be enclosed in single
or double quotes, although values consisting of certain characters can be left unquoted in HTML
(but not XHTML).Leaving attribute values unquoted is considered unsafe most elements can
take any of several common attributes:

The id attribute provides a document-wide unique identifier for an element. This can be
used by style sheets to provide presentational properties, by browsers to focus attention on
the specific element, or by scripts to alter the contents or presentation of an element.

The class attribute provides a way of classifying similar elements for presentation
purposes.

For

example,

an

HTML

document

might

use

the

designation

class="notation" to indicate that all elements with this class value are subordinate to

TEXT: Automatic Template Extraction from Heterogeneous Webpages

42

the main text of the document. Such elements might be gathered together and presented as
footnotes on a page instead of appearing in the place where they occur in the HTML source.

An author may use the style non-attributable codes presentational properties to a


particular element. It is considered better practice to use an elements son- id page and select
the element with a style sheet, though sometimes this can be too cumbersome for a simple ad
hoc application of styled properties.

The title attribute is used to attach sub textual explanation to an element. In most
browsers this attribute is displayed as what is often referred to as a tool tip.

Character and entity references

As of version 4.0, HTML defines a set of 252 character entity references and a set of
1,114,050 numeric character references, both of which allow individual characters to be written
via simple markup, rather than literally. A literal character and its markup counterpart are
considered equivalent and are rendered identically.

The ability to "escape" characters in this way allows for the characters < and & (when
written as &it; and &amp;, respectively) to be interpreted as character data, rather than
markup. For example, a literal < normally indicates the start of a tag, and & normally indicates
the start of a character entity reference or numeric character reference; writing it as &amp; or

TEXT: Automatic Template Extraction from Heterogeneous Webpages

43

&#x26; or &#38; allows & to be included in the content of elements or the values of attributes.
The double-quote character ("), when used to quote an attribute value, must also be escaped as
&quoted; or &#x22; or &#34; when it appears within the attribute value itself. The singlequote character ('), when used to quote an attribute value, must also be escaped as &#x27; or
&#39; (should NOT be escaped as &apos; except in XHTML documents) when it appears
within the attribute value itself.
However, since document authors often overlook the need to escape these characters,
browsers tend to be very forgiving, treating them as markup only when subsequent text appears
to confirm that intent.

Escaping also allows for characters that are not easily typed or that aren't even available
in the document's character encoding to be represented within the element and attribute content.
Data Types
HTML defines several data types for element content, such as script data and style sheet
data, and a plethora of types for attribute values, including IDs, names, URIs, numbers, units of
length, languages, media descriptors, colors, character encodings, dates and times, and so on. All
of these data types are specializations of character data.
The Document Type Declaration
HTML documents are required to start with a Document Type Declaration (informally, a
doctype). In browsers, the function of the doctype is selecting the rendering mode
particularly to avoid the quirks mode.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

44

The original purpose of the document type is to enable validation based on Document
Type Definition (DTD) with SGML tools. The DTD to which the DOCTYPE refers contains
machine-readable grammar specifying the permitted and prohibited content for a document
conforming to such a DTD. Browsers do not read the DTD, however. HTML5 validation is not
DTD-based, so in HTML5 the document type does not refer to a DTD.

HTTP
The World Wide Web is composed primarily of HTML documents transmitted from a
Web server to a Web browser using the Hypertext Transfer Protocol (HTTP). However, HTTP
can be used to serve images, sound, and other content in addition to HTML.

To allow the Web browser to know how to handle the document it received, an indication
of the file format of the document must be transmitted along with the document. This vital
metadata

includes

the

MIME

type

(text/html

for

HTML

4.01

and

earlier,

application/xhtml+xml for XHTML 1.0 and later) and the character encoding.
In modern browsers, the MIME type that is sent with the HTML document affects how
the document is interpreted. A document sent with an XHTML MIME type, or served as
application/xhtml+xml, is expected to be well-formed XML, and a syntax error causes the
browser to fail to render the document.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

45

The same document sent with an HTML MIME type, or served as text/html, might be
displayed successfully, since Web browsers are more lenient with HTML. However, XHTML
parsed in this way is not considered either proper XHTML or HTML, but so-called tag soup.
HTML E-Mail
Most graphical e-mail clients allow the use of a subset of HTML (often ill-defined) to
provide formatting and semantic markup capabilities not available with plain text, like
emphasized text, block quotations for replies, and diagrams or mathematical formulas that could
not easily be described otherwise.

Many of these clients include both a GUI editor for composing HTML e-mail messages
and a rendering engine for displaying received HTML messages. Use of HTML in e-mail is
controversial because of compatibility issues, because it can be used in privacy attacks, because
it can confuse spam filters, and because the message size is larger than plain text.

6.4 SQL Server


MS_SQL Server 2005 extends the performance, reliability, quality and ease-of-use of
MS_SQL Server 2000. MS_SQL Server 2005 includes several new features that make it an
excellent database platform for large-scale online transactional processing (OLTP), data
warehousing, and e-commerce applications.
.
The Repository component available in SQL Server version 7.0 is now called Microsoft
SQL Server 2000 Meta Data Services. References to the component now use the term Meta Data

TEXT: Automatic Template Extraction from Heterogeneous Webpages

46

Services. The term repository is used only in reference to the repository engine within Meta Data
Services.
Microsoft SQL Server 2005 provides a scalable database that combines ease of use with
complex analysis and data warehousing tools. SQL Server includes a rich Graphical User
Interface (GUI) along with a complete development environment for creating data-driven
applications.
SQL Server Architecture
Microsoft SQL Server 2005 is designed to work effectively as:

A central database on a server shared by many users who connect to it over a


network.

A desktop database that services only applications running on the same


desktop.

Features in SQL Server 2005

Internet Integration
The SQL Server 2000 database engine includes integrated XML support. It also has the
scalability, availability, and security features required to operate as the data storage component of
the largest Web sites.
The SQL Server 2000 programming model is integrated with the Windows DNA
architecture for developing Web applications, and SQL Server 2000 supports features such as
English Query and Microsoft Search Service to incorporate user-friendly queries and powerful
search capabilities in Web applications.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

47

Scalability and Availability


The same database engine can be used across platforms ranging from laptop computers
running Microsoft Windows 98 through large, multiprocessor servers running Microsoft
Windows 2000 Data Center Edition.
SQL Server 2000 Enterprise Edition supports features such as federated servers, indexed
views, and large memory support that allow it to scale to the performance levels required by the
largest Web sites.
Enterprise-Level Database Features

The SQL Server 2005 relational database engine supports the features required to
support demanding data processing environments. The database engine protects data integrity
while minimizing the overhead of managing thousands of users concurrently modifying the
database.
SQL Server 2000 distributed queries allow you to reference data from multiple sources as
if it were a part of a SQL Server 2000 database, while at the same time, the distributed
transaction support protects the integrity of any updates of the distributed data.
Ease of installation, deployment, and use
SQL Server 2000 includes a set of administrative and development tools that improve
upon the process of installing, deploying, managing, and using SQL Server across several sites.

SQL Server 2000 also supports a standards-based programming model integrated with the
Windows DNA, making the use of SQL Server database and data warehouses a seamless part of
building powerful and scalable systems.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

48

These features allow you to rapidly deliver SQL Server applications that customers can
implement with a minimum of installation and administrative overhead.

Data warehousing
SQL Server 2000 includes tools for extracting and analyzing summary data for online
analytical processing. SQL Server also includes tools for visually designing databases and
analyzing data using English-based questions.

7. SYSTEM DESIGN
7.1 Data Flow Diagram

Level 0:

TEXT: Automatic Template Extraction from Heterogeneous Webpages

49

Store

Add WebPages
Admin

Webpage
Details

Db
Template
Extraction
User Details

User

Webpage
Details

Registration

Select
Compartment
View
Template

Get
Templates
for
Websites
View Details

Level 1:

Get Clarify
Details

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Admin
Entry
Admin

50

Login from
Database
Db

Login

Store

Add Template

Check User
Details

Metadata
Extraction

View Templates

Level 2:

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Login From
Database

User Entry
User

51

Db

Registratio
n

Store
Login

Final
Templates

View Websites
Information
Select
Templates

Document
Classification
View
Information

TEXT: Automatic Template Extraction from Heterogeneous Webpages

52

7.2 Database Design


AdminLogin
Column Name

Data Type

Description

U_Name

Varchar(30)

User Name

Password

Varchar(30)

Password

Document
Column Name

Data Type

Description

Doc_Id

Int

Document Id

Doc_Name

Varchar(30)

Document Name

Author

Varchar(30)

Author

Title

Varchar(30)

Title

Description

Text

Description

Files

Text

Files

TEXT: Automatic Template Extraction from Heterogeneous Webpages

53

8. PROJECT ELUCIDATION
The TEXT system after careful analysis has been identified to be presented with the
following modules:
1. Metadata Extraction
2. Rule-based Approach
3. Machine Learning Approach
4. Document Classification for Metadata Extraction
5. Knowledge-based Classifications

1. Metadata Extraction
In this module, metadata extraction has the following meaning. Metadata refers to
information about a document that is used to catalogue a document and later to allow users to
search and locate it. It is commonly clustered on one or more pages; examples include title,
creators, affiliation, publisher, language, ID number, and date. Extraction refers to the process of
automatically locating the pages that contain metadata, extracting the metadata and tagging them
as the appropriate type. We classify approaches to build a metadata extraction system into: rulebased approaches and machine-learning approaches.

2. Rule-based Approach
The steps of building a rule-based metadata extraction system are typically as follows:
first, some experts examine samples of the document collection and define rules for metadata

TEXT: Automatic Template Extraction from Heterogeneous Webpages

54

extraction; then, software developers implement these rules either as part of an expert system or
as part of an ad hoc rule engine. The accuracy, inventiveness, and appropriateness of the rules
that experts defined play a critical role in building a system with high accuracy.

3. Machine Learning Approach


Machine learning tasks can be classified into two categories: Empirical learning and
Analytical learning. Empirical learning requires external inputs while analytical learning does
not need external inputs. Based on whether the input data are labelled samples or not, Empirical
learning can be further classified into supervised learning and unsupervised learning tasks. A
supervised learning task is one that analyzes a given set of objects with class labels while an
unsupervised learning task is one that analyzes a given set of objects without class labels.

4. Document Classification for Metadata Extraction


In this module, the objective of classifying documents into groups is to ease the task of
metadata extraction for a heterogeneous collection. Documents are classified into groups based
on the similarity of their metadata pages so that we can develop a simple template to extract
metadata from documents in a group. We define two kinds of similarity for document
classification in our research: visual similarity and content similarity. The visual similarity is the
similarity of the geometrical arrangement of blocks (both text and graphics) on the metadata
page as well as the typographic features of the text. Some examples of the typographic features
are font size, text alignment, text height, and line spacing. The content similarity is the similarity
of the occurrences of special labels (e.g. ABSTRACT, Title, and Subject), the occurrences

TEXT: Automatic Template Extraction from Heterogeneous Webpages

55

of special text patterns (e.g. three letters followed by nine digits), the occurrences of the words
from special databases (e.g., a word from a dictionary of English last names) in the text, and the
statistical features of the text (e.g. a text with more than 50% letters in upper case).

5. Knowledge-based Classification
In this module, we define a set of page formats in advance, and classify a document into a
group based on which page format is matched. A page format includes the information about the
features of the blocks, the relationship among the blocks, the sample pages, and the threshold
valued of the similarity. Basically, a page format consists of a set of criteria (i.e. features of
blocks, block relations, similarity threshold values). Only the pages that meet all these
requirements are classified into the group associated by this page format. It provides several
ways to measure the similarity of a page and a sample page.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

9. CODING
AdminLogin.aspx.cs
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
public partial class AdminLogin : System.Web.UI.Page
{
ClsDbLayer _objDb = new ClsDbLayer();
SqlDataReader dr;
protected void Page_Load(object sender, EventArgs e)
{
}
protected void Button1_Click(object sender, EventArgs e)
{

56

TEXT: Automatic Template Extraction from Heterogeneous Webpages

string Query = "select * from AdminLogin where UserName='" +


TextBox1.Text + "' and Password='" + TextBox2.Text + "'";
dr = _objDb.Select(Query);
if (dr.Read())
{
TextBox1.Text = dr[0].ToString();
TextBox2.Text = dr[1].ToString();
Response.Redirect("AdminPage.aspx");
}
else
{
Response.Write("No Values");
}
}
}
AddDocumentDetails.aspx.cs
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;

57

TEXT: Automatic Template Extraction from Heterogeneous Webpages

58

public partial class AddDocumentDetails : System.Web.UI.Page


{
ClsDbLayer _objDb = new ClsDbLayer();
int i;
protected void Page_Load(object sender, EventArgs e)
{
}
protected void Button1_Click(object sender, EventArgs e)
{
string Query = "insert into T_Document values('" + TextBox1.Text + "','"
+ TextBox2.Text + "','" + TextBox3.Text + "','" + TextBox4.Text + "','" +
TextBox5.Text + "','" + FileUpload1.PostedFile.FileName + "')";
i = _objDb.Save(Query);
if (i != -1)
{
FileUpload1.SaveAs(Server.MapPath(FileUpload1.FileName));
Response.Write("Submitted Successfully");
}
else
{
Response.Write("Not Submitted");
}
}
}

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Extraction.aspx.cs

EXTRACTION
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
using System.IO;
public partial class Extraction : System.Web.UI.Page
{
ClsDbLayer _objDb = new ClsDbLayer();
int i;
DataSet ds;
protected void Page_Load(object sender, EventArgs e)
{
if (IsPostBack.Equals(false))
{
string Query = "select Doc_Id from T_Document";
ds = _objDb.Display(Query);
DropDownList1.DataTextField = "Doc_Id";

59

TEXT: Automatic Template Extraction from Heterogeneous Webpages

DropDownList1.DataValueField = "Doc_Id";
DropDownList1.DataSource = ds;
DropDownList1.DataBind();
}
}
protected void Button1_Click(object sender, EventArgs e)
{
string Query = "insert into Extract values('" +
DropDownList1.SelectedItem.ToString() + "','" + TextBox1.Text + "','" +
TextBox2.Text + "','" + TextBox3.Text + "','" + TextBox4.Text + "','" +
TextBox5.Text + "','" + FileUpload1.PostedFile.FileName + "')";
i = _objDb.Save(Query);
if (i != -1)
{
Response.Write("Saved Successfully");
}
else
{
Response.Write("Saved Updated");
}

}
protected void Button2_Click(object sender, EventArgs e)
{
StreamReader Str = File.OpenText(FileUpload1.PostedFile.FileName);
string Sr = Str.ReadToEnd();
TextBox7.Text = Sr.ToString();
Str.Close();

60

TEXT: Automatic Template Extraction from Heterogeneous Webpages

61

}
protected void DropDownList1_SelectedIndexChanged(object sender,
EventArgs e)
{
string Query = "select Doc_Name,Title from T_Document where Doc_Id
='" + DropDownList1.SelectedItem.ToString() + "'";
SqlDataReader dr = _objDb.Select(Query);
if (dr.Read())
{
TextBox1.Text = dr[0].ToString();
TextBox2.Text = dr[1].ToString();
}
}
}
ViewDocuments.aspx.cs

using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;

TEXT: Automatic Template Extraction from Heterogeneous Webpages

using System.Data.SqlClient;
public partial class ViewDocuments : System.Web.UI.Page
{
ClsDbLayer _objDb = new ClsDbLayer();
DataSet ds;
protected void Page_Load(object sender, EventArgs e)
{
string Query = "select * from T_Document ";
ds = _objDb.Display(Query);
GridView1.DataSource = ds;
GridView1.DataBind();
}
}

UserRegistration.aspx.cs
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;

62

TEXT: Automatic Template Extraction from Heterogeneous Webpages

using System.Data.SqlClient;
public partial class UserRegistration : System.Web.UI.Page
{
ClsDbLayer _objDb = new ClsDbLayer();
int i;
protected void Page_Load(object sender, EventArgs e)
{
}
protected void ImageButton1_Click1(object sender, ImageClickEventArgs
e)
{
string Query = "insert into U_Reg values('" + TextBox1.Text + "','" +
TextBox2.Text + "','" + TextBox3.Text + "','" + TextBox4.Text + "','" +
TextBox5.Text + "','" + TextBox6.Text + "','" + TextBox7.Text + "','" +
TextBox8.Text + "')";
i = _objDb.Save(Query);
if (i != -1)
{
Response.Write("Submitted");
}
else
{
Response.Write("Not Submitted");
}
}
}

63

TEXT: Automatic Template Extraction from Heterogeneous Webpages

UserLogin.aspx.cs
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
public partial class UserLogin : System.Web.UI.Page
{
ClsDbLayer _objDb = new ClsDbLayer();
SqlDataReader dr;
protected void Page_Load(object sender, EventArgs e)
{
}
protected void Button1_Click(object sender, EventArgs e)
{
string Query = "select U_Name,Password from U_Reg where U_Name
like '" + TextBox1.Text + "' and Password like '" + TextBox2.Text + "'";
dr = _objDb.Select(Query);
if (dr.Read())

64

TEXT: Automatic Template Extraction from Heterogeneous Webpages

65

{
TextBox1.Text = dr[0].ToString();
TextBox2.Text = dr[1].ToString();
Response.Redirect("UserPage.aspx");
}
else
{
Response.Write("No Values");
}
}
protected void ImageButton1_Click(object sender, ImageClickEventArgs e)
{
Response.Redirect("Default.aspx");
}
}
Books.aspx.cs
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;

TEXT: Automatic Template Extraction from Heterogeneous Webpages

public partial class Books : System.Web.UI.Page


{
protected void Page_Load(object sender, EventArgs e)
{
}
protected void LinkButton1_Click(object sender, EventArgs e)
{
LinkButton4.Text = "Asp .Net";
LinkButton5.Text = "C Sharp";
}
protected void LinkButton2_Click(object sender, EventArgs e)
{
LinkButton4.Text = "Report Format";
LinkButton5.Text = "Book Format";
}
protected void LinkButton4_Click(object sender, EventArgs e)
{
if (LinkButton4.Text == "Report Format")
{
Response.Redirect("OriginalTemp.html");
}
if (LinkButton4.Text == "Asp .Net")
{
Response.Redirect("Design.aspx");
}
}
protected void LinkButton5_Click(object sender, EventArgs e)

66

TEXT: Automatic Template Extraction from Heterogeneous Webpages

{
if (LinkButton5.Text == "Book Format")
{
Response.Redirect("Contents.aspx");
}
if (LinkButton5.Text == "C Sharp")
{
Response.Redirect("TemplateBook.aspx");
}
}
}
SelectBooks.aspx.cs
DOCUMENTS SELECT
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
public partial class Books : System.Web.UI.Page

67

TEXT: Automatic Template Extraction from Heterogeneous Webpages

{
ClsDbLayer _objDb = new ClsDbLayer();
SqlDataReader dr;
DataSet ds;
protected void Page_Load(object sender, EventArgs e)
{
Panel1.Visible = false;
if (IsPostBack.Equals(false))
{
string Query = "select Doc_Name from T_Document";
ds = _objDb.Display(Query);
DropDownList1.DataTextField = "Doc_Name";
DropDownList1.DataValueField = "Doc_Name";
DropDownList1.DataSource = ds;
DropDownList1.DataBind();
}
}
protected void DropDownList1_SelectedIndexChanged(object sender,
EventArgs e)
{
Panel1.Visible = true;
string Query = "select * from T_Document where Doc_Name like '" +
DropDownList1.SelectedItem.ToString() + "'";
dr = _objDb.Select(Query);
if (dr.Read())
{
Label1.Text = "Document Id :" + dr[0].ToString();
Label2.Text = "Document Name :" + dr[1].ToString();
Label3.Text = "Author Name :" + dr[2].ToString();
Label4.Text = "Title :" + dr[3].ToString();

68

TEXT: Automatic Template Extraction from Heterogeneous Webpages

}
else
{
Response.Write("No Books");
}
}
protected void LinkButton1_Click(object sender, EventArgs e)
{
if (DropDownList1.SelectedItem.Text == "C Sharp")
{
Response.Redirect("TemplateBook.aspx");
}
if (DropDownList1.SelectedItem.Text == "Asp .Net")
{
Response.Redirect("Asp.aspx");
}
}
}

ClsDbLayer.cs
using System;
using System.Data;
using System.Configuration;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;

69

TEXT: Automatic Template Extraction from Heterogeneous Webpages

using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
/// <summary>
/// Summary description for ClsDbLayer
/// </summary>
public class ClsDbLayer
{
SqlConnection con;
SqlCommand cmd;
SqlDataReader dr;
SqlDataAdapter da;
DataSet ds;
public int Save(string Query)
{
int i;
try
{
con = new SqlConnection();
con.ConnectionString = "Data Source=.;Initial
Catalog=TTEXT;Integrated Security=True";
con.Open();
cmd = new SqlCommand();
cmd.CommandText = Query;
cmd.Connection = con;
i = cmd.ExecuteNonQuery();
}
catch (Exception er)

70

TEXT: Automatic Template Extraction from Heterogeneous Webpages

{
i = -1;
}
finally
{
con.Close();
cmd.Dispose();
}
return i;
}
public SqlDataReader Select(string Query)
{
try
{
con = new SqlConnection();
con.ConnectionString = "Data Source=.;Initial
Catalog=TTEXT;Integrated Security=True";
con.Open();
cmd = new SqlCommand();
cmd.CommandText = Query;
cmd.Connection = con;
dr = cmd.ExecuteReader();
}
catch (Exception er)
{
dr = null;
}
finally
{

71

TEXT: Automatic Template Extraction from Heterogeneous Webpages

}
return dr;
}
public DataSet Display(string Query)
{
try
{
con = new SqlConnection();
con.ConnectionString = "Data Source=.;Initial
Catalog=TTEXT;Integrated Security=True";
con.Open();
cmd = new SqlCommand();
cmd.CommandText = Query;
cmd.Connection = con;
da = new SqlDataAdapter();
ds = new DataSet();
da.SelectCommand = cmd;
da.Fill(ds);
}
catch (Exception er)
{
ds = null;
}
finally
{
}
return ds;
}

72

TEXT: Automatic Template Extraction from Heterogeneous Webpages

73

10. CODING EFFICIENCY

This project of TEXT: Automatic Template Extraction from Heterogeneous Webpages


has been coded well and good. This is because of code efficiency technique applied during the
coding phase of SDLC (Software Development Life Cycle).

The TEXT: Automatic Template Extraction from Heterogeneous Webpages has been
coded using ASP.NET of Microsoft .NET Framework. The Backend is MS SQL Server.

The ASP.NET page framework is a programming framework that runs on a web server to
dynamically produce and manage web forms pages. In Visual Studio, web forms provides a form

TEXT: Automatic Template Extraction from Heterogeneous Webpages

74

designer, editor, controls and debugging which together allow you to rapidly build server-based,
programmable user interfaces for browsers and web client devices.

The ASP.NET page framework creates an abstraction of the traditional client-server web
interaction so that you can program your application using traditional methods and tools that
support Rapid Application Development (RAD) and OOP.

Hence, the TEXT: Automatic Template Extraction from Heterogeneous Webpages


project has been efficiently coded.

11. TEST METHODS EMPLOYED


Software testing is any activity aimed at evaluating an attribute or capability of a program
or system and determining that it meets its required results although crucial to software quality
and widely deployed by programmers and testers, software testing still remains an art, due to
limited understanding of the principles of software. The difficulty in software testing stems from
the complexity of software: we can not completely test a program with moderate complexity.
Testing is more than just debugging. The purpose of testing can be quality assurance,
verification and validation, or reliability estimation. Testing can be used as a generic metric as
well. Correctness testing and reliability testing are two major areas of testing. Software testing is
a trade-off between budget, time and quality.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

75

Software Testing is the process of executing a program or system with the intent of
finding errors. This is the major quality measure employed during the software engineering
development. Its basic function is to detect error in the software. Testing is necessary for the
proper functioning of the system.

Testing is usually performed for the following purposes:

To improve quality.
For Verification & Validation (V&V).
For reliability estimation.

The purpose of software testing is to assess and evaluate the quality of work performed at
each step of the software development process. Although it sometimes seems that way, the
purpose of testing is NOT to use up all the remaining budget or schedule resources at the end ofa
development effort.
The goal of testing is to ensure that the software performs as intended, and to improve
software quality, reliability and maintainability. There are several rules that can serve as testing
objectives.
They are:
Testing is a process of executing a program with the intent of finding an error.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

76

A good test case is one that has a high probability of finding an undiscovered
error.
A successful test is one that uncovers a discovered error.

If testing is conducted successfully according to the objectives as started above, it would


uncover error in the software. Also testing demonstrates that software functions appear to the
working according to specification, that performance requirements appear to have been met.

A strategy for software testing integrates software test case design techniques into
a well planned series of steps that result in the successful construction of software. Any testing
strategy must incorporate test planning, test case design, test execution and the resultant data
collection and evaluation. There are various software testing strategies . The test methods
employed during the development of TEXT: Automatic Template Extraction from
Heterogeneous Webpages include:

Unit Testing

Functional Testing

Stress Testing

Integration testing

User Acceptance testing.

Unit Testing:
Unit testing focuses on the verification effort on the smallest unit of each module
in the system. It comprises the set of tests performed by an individual programmer prior to the

TEXT: Automatic Template Extraction from Heterogeneous Webpages

77

integration of the unit into a larger system. It involves various tests that a programmer will
perform in a program unit.
Using the unit test plans prepared in the design of the system development as a
guide, the control paths are tested to uncover errors within the boundary of the module. In this
testing, each module is tested and it was found to be working satisfactorily as per the expected
output from the module.
The unit test considerations that were taken into account are,
Interfacing errors
Integrity of local data structure
Boundary Condition
Independent Paths
Error Handling Paths

Functional Testing:
Functional test involves exercising the code with correct input values for which
the expected results are known. The system bring developed is tested with all the nominal input
values, so that the expected results are received. The system is also tested with the boundary
values.

Stress Testing:
Stress tests are designed to overload a system in various ways. The system being
developed is tested like attempting to sign on more than the maximum number of allowed

TEXT: Automatic Template Extraction from Heterogeneous Webpages

78

terminals, inputting mismatched data types, processing more than the allowed number of
identifiers.
Integration testing:
This is a systematic technique for constructing the program structure while at the
same time conducting tests to uncover errors associated with the interface. In this testing all the
modules are integrated and the entire system is tested as a whole and the possibility of occurring
error are rare since it has already been unit tested. In case if any error occurs, it is found out in
this step and rectified as a whole before passing to the next step.

User Acceptance testing:


The user acceptance is the key factor in the development of a successful system.
The system under consideration is tested for acceptance by constantly keeping in touch with the
prospective users at the time of development and the changes are made as and when required.

12. IMPLEMENTATION AND MAINTENANCE

Implementation is a process involved in the complete development of software. The


implementation process is based upon the design methodology. The system is implemented soon
after the system id tested and is found to satisfy all the requirements of the clients, upon whose
consent the system is implemented. There are several procedures to be followed while
implementing the software.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

79

Any system can be implemented in four methods.

Direct Method
In direct method, a new system will be designed and implemented.

Cut-over Method
In Cut-over method, the existing system will be cut down and the new system will be
implemented.

Segment Method
In segment method, the existing system process is divided segments into number of
segments. For the new system they are implemented one by one.

Parallel Method
In parallel method, the new system is developed in parallel, independent of the existing
system.

The proposed TEXT: Automatic Template Extraction from Heterogeneous Webpages is


implemented by Parallel method in offline. It is developed in Active Server Pages. The
implementation plan consists of

Testing the developed system with sample data.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Detection and correction of errors.

Making necessary change in the system.

Checking of reports with that of existing system.

80

Before installing the proposed system we will have to supply the client, detail report
regarding the complete details and functions of the developed system. During confirm whether
the required hardware configuration for smooth and reliable running of the software is present or
not. The next step involved is entering the data.
During the implementation stages the client will never leave the existing system until the
client is fully satisfied with the proposed system and also during the initial stages both the
existing and the proposed system will be going hand in since some time is required for the
updating the transactions.

Finally, the TEXT: Automatic Template Extraction from Heterogeneous Webpages has
been successfully implemented.

The term Maintenance is used to describe the software engineering activities that occur
following delivery of a software product to the customer. The maintenance phase of the software
life cycle is the time period in which software product performs useful work.

Maintenance activities involve making enhancements to software products, adapting


products to new environments, and correcting problems. Software product enhancement may

TEXT: Automatic Template Extraction from Heterogeneous Webpages

81

involve providing new functional capabilities, improving user displays and modes of interaction,
upgrading external documents and internal documentation, or upgrading the performance
characteristics of a system. Adaptation of software to a new environment may involvement
moving the software to a different machine, or for instance, modifying the software to
accommodate a new telecommunications protocol or an additional disk drives. Problem correct
errors. Some errors require immediate attention, some can be corrected on a scheduled, periodic
basis, and other are known but never corrected.

13. SYSTEM SECURITY MEASURES

This system security is a vital and cautious area of software development. Merely
developing and installing a software will not make the things complete. We must adopt some
mechanisms to maintain system security. Otherwise the software will be immaterial.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

82

A number of precautionary measures have been taken in the software of


TEXT: Automatic Template Extraction from Heterogeneous Webpages so that the system is
secured properly. Only authorised user/employee of the organisation can use the project. The
user has to supply proper parameters and database details directly.

A number of security measures is adopted in the design of the TEXT project also. No
information or data can be altered in the Database during Visualization.

The extraction and display of datasets are taking place after full scrutiny and security
policy implementations. There is no possibility of accidental modification of data in the Data
sets. Hence the TEXT: Automatic Template Extraction from Heterogeneous Webpages Project
is sufficiently secured.

14. COST ESTIMATION OF THE PROJECT


The process of estimating total cost for a software project is termed as the cost estimation
of the project. This is one of the most difficult and error-prone tasks software engineering. It is
very tedious to make an accurate estimation during the planning phase a large number of
unknown factors at that time.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

83

The major factors that influence the software cost are:


a) Programmer Ability
b) Product Complexity
c) Product Size
d) Available Time
e) Required Reliability
f) Level of Technology.

To estimate the cost of a software project at the time of planning, we need to use a cost
estimation technique. Cost estimation technique are broadly classified as
1. Top-down Techniques
2. Bottom-up Techniques
Top-down approaches first focus on system costs such as computer resources as well as
costs for configuration management, quality assurance, system integration and training. Bottomup costs estimates first focus on the cost to develop each module or sub-system. Those costs are
combined to arrive at an overall estimate.

The most common cost estimation techniques are listed below:


i. Expert Judgement
ii. Delphi Cost Estimation
iii. Work Breakdown Structures
iv. COCOMO Model

TEXT: Automatic Template Extraction from Heterogeneous Webpages

84

Of these four techniques, we selected the first and simple method knows as Expert
Judgement in order to estimate the total cost of the project titled TEXT: Automatic Template
Extraction from Heterogeneous Webpages It is widely used for cost estimation and is a TopDown technique. This approach relies on experience, background and business sense of one or
more key people in the organization.
In this method, the cost estimation for the current project is done based on the cost spent
on the previous project. Here the experts available in the organization are the decision-makers.
They try to estimate the cost for the software project with the help of their experience and skill.
Sometimes it may not be possible to arrive at a definite estimate due to the variations raised
during the present project.

15. GANTT CHART

TEXT: Automatic Template Extraction from Heterogeneous Webpages

85

Scheduling of a software project does not differ greatly from scheduling of any multi-task
engineering effort. Therefore, generalized scheduling tools and techniques applied with little
modification to software projects.
Program Evaluation and Review Technique (PERT) and Critical Path Method (CPM) are
two project scheduling methods that can be applied to software development.
When creating a software project schedule, the planer begins with a set of tasks. In this
context, the inputs include Work Breakdown Structures (WBS). Because of work breakdown
inputs a Time-line Chart or Gantt Chart can be developed. The below figure illustrates the Gantt
Chart for TEXT: Automatic Template Extraction from Heterogeneous Webpages.
Task
Project Mgt.
a) Plan
b) Review
c) Problem defined
Development
a) Design
b) Code
c) Debug
d) Coding finished

W1

Month 1
W2 W3

W4

W1

Month 2
W2 W3

W4

W1

Month 3
W2 W3

W4

W1

Month 4
W2 W3

W4

Testing
a) Unit Test
b) Integration Test
c)

Acceptance Test

d) Testing over
Support
a) Installation
b) Maintenance
c) Project
Complete

TEXT: Automatic Template Extraction from Heterogeneous Webpages

86

16. CONCLUSION
This project executes less time usage for template extraction compare to existing
algorithms like RTDM, Text-Hash and Text-Max. In this system we used WaveK-Means
algorithm to find similarity between the web pages. This algorithm provides better performance
compared to previous algorithms in terms of space and time. The space and time consumed by
this algorithm is less compared to RTDM, Text-Hash and Text-Max. Our Experimental results
with real life data sets confirm effectiveness and robustness of our algorithm.

Moreover, in future implementation our extraction approach will be resilient to changes


in source document formats. For example, changes in HTML formatting codes do not affect our
ability to extract and structure information from a given Web page. Finally the model will
contribute effectively to the emergence of semantic web, by providing methodology, tools and
both global and generic solutions.

TEXT: Automatic Template Extraction from Heterogeneous Webpages

17. ANNEXURE SCREEN LAYOUTS

Home Page

87

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Admin Login

88

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Admin Page

89

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Document Classification

90

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Extraction Details

91

TEXT: Automatic Template Extraction from Heterogeneous Webpages

View Documents

92

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Before Extraction

93

TEXT: Automatic Template Extraction from Heterogeneous Webpages

After Extraction

94

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Before Extraction

95

TEXT: Automatic Template Extraction from Heterogeneous Webpages

User Registration

96

TEXT: Automatic Template Extraction from Heterogeneous Webpages

97

TEXT: Automatic Template Extraction from Heterogeneous Webpages

User Login

98

TEXT: Automatic Template Extraction from Heterogeneous Webpages

User Page

99

TEXT: Automatic Template Extraction from Heterogeneous Webpages

View Documents

100

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Document Selection

101

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Page with Design Content

102

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Page with Introduction

103

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Report Document

104

TEXT: Automatic Template Extraction from Heterogeneous Webpages

Page with Book Content

105

TEXT: Automatic Template Extraction from Heterogeneous Webpages

18. GLOSSARY
API

Application Programming Interface

ASP

Active Server Pages

CLC

Classic Life Cycle

COCOMO

Constructive Cost Model

CPM

Critical Path Method

DFD

Data Flow Diagram

GUI

Graphical User Interface

HTML

Hyper-Text Mark-up Language

IDE

Integrated Drive Electronics

ISAM

Indexed Sequential Access Method

PERT

Project Evaluation and Review Technique

RAD

Rapid Application Development

SEP

Software Engineering Paradigm

SOA

Service Oriented Architecture

TEXT

Template Extraction

VSAM

Virtual Sequential Access Method

WBS

Work Breakdown Structures

106

TEXT: Automatic Template Extraction from Heterogeneous Webpages

107

WWW

World Wide Web

XML

Extensible Mark-up Language

19. BIBLIOGRAPHY

Books:

ASP.NET 4.5 Unleashed

Stephen Walther, et al.

HTML 4.0 The 20 Essential Skills

E. Stephen Mark & Janan Plat

Microsoft SQL Server

Ivan Bay Ross

Expert SQL Server 2008 Development


HTML: The Complete Reference

Web Sites:

http://www.sparkinfo.net
http://www.dotnetspider.com
http://www.w3schools.com
http://www.net-tutorials.com
http://www.c-sharpcorner.com
http://www.sqlserverclub.com
*****

Adam Machanic
Thomas A. Powell

TEXT: Automatic Template Extraction from Heterogeneous Webpages

108