Professional Documents
Culture Documents
1. INTRODUCTION
Now a days most of the information is stored in text databases. This information consists
of large collection of documents from Heterogeneous web pages. Now we extract template from
these heterogeneous templates, and to extract template we use different algorithms to find
similarity of underlying template structures in the documents and we cluster the web documents
based on the similarity of underlying template structure in the documents so that template is
extracted with various clusters. The Web poses itself as the largest data repository ever available
in the history of humankind. Major efforts have been made in order to provide efficient access to
relevant information within this huge repository of data. Although several techniques have been
developed to the problem of Web data extraction, their use is still not spread, mostly because of
the need for high human intervention and the low quality of the extraction results.
In this project, a domain-oriented approach to Web data extraction and discuss its
application to automatically extracting news from Web sites. Our approach is based on a highly
efficient tree structure analysis that produces very effective results. The HTML DOM follows a
naming convention for properties, methods, events, collections, and data types. All names are
defined as one or more English words concatenated together to form a single string. Properties
and Methods the property or method name starts with the initial keyword in lowercase, and each
subsequent word starts with a capital letter. For example, a property that returns document Meta
information such as the date the file was created might be named "fileDateCreated". In the
ECMA Script binding, properties are exposed as properties of a given object.
The HTML DOM follows a naming convention for properties, methods, events,
collections, and data types. All names are defined as one or more English words concatenated
together to form a single string. Properties and Methods The property or method name starts with
the initial keyword in lowercase, and each subsequent word starts with a capital letter. For
example, a property that returns document meta information such as the date the file was created
might be named "fileDateCreated".
In the ECMA Script binding, properties are exposed as properties of a given object. In
Java, properties are exposed with get and set methods. Non-HTML 4.0 interfaces and attributes
While most of the interfaces defined below can be mapped directly to elements defined in the
HTML 4.0
Recommendation, some of them cannot. Similarly, not all attributes listed below have
counterparts in the HTML 4.0 specification (and some do, but have been renamed to avoid
conflicts with scripting languages). Interfaces and attribute definitions that have links to the
HTML 4.0 specification have corresponding element and attribute definitions there; all others are
added by this specification, either for convenience or backwards compatibility with "DOM Level
0" implementations.
The steps of building a rule-based metadata extraction system are typically as follows:
first, some experts examine samples of the document collection and define rules for metadata
extraction; then, software developers implement these rules either as part of an expert system or
as part of an ad hoc rule engine. The accuracy, inventiveness, and appropriateness of the rules
that experts defined play a critical role in building a system with high accuracy.
2. COMPANY PROFILE
Comdex Infotech, Chennai
Preview
Technology
Comdex Infotech predominantly works in the software development area and develops
software practically based on all firmware platforms using technology used in the IT industry
currently. Comdex Infotech has been delivering engineering solutions across diverse industries
enabling customers to foster product innovation, improve operational efficiencies, and decrease
time-to-market for their products.
Web: ASP.NET, ASP, JSP, XML/XSLT, SOAP, HTML/DHTML, JavaScript, Flash, Active X
Programming Languages: C, C++, OOAD Paradigms, JAVA, J2ME, J2EE, Dot NET
Technologies, PHP, Spring, Struts, JSF
Products
Comdex Infotech is experienced in financial, systems integration and outsourcing which enables
us to deliver innovative results-driven solutions to government and commercial clients around
the world.
Customer Support provides timely, reliable, and cost-efficient assistance to our clients. Our
support organization is staffed with highly motivated, trained professionals dedicated to
providing quality support as quickly as possible.
Banking
Web Designing
Hospital Management
Financial
Retail
Student Management
Sales
Our quality assurance performs tests on every PC system for a minimum of 12 hours. Our
systems are guaranteed and come with our on-site and phone technical support.
Assembled Desktops
Branded Desktops
Laptops
Computer Accessories
Computer Peripherals
Software's
Services
Comdex PC shall support various services like:
On-Call Service
Annual Maintenances Contract
Contacts
Comdex Infotech
No.24/19, Bharathi Street,
West Mambalam
Chennai - 600 033, INDIA
Phone : + 91 44 42136340
Mobile : + 91 94441 18194
Email : info@comdexinfotech.com
Website : www.comdexinfotech.com
3. SYSTEM ANALYSIS
System analysis is process of totally understanding the current system by gathering and
interpreting facts, diagnosing problems and using the facts to improve the current system. It is
the detailed study of the various operations performed by a system and their relationships within
and outside of the system.
System analysis is done in order to understand the problem and to emphasis what is
needed from the system. This part of system development life cycle is crucial phase where the
information of the user is determined.
related distance measures. For instance, tree-edit distance has at least On1n2 time complexity,
where n1 and n2 are the sizes of two DOM trees and the sizes of the trees are usually more than a
thousand. Thus, clustering on sampled web documents is used to practically handle a
large number of web documents.
The problem of extracting a template from the web documents conforming to a common
template has been studied in. Due to the assumption of all documents being generated from a
single common template, solutions for this problem are applicable only when all documents are
guaranteed to conform to a common template. However, in real applications, it is not trivial to
classify massively crawled documents into homogeneous partitions in order to use these
techniques
Coclustering algorithms find simultaneous clustering of the rows and columns of a matrix
and require the numbers of clusters of columns and rows as input parameters. However, one can
cluster only documents not paths, and moreover, the numbers of clusters of columns and rows
are unknown.
One proposes to represent a web document and a template as a set of paths in a DOM
tree. As validated by the most popular XML query language XPATH, paths are sufficient to
express tree structures and useful to be queried. By considering only paths, the overhead to
measure the similarity between documents becomes small without significant loss of
information.
10
of the system is limited. The expenditures must be justified. Thus the developed system as well
within the budget and this was achieved because most of the technologies used are freely
available. Only the customized products had to be purchased.
2. Technical Feasibility
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the available
technical resources. This will lead to high demands on the available technical resources. This
will lead to high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.
3. Social Feasibility
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity.
The level of acceptance by the users solely depends on the methods that are employed to
educate the user about the system and to make him familiar with it. His level of confidence must
be raised so that he is also able to make some constructive criticism, which is welcomed, as he is
the final user of the system.
11
There are several different software engineering paradigms (SEPs). They include
12
Of these SEPs, We chose the first and simple paradigm called the classic life cycle (CLC)
for Automatic Template Extraction from Heterogeneous web Pages project. This paradigm is
also known as the Water-Fall Model or Linear Sequential Model. The CLC model has the
following phases:
Analysis
Design
Code
Test
Support
System / Information
Engineering
Analysis
Design
Code
Test
13
The above figure illustrates the CLC model. This model was applied to develop the
system for Automatic Template Extraction from Heterogeneous web Pages. The five
phases of the paradigm are explained below.
Code Generation
The design must be translated into a machine-readable form. The code generation step
performs this task. If design is performed in, a detailed manner code generation can be
accomplished mechanistically.
14
Testing
Once code has been generated, program testing beings. The testing process focus on the
logical internals of the software, ensuring that all statements have been tested, and on the
functional externals; that is, conducting tests to uncover errors and ensure that defined input will
produced actual results that agree with required results.
Support
Software will undoubtedly undergo change after it is delivered to the customer. Change
will occur because errors have been encountered, because the software must be adapted to
accommodate changes in its external environment, or because the customer requires functional
or performance enhancements. Software supports/maintenance reapplies each of the preceding
phases to an existing program rather than a new one.
Processor
CPU speed
RAM
2 GB or more
Hard Disk
15
320 GB or more
Operating System
Windows XP or later
Web Framework
Database Server
Web Server
Web browser
Software Configuration
6. SOFTWARE PROFILE
6.1 Windows XP
Windows XP is a family of 32-bit and 64-bit operating systems produced by
Microsoft for use on personal computers, including home and business desktops, notebook
computers, and media centers. The name "XP" stands for experience. Windows XP is the
16
successor to both Windows 2000 Professional and Windows Me, and is the first consumeroriented operating system produced by Microsoft to be built on the Windows NT kernel (version
5.1) and architecture. Windows XP was first released on 25 October 2001, and over 400 million
copies were in use in January 2006, according to an estimate in that month by an IDC analyst.
The most common editions of the operating system are Windows XP Home Edition,
which is targeted at home users, and Windows XP Professional, which offers additional features
such as support for Windows Server domains and two physical processors, and is targeted at
power users, business and enterprise clients.
Windows XP is known for its improved stability and efficiency over the 9x versions of
Microsoft Windows. It presents a significantly redesigned graphical user interface, a change
Microsoft promoted as more user-friendly than previous versions of Windows.
Windows XP has also been criticized by some users for security vulnerabilities, tight
integration of applications such as Internet Explorer 6 and Windows Media Player, and for
aspects of its default user interface. Later versions with Service Pack 2, and Internet Explorer 7
addressed some of these concerns.
As of the end of September 2008, Windows XP is the most widely used operating
system in the world with a 69% market share, having peaked at 85% in December 2006.
Features
Improved Device Support
17
Windows XP provides new and/or improved drivers and user interfaces for devices compared to
Windows Me and 98.Windows Image Acquisition (WIA), originally introduced in Windows Me,
replaced the traditional TWAIN support for scanners and digital cameras. As TWAIN does not
separate the user interface from the driver of a device, it is difficult to provide transparent
network access; whenever an application loads a TWAIN driver, it is completely undetectable
from the supplied manufacturer's GUI.
On old versions of Windows, when users upgrade a device driver, there is a chance the new
driver is less efficient or functional than the original. Reinstalling the old driver can be a major
hassle and to avoid this quandary, Windows XP keeps a copy of an old driver when a new
version is installed. If the new driver has problems, the user can return to the previous version.
This feature does not work with printer drivers.
Improved Interface
Windows XP includes a new set of visual themes, known by its codename, Luna. Available in
three schemes, the interface is more task-based than the basic one included since Windows 95,
with options available in Explorer windows to interact with each file. It also includes other
modifications, such as grouping of related programs, hiding of taskbar icons, and many other
elements.
Fast User Switching
Fast User Switching allows another user to log in and use the system without having to log out
the previous user and quit his or her applications. Previously (on both Windows Me and
Windows 2000) only one user at a time could be logged in (except through Terminal Services),
which was a serious drawback to multi-user activity. Fast User Switching, like Terminal
18
Services, requires more system resources than having only a single user logged in at a time and
although more than one user can be logged in, only one user can be actively using their account
at a time. This feature is not available when the Welcome Screen is turned off, such as when
joined to a Windows Server Domain or with Novell Client installed.
Remote Assistance
Remote Assistance allows a Windows XP user to temporarily take over a remote Windows XP
computer over a network or the internet to resolve issues. As it can be a hassle for system
administrators to personally visit the affected computer, Remote Assistance allows them to
diagnose and possibly even repair problems with a computer without ever personally visiting it.
CD Burning
Windows XP includes technology from Roxio which allows users to directly burn files to a
compact disc through Windows Explorer. Previously, end users had to install CD burning
software, such as Nero Burning ROM. Now, CD and DVD-RAM burning has been directly
integrated into the Windows interface; users burn files to a CD in the same way they write files
to a floppy disk or to the hard drive. The burning functionality is also exposed as an API called
the Image Mastering API. Windows XP's CD burning support does not do disk to disk copying or
disk images although the API can be used programmatically to do these tasks. Creation of audio
CDs is integrated into Windows Media Player.
Clear Types
Windows XP includes Clear Type sub-pixel font anti-aliasing, which makes onscreen fonts
smoother and more readable on liquid crystal display (LCD) screens, although this causes a
minor performance hit. Although Clear Type has an effect on cathode ray tube (CRT) monitors,
its primary use is for LCD/TFT-based (laptop, notebook and modern 'flat screen') displays.
19
Remote Desktop
Users can log into Windows XP Professional remotely through the Remote Desktop service. It is
built on Terminal Services technology (RDP), and is similar to Remote Assistance, but allows
remote users to access local resources such as printers. Any Terminal Services client, a special
"Remote Desktop Connection" client, or a web-based client using an ActiveX control may be
used to connect to the Remote Desktop. (Remote Desktop clients for earlier versions of
Windows, Windows 95, Windows 98 and 98 Second Edition, Windows Me, Windows NT 4.0, or
Windows 2000, have been made available by Microsoft. This permits earlier versions of
Windows to connect to a Windows XP system running Remote Desktop, but not vice-versa.)
There are several resources that users can redirect from the remote server machine to the local
client, depending upon the capabilities of the client software used. For instance, File System
Redirection allows users to use their local files on a remote desktop within the terminal session,
while Printer Redirection allows users to use their local printer within the terminal session as
they would with a locally or network shared printer. Port Redirection allows applications running
within the terminal session to access local serial and parallel ports directly, and Audio allows
users to run an audio program on the remote desktop and have the sound redirected to their local
computer. The clipboard can also be shared between the remote computer and the local
computer.
Power Management
20
Before Windows 98, power management was based on the Advanced Power Management
architecture. It was of limited use to most users and the feature was easily broken by the addition
of hardware devices or software. Windows XP's power management architecture is based on the
ACPI standard and still supports APM. (In Windows 98 ACPI was supported but disabled by
default. Windows Me enabled ACPI by default.) It supports multiple levels of sleep states,
including critical sleep states when a mobile (or UPS connected) computer is running out of
battery power, processor power control (the ability to adjust the speed of the computer's
processor on-the-fly to save energy), selective suspend of externally attached (such as USB)
devices, and turning off the power to the screen of a laptop when the lid is closed. In addition, it
also dims the screen when the laptop has low battery power.
Hibernate Mode
When Windows XP hibernates it dumps the entire contents of the RAM to disk and
powers down the entire machine. On start-up it quickly reloads the data back to RAM. This
allows the system to be completely powered off while in hibernate mode. This requires a file the
size of the installed RAM to be placed in the system's root directory, using up space even when
not in hibernation. Hibernation is enabled by default and can be disabled in order to recover disk
space.
Standby (Sleep) Mode
When Windows enters standby mode, it turns off all nonessential hardware, including the
monitor, hard drives, and removable drives. This means that the system reactivates itself very
quickly when "woken up". This does not power down the system. In order to save power without
user intervention, a system can be configured to go to standby when idle and then hibernate if not
re-activated.
21
The Windows Standby feature conforms to the S1 and S3 Sleep States in the ACPI standards.
Kernel Improvements
The Windows XP kernel is completely different from the kernel of the Windows 9x/ me line of
operating systems. As an upgrade of the Windows 2000 kernel, the improvements are major,
albeit transparent to the end user. They include some enhancements to the scalability and
performance of the system.
Windows XP includes Simultaneous Multithreading Support, or the ability to utilize the HyperThreading feature of newer Intel Pentium 4 processors. Simultaneous Multithreading is a
processor's ability to process more than one data thread at a time. Intel has described the effect as
being more or less 70% that of having the processing power of two processors.
The ability to boot in 30 seconds was a design goal for Windows XP, and Microsoft's developers
made efforts to streamline the system as much as possible; many people have found that without
extra services Windows XP can boot from the PC's power on self-test (POST) to the Windows
GUI in about 30 seconds. The Perfected is a significant part of this; it monitors what files are
loaded during boot, and optimizes the locations of these files on disk so that less time is spent
waiting for the hard drive's heads to move.
Application Compatibility
As Windows XP merged the consumer and enterprise versions of Windows into one, it folded the
user-friendly interface of Windows Me onto the kernel of Windows 2000. A drawback of this is
that older software designed for previous versions of Windows may not function. Microsoft
22
addressed this by going to great lengths to improve compatibility with application specific
tweaks and shims and providing tools to allow users to try these tweaks and shims on their own
applications.
Application Isolation & Side-by-Side Assemblies
A common issue in previous versions of Windows was that users frequently suffered from DLL
hell, where more than one version of the same Dynamically Linked Library (DLL) was installed
on the computer. As software relies on DLLs, using the wrong version could result in nonfunctional applications, or worse. Windows XP solved this problem by introducing side-by-side
assemblies.
demand to the appropriate application keeping applications isolated from each other and not
using common dependencies.
Windows XP also introduced a new mode of COM object registration called Registration-free
COM. This makes it possible for applications that need to install COM objects to store all the
required COM registry information in the application's directory, instead of in the global registry,
where, strictly speaking, only a single application will ever use it.
DLL hell can be substantially avoided using Registration-free COM, the only limitation being it
requires at least Windows XP or later Windows versions and that it must not be used for EXE
COM servers or system-wide components such as MDAC, MSXML, DirectX or Internet
Explorer.
23
6.2 ASP.NET
The .NET Framework is the next evolution of the Microsoft development platform. It
supports multiple languages and allows them to seamlessly interact with each other via common
standards. One of these standards is the Common Language Runtime (CLR), which provides a
set of services that are common to all applications such as memory management, cross language
integration, access security, debugging and more. Common Type System (CTS) is an important
common standard, provides a standard set of data types that are used by all .NET languages. A
powerful versioning system is a part of the .NET framework. It avoids the occurrence of DLL
collision problems of the past. Powerful object-oriented features have been introduced to all
.NET languages. It also provides a rich library of well-defined base classes. It is intended for
highly distributed software, making internet functionality and interoperability easier and more
transparent than ever before.
Visual Studio .NET is a complete set of development tools for building ASP web
applications, XML web services, desktop applications, and mobile applications. Visual Basic
.NET, Visual C++ .NET and Visual C# .NET, all use the same IDE, which allows them to share
tools and facilities in the creation of mixed-language solutions. In addition, these languages
leverage the functionality of the .NET Framework, which provides access to key technologies
that simplify the development of ASP web applications and XML web services.
24
25
Whether object code is stored and executed locally, but internet-distributed or executed
remotely.
26
including the code created by an unknown or semi-trusted third party and an environment that
eliminates the performance problems of scripted or interpreted environments.
To build all communication on industry standards to ensure that code based on .NET
Console Applications
ASP.NET Applications
Window Services
27
28
ASP.NET in turn is part of the .NET framework, so that it provides access to all of the
features of that framework. For example, you can create ASP.NET web applications using any
.NET programming language (Visual Basic, C#, managed extensions for C++, and many others)
and .NET debugging facilities. You access data using .NET framework classes, and so on.
ASP.NET web applications run on a web server configured with Microsoft Internet
Information Services (IIS). However, you do not used to work directly with IIS. You can
program IIS facilities using ASP.NET classes and Visual Studio handles file management tasks
such as creating IIS applications when needed and providing ways for you to deploy your web
applications to IIS.
Features of ASP.NET
ASP.NET Page Framework and Web Forms Page
Web forms pages run on any browser or client device. However you can design
your web forms page to target a specific browser such as MS IE5.0 and take advantage of the
features of a specific browser or client device. ASP.NET supports mobile controls for web
enabled devices such as cellular phones, handheld computers and PDAs.
29
The ASP.NET page framework creates an abstraction of the traditional clientserver web interaction so that you can program your application using traditional methods and
tools that support Rapid Application Development (RAD) and OOP.
Within web forms pages you can work with HTML elements using properties,
methods and events. The ASP.NET page framework removes the implementation details of the
separation of client and server inherent in web based applications by presenting a unified model
for responding to client events in code that runs at the server. The framework also automatically
maintains the state of the page and the controls on that page during the page processing life
cycle.
XML Web Services
30
ASP.NET provides intrinsic state management functionality that allows you to save
and manage application-specific, session-specific, and developer-defined information. This
information can be independent of any controls on the page. It can be shared between pages,
such as customer information. ASP.NET offers distributed state facilities. You can create multiple
instances of the same application on one computer or on several computers.
ASP.NET Optimization
In this day of business-to-business and business-to-consumer, e-commerce, slow web
applications can waste resources and drive customers away from your company. Web site
performance is an extremely important issue for the developers writing code and for the system
administrator maintaining applications.
ASP.NET incorporates a variety of features and tools that allow you to design and
implement high-performance web applications.
The features include:
ASP.NET gives you the ability to create web applications that meet the demands that
arise when they must process large numbers of requests simultaneously.
31
Application Events
ASP.NET allows you to include application event handling code in the optimal
global aspx file. You can use application events to manage application wide information and
perform orderly application start up and clean up tasks.
Compilation
All ASP.NET code including server scripts, is compiled which allows for a
strong typing, performance optimizations and early binding among other benefits. Once the code
has been compiled the runtime further compiles ASP.NET to native code, providing improved
performance.
Configuration
ASP.NET configuration settings are stored in XML based files. Since these
XML files are ASCII text files, you can read and modify them, so it is simple to make
configuration changes to your web application. Each of your applications can have its own
configuration file and you can extend the configuration scheme to suit your requirements.
Security
ASP.NET provides default authorization and authentication schemes for web
application. You can easily remove, add to, or replace these schemes depending upon the needs
of your application.
32
Debugging Support
ASP.NET takes advantages of the runtime debugging infrastructure to provide
cross-language and cross-compiler debugging support used both locally and remotely from a web
server. In addition the ASP.NET page framework provides a trace-mode that enables you to insert
instrumentation messages into your forms.
Extensibility
Mobile web forms and mobile web forms control offer the same extensibility
features available in ASP.NET and add support for working with multiple devices. Specifically,
the following kinds of extensibility are provided.
New mobile controls can be written and used in a mobile web forms page.
ASP.NET user controls can be used to write simple mobile controls declaratively.
The output of any control can be customized on a device-specific basis by adding a new
adapter for the control.
Support for an entirely new device can be added by using adapter extensibility, without
any changes to individual applications.
33
Calendar
The calendar control offers date-picking functionality and is displayed on mobile
devices. If a user selects a date, browses to another page and browses back to the page with the
calendar control the user has no indication of which data they selected. Usability is improved if
the page displays a date.
Command
The command control provides a way to invoke Microsoft ASP.NET event
handlers from UI elements.
CompareValidator
CustomValidator
The CustomValidator control allows the developer to provide a custom method to
validate another controls field.
Form
34
The form control represents the outermost grouping of controls within a mobile
page object.
Image
Image control provides the capability to specify the image that you want to
display on wireless device.
RangeValidator
The rangevalidator control validates that the values of another control fall within
an allowable range, where the minimum and the maximum are provided either directly or by
reference to another control.
RegularExpressionValidator
Textbox
35
Label
The label control creates a text-based control that displays output-only text on a
mobile device.
Hyperlink
Panel
The panel control provides a grouping mechanism for organizing controls. Panel
control can be recursively nested within a form control.
36
components for creating distributed, data sharing applications. It is an integral part of the .NET
framework, providing access to relational data, XML and application data. ADO.NET supports a
wide variety of development needs, including the creation of front end database clients and
middle-tier business objects used by applications, tools, languages or Internet browsers.
ASP.NET includes data access tools that make it easier than ever for you to
design sites that allow you users to interact with databases through web pages. The .NET
framework includes two data providers for accessing enterprise databases, the OLE DB.NET
data provider and the SQL server .NET data provider. SQL databases can be accessed from
ASP.NET using the different features like SQL connection class, Dataset, SQL Data Reader
class, etc. The .NET framework includes three controls that make the display of large amounts of
data easier.
Overview of ADO.NET
37
ADO.NET clearly factors data access from data manipulation into discrete
components that can be used separately or in tandem. It includes .NET data providers for
connecting to a database, executing commands, and retrieving results. Those results are either
processed directly, or placed in an ADO.NET Dataset object in order to be exposed to the use in
an ad-hoc manner, combined with data from multiple sources, or remote between tiers. The
ADO.NET dataset object can also be used independently of a .NET data provider to manage data
local to the application or sourced from XML.
ADO.NET Components
The ADO.NET components have been designed to factor data access from the
data manipulation. There are two central components of ADO.NET that accomplish this. The
Dataset, the .NET data provider which is a set of components including the connection,
command, Data Reader and Data Adapter objects. The ADO.NET dataset is the core component
of the disconnected architecture of ADO.NET.
Internet Information Server (IIS)
Internet Information Server is a web server developed by Microsoft that runs on
Windows NT/Windows 2000 platform. Internet Information Server 5.0 (IIS) is fully integrated at
the operating system level, Windows 2000 server lets organization add internet capabilities that
weave directly into the rest of their computing infrastructure.
Application Protection
Internet Information Server5.0 offers improved protection and increased
reliability for web applications. By default, IIS runs all applications in a common or a pooled
38
process that is separate from core IIS processes. In addition, administrator can still isolate
mission-critical applications and that should be run outside of both core IIS and pooled
processes.
Integrated setup and upgrade
Remote Administration
Internet Information Server 5.0 has web-based administration tools that allow
remote management of a server from almost any browser on any platform with IIS 5.0,
administrators can set up administration accounts called privileges on web sites, to help
distribute administrative tasks.
Certificate Storage
IIS certificate storage is now integrated with Windows crypto API storage. The
Windows certificate manager provides a single point of entry that lets administrator store, back
up and configure server certificates.
39
Protocol Compliance
IIS is fully integrated with the Kerberos v5 authentication protocol implemented in
Microsoft Windows 2000. This means administrators can pass authentication credentials among
connected computers running windows.
6.3 HTML
HTML is written in the form of tags, surrounded by angle brackets. HTML can also
describe, to some degree, the appearance and semantics of a document, and can include
embedded scripting language code (such as JavaScript) which can affect the behavior of Web
browsers and other HTML processors.
40
Elements
Elements are the basic structure for HTML markup. Elements have two basic properties:
attributes and content. Each attribute and each element's content has certain restrictions that must
be followed for an HTML document to be considered valid. An element usually has a start tag
(e.g. <element-name>) and an end tag (e.g. </element-name>). The element's attributes
are contained in the start tag and content is located between the tags (e.g. <elementname attribute="value">Content</element-name>). Some elements, such as
<br>, do not have any content and must not have a closing tag. Listed below are several types of
markup elements used in HTML.
Presentational markup describes the appearance of the text, regardless of its function. For
example <b>boldface</b> indicates that visual output devices should render "boldface" in
bold text, but gives no indication what devices which are unable to do this (such as aural devices
that read the text aloud) should do. In the case of both <b>bold</b> and <i>italic</i>,
there are elements which usually have an equivalent visual rendering but are more semantic in
41
Attributes
Most of the attributes of an element are name-value pairs, separated by "=", and written
within the start tag of an element, after the element's name. The value may be enclosed in single
or double quotes, although values consisting of certain characters can be left unquoted in HTML
(but not XHTML).Leaving attribute values unquoted is considered unsafe most elements can
take any of several common attributes:
The id attribute provides a document-wide unique identifier for an element. This can be
used by style sheets to provide presentational properties, by browsers to focus attention on
the specific element, or by scripts to alter the contents or presentation of an element.
The class attribute provides a way of classifying similar elements for presentation
purposes.
For
example,
an
HTML
document
might
use
the
designation
class="notation" to indicate that all elements with this class value are subordinate to
42
the main text of the document. Such elements might be gathered together and presented as
footnotes on a page instead of appearing in the place where they occur in the HTML source.
The title attribute is used to attach sub textual explanation to an element. In most
browsers this attribute is displayed as what is often referred to as a tool tip.
As of version 4.0, HTML defines a set of 252 character entity references and a set of
1,114,050 numeric character references, both of which allow individual characters to be written
via simple markup, rather than literally. A literal character and its markup counterpart are
considered equivalent and are rendered identically.
The ability to "escape" characters in this way allows for the characters < and & (when
written as ⁢ and &, respectively) to be interpreted as character data, rather than
markup. For example, a literal < normally indicates the start of a tag, and & normally indicates
the start of a character entity reference or numeric character reference; writing it as & or
43
& or & allows & to be included in the content of elements or the values of attributes.
The double-quote character ("), when used to quote an attribute value, must also be escaped as
"ed; or " or " when it appears within the attribute value itself. The singlequote character ('), when used to quote an attribute value, must also be escaped as ' or
' (should NOT be escaped as ' except in XHTML documents) when it appears
within the attribute value itself.
However, since document authors often overlook the need to escape these characters,
browsers tend to be very forgiving, treating them as markup only when subsequent text appears
to confirm that intent.
Escaping also allows for characters that are not easily typed or that aren't even available
in the document's character encoding to be represented within the element and attribute content.
Data Types
HTML defines several data types for element content, such as script data and style sheet
data, and a plethora of types for attribute values, including IDs, names, URIs, numbers, units of
length, languages, media descriptors, colors, character encodings, dates and times, and so on. All
of these data types are specializations of character data.
The Document Type Declaration
HTML documents are required to start with a Document Type Declaration (informally, a
doctype). In browsers, the function of the doctype is selecting the rendering mode
particularly to avoid the quirks mode.
44
The original purpose of the document type is to enable validation based on Document
Type Definition (DTD) with SGML tools. The DTD to which the DOCTYPE refers contains
machine-readable grammar specifying the permitted and prohibited content for a document
conforming to such a DTD. Browsers do not read the DTD, however. HTML5 validation is not
DTD-based, so in HTML5 the document type does not refer to a DTD.
HTTP
The World Wide Web is composed primarily of HTML documents transmitted from a
Web server to a Web browser using the Hypertext Transfer Protocol (HTTP). However, HTTP
can be used to serve images, sound, and other content in addition to HTML.
To allow the Web browser to know how to handle the document it received, an indication
of the file format of the document must be transmitted along with the document. This vital
metadata
includes
the
MIME
type
(text/html
for
HTML
4.01
and
earlier,
application/xhtml+xml for XHTML 1.0 and later) and the character encoding.
In modern browsers, the MIME type that is sent with the HTML document affects how
the document is interpreted. A document sent with an XHTML MIME type, or served as
application/xhtml+xml, is expected to be well-formed XML, and a syntax error causes the
browser to fail to render the document.
45
The same document sent with an HTML MIME type, or served as text/html, might be
displayed successfully, since Web browsers are more lenient with HTML. However, XHTML
parsed in this way is not considered either proper XHTML or HTML, but so-called tag soup.
HTML E-Mail
Most graphical e-mail clients allow the use of a subset of HTML (often ill-defined) to
provide formatting and semantic markup capabilities not available with plain text, like
emphasized text, block quotations for replies, and diagrams or mathematical formulas that could
not easily be described otherwise.
Many of these clients include both a GUI editor for composing HTML e-mail messages
and a rendering engine for displaying received HTML messages. Use of HTML in e-mail is
controversial because of compatibility issues, because it can be used in privacy attacks, because
it can confuse spam filters, and because the message size is larger than plain text.
46
Services. The term repository is used only in reference to the repository engine within Meta Data
Services.
Microsoft SQL Server 2005 provides a scalable database that combines ease of use with
complex analysis and data warehousing tools. SQL Server includes a rich Graphical User
Interface (GUI) along with a complete development environment for creating data-driven
applications.
SQL Server Architecture
Microsoft SQL Server 2005 is designed to work effectively as:
Internet Integration
The SQL Server 2000 database engine includes integrated XML support. It also has the
scalability, availability, and security features required to operate as the data storage component of
the largest Web sites.
The SQL Server 2000 programming model is integrated with the Windows DNA
architecture for developing Web applications, and SQL Server 2000 supports features such as
English Query and Microsoft Search Service to incorporate user-friendly queries and powerful
search capabilities in Web applications.
47
The SQL Server 2005 relational database engine supports the features required to
support demanding data processing environments. The database engine protects data integrity
while minimizing the overhead of managing thousands of users concurrently modifying the
database.
SQL Server 2000 distributed queries allow you to reference data from multiple sources as
if it were a part of a SQL Server 2000 database, while at the same time, the distributed
transaction support protects the integrity of any updates of the distributed data.
Ease of installation, deployment, and use
SQL Server 2000 includes a set of administrative and development tools that improve
upon the process of installing, deploying, managing, and using SQL Server across several sites.
SQL Server 2000 also supports a standards-based programming model integrated with the
Windows DNA, making the use of SQL Server database and data warehouses a seamless part of
building powerful and scalable systems.
48
These features allow you to rapidly deliver SQL Server applications that customers can
implement with a minimum of installation and administrative overhead.
Data warehousing
SQL Server 2000 includes tools for extracting and analyzing summary data for online
analytical processing. SQL Server also includes tools for visually designing databases and
analyzing data using English-based questions.
7. SYSTEM DESIGN
7.1 Data Flow Diagram
Level 0:
49
Store
Add WebPages
Admin
Webpage
Details
Db
Template
Extraction
User Details
User
Webpage
Details
Registration
Select
Compartment
View
Template
Get
Templates
for
Websites
View Details
Level 1:
Get Clarify
Details
Admin
Entry
Admin
50
Login from
Database
Db
Login
Store
Add Template
Check User
Details
Metadata
Extraction
View Templates
Level 2:
Login From
Database
User Entry
User
51
Db
Registratio
n
Store
Login
Final
Templates
View Websites
Information
Select
Templates
Document
Classification
View
Information
52
Data Type
Description
U_Name
Varchar(30)
User Name
Password
Varchar(30)
Password
Document
Column Name
Data Type
Description
Doc_Id
Int
Document Id
Doc_Name
Varchar(30)
Document Name
Author
Varchar(30)
Author
Title
Varchar(30)
Title
Description
Text
Description
Files
Text
Files
53
8. PROJECT ELUCIDATION
The TEXT system after careful analysis has been identified to be presented with the
following modules:
1. Metadata Extraction
2. Rule-based Approach
3. Machine Learning Approach
4. Document Classification for Metadata Extraction
5. Knowledge-based Classifications
1. Metadata Extraction
In this module, metadata extraction has the following meaning. Metadata refers to
information about a document that is used to catalogue a document and later to allow users to
search and locate it. It is commonly clustered on one or more pages; examples include title,
creators, affiliation, publisher, language, ID number, and date. Extraction refers to the process of
automatically locating the pages that contain metadata, extracting the metadata and tagging them
as the appropriate type. We classify approaches to build a metadata extraction system into: rulebased approaches and machine-learning approaches.
2. Rule-based Approach
The steps of building a rule-based metadata extraction system are typically as follows:
first, some experts examine samples of the document collection and define rules for metadata
54
extraction; then, software developers implement these rules either as part of an expert system or
as part of an ad hoc rule engine. The accuracy, inventiveness, and appropriateness of the rules
that experts defined play a critical role in building a system with high accuracy.
55
of special text patterns (e.g. three letters followed by nine digits), the occurrences of the words
from special databases (e.g., a word from a dictionary of English last names) in the text, and the
statistical features of the text (e.g. a text with more than 50% letters in upper case).
5. Knowledge-based Classification
In this module, we define a set of page formats in advance, and classify a document into a
group based on which page format is matched. A page format includes the information about the
features of the blocks, the relationship among the blocks, the sample pages, and the threshold
valued of the similarity. Basically, a page format consists of a set of criteria (i.e. features of
blocks, block relations, similarity threshold values). Only the pages that meet all these
requirements are classified into the group associated by this page format. It provides several
ways to measure the similarity of a page and a sample page.
9. CODING
AdminLogin.aspx.cs
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
public partial class AdminLogin : System.Web.UI.Page
{
ClsDbLayer _objDb = new ClsDbLayer();
SqlDataReader dr;
protected void Page_Load(object sender, EventArgs e)
{
}
protected void Button1_Click(object sender, EventArgs e)
{
56
57
58
Extraction.aspx.cs
EXTRACTION
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
using System.IO;
public partial class Extraction : System.Web.UI.Page
{
ClsDbLayer _objDb = new ClsDbLayer();
int i;
DataSet ds;
protected void Page_Load(object sender, EventArgs e)
{
if (IsPostBack.Equals(false))
{
string Query = "select Doc_Id from T_Document";
ds = _objDb.Display(Query);
DropDownList1.DataTextField = "Doc_Id";
59
DropDownList1.DataValueField = "Doc_Id";
DropDownList1.DataSource = ds;
DropDownList1.DataBind();
}
}
protected void Button1_Click(object sender, EventArgs e)
{
string Query = "insert into Extract values('" +
DropDownList1.SelectedItem.ToString() + "','" + TextBox1.Text + "','" +
TextBox2.Text + "','" + TextBox3.Text + "','" + TextBox4.Text + "','" +
TextBox5.Text + "','" + FileUpload1.PostedFile.FileName + "')";
i = _objDb.Save(Query);
if (i != -1)
{
Response.Write("Saved Successfully");
}
else
{
Response.Write("Saved Updated");
}
}
protected void Button2_Click(object sender, EventArgs e)
{
StreamReader Str = File.OpenText(FileUpload1.PostedFile.FileName);
string Sr = Str.ReadToEnd();
TextBox7.Text = Sr.ToString();
Str.Close();
60
61
}
protected void DropDownList1_SelectedIndexChanged(object sender,
EventArgs e)
{
string Query = "select Doc_Name,Title from T_Document where Doc_Id
='" + DropDownList1.SelectedItem.ToString() + "'";
SqlDataReader dr = _objDb.Select(Query);
if (dr.Read())
{
TextBox1.Text = dr[0].ToString();
TextBox2.Text = dr[1].ToString();
}
}
}
ViewDocuments.aspx.cs
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
public partial class ViewDocuments : System.Web.UI.Page
{
ClsDbLayer _objDb = new ClsDbLayer();
DataSet ds;
protected void Page_Load(object sender, EventArgs e)
{
string Query = "select * from T_Document ";
ds = _objDb.Display(Query);
GridView1.DataSource = ds;
GridView1.DataBind();
}
}
UserRegistration.aspx.cs
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
62
using System.Data.SqlClient;
public partial class UserRegistration : System.Web.UI.Page
{
ClsDbLayer _objDb = new ClsDbLayer();
int i;
protected void Page_Load(object sender, EventArgs e)
{
}
protected void ImageButton1_Click1(object sender, ImageClickEventArgs
e)
{
string Query = "insert into U_Reg values('" + TextBox1.Text + "','" +
TextBox2.Text + "','" + TextBox3.Text + "','" + TextBox4.Text + "','" +
TextBox5.Text + "','" + TextBox6.Text + "','" + TextBox7.Text + "','" +
TextBox8.Text + "')";
i = _objDb.Save(Query);
if (i != -1)
{
Response.Write("Submitted");
}
else
{
Response.Write("Not Submitted");
}
}
}
63
UserLogin.aspx.cs
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
public partial class UserLogin : System.Web.UI.Page
{
ClsDbLayer _objDb = new ClsDbLayer();
SqlDataReader dr;
protected void Page_Load(object sender, EventArgs e)
{
}
protected void Button1_Click(object sender, EventArgs e)
{
string Query = "select U_Name,Password from U_Reg where U_Name
like '" + TextBox1.Text + "' and Password like '" + TextBox2.Text + "'";
dr = _objDb.Select(Query);
if (dr.Read())
64
65
{
TextBox1.Text = dr[0].ToString();
TextBox2.Text = dr[1].ToString();
Response.Redirect("UserPage.aspx");
}
else
{
Response.Write("No Values");
}
}
protected void ImageButton1_Click(object sender, ImageClickEventArgs e)
{
Response.Redirect("Default.aspx");
}
}
Books.aspx.cs
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
66
{
if (LinkButton5.Text == "Book Format")
{
Response.Redirect("Contents.aspx");
}
if (LinkButton5.Text == "C Sharp")
{
Response.Redirect("TemplateBook.aspx");
}
}
}
SelectBooks.aspx.cs
DOCUMENTS SELECT
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
public partial class Books : System.Web.UI.Page
67
{
ClsDbLayer _objDb = new ClsDbLayer();
SqlDataReader dr;
DataSet ds;
protected void Page_Load(object sender, EventArgs e)
{
Panel1.Visible = false;
if (IsPostBack.Equals(false))
{
string Query = "select Doc_Name from T_Document";
ds = _objDb.Display(Query);
DropDownList1.DataTextField = "Doc_Name";
DropDownList1.DataValueField = "Doc_Name";
DropDownList1.DataSource = ds;
DropDownList1.DataBind();
}
}
protected void DropDownList1_SelectedIndexChanged(object sender,
EventArgs e)
{
Panel1.Visible = true;
string Query = "select * from T_Document where Doc_Name like '" +
DropDownList1.SelectedItem.ToString() + "'";
dr = _objDb.Select(Query);
if (dr.Read())
{
Label1.Text = "Document Id :" + dr[0].ToString();
Label2.Text = "Document Name :" + dr[1].ToString();
Label3.Text = "Author Name :" + dr[2].ToString();
Label4.Text = "Title :" + dr[3].ToString();
68
}
else
{
Response.Write("No Books");
}
}
protected void LinkButton1_Click(object sender, EventArgs e)
{
if (DropDownList1.SelectedItem.Text == "C Sharp")
{
Response.Redirect("TemplateBook.aspx");
}
if (DropDownList1.SelectedItem.Text == "Asp .Net")
{
Response.Redirect("Asp.aspx");
}
}
}
ClsDbLayer.cs
using System;
using System.Data;
using System.Configuration;
using System.Linq;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
69
using System.Web.UI.WebControls.WebParts;
using System.Xml.Linq;
using System.Data.SqlClient;
/// <summary>
/// Summary description for ClsDbLayer
/// </summary>
public class ClsDbLayer
{
SqlConnection con;
SqlCommand cmd;
SqlDataReader dr;
SqlDataAdapter da;
DataSet ds;
public int Save(string Query)
{
int i;
try
{
con = new SqlConnection();
con.ConnectionString = "Data Source=.;Initial
Catalog=TTEXT;Integrated Security=True";
con.Open();
cmd = new SqlCommand();
cmd.CommandText = Query;
cmd.Connection = con;
i = cmd.ExecuteNonQuery();
}
catch (Exception er)
70
{
i = -1;
}
finally
{
con.Close();
cmd.Dispose();
}
return i;
}
public SqlDataReader Select(string Query)
{
try
{
con = new SqlConnection();
con.ConnectionString = "Data Source=.;Initial
Catalog=TTEXT;Integrated Security=True";
con.Open();
cmd = new SqlCommand();
cmd.CommandText = Query;
cmd.Connection = con;
dr = cmd.ExecuteReader();
}
catch (Exception er)
{
dr = null;
}
finally
{
71
}
return dr;
}
public DataSet Display(string Query)
{
try
{
con = new SqlConnection();
con.ConnectionString = "Data Source=.;Initial
Catalog=TTEXT;Integrated Security=True";
con.Open();
cmd = new SqlCommand();
cmd.CommandText = Query;
cmd.Connection = con;
da = new SqlDataAdapter();
ds = new DataSet();
da.SelectCommand = cmd;
da.Fill(ds);
}
catch (Exception er)
{
ds = null;
}
finally
{
}
return ds;
}
72
73
The TEXT: Automatic Template Extraction from Heterogeneous Webpages has been
coded using ASP.NET of Microsoft .NET Framework. The Backend is MS SQL Server.
The ASP.NET page framework is a programming framework that runs on a web server to
dynamically produce and manage web forms pages. In Visual Studio, web forms provides a form
74
designer, editor, controls and debugging which together allow you to rapidly build server-based,
programmable user interfaces for browsers and web client devices.
The ASP.NET page framework creates an abstraction of the traditional client-server web
interaction so that you can program your application using traditional methods and tools that
support Rapid Application Development (RAD) and OOP.
75
Software Testing is the process of executing a program or system with the intent of
finding errors. This is the major quality measure employed during the software engineering
development. Its basic function is to detect error in the software. Testing is necessary for the
proper functioning of the system.
To improve quality.
For Verification & Validation (V&V).
For reliability estimation.
The purpose of software testing is to assess and evaluate the quality of work performed at
each step of the software development process. Although it sometimes seems that way, the
purpose of testing is NOT to use up all the remaining budget or schedule resources at the end ofa
development effort.
The goal of testing is to ensure that the software performs as intended, and to improve
software quality, reliability and maintainability. There are several rules that can serve as testing
objectives.
They are:
Testing is a process of executing a program with the intent of finding an error.
76
A good test case is one that has a high probability of finding an undiscovered
error.
A successful test is one that uncovers a discovered error.
A strategy for software testing integrates software test case design techniques into
a well planned series of steps that result in the successful construction of software. Any testing
strategy must incorporate test planning, test case design, test execution and the resultant data
collection and evaluation. There are various software testing strategies . The test methods
employed during the development of TEXT: Automatic Template Extraction from
Heterogeneous Webpages include:
Unit Testing
Functional Testing
Stress Testing
Integration testing
Unit Testing:
Unit testing focuses on the verification effort on the smallest unit of each module
in the system. It comprises the set of tests performed by an individual programmer prior to the
77
integration of the unit into a larger system. It involves various tests that a programmer will
perform in a program unit.
Using the unit test plans prepared in the design of the system development as a
guide, the control paths are tested to uncover errors within the boundary of the module. In this
testing, each module is tested and it was found to be working satisfactorily as per the expected
output from the module.
The unit test considerations that were taken into account are,
Interfacing errors
Integrity of local data structure
Boundary Condition
Independent Paths
Error Handling Paths
Functional Testing:
Functional test involves exercising the code with correct input values for which
the expected results are known. The system bring developed is tested with all the nominal input
values, so that the expected results are received. The system is also tested with the boundary
values.
Stress Testing:
Stress tests are designed to overload a system in various ways. The system being
developed is tested like attempting to sign on more than the maximum number of allowed
78
terminals, inputting mismatched data types, processing more than the allowed number of
identifiers.
Integration testing:
This is a systematic technique for constructing the program structure while at the
same time conducting tests to uncover errors associated with the interface. In this testing all the
modules are integrated and the entire system is tested as a whole and the possibility of occurring
error are rare since it has already been unit tested. In case if any error occurs, it is found out in
this step and rectified as a whole before passing to the next step.
79
Direct Method
In direct method, a new system will be designed and implemented.
Cut-over Method
In Cut-over method, the existing system will be cut down and the new system will be
implemented.
Segment Method
In segment method, the existing system process is divided segments into number of
segments. For the new system they are implemented one by one.
Parallel Method
In parallel method, the new system is developed in parallel, independent of the existing
system.
80
Before installing the proposed system we will have to supply the client, detail report
regarding the complete details and functions of the developed system. During confirm whether
the required hardware configuration for smooth and reliable running of the software is present or
not. The next step involved is entering the data.
During the implementation stages the client will never leave the existing system until the
client is fully satisfied with the proposed system and also during the initial stages both the
existing and the proposed system will be going hand in since some time is required for the
updating the transactions.
Finally, the TEXT: Automatic Template Extraction from Heterogeneous Webpages has
been successfully implemented.
The term Maintenance is used to describe the software engineering activities that occur
following delivery of a software product to the customer. The maintenance phase of the software
life cycle is the time period in which software product performs useful work.
81
involve providing new functional capabilities, improving user displays and modes of interaction,
upgrading external documents and internal documentation, or upgrading the performance
characteristics of a system. Adaptation of software to a new environment may involvement
moving the software to a different machine, or for instance, modifying the software to
accommodate a new telecommunications protocol or an additional disk drives. Problem correct
errors. Some errors require immediate attention, some can be corrected on a scheduled, periodic
basis, and other are known but never corrected.
This system security is a vital and cautious area of software development. Merely
developing and installing a software will not make the things complete. We must adopt some
mechanisms to maintain system security. Otherwise the software will be immaterial.
82
A number of security measures is adopted in the design of the TEXT project also. No
information or data can be altered in the Database during Visualization.
The extraction and display of datasets are taking place after full scrutiny and security
policy implementations. There is no possibility of accidental modification of data in the Data
sets. Hence the TEXT: Automatic Template Extraction from Heterogeneous Webpages Project
is sufficiently secured.
83
To estimate the cost of a software project at the time of planning, we need to use a cost
estimation technique. Cost estimation technique are broadly classified as
1. Top-down Techniques
2. Bottom-up Techniques
Top-down approaches first focus on system costs such as computer resources as well as
costs for configuration management, quality assurance, system integration and training. Bottomup costs estimates first focus on the cost to develop each module or sub-system. Those costs are
combined to arrive at an overall estimate.
84
Of these four techniques, we selected the first and simple method knows as Expert
Judgement in order to estimate the total cost of the project titled TEXT: Automatic Template
Extraction from Heterogeneous Webpages It is widely used for cost estimation and is a TopDown technique. This approach relies on experience, background and business sense of one or
more key people in the organization.
In this method, the cost estimation for the current project is done based on the cost spent
on the previous project. Here the experts available in the organization are the decision-makers.
They try to estimate the cost for the software project with the help of their experience and skill.
Sometimes it may not be possible to arrive at a definite estimate due to the variations raised
during the present project.
85
Scheduling of a software project does not differ greatly from scheduling of any multi-task
engineering effort. Therefore, generalized scheduling tools and techniques applied with little
modification to software projects.
Program Evaluation and Review Technique (PERT) and Critical Path Method (CPM) are
two project scheduling methods that can be applied to software development.
When creating a software project schedule, the planer begins with a set of tasks. In this
context, the inputs include Work Breakdown Structures (WBS). Because of work breakdown
inputs a Time-line Chart or Gantt Chart can be developed. The below figure illustrates the Gantt
Chart for TEXT: Automatic Template Extraction from Heterogeneous Webpages.
Task
Project Mgt.
a) Plan
b) Review
c) Problem defined
Development
a) Design
b) Code
c) Debug
d) Coding finished
W1
Month 1
W2 W3
W4
W1
Month 2
W2 W3
W4
W1
Month 3
W2 W3
W4
W1
Month 4
W2 W3
W4
Testing
a) Unit Test
b) Integration Test
c)
Acceptance Test
d) Testing over
Support
a) Installation
b) Maintenance
c) Project
Complete
86
16. CONCLUSION
This project executes less time usage for template extraction compare to existing
algorithms like RTDM, Text-Hash and Text-Max. In this system we used WaveK-Means
algorithm to find similarity between the web pages. This algorithm provides better performance
compared to previous algorithms in terms of space and time. The space and time consumed by
this algorithm is less compared to RTDM, Text-Hash and Text-Max. Our Experimental results
with real life data sets confirm effectiveness and robustness of our algorithm.
Home Page
87
Admin Login
88
Admin Page
89
Document Classification
90
Extraction Details
91
View Documents
92
Before Extraction
93
After Extraction
94
Before Extraction
95
User Registration
96
97
User Login
98
User Page
99
View Documents
100
Document Selection
101
102
103
Report Document
104
105
18. GLOSSARY
API
ASP
CLC
COCOMO
CPM
DFD
GUI
HTML
IDE
ISAM
PERT
RAD
SEP
SOA
TEXT
Template Extraction
VSAM
WBS
106
107
WWW
XML
19. BIBLIOGRAPHY
Books:
Web Sites:
http://www.sparkinfo.net
http://www.dotnetspider.com
http://www.w3schools.com
http://www.net-tutorials.com
http://www.c-sharpcorner.com
http://www.sqlserverclub.com
*****
Adam Machanic
Thomas A. Powell
108