You are on page 1of 71

Enhanced Deduplication Technique in Cloud Computing

ABSTRACT:
Storing data are important criteria in cloud computing. Sending data from one device to other is an
important task. Data duplication is the major problem in today to transfer the data from one place to
other and one device to other. Duplication of data causes many problems by unused storage of data in
cloud storage. Duplication means copying the data of one file to another file either with the change of
file name or same name. To identify the duplicated data there are no of deduplication techniques are
available. This paper focus on dedulication of data and to reduce the data storage by using compression
techniques. To improve the results two methods are adopted unique segment and map segment. Results
shows that the efficiency of deduplication and compression.

CONTENTS
1. INTRODUCTION
1.1.
1.2.
1.3.
1.4.

INTRODUCTION TO PROJECT
PURPOSE OF THE PROJECT
EXISTING SYSTEM & ITS DISADVANTAGES
PROPOSED SYSTEM & ITS ADVANTAGES

2. SYSTEM ANALYSIS
2.1.
2.2.
2.3.
2.4.

STUDY OF THE SYSTEM


INPUT & OUTPUT REPRESENTATION
PROCESS MODELS USED WITH JUSTIFICATION
SYSTEM ARCHITECTURE

3. FEASIBILITY STUDY
3.1.
3.2.
3.3.

TECHNICAL FEASIBILITY
OPERATIONAL FEASIBILITY
ECONOMIC FEASIBILITY

4. SOFTWARE REQUIREMENT SPECIFICATIONS


4.1.
4.2.
4.3.
4.4.

FUNCIONAL REQUIREMENTS
PERFORMANCE REQUIREMENTS
SOFTWARE REQUIREMENTS
HARDWARE REQUIREMENTS

4.5 FEATURES OF C#.NET


4.6. The .NET FRAMEWORK
4.7. FEATURES OF SQL SERVER

5. SYSTEM DESIGN
5.1 .
5.2

INTRODUCTION
UML DIAGRAMS

6. OUTPUT SCREENS
7. SYSTEM TESTING
7.1
7.2

INTRODUCTION TO TESTING
TESTING STRATEGIES

8. SYSTEM SECURITY
8.1
8.2

INTRODUCTION
SECURITY IN SOFTWARE

9. BIBLIOGRAPHY

1.1.

INTRODUCTION & OBJECTIVE

As many businesses, government agencies and research projects collect increasingly large amounts of
data, techniques that allow efficient rocessing, analyzing and mining of such massive databases have in
recent years attracted interest from both academia and industry. One task that has been recognized to be
of increasing importance in many application domains is the matching of records that relate to the same
entities from several databases. Often, information from multiple sources needs to be integrated and
combined in order to improve data quality, or to enrich data to facilitate more detailed data analysis. The
records to be matched frequently correspond to entities that refer to people, such as clients or customers,
patients, employees, tax payers, students, or travelers.
1.2.

PURPOSE OF THE PROJECT

Cloud computing features a few applications, as an example, Infosys is utilizing Microsoft's Windows
Azure cloud administrations, as well as SQL knowledge Services, to form cloud-based programming
talents that will let vehicles merchants provide knowledge on inventories and totally different assets.
Best Buy's Giftag application uses Google App Engine to let purchasers create and provide lists of
things to induce from Webpages they visit. Wang Fu Jing retail store, a distributer in China, utilizes IBM
cloud administrations, as well as store network administration programming for its system of retail
locations.
Data Mining Inception
Unfortunately, we don't have a clear date to celebrate data mining birthday every year. It emerged in
early 80s when an the amount of data generated and stored in databases became overwhelming and there
was a strong need for tools and methods to extract useful and task-oriented knowledge. However, some
people say that the actual birth of data mining happened in London in 1662 when John Graunt wrote
Natural and Political Observations Made upon the Bills of Mortality. It was an impressing work for
those times. He actually did a thorough analysis of mortality in those years and tried to build a model to
predict the next bubonic plague in the city. Well, if that was a data mining project or not, this is still a
debatable subject even now. The data used was definitely not big and the model was never actually built.
Some conclusions have been drawn though, like correlations between plague years and new kings and

population dynamics in London. I would rather say this was a statistical project with large implications
in demography. As a matter of fact, some statisticians consider him the founder of the science of
demography.
Influences
After your time varied monumental internet based mostly organizations (Amazon,Google ) currently
perceive that simply slightly live of their info reposition limit is being utilised. This has prompted the
leasing of area and therefore the reposition of information on remote servers or "mists". Knowledge is
then incidentally reserved on desktop PCs, cellular telephones or different internet connected gadgets.
Amazon Elastic cypher Cloud (EC2) and straightforward Storage answer (S3) square measure this best
famous offices. Distributed storage could be a model wherever the data is place away on totally different
committed reposition servers [3]. These capability servers are also expedited by third gatherings. In
registering setting intensive live of repetitive duplicates of knowledge exist. Deduplication procedure
goes for modification these excess duplicates of knowledge [4]. At the purpose once a reinforcement
application makes a reinforcement, that is planned fortnightly or week once week relying upon the
criticality of the data, it makes an enormous record or arrangement of individual files.
Data Mining vs Statistics
Now, since we know what data mining is, let's see what the science of statistics is and how it is related to
data mining. A very good definition of statistics says that "Statistics is the science and practice of
developing knowledge through the use of empirical data expressed in quantitative form. It is based on
statistical theory which is a branch of applied mathematics. Within statistical theory, randomness and
uncertainty are modeled by probability theory." Please note that everything is expressed in quantitative
form. Statistics science is primarily oriented towards the extraction of quantitative and statistical data
characteristics. Let's stick to the example above. When we determine the covariance of 2 variables, we
actually see if these variables vary together and measure the strength of the relationship. But we'll never
be able to characterize this dependency at a conceptual level, and produce a casual explanation and a
qualitative description of this relationship. You cannot "see" the reasons of this dependency because they
are related to factors that are not explicitly provided in the data. Furthermore, the data mining process is
interactive, iterative and exploratory. Not to mention the data pre-processing which is essential for any

data mining project. Data reduction and compression, data cleaning and transformation are very
important for data mining but they definitely don't go with statistics.
Verdict: We can definitely say they are different.
Data mining vs Machine Learning
Machine learning represents a sub-field of artificial intelligence and it was conceived in early 60s with
the clear objective to design and develop algorithms and techniques that implement various types of
learning, mechanisms capable of inducing knowledge from examples of data. Machine learning has a
wide spectrum of applications including natural language processing, search engines, medical diagnosis,
bio-informatics, speech and handwriting recognition, object recognition in computer vision, game
playing and robot locomotion. The general framework for machine learning is as follows: The learning
system aims at determining a description of a given concept from a set of concept examples provided by
the teacher and from the background knowledge. Concept examples can be positive ( iron, when
teaching the concept of metals) or negative (marble). Background knowledge contains the information
about the language used to describe the examples and concepts - possible values of variables (domain),
hierarchies, predicates, rules, etc. The learning algorithm then builds on the type of examples and on the
size and relevance of the background knowledge (there are some other factors involved here but we
won't discuss them here). The main types of learning systems are supervised learning, unsupervised
learning, semi-supervised learning, reinforcement learning, transduction and learning to learn. We'll see
later on in this article how the first 2 systems influenced data mining to a great extent. So the machine
learning process emphasizes the development of the algorithms and usually assumes data is already
residing on the main memory. On the other hand, the first condition for a data mining project to succeed
is to have data, large amounts of data. Think about a chess computer game. It doesn't require a huge
database to play against you. It only needs the examples and the knowledge. Teach the system what a
counter-gambit is and it will know how to respond.
Verdict: They are different.
Data Mining vs KDD
What is KDD? It is not a syndrome (as I first thought when I heard about it) and it is not the name of a
DJ either. And don't you dare to associate it with the annual conference of Knowledge Discovery and

Data Mining organized by SIGKDD. It actually means Knowledge Discovery from Databases and the
concept emerged in 1989 to refer to the broad process of finding knowledge in data. It is referring to the
nontrivial extraction of implicit, previously unknown, and potentially useful information from data.
Knowledge discovery differs from machine learning in that the task is more general and is concerned
with issues specific to databases. Oops! Does it sound familiar to you? I bet it does. Then what the bleep
is data mining? Actually, these 2 terms have been interchangeably used for several years. No distinction
was made. Until a kind of consensus has been made within the community. We'll still have 2 terms but
with slightly different understandings. The term KDD is now viewed as the overall process of
discovering useful knowledge from data, while data mining is viewed as an application of some
particular algorithms for extracting patterns from data without the additional steps of the KDD process,
like data cleaning, data reduction, concept hierarchies generation and it can even go to the infrastructure
of the project. To me, it sounds a bit fishy. I mean, how on earth would a new comer know this
difference between KDD and data mining? The name doesn't tell you anything.
Verdict: Well, I would say they are quite the same.
Dedpulication done in different steps:
Step 1: Similarity computation for all pair of records:
In this step, the similarity computation is carried out by finding the similarity functions on each
record field. Each function compares the similarity of each field with other record fields and assigns a
similarity value for each field. Accurate similarity functions are very important to calculate the distance
between the records for better duplicate detection. Levenshtein distance and cosine similarity are the two
similarity measures used in our proposed approach. Here, the input records are partitioned into two parts
and the two measures are computed for the two parts of record pairs.
(1) Levenshtein distance: The chosen name fields of the records are record 1 and Record 2. The
Levenshtein distance is computed by calculating the minimum number of operations that has to be
made to transform one string to the other, usually these operations are: replace, insert or deletion of a
character. The levenshtein distances between the records are found out by considering the record as a
whole.
2) Cosine similarity: The cosine similarity between the two records name field Record 1 and Record
2 are calculated as follows: First, the dimension of both strings are obtained by taking the union of two

string elements in the record 1 and record 2 then the frequency of occurrence vectors of the two
elements are calculated.
Step 2: Computing feature vectors:
Feature vectors represent the set of elements that is required for the detection of duplicate
elements from the data repository. The vectors can be obtained from the processing of the two similarity
measure values. In general, the usual similarity functions may fail to find the similarity correctly,
because the computation of similarity between fields can vary significantly depending on the domain
and specific field under consideration. Therefore, it is necessary to adapt similarity measures for each
field of the database with respect to the particular data domain for attaining accurate similarity
computations. Consequently, we combine these similarity values obtained from different similarity
measures to compute the distance between any two records. Here, we can represent similarity between
any pair of records by a feature vector in which each component has the similarity value between two
records of anyone of the similarity measure.
Step 3: New similarity formulae generation:
In this step, we consider the new formulae for the extraction of the feature vectors. An expression
derived to calculate the fitness of the corresponding data. In order to find more precise output, i.e., to
find the near duplicates better, we process a number of expressions. These expression, that we subject to
process are used for the calculation of duplicates. A set of similar expression are supplied as input to find
better among the supplied inputs. In this step, we find the best among the input expressions, which is
capable of providing better solution for the problem.
Step 4: Duplicate detection using the new similarity formulae:
Once the optimal similarity formulae are generated from the above step, the generated formulae
is used to find the duplicate or nonduplicate records. Here, we fix the threshold, T to find the margin
between duplicate and non-duplicate pairs.

1.3.

EXISTING SYSTEM

In the Existing system has presented a survey of indexing techniques. The number of candidate record
pairs generated by these techniques has been estimated theoretically, and their efficiency and scalability
has been evaluated using various data sets. These experiments highlight that one of the most important
factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of
blocking keys. Because training data in the form of known true matches and non-matches is often not
available in real world applications, it is commonly up to domain and linkage experts to decide how
such blocking keys are defined.
The indexing techniques presented in this survey are heuristic approaches that aim to split the
records in a database into (possibly overlapping) blocks such that matches are inserted into the
same block and non-matches into different blocks. While future work in the area of indexing for
record linkage and deduplication should include the development of more efficient and more
scalable new indexing techniques, the ultimate goal of such research will be to develop
techniques that generate blocks such that it can be proven that (a) all comparisons between
records within a block will have a certain minimum similarity with each other, and (b) the
similarity between records in different blocks is below this minimum similarity.

1.4.

PROPOSED SYSTEM

Deduplication is an effective technique for optimization of instances of data stored in cloud storage [4].
Deduplication can be classified into unique segment level and map segment level deduplication. Unique
segment level deduplication method implements the storage of unique data segments by checking every
incoming data for duplicate identification. This method achieves better deduplication efficiency because
it does exact deduplication [5]. However, the throughput is low as it checks every incoming piece of data
for duplication. Map segment level deduplication method identifies the data of same files to be
compared against duplicates. Files with same data are called as duplicates. This method achieves better
throughput as it compares every incoming data piece only with piece of similar files. However, the

deduplication efficiency is comparatively low as some duplicate data pieces may be found across
different groups. Hence, this technique performs better deduplication.

2.1 STUDY OF THE SYSTEM


The Record Linkage Process
The general steps involved in the linking of two databases. Because most real-world data are dirty
and contain noisy, incomplete and incorrectly formatted information, a crucial first step in any record
linkage or deduplication project is data cleaning and standardisation. It has been recognised that a
lack of good quality data can be one of the biggest obstacles to successful record linkage. The main
task of data cleaning and standardisation is the conversion of the raw input data into well defined,
consistent forms, as well as the resolution of inconsistencies in the way information is represented and
encoded.
The second step (Indexing) is the topic of this survey, and will be discussed in more detail in Section
2. The indexing step generates pairs of candidate records that are compared in detail in the comparison
step using a variety of comparison functions appropriate to the con- tent of the record fields
(attributes). functions specific for date, age, and numerical values are used for fields that contain such
data. Several fields are normally compared for each record pair, resulting in a vector that contains
the numerical similarity values calculated for that pair.
Using these similarity values, the next step in the record linkage process is to classify the compared
can- didate record pairs into matches, non-matches, and possible matches, depending upon the decision
model used. Record pairs that were removed in the indexing step are classified as non-matches without
be- ing compared explicitly. The majority of recent research into record linkage has concentrated on
improving the classification step, and various classification techniques have been developed. Many of
them are based on ma- chine learning approaches. If record pairs are classified into

possible

matches, a clerical review process is required where these pairs are manually assessed and classified
into matches or non- matches. This is usually a time-consuming, cumbersome and error-prone process,
especially when large databases are being linked or deduplicated. Measuring and eval- uating the
quality and complexity of a record linkage project is a final step in the record linkage process.

Data extraction from dynamic transactions and their integration in applications requires precision and
accuracy. In particular, our goal is to find a way to extract reliable data, and to convert them in a
standard user understandable form.
For developing an application prototype we consider a client server architecture. User interfaces have
been developed using ASP.NET and ASPs that are accessible like any native ASP.NET application in
the form of a war file that can be deployed in a web server.
For operational flexibility to demonstrate the SaaS clouds, the interfaces have been developed using JEE
modules such as ASPs and. Coding ASPs is easier than coding ASP.NET. With ASPs, one can place
static text by coding HTML tags as opposed to ASP.NET, in which you place static text by coding a
plentitude of println statements.
For our implementation they are categorized in the following way. Basically, both ASP and are part of
MVC architecture model (Model-View-Controller). ASP can be thought as a front end or user interface
of MVC architecture. And acts as controller which processes the input given by ASP forms and gives
output according to needs.
For our demonstration we construct different specifics for implementing the client for fetching and
organizing extracted results.
The interfaces at the top level have been categorized as
For an effective representation we construct client interface in web GUI admin mode to perform
daily transactions involved in a virtual shopping system. It involves from seeking inventory data
for extraction, implementing the number of ontology trainers for understanding the knowledge
base and also the destination item sets for implementing rule mining extraction process.
A Web Interface dedicated entirely for showing results. The Clients processed gathered results is
limited to dozen or less because it sometimes involves parsing high number of un manageable
rules. This issue arises when the system is implemented for large web resources.

2.2 INPUT & OUTPOUT REPRESENTETION


Input design is a part of overall system design. The main objective during the input design is as given
below:

To achieve the highest possible level of efficiency while demonstrating the rule mining process.

To achieve the highest possible level of simplicity while using the system.

To ensure that the input is acceptable and understood by the user.

INPUT STAGES:
The main input stages can be listed as below:
Data recording
Data transcription
Data conversion
Data verification
Data control
Data transmission
Data validation
Data correction
INPUT TYPES:
It is necessary to determine the various types of inputs. Inputs can be categorized as follows:
External inputs, which are prime inputs for the system.(System Training with OWL
repositories as part of Knowledge based learning for implementing extraction process)
Internal inputs, which are user communications with the system(loading item sets, loading
trainers, specifying inventories management).
Operational, which are communications within the systems modules(client-server
communication for virtual shopping implementation and displaying results using
Ontologies)
Interactive, which are inputs entered during a dialogue(logging using admin credentials).
INPUT MEDIA:
At this stage choice has to be made about the input media. To conclude about the input media
consideration has to be given to;
Type of input

Flexibility of format
Speed
Accuracy
Verification methods
Rejection rates
Ease of correction
Storage and handling requirements
Security
Easy to use
Portability(real devices or virtual devices like on screen keyboard)
Keeping in view of the above description of the input types and input media, it can be said that most of
the inputs are of the form of internal and interactive. As Input data is to be the directly keyed in by the
user. The input should be independent of the nature of devices but only on the actions like click of a
button(Extract) for responding.
OUTPUT DESIGN:
When the demonstration starts, the theme of the project is in the initial stages itself. The demonstration's
output data will only produce a likely estimate of real-world events. Methods to increase the accuracy of
output data include: repeatedly performing simulations and comparing results, dividing events into
batches and processing them individually, and checking that the results of demonstration conducted in
adjacent time periods connect to produce a coherent holistic view of the system. The results of the
rulemining process speak for themselves.
OUTPUT DEFINITION
Inaccurately extracted knowledge may reduce the quality of the systems output. For this reason, our
extraction rules were designed to be of low risk levels to ensure higher extraction precision.

The outputs should be defined in terms of the following points:

Content of the output(Processed Web Meta Data Results for specified number of frequently used
item sets)

Format of the output(Html Meta data combined with extracted statistical data)

Location of the output(Web page Console)

Frequency of the output(as long as extraction operations are involved)

Volume of the output(depends on the users specifications while extracting)

Sequence of the output(Web Data is processed and extracted in a Fisrt Come First Serve manner
upto the defined link threshold value based on popularity and item value)

OUTPUT MEDIA:
In the next stage it is to be decided that which medium is the most appropriate for the output. The main
considerations when deciding about the output media are:

The suitability for the device to the particular application.

The need for a hard copy.

The response time required.

The location of the users

The software and hardware available.


Keeping in view the above description this project is to have outputs mainly from modules such

as joining network, authentication and operations for demonstrative purposes only and not for reviewing
later and hence no need of a central repository for storage.

2.3 PROCESS MODEL USED WITH JUSTIFICATION


SDLC (Umbrella Model):
Umbrella
Activity

DOCUMENT CONTROL

Business Requirement
Documentation

Requirements
Gathering

Feasibility Study
TEAM FORMATION
Project Specification
PREPARATION

ANALYSIS &
DESIGN

INTEGRATION
& SYSTEM
TESTING

TRAINING

DELIVERY/INS
TALLATION

Umbrella
Activity

ASSESSMENT
CODE

UNIT TEST

ACCEPTANCE
TEST

Umbrella
Activity

SDLC is nothing but Software Development Life Cycle. It is a standard which is used by software
industry to develop good software.
Stages in SDLC:
Requirement Gathering

Analysis
Designing
Coding
Testing
Maintenance

Requirements Gathering stage:


The requirements gathering process takes as its input the goals identified in the high-level
requirements section of the project plan. Each goal will be refined into a set of one or more
requirements. These requirements define the major functions of the intended application, define
operational data areas and reference data areas, and define the initial data entities. Major functions
include critical processes to be managed, as well as mission critical inputs, outputs and reports. A user
class hierarchy is developed and associated with these major functions, data areas, and data entities.
Each of these definitions is termed a Requirement. Requirements are identified by unique requirement
identifiers and, at minimum, contain a requirement title and
textual description.

These requirements are fully described in the primary deliverables for this stage: the Requirements
Document and the Requirements Traceability Matrix (RTM). The requirements document contains
complete descriptions of each requirement, including diagrams and references to external documents as
necessary. Note that detailed listings of database tables and fields are not included in the requirements
document.
The title of each requirement is also placed into the first version of the RTM, along with the title of
each goal from the project plan. The purpose of the RTM is to show that the product components
developed during each stage of the software development lifecycle are formally connected to the
components developed in prior stages.
In the requirements stage, the RTM consists of a list of high-level requirements, or goals, by title,
with a listing of associated requirements for each goal, listed by requirement title. In this hierarchical
listing, the RTM shows that each requirement developed during this stage is formally linked to a specific
product goal. In this format, each requirement can be traced to a specific product goal, hence the term
requirements traceability.
The outputs of the requirements definition stage include the requirements document, the RTM, and
an updated project plan.
Feasibility study is all about identification of problems in a project.
No. of staff required to handle a project is represented as Team Formation, in this case only modules
are individual tasks will be assigned to employees who are working for that project.
Project Specifications are all about representing of various possible inputs submitting to the server
and corresponding outputs along with reports maintained by administrator
Analysis Stage:
The planning stage establishes a bird's eye view of the intended software product, and uses this to
establish the basic project structure, evaluate feasibility and risks associated with the project, and
describe appropriate management and technical approaches.

The most critical section of the project plan is a listing of high-level product requirements, also referred
to as goals. All of the software product requirements to be developed during the requirements definition
stage flow from one or more of these goals. The minimum information for each goal consists of a title
and textual description, although additional information and references to external documents may be
included. The outputs of the project planning stage are the configuration management plan, the quality
assurance plan, and the project plan and schedule, with a detailed listing of scheduled activities for the
upcoming Requirements stage, and high level estimates of effort for the out stages.
Designing Stage:
The design stage takes as its initial input the requirements identified in the approved requirements
document. For each requirement, a set of one or more design elements will be produced as a result of
interviews, workshops, and/or prototype efforts. Design elements describe the desired software features
in detail, and generally include functional hierarchy diagrams, screen layout diagrams, tables of business
rules, business process diagrams, pseudo code, and a complete entity-relationship diagram with a full
data dictionary. These design elements are intended to describe the software in sufficient detail that
skilled programmers may develop the software with minimal additional input.

When the design document is finalized and accepted, the RTM is updated to show that each design
element is formally associated with a specific requirement. The outputs of the design stage are the
design document, an updated RTM, and an updated project plan.
Development (Coding) Stage:
The development stage takes as its primary input the design elements described in the approved
design document. For each design element, a set of one or more software artifacts will be produced.
Software artifacts include but are not limited to menus, dialogs, data management forms, data reporting
formats, and specialized procedures and functions. Appropriate test cases will be developed for each set
of functionally related software artifacts, and an online help system will be developed to guide users in
their interactions with the software.

The RTM will be updated to show that each developed artifact is linked to a specific design
element, and that each developed artifact has one or more corresponding test case items. At this point,
the RTM is in its final configuration. The outputs of the development stage include a fully functional set
of software that satisfies the requirements and design elements previously documented, an online help
system that describes the operation of the software, an implementation map that identifies the primary
code entry points for all major system functions, a test plan that describes the test cases to be used to
validate the correctness and completeness of the software, an updated RTM, and an updated project plan.
Integration & Test Stage:
During the integration and test stage, the software artifacts, online help, and test data are migrated
from the development environment to a separate test environment. At this point, all test cases are run to
verify the correctness and completeness of the software. Successful execution of the test suite confirms a
robust and complete migration capability. During this stage, reference data is finalized for production
use and production users are identified and linked to their appropriate roles. The final reference data (or
links to reference data source files) and production user list are compiled into the Production Initiation
Plan.

The outputs of the integration and test stage include an integrated set of software, an online help
system, an implementation map, a production initiation plan that describes reference data and production
users, an acceptance plan which contains the final suite of test cases, and an updated project plan.
Installation & Acceptance Test:
During the installation and acceptance stage, the software artifacts, online help, and initial
production data are loaded onto the device or emulator. At this point, all test cases are run to verify the
correctness and completeness of the software. Successful execution of the test suite is a prerequisite to
acceptance of the software by the customer.
After customer personnel have verified that the initial production data load is correct and the test
suite has been executed with satisfactory results, the customer formally accepts the delivery of the
software.

The primary outputs of the installation and acceptance stage include a production application, a
completed acceptance test suite, and a memorandum of customer acceptance of the software. Finally, the
PDR enters the last of the actual labor data into the project schedule and locks the project as a permanent
project record. At this point the PDR "locks" the project by archiving all software items, the
implementation map, the source code, and the documentation for future reference.
Maintenance:
Outer rectangle represents maintenance of a project, Maintenance team will start with requirement
study, understanding of documentation later employees will be assigned work and they will under go
training on that particular assigned category.
For this life cycle there is no end, it will be continued so on like an umbrella (no ending point to
umbrella sticks).

2.4 SYSTEM ARCHITECTURE


Architecture flow:

Record matching
Driven Itemset
Learning Module
using .net

Indexing Driven
System Module
using .net and
deduplications

Flow Pattern:

Third party
Web Shop

Flow Pattern represents how the requests are flowing through one layer to another layer and how
the responses are getting by other layers to presentation layer through ASP.NET sources in architecture
diagram.

Feasibility Study:
Preliminary investigation examines project feasibility, the likelihood the application will be
useful to the user. The main objective of the feasibility study is to test the Technical, Operational and

Economical feasibility for adding new modules and debugging traditional desktop centric applications
and porting them to mobile devices. All systems are feasible if they are given unlimited resources and
infinite time. There are aspects in the feasibility study portion of the preliminary investigation:

Technical Feasibility

Operation Feasibility

Economical Feasibility

3.1 TECHNICAL FEASIBILITY


The technical issue usually raised during the feasibility stage of the investigation includes the following:

Does the necessary technology exist to do what is suggested?

Do the proposed equipments have the technical capacity to hold the data required to use the new
system?

Will the proposed system provide adequate response to inquiries, regardless of the number or
location of users?

Can the system be upgraded if developed?

As part of achieving technical competency for developing the proposed system users are required to
acquire skillset in ASP.NET, SQL CLIENT, Session based, Cookie based authentication techniques,
Ontology(OWL) parsing, learning, training implementations in ASP.NET.

3.2 OPERATIONAL FEASIBILITY


OPERATIONAL FEASIBILITY
User-friendly

Client will use the ASP, and servlet resources for screens of their various transactions i.e. for starting
initiating and using ontology driven rulemining system. Also the users are notified of each successful
operation. Basic usage of any client-server centric application is good enough for this task. These
screens and notifications are generated in a user-friendly manner.
Reliability

The usage of OWL API pertaining to W3Schools Standard ensures and enforces accurate ontology
based rulemining system.
Security
In this project context the main idea is to demonstrate the functionalities of Ontology driven
rulemining system. Our demonstrational prototype validates our claim so it requires no further
security based schemes.
Portability
The application will be developed using standard open source technologies like ASP.NET, OWL API
sources and HTML parsing schemes. These technologies will work on any standard system capable
of running ASP.NET.
Hence portability problems will not arise.
Availability
This software will be available always since it is maintained at one place.
Maintainability
The system called the Knowledge-Based Interactive Postmining of Association Rules Using
Ontologies uses 3-tire architecture. specifically using three layers called client layer, server layer
provided using ASP.NET SqlClient sources for SQL SERVER database which is the third layer. All
the application modules are maintained at one place and hence there are no maintenance issues.The
1st tier is the web centric Client application, which is said to be front-end for the learning and
extraction process and the 2nd tier is the OWL meta data oriented third party web site which is the
key domain of this project. Various third party libraries called OWL APIs for OWL implementations
in the system are used. All the application modules are maintained at one place and hence there are
no maintenance issues because free and open source nature of OWL API.

3.3 ECONOMIC FEASILITY

The demonstrational prototype takes care of the present existing systems data flow and procedures
completely and should generate all the reports in the web gui mode for viewing and also can be
easily integrated into any complete web entity extraction systems.
Technologies like ASP.NET and SQL SERVER usage minimizes the cost for the Developer.

4.1 FUNCTIONAL REQUIREMENTS SPECIFICATION

The present application has been divided in to four modules.


Admin
Users
Key Generation
Deduplication

1. The administrator should be able to

Admin is owner of this sit, he can login to the site with his specific username and
password.

After login he accept the new users who newly registered with the site i.e He can give the
login permissions to the users.

Admin Generate the key based on users first name. This Key will be generate for two
databases(Ration Card and Bank A/c).Based on this key he can compare two databases
and finally find the matching database and unmatching database.

2.

Users should be able to

login to the system through the initial screen of the system

change the password after logging into the system

After login based on his requirement he can application for ration card and bank a/c.

He can search based on first name or last name.admin will compare both databases give
matching database to the users.

3. The Key Generation should be able to

Key will be generate for users fisrt name based on soudex functions of user names. This
key used for comapring two databases.

4. The Deduplication should be able to


In this process we are creating a new database fron existing two databases

4.2 PERFORMANCE REQUIREMENTS


Performance is measured in terms of the output provided by the application. Requirement specification
plays an important part in the analysis of a system. Only when the requirement specifications are
properly given, it is possible to design a system, which will fit into required environment. It rests largely
with the users of the existing system to give the requirement specifications because they are the people
who finally use the system. This is because the requirements have to be known during the initial stages
so that the system can be designed according to those requirements. It is very difficult to change the
system once it has been designed and on the other hand designing a system, which does not cater to the
requirements of the user, is of no use.
The requirement specification for any system can be broadly stated as given below:

The system should be able to interface with the existing system

The system should be accurate

The system should be better than the existing system

The existing system is completely dependent on the user to perform all the duties.

4.3 SOFTWARE REQUIREMENTS:


Operating system

: Windows XP.

Coding Language

: ASP.Net with C#

Data Base

: SQL Server 2005/08

4.4 HARDWARE REQUIREMENTS:


System

: Pentium IV 2.4 GHz.

Hard Disk

: 40 GB.

Floppy Drive : 1.44 Mb.

Monitor

: 15 VGA Colour.

Mouse

: Logitech.

Ram

: 512 Mb.

4.4.1. INTRODUCTION TO JAVA


About Java:

Initially the language was called as oak but it was renamed as java in 1995.The primary
motivation of this language was the need for a platform-independent(i.e. architecture
neutral)language that could be used to create software to be embedded in various consumer
electronic devices.
Java is a programmers language
Java is cohesive and consistent
Except for those constraint imposed by the Internet environment. Java gives the
programmer, full control
Finally Java is to Internet Programming where c was to System Programming.
Importance of Java to the Internet

Java has had a profound effect on the Internet. This is because; java expands the Universe of objects
that can move about freely in Cyberspace. In a network, two categories of objects are transmitted
between the server and the personal computer. They are passive information and Dynamic active
programs. in the areas of Security and probability. But Java addresses these concerns and by doing
so, has opened the door to an exciting new form of program called the Applet.

Applications and applets. An application is a program that runs on our Computer under the
operating system of that computer. It is more or less like one creating using C or C++ .Javas ability
to create Applets makes it important. An Applet I san application, designed to be transmitted over the
Internet and executed by a Java-compatible web browser. An applet I actually a tiny Java program,
dynamically downloaded across the network, just like an image. But the difference is, it is an
intelligent program, not just a media file. It can be react to the user input and dynamically change.

Java Architecture
Java architecture provides a portable, robust, high performing environment for development.
Java provides portability by compiling the byte codes for the Java Virtual Machine, which is then
interpreted on each platform by the run-time environment. Java is a dynamic system, able to load
code when needed from a machine in the same room or across the planet.
Compilation of code
When you compile the code, the Java compiler creates machine code (called byte code)for a hypothetical
machine called Java Virtual Machine(JVM). The JVM is supposed t executed the byte code. The JVM is created
for the overcoming the issue of probability. The code is written and compiled for one machine and interpreted
on all machines .This machine is called Java Virtual Machine.
Compiling and interpreting java source code.

Pc
compiler
Source
code

Macintosh
compiler

Java
Byte
code
Platform
independ
ent

SPARC
Compiler

Java
interpreter
Java
interpreter
macintosh
Java
interpreter(
SPARC)

During run-time the Java interpreter tricks the byte code file into thinking that it is running on a Java
Virtual Machine. In reality this could be an Intel Pentium windows 95 or sun SPARCstation running
Solaris or Apple Macintosh running system and all could receive code from any computer through
internet and run the Applets.
Simple:

)))
Java was designed to be easy for the Professional programmer to learn and to use effectively. If you
are an experienced C++ Programmer. Learning Java will oriented features of C++ . Most of the
confusing concepts from C++ are either left out of Java or implemented in a cleaner, more
approachable manner. In Java there are a small number of clearly defined ways to accomplish a
given task.
Object oriented
Java was not designed to be source-code compatible with any other language. This allowed the Java
team the freedom to design with a blank state. One outcome of this was a clean usable, pragmatic
approach to objects. The object model in Java is simple and easy to extend, while simple types, such
as integers, are kept as high-performance non-objects.

Robust
The multi-platform environment of the web places extraordinary demands on a program, because the
program must execute reliably in a variety of systems. The ability to create robust programs. Was
given a high priority in the design of Java. Java is strictly typed language; it checks your code at
compile time and runtime.
Java virtually eliminates the problems of memory management and deal location, which is
completely automatic. In a well-written Java program, all run-time errors can and should be
managed by your program.

TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TALE:

We can work with a table in two types,

1. Design View
2. Datasheet View
DESIGN VIEW:
To build or modify the structure of a table we work in the table design view. We can specify what
kind of data will be hold.
DATASHEET VIEW:
To add, edit or analyses the data itself we work in tables datasheet view mode.
QUERY:
A query is a question that has to be asked the data. Access gathers data that answers the question
from one or more table. The data that make up the answer is either dynaset (if you edit it) or a snapshot
(it cannot be edited).Each time we run query, we get latest information in the dynaset . Access either
displays the dynaset or snapshot for us to view or perform an action on it, such as deleting or updating.

5.1 INTRODUCTION
Systems design

Introduction: Systems design is the process or art of defining the architecture, components,
modules, interfaces, and data for a system to satisfy specified requirements. One could see it as the
application of systems theory to product development. There is some overlap and synergy with the
disciplines of systems analysis, systems architecture and systems engineering.

5.2 UML DIAGRAMS


Unified Modeling Language:

The Unified Modeling Language allows the software engineer to express an analysis model using the
modeling notation that is governed by a set of syntactic semantic and pragmatic rules.
A UML system is represented using five different views that describe the system from distinctly
different perspective. Each view is defined by a set of diagram, which is as follows.

User Model View


i. This view represents the system from the users perspective.
ii. The analysis representation describes a usage scenario from the end-users
perspective.

Structural model view


i. In this model the data and functionality are arrived from inside the system.
ii. This model view models the static structures.

Behavioral Model View


It represents the dynamic of behavioral as parts of the system, depicting the interactions
of collection between various structural elements described in the user model and
structural model view.

Implementation Model View

In this the structural and behavioral as parts of the system are represented as they are to
be built.

Environmental Model View


In this the structural and behavioral aspects of the environment in which the system is to
be implemented are represented.

UML is specifically constructed through two different domains they are:


UML Analysis modeling, this focuses on the user model and structural model views of the
system.
UML design modeling, which focuses on the behavioral modeling, implementation modeling and
environmental model views.
Use case Diagrams represent the functionality of the system from a users point of view. Use cases are
used during requirements elicitation and analysis to represent the functionality of the system. Use cases
focus on the behavior of the system from external point of view.
Actors are external entities that interact with the system. Examples of actors include users like
administrator, bank customer etc., or another system like central database.

UML DIAGRAMS

ER DIAGRAM

Data Dictionary(Sql Server equivalent queries)


sqlserver sa
12345

Home Page:

Sign up for new user:

Choose File:

File Uploaded Successfully after checking the duplication:

All Files:

Download File:

Now run the duplication results monitor:

Admin id: Krishna@gmail.com


Password: 123456

C://dropstorage

Deduplication Records:

7.1 INTRODUCTION TO TESTING


Introduction to Testing:

Testing is a process, which reveals errors in the

program. It is the major quality measure employed

during software development. During software development. During testing, the program is executed
with a set of test cases and the output of the program for the test cases is evaluated to determine if the
program is performing as it is expected to perform.

7.2 TESTING IN STRATEGIES


In order to make sure that the system does not have errors, the different levels of testing
strategies that are applied at differing phases of software development are:

Unit Testing:

Unit Testing is done on individual modules as they are completed and become executable. It is
confined only to the designer's requirements.
Each module can be tested using the following two Strategies:
Black Box Testing:

In this strategy some test cases are generated as input conditions that fully execute all
functional requirements for the program. This testing has been uses to find errors in the following
categories:
Incorrect or missing functions
Interface errors
Errors in data structure or external database access
Performance errors
Initialization and termination errors.
In this testing only the output is checked for correctness.
The logical flow of the data is not checked.
White Box testing :

In this the test cases are generated on the logic of each module by drawing flow graphs of that
module and logical decisions are tested on all the cases. It has been uses to generate the test cases in
the following cases:

Guarantee that all independent paths have been Executed.


Execute all logical decisions on their true and false Sides.
Execute all loops at their boundaries and within their operational bounds
Execute internal data structures to ensure their validity.
Integrating Testing :

Integration testing ensures that software and subsystems work together a whole. It tests the
interface of all the modules to make sure that the modules behave properly

when integrated

together. In this case the communication between the device and Google Translator Service.
System Testing :

Involves in-house testing in an emulator of the entire system before delivery to the user. It's aim
is to satisfy the user the system meets all requirements of the client's specifications.

Acceptance Testing :

It is a pre-delivery testing in which entire system is tested in a real android device on real world data
and usage to find errors.
Test Approach :
Testing can be done in two ways:
Bottom up approach
Top down approach
Bottom up Approach:

Testing can be performed starting from smallest and lowest level modules and proceeding
one at a time. For each module in bottom up testing a short program executes the module and
provides the needed data so that the module is asked to perform the way it will when embedded with
in the larger system. When bottom level modules are tested attention turns to those on the next level
that use the lower level ones they are tested individually and then linked with the previously
examined lower level modules.
Top down approach:

This type of testing starts from upper level modules. Since the detailed activities usually
performed in the lower level routines are not provided stubs are written. A stub is a module shell
called by upper level module and that when reached properly will return a message to the calling

module indicating that proper interaction occurred. No attempt is made to verify the correctness of
the lower level module.
Validation:

The system has been tested and implemented successfully and thus ensured that all the requirements
as listed in the software requirements specification are completely fulfilled. In case of erroneous
input corresponding error messages are displayed

Test
Numb
er
1

Module Test Name


Name

Authen
tication
View
Data
View
Data

3
4

Authen Login with


tication credentials

D
at
e
invalid

Login
with
valid
credentials
View key generation
View matched records
and unmatched records

Te Tesr Cases
ste
r
Should
Display
Invalid
Credentials/ User does not
exist
Should Provide Access

Re
sul
t
P

Should not work

Should Display matching


records & unmatched records.

8.1 INTRODUCTION
System Security:
Enhanced Deduplication Technique in Cloud Computing
Deduplication

8.1 Introduction:
In this project context the main idea is to demonstrate the functionalities of Indexing techniques for
scalable record linkage and deduplication. Our demonstrational prototype validates our claim so it
requires no further security based schemes.

ADVANCED

No login credentials were required to use the system. Besides an admin panel to initiate rule
mining process

9. BIBLIOGRAPHY
R e f e r e n c e s f o r t h e P r o j e c t D e v e l o p m e n t Wer e t a k e n f r o m t h e f o l l o w i n g
B o o k s a n d Web S i t e s .
ASP.NET Technologies
ASP.NET Complete Reference
ASP.NET Script Programming by Yehuda Shiran
Mastering ASP.NET Security
ASP.NET2 Networking by Pistoia
ASP.NET Security by Scotl oaks
Head First EJB Sierra Bates
.NET Professional by Shadab siddiqui
ASP.NET server pages by Larne Pekowsley
ASP.NET Server pages by Nick Todd
HTML
HTML Black Book by Holzner
JDBC
ASP.NET Database Programming with JDBC by Patel moss.

You might also like