An Effective Approach For Detecting Code Clones

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 3236

An Effective Approach for Detecting Code Clones
Girija Gupta
#1
, Indu Singh
*2

#
M.Tech Student( CSE) JCD College of Engineering, Affiliated to Guru Jambheshwar University,Hisar,India
*
Assistant Professor( CSE) JCD College of Engineering,Affiliated to Guru Jambheshwar University,Hisar,India

Abstract The Software Systems are becoming more complex
and so the understandability and maintainability of the software
systems is degrading day by day. It has become one of the most
expensive activities in the software industries to maintain a
system.
The copying and duplication of source code is a common activity
but it introduces a negative point to reuse by creating code
clones. Therefore detection of code clones plays a vital role in
software industries and it is also an active research area these
days. Detection of duplicate codes increases efficiency of the
software maintenance process and decreases the cost of
maintenance with increase in the understand ability of the
system. Various techniques have been developed for detecting
clones but most of them like text based approach and token
based approaches take a large amount of time and are expensive
too. Other approaches like tree based and PDG based are very
complex. The aim of the presented work is to develop a tool
based on metric based approach to detect clones for java source
files with ease.

Keywords Software cloning, metric approach, clone detection
tool
I. INTRODUCTION
The cloning at design as well as at code level is a
hindrance in software development activities and so it needed
to be removed. It has grown as an active area in software
engineering research community resulting in the development
of various techniques, various tools and other methods for
clone detection and removal.
The reuse of code reduces software development and
maintenance costs in the process of creating software systems.
Copying a segment of source code that can be used to add new
functionalities with slight or no modification is a common
activity.Reasons of reuse are reduction of cost, time, effort,
and risk and then increasing the quality and efficiency [1]. The
most common formof reuse is to copy-paste the code which
results in duplication of code.
It is defined that software reuse is the process of creating
software systems from existing software systems[3]. It is
easier to modify the existing software than developing
programs fromthe scratch.
The major shortcomings of such duplicated fragments is
that if a bug is detected in a code fragment, all the other
fragments similar to it must be investigated to check the
clones presence. The cloning not only produces code that is
difficult to maintain, but may also introduce errors [6]. Code
clones are considered as an obstacle in software industry and
it is believed that cloned code has several adverse affects on
the maintenance of software systems. That is why, it is
benefecial to remove clones and prevent their introduction by
constantly checking the source code. [9].
The clones are often the result of copy-paste activities.
These activities are very easy and can significantly reduce
programming effort and time as they reuse an existing
fragment of code rather than rewriting similar code from
starting especially in device drivers of operating systems
where the algorithms are similar.
The code cloning is a more serious problem in industrial
software systems. If the clone is present, the normal
functioning of the systemmay not be affected, but without
taking action by the maintenance team, further development
may become very expensive. The clones are believed to have
a negative impact on evolution [5]. The code clones may
adversely affect the software systems' quality, especially their
maintainability and comprehensibility. The cost of
maintaining clones over a system's lifetime has not been
estimated yet but it is at least agreed that the financial impact
on maintenance is very high. The costs estimated at 40% -
70% of the total costs during a system's lifetime . The research
shows that a significant amount of code of a software system
is cloned code and this amount may vary depending on the
domain and origin of the software system. Baker [1] has
found that on large systems between 13% - 20% of source
code can be cloned code. B.Lague et al. [9] have studied only
function clones and reported that between 6.4% - 7.5% of
code is cloned code and Baxter et al. have reported that
12.7% of code being clones of a software system. Mayrand et
al. have also estimated that normal industrial source code
contains 5% - 20% of duplicated code. Kapser and Godfrey
[10] have experienced that as much as 10% -15% of source
code of large system is cloned. For COBOL systemwhich are
object oriented, the rate of duplicated code is found more than
50% [11].
Due to amount of duplicated code and its maintenance
cost of large soft-ware systems, it is therefore, crucial to detect
code clones of large systems for performing the respective
maintenance tasks. In the presented work a tool is designed
that helps in detection of clones for java source files using
metric based approach.

II. RELATED WORK

The code clone is one of the main reason that makes
software maintenance more difficult . Code clones are code
fragments in source files which are similar to another code


fragment. If there is a fault found in one code fragment then
the entire cloned fragments need modifications and it becomes
more difficult to maintain a systemif the system becomes
large.
Various research studies have reported that large software
companies spent a lot of money to maintain the existing
systems. Researchers consider clones to be harmful Brenda
Baker[1] concluded that clones are harmful because of the fact
that inconsistent changes increase both maintenance effort and
introduces various errors. Fowler[2] suggests that duplication
of code is a major reason of poor maintainability and if they
are not detected on time they can create a lot of problems.
Reto Geiger[5] reported that clones are generally considered
harmful to the quality of source code and one of the main
drawbacks of code clones is that changes to one code
fragment may need to be propagated to several other similar
ones. Chanchal K.Roy[4] presented a survey report on clones
which represented various techniques for clone detection,
reasons of code cloning and types of clones.

A. Textual based comparison
In this approach the code is not transformed into any
intermediate form before applying comparison it is directly
given to clone detection process. This approach is efficient
but cannot detect the structural type of clones having
different code but same logic.

B. Token based comparison
In this approach a parser or lexer is required to convert
the code into tokens i.e intermediate formbefore applying
the comparison. It is more efficient than the text based
approach if blank spaces and comments are present in the
source code. It doesnot convert source code in the token
sequence efficiently because various false positive may
introduce in the code.
C. Abstract Syntax Tree Based Comparison
In this approach first source code is converted into an
abstract syntax tree and then traversing of the tree is done for
finding a similar sub tree and if similarity is found then the
code for this sub tree is termed as clone. This approach is
quite efficient but it is very difficult and complex to create an
abstract syntax tree
D. Program Dependency Graph Comparison
In this approach first of all PDG is obtained then
isomorphic graph comparison is applied to detect the code
clones[7]. Now the source code slices which are represented
by a sub graph are returned as a clone. It is more efficient
because they detect both semantic and syntactic clones but
with this approach it becomes very complex for large software
systems and it is costly too.
E. Metric Based Comparison
In this approach the metrics is calculated fromthe source
code and these metrics are used to measure clones in software
codes. It doesnot work directly on source code but this
approach uses metrics to detect the clones [8]. Though various
tools are available for calculating metrics like Columbus is the
tool which calculates metrics that are useful in detection of
clones, but this tool does not work for Java programs and the
tool available for the calculation of Java code metrics is
Source Monitor but the metrics provided by this tool are not
so useful for detection of clones. Some other tools that are
available for calculating Java code metrics are very complex
like Datrix which are designed for extending the quality of
Java code [12]. In this presented work the code clone is being
detected with the help of this tool which is metric based
approach.

The various reasons by which the code clones can be
introduced in the code are :

Time Limit : The main reason for code cloning is that a certain
time limit is assigned to developer to finish a project and to do
this developers just copy and paste the existing one.

Difficulty in Understanding Large System : The difficulty in
understanding a large software system. It forces the
developers to use the example-oriented programming by using
previous code.

Resue : One of the major reason of code duplication is reusing
codeby copying and pasting the existing code.

By Accident : Code cloning may be done aacidentally . These
are caused unintentionally when two software developers may
come with same solution. These are not clones technically.

Developers Performance : If the productivity of a developer
is measured by the number of lines he produces per hour, in
such cases, the developer's focus is to increase the number of
lines of the systemtherefore he tries to reuse the same code
again and again by copying and pasting.

Risk in New Code : As there is high risk of software error in
new code fragments and the existing code is already tested in
which there is less risk of error therefore a developer finds the
existing code more reliable then creating the new one.

Language Limitation : The clones can be introduced due to
the limitations of the language. So sometimes the developers
are forced to copy because of limitations of their knowledge in
that particular programming language.
Therefore detecting the code clones is of major concern
these days inorder to make the systemmore efficient .
III. PROPOSED WORK
The presented work is based on the detection of clones by
using metric based approach in which the input in form of
byte code is given to the proposed tool and then metrics is
computed for the given input and then after performing
comparison detection of clones is done.



A. Giving Input to the tool
In this step,input is given to the tool in the formof byte
code i.e .class file is given as an output because byte code
represents unified representation and it is platform
independent too which makes the clone detection more
efficient. Fig. 1 shows the first page that gets opened when we
run FileChooserDemo.class file.The startup page consists of
two buttons ,first is Open a File button used to select
a .class file for which metric is to be calculated and second
button is Delete Previous Data used to delete the previous
data that is stored in database.

Fig. 1 Startup page

Fig. 2 Selecting Factorial.class(for loop) and giving it as an input to tool

Fig. 3 Selecting Factorial2.class(while loop) and giving it as an input to tool

B. Computation of Metrics
The second step is the computation of metrics i.e
calculation of metrics is done inorder to detect the clones.
Bruno Lague et al. [15] provided the metrics which are useful
in the detection of clones and this tool deals with the
calculation of only those metrics which help in detection of
potential clones. Apart from these metrics some object
oriented metrics are also calculated fromthis tool that helps in
identifying potential clones. The metrics that were listed by
Bruno Lague et al. [15] are in the formof class metrics and
function metrics are :

Class Metrics
1. Number of functions present in class
2. Number of if statements present in class
3. The number of lines of code(LOC)
4. Number of variables present in the class
5. Number of public variable present in the class
6. Number of private variables present in the class
7. Number of protected variables present in the class
8. Number of friend variables present in the class

Function Metrics
1. The name of functions present in a class
2. Total no of variables present at function level
3. Number of lines in a function
4. The return type of function
5.Total no of arguments passed to the function
6. The no. of times a function is called

The calculated metrics are stored in the database and
these are then stored in excel sheets so that the metrics of both
files can be compared both at the class and the function level.
Now the Fig 4 and Fig 5 represents the metrics calculated for
Factorial1.class and Fig 6 and Fig 7 represents the metrics
calculated for factorial2.class both at function and class level
so that detection of clones be done with more ease.


Fig. 4 Metrics of factorial1.class(for loop) at class level

Fig. 5 Metrics of Factorial1.class(for loop) at function level

Fig. 6 Metric of Factorial2.class(while loop) at class level

Fig. 7 Metrics of Factorial2.class(while loop) at function level

C. Detection of Clones
In thie third step after computation of metrics is
completed and they are mapped into excel sheets so that they
can be compared to detect clones. On the basis of similarity of
the metric value in both files detection of clones is done.

Fig. 8 Browsing the two excel files for which comparison is performed

Fig. 9 Comparison of Metrics of both files at class level

Now the two files are selected for which comparison is
to be performed by the tool. Fig.9 represents the
comparison of metrics at class level. In the same way the
functional level metrics are compared to find the clones. In


this way clones are detected by the tool.

Fig. 10 Comparison of Metrics of both files at function level

D. Architecture of the tool
The Fig 11 represents the architecture of the tool which
shows that firstly two .class files(byte codes) are given as
an input to the tool then metric calculation is done after that
on the basis of metric match i.e comparing the two files the
clones are detected.

Fig. 11 Architecture of the tool

IV. RESULTS AND DISCUSSION
The .class files are given as an input to the tool. The
reason of using byte code is that it is platformindependent and
it generates unified representation of code and thereby
detecting the clones more efficiently. After giving input,
metric calculation is done to detect clones after comparing
files on basis of metrics. Table 1 shows class metrics values
for various files and Table 2 shows function metrics values for
various files.

TABLE I CLASS METRICS VALUE OF VARIOUS FILES

TABLE II FUNCTION METRICS VALUE OF VARIOUS FILES

V. CONCLUSION AND FUTURE SCOPE


The tool designed is a clone detection tool used to detect
potential clones present in a java file. The tool calculates the
metrics of input java files and after that comparison is done on
the basis of metric match and finally the code clones get
detected. This tool is efficient and easy to use then other
approaches like Abstract syntax based tree and program
dependence graphs as these techniques are very complex to
use. This tool makes use of byte code(.class files) rather
than .java files which makes it more efficient as byte code is
platformindependent and it is the unified representation of the
code.
In the future this approach can be combined with other
approaches to make it a hybrid approach to detect the clones
more efficiently and this tool can be extended to detect clones
for other languages also.

ACKNOWLEDGMENT
We would like to express our gratitude to all who gave
their support to complete this paper. We express our gratitude
to JCD college for providing us labs to complete our work and
all the faculty members who supported us.

REFERENCES

[1] Brenda Baker, On Finding Duplication and Near
Duplication in Large Software Systems, In
Proceedings of the Second Working Conference on
Reverse Engineering, pp 86-95.
[2] M. Fowler, Refactoring : Improving the Design of
Existing Code, Addison-Wesley, 2000.
[3] Krueger C. W., Software Reuse, ACM Computing
Surveys (CSUR), vol. 24, no. 2, pp. 131- 183,june1992
[4] Chanchal Kumar Roy and James R. Cordy, A Survey
on Software Clone Detection Research, Technical
Report No. 2007-541, School of Computing Queen's
University at Kingston Ontario, Canada, September
26, 2007.
[5] Reto Geiger, Beat Fluri , Harald C. Gall and Martin
Pinzger.Relation of code clones & change couplings.
In Proceedings of the 9th International Conference of
Funta-mental Approaches to Software Engineering of
(FASE'06), pp. 411-425, Vienna, Austria,March 2006.
[6] Zhenmin Li , Shan Lu, Suvda Myagmar, Yuanyuan
Zhou. CP-Miner: A Tool for Finding Copy-paste and
Related Bugs in Operating SystemCode. In Design
Proceedings of the 6
th
Symposium on Operating
System and Implementation (OSDI'04), pp. 289-302,
San Francisco, CA, USA, December 2004.
[7] C. Liu, C. Chen, J. Han, P. S. Yu, GPLAG: Detection
of the Software Plagiarismby ProgramDependence
Graph Analysis Conf. On Knowledge Discovery and
Data Mining, pp. 872-881, 2006.
[8] G. Anil Kumar, Dr. C. R. K. Reddy, Dr. A. Govardhan,
An Efficient method-level Code Clone Detection
Scheme Through textual analysis using metrics,
International Journal of Computer Engineering and
Technology (IJCET) Volume 3, Issue 1 , pp. 273-288 ,
January-June (2012).
[9] Bruno Lague, Daniel Proulx, Jean Mayrand, Ettore M.
Merlo and John Hudepohl. Assessing the Benefits of
IncorporatingFunction Clone Detection in a Development
Process. In the Proceedings of the 13
th
International
Conference on Software Maintenance (ICSM'97), pp.
314-321, Bari, Italy, October 1997.
[10] Cory J. Kapser and Michael W. Godfrey.Supporting the
Analysis of Clones in Soft-ware Systems:A Case Study
Journal of Software Maintenance and Evolution: Re-
search and Practice, Vol. 18(2): 61-82, March 2006.
[11] Stephane Ducasse, Matthias Rieger, Serge Demeyer.A
Language Independent Approach for Detecting Duplica
-ted Code. In Proceedings of the 15
th
International
Conference on Software Maintenance (ICSM'99) , pp.
109-118, Oxford, England, September 1999.
[12] Jean-Francois Patenaude, Bruno Lague, Extending the
Software Quality Assessment Techniques to Java Object
Systems,Seventh International Workshop on the Digital
Identifier, pp. 45- 56, 1999.

An Effective Approach For Detecting Code Clones

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Effective Approach For Detecting Code Clones

Uploaded by

Copyright:

Available Formats

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 3236

You might also like