The Software Systems are becoming more complex
and so the understandability and maintainability of the software
systems is degrading day by day. It has become one of the most
expensive activities in the software industries to maintain a
system.
The copying and duplication of source code is a common activity
but it introduces a negative point to reuse by creating code
clones. Therefore detection of code clones plays a vital role in
software industries and it is also an active research area these
days. Detection of duplicate codes increases efficiency of the
software maintenance process and decreases the cost of
maintenance with increase in the understand ability of the
system. Various techniques have been developed for detecting
clones but most of them like text based approach and token
based approaches take a large amount of time and are expensive
too. Other approaches like tree based and PDG based are very
complex. The aim of the presented work is to develop a tool
based on metric based approach to detect clones for java source
files with ease.
The Software Systems are becoming more complex
and so the understandability and maintainability of the software
systems is degrading day by day. It has become one of the most
expensive activities in the software industries to maintain a
system.
The copying and duplication of source code is a common activity
but it introduces a negative point to reuse by creating code
clones. Therefore detection of code clones plays a vital role in
software industries and it is also an active research area these
days. Detection of duplicate codes increases efficiency of the
software maintenance process and decreases the cost of
maintenance with increase in the understand ability of the
system. Various techniques have been developed for detecting
clones but most of them like text based approach and token
based approaches take a large amount of time and are expensive
too. Other approaches like tree based and PDG based are very
complex. The aim of the presented work is to develop a tool
based on metric based approach to detect clones for java source
files with ease.
The Software Systems are becoming more complex
and so the understandability and maintainability of the software
systems is degrading day by day. It has become one of the most
expensive activities in the software industries to maintain a
system.
The copying and duplication of source code is a common activity
but it introduces a negative point to reuse by creating code
clones. Therefore detection of code clones plays a vital role in
software industries and it is also an active research area these
days. Detection of duplicate codes increases efficiency of the
software maintenance process and decreases the cost of
maintenance with increase in the understand ability of the
system. Various techniques have been developed for detecting
clones but most of them like text based approach and token
based approaches take a large amount of time and are expensive
too. Other approaches like tree based and PDG based are very
complex. The aim of the presented work is to develop a tool
based on metric based approach to detect clones for java source
files with ease.
An Effective Approach for Detecting Code Clones Girija Gupta #1 , Indu Singh *2
# M.Tech Student( CSE) JCD College of Engineering, Affiliated to Guru Jambheshwar University,Hisar,India * Assistant Professor( CSE) JCD College of Engineering,Affiliated to Guru Jambheshwar University,Hisar,India
Abstract The Software Systems are becoming more complex and so the understandability and maintainability of the software systems is degrading day by day. It has become one of the most expensive activities in the software industries to maintain a system. The copying and duplication of source code is a common activity but it introduces a negative point to reuse by creating code clones. Therefore detection of code clones plays a vital role in software industries and it is also an active research area these days. Detection of duplicate codes increases efficiency of the software maintenance process and decreases the cost of maintenance with increase in the understand ability of the system. Various techniques have been developed for detecting clones but most of them like text based approach and token based approaches take a large amount of time and are expensive too. Other approaches like tree based and PDG based are very complex. The aim of the presented work is to develop a tool based on metric based approach to detect clones for java source files with ease.
Keywords Software cloning, metric approach, clone detection tool I. INTRODUCTION The cloning at design as well as at code level is a hindrance in software development activities and so it needed to be removed. It has grown as an active area in software engineering research community resulting in the development of various techniques, various tools and other methods for clone detection and removal. The reuse of code reduces software development and maintenance costs in the process of creating software systems. Copying a segment of source code that can be used to add new functionalities with slight or no modification is a common activity.Reasons of reuse are reduction of cost, time, effort, and risk and then increasing the quality and efficiency [1]. The most common formof reuse is to copy-paste the code which results in duplication of code. It is defined that software reuse is the process of creating software systems from existing software systems[3]. It is easier to modify the existing software than developing programs fromthe scratch. The major shortcomings of such duplicated fragments is that if a bug is detected in a code fragment, all the other fragments similar to it must be investigated to check the clones presence. The cloning not only produces code that is difficult to maintain, but may also introduce errors [6]. Code clones are considered as an obstacle in software industry and it is believed that cloned code has several adverse affects on the maintenance of software systems. That is why, it is benefecial to remove clones and prevent their introduction by constantly checking the source code. [9]. The clones are often the result of copy-paste activities. These activities are very easy and can significantly reduce programming effort and time as they reuse an existing fragment of code rather than rewriting similar code from starting especially in device drivers of operating systems where the algorithms are similar. The code cloning is a more serious problem in industrial software systems. If the clone is present, the normal functioning of the systemmay not be affected, but without taking action by the maintenance team, further development may become very expensive. The clones are believed to have a negative impact on evolution [5]. The code clones may adversely affect the software systems' quality, especially their maintainability and comprehensibility. The cost of maintaining clones over a system's lifetime has not been estimated yet but it is at least agreed that the financial impact on maintenance is very high. The costs estimated at 40% - 70% of the total costs during a system's lifetime . The research shows that a significant amount of code of a software system is cloned code and this amount may vary depending on the domain and origin of the software system. Baker [1] has found that on large systems between 13% - 20% of source code can be cloned code. B.Lague et al. [9] have studied only function clones and reported that between 6.4% - 7.5% of code is cloned code and Baxter et al. have reported that 12.7% of code being clones of a software system. Mayrand et al. have also estimated that normal industrial source code contains 5% - 20% of duplicated code. Kapser and Godfrey [10] have experienced that as much as 10% -15% of source code of large system is cloned. For COBOL systemwhich are object oriented, the rate of duplicated code is found more than 50% [11]. Due to amount of duplicated code and its maintenance cost of large soft-ware systems, it is therefore, crucial to detect code clones of large systems for performing the respective maintenance tasks. In the presented work a tool is designed that helps in detection of clones for java source files using metric based approach.
II. RELATED WORK
The code clone is one of the main reason that makes software maintenance more difficult . Code clones are code fragments in source files which are similar to another code International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013
fragment. If there is a fault found in one code fragment then the entire cloned fragments need modifications and it becomes more difficult to maintain a systemif the system becomes large. Various research studies have reported that large software companies spent a lot of money to maintain the existing systems. Researchers consider clones to be harmful Brenda Baker[1] concluded that clones are harmful because of the fact that inconsistent changes increase both maintenance effort and introduces various errors. Fowler[2] suggests that duplication of code is a major reason of poor maintainability and if they are not detected on time they can create a lot of problems. Reto Geiger[5] reported that clones are generally considered harmful to the quality of source code and one of the main drawbacks of code clones is that changes to one code fragment may need to be propagated to several other similar ones. Chanchal K.Roy[4] presented a survey report on clones which represented various techniques for clone detection, reasons of code cloning and types of clones.
A. Textual based comparison In this approach the code is not transformed into any intermediate form before applying comparison it is directly given to clone detection process. This approach is efficient but cannot detect the structural type of clones having different code but same logic.
B. Token based comparison In this approach a parser or lexer is required to convert the code into tokens i.e intermediate formbefore applying the comparison. It is more efficient than the text based approach if blank spaces and comments are present in the source code. It doesnot convert source code in the token sequence efficiently because various false positive may introduce in the code. C. Abstract Syntax Tree Based Comparison In this approach first source code is converted into an abstract syntax tree and then traversing of the tree is done for finding a similar sub tree and if similarity is found then the code for this sub tree is termed as clone. This approach is quite efficient but it is very difficult and complex to create an abstract syntax tree D. Program Dependency Graph Comparison In this approach first of all PDG is obtained then isomorphic graph comparison is applied to detect the code clones[7]. Now the source code slices which are represented by a sub graph are returned as a clone. It is more efficient because they detect both semantic and syntactic clones but with this approach it becomes very complex for large software systems and it is costly too. E. Metric Based Comparison In this approach the metrics is calculated fromthe source code and these metrics are used to measure clones in software codes. It doesnot work directly on source code but this approach uses metrics to detect the clones [8]. Though various tools are available for calculating metrics like Columbus is the tool which calculates metrics that are useful in detection of clones, but this tool does not work for Java programs and the tool available for the calculation of Java code metrics is Source Monitor but the metrics provided by this tool are not so useful for detection of clones. Some other tools that are available for calculating Java code metrics are very complex like Datrix which are designed for extending the quality of Java code [12]. In this presented work the code clone is being detected with the help of this tool which is metric based approach.
The various reasons by which the code clones can be introduced in the code are :
Time Limit : The main reason for code cloning is that a certain time limit is assigned to developer to finish a project and to do this developers just copy and paste the existing one.
Difficulty in Understanding Large System : The difficulty in understanding a large software system. It forces the developers to use the example-oriented programming by using previous code.
Resue : One of the major reason of code duplication is reusing codeby copying and pasting the existing code.
By Accident : Code cloning may be done aacidentally . These are caused unintentionally when two software developers may come with same solution. These are not clones technically.
Developers Performance : If the productivity of a developer is measured by the number of lines he produces per hour, in such cases, the developer's focus is to increase the number of lines of the systemtherefore he tries to reuse the same code again and again by copying and pasting.
Risk in New Code : As there is high risk of software error in new code fragments and the existing code is already tested in which there is less risk of error therefore a developer finds the existing code more reliable then creating the new one.
Language Limitation : The clones can be introduced due to the limitations of the language. So sometimes the developers are forced to copy because of limitations of their knowledge in that particular programming language. Therefore detecting the code clones is of major concern these days inorder to make the systemmore efficient . III. PROPOSED WORK The presented work is based on the detection of clones by using metric based approach in which the input in form of byte code is given to the proposed tool and then metrics is computed for the given input and then after performing comparison detection of clones is done.
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013
A. Giving Input to the tool In this step,input is given to the tool in the formof byte code i.e .class file is given as an output because byte code represents unified representation and it is platform independent too which makes the clone detection more efficient. Fig. 1 shows the first page that gets opened when we run FileChooserDemo.class file.The startup page consists of two buttons ,first is Open a File button used to select a .class file for which metric is to be calculated and second button is Delete Previous Data used to delete the previous data that is stored in database.
Fig. 1 Startup page
Fig. 2 Selecting Factorial.class(for loop) and giving it as an input to tool
Fig. 3 Selecting Factorial2.class(while loop) and giving it as an input to tool
B. Computation of Metrics The second step is the computation of metrics i.e calculation of metrics is done inorder to detect the clones. Bruno Lague et al. [15] provided the metrics which are useful in the detection of clones and this tool deals with the calculation of only those metrics which help in detection of potential clones. Apart from these metrics some object oriented metrics are also calculated fromthis tool that helps in identifying potential clones. The metrics that were listed by Bruno Lague et al. [15] are in the formof class metrics and function metrics are :
Class Metrics 1. Number of functions present in class 2. Number of if statements present in class 3. The number of lines of code(LOC) 4. Number of variables present in the class 5. Number of public variable present in the class 6. Number of private variables present in the class 7. Number of protected variables present in the class 8. Number of friend variables present in the class
Function Metrics 1. The name of functions present in a class 2. Total no of variables present at function level 3. Number of lines in a function 4. The return type of function 5.Total no of arguments passed to the function 6. The no. of times a function is called
The calculated metrics are stored in the database and these are then stored in excel sheets so that the metrics of both files can be compared both at the class and the function level. Now the Fig 4 and Fig 5 represents the metrics calculated for Factorial1.class and Fig 6 and Fig 7 represents the metrics calculated for factorial2.class both at function and class level so that detection of clones be done with more ease. International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013
Fig. 4 Metrics of factorial1.class(for loop) at class level
Fig. 5 Metrics of Factorial1.class(for loop) at function level
Fig. 6 Metric of Factorial2.class(while loop) at class level
Fig. 7 Metrics of Factorial2.class(while loop) at function level
C. Detection of Clones In thie third step after computation of metrics is completed and they are mapped into excel sheets so that they can be compared to detect clones. On the basis of similarity of the metric value in both files detection of clones is done.
Fig. 8 Browsing the two excel files for which comparison is performed
Fig. 9 Comparison of Metrics of both files at class level
Now the two files are selected for which comparison is to be performed by the tool. Fig.9 represents the comparison of metrics at class level. In the same way the functional level metrics are compared to find the clones. In International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013
Fig. 10 Comparison of Metrics of both files at function level
D. Architecture of the tool The Fig 11 represents the architecture of the tool which shows that firstly two .class files(byte codes) are given as an input to the tool then metric calculation is done after that on the basis of metric match i.e comparing the two files the clones are detected.
Fig. 11 Architecture of the tool
IV. RESULTS AND DISCUSSION The .class files are given as an input to the tool. The reason of using byte code is that it is platformindependent and it generates unified representation of code and thereby detecting the clones more efficiently. After giving input, metric calculation is done to detect clones after comparing files on basis of metrics. Table 1 shows class metrics values for various files and Table 2 shows function metrics values for various files.
TABLE I CLASS METRICS VALUE OF VARIOUS FILES
TABLE II FUNCTION METRICS VALUE OF VARIOUS FILES
V. CONCLUSION AND FUTURE SCOPE International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013
The tool designed is a clone detection tool used to detect potential clones present in a java file. The tool calculates the metrics of input java files and after that comparison is done on the basis of metric match and finally the code clones get detected. This tool is efficient and easy to use then other approaches like Abstract syntax based tree and program dependence graphs as these techniques are very complex to use. This tool makes use of byte code(.class files) rather than .java files which makes it more efficient as byte code is platformindependent and it is the unified representation of the code. In the future this approach can be combined with other approaches to make it a hybrid approach to detect the clones more efficiently and this tool can be extended to detect clones for other languages also.
ACKNOWLEDGMENT We would like to express our gratitude to all who gave their support to complete this paper. We express our gratitude to JCD college for providing us labs to complete our work and all the faculty members who supported us.
REFERENCES
[1] Brenda Baker, On Finding Duplication and Near Duplication in Large Software Systems, In Proceedings of the Second Working Conference on Reverse Engineering, pp 86-95. [2] M. Fowler, Refactoring : Improving the Design of Existing Code, Addison-Wesley, 2000. [3] Krueger C. W., Software Reuse, ACM Computing Surveys (CSUR), vol. 24, no. 2, pp. 131- 183,june1992 [4] Chanchal Kumar Roy and James R. Cordy, A Survey on Software Clone Detection Research, Technical Report No. 2007-541, School of Computing Queen's University at Kingston Ontario, Canada, September 26, 2007. [5] Reto Geiger, Beat Fluri , Harald C. Gall and Martin Pinzger.Relation of code clones & change couplings. In Proceedings of the 9th International Conference of Funta-mental Approaches to Software Engineering of (FASE'06), pp. 411-425, Vienna, Austria,March 2006. [6] Zhenmin Li , Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating SystemCode. In Design Proceedings of the 6 th Symposium on Operating System and Implementation (OSDI'04), pp. 289-302, San Francisco, CA, USA, December 2004. [7] C. Liu, C. Chen, J. Han, P. S. Yu, GPLAG: Detection of the Software Plagiarismby ProgramDependence Graph Analysis Conf. On Knowledge Discovery and Data Mining, pp. 872-881, 2006. [8] G. Anil Kumar, Dr. C. R. K. Reddy, Dr. A. Govardhan, An Efficient method-level Code Clone Detection Scheme Through textual analysis using metrics, International Journal of Computer Engineering and Technology (IJCET) Volume 3, Issue 1 , pp. 273-288 , January-June (2012). [9] Bruno Lague, Daniel Proulx, Jean Mayrand, Ettore M. Merlo and John Hudepohl. Assessing the Benefits of IncorporatingFunction Clone Detection in a Development Process. In the Proceedings of the 13 th International Conference on Software Maintenance (ICSM'97), pp. 314-321, Bari, Italy, October 1997. [10] Cory J. Kapser and Michael W. Godfrey.Supporting the Analysis of Clones in Soft-ware Systems:A Case Study Journal of Software Maintenance and Evolution: Re- search and Practice, Vol. 18(2): 61-82, March 2006. [11] Stephane Ducasse, Matthias Rieger, Serge Demeyer.A Language Independent Approach for Detecting Duplica -ted Code. In Proceedings of the 15 th International Conference on Software Maintenance (ICSM'99) , pp. 109-118, Oxford, England, September 1999. [12] Jean-Francois Patenaude, Bruno Lague, Extending the Software Quality Assessment Techniques to Java Object Systems,Seventh International Workshop on the Digital Identifier, pp. 45- 56, 1999.