You are on page 1of 41

Team 2

TRACEABILITY
WITH LSI
 Traceability
Traceability
 The ability to link between different artifacts
Example artifacts: code, user manuals, design
documentation, development wikis, etc.
 In particular, link code to:
Relevant requirements
Sections in design documents
Test-cases
Other structured and free-text artifacts
 Also, link from requirements, design
documents, etc. to code

A Comparison of Traceability Techniques for Specifications 3


What’s Traceability Good
For?
 Program Comprehension
Which code segment implements which specific
requirement and vice-versa
 Impact analysis
Keeping non-code artifacts up-to-date
 Requirement Tracing
Discover what code needs to change to handle
a new req.
Aid in determining whether a specification is
completely implemented and covered by tests

A Comparison of Traceability Techniques for Specifications 4


Challenges
 Scalability
 Large # of artifacts
 Heterogeneity
 Large # of different document formats and programming languages
 Noisy
 Free text information (natural language): conjuctions, prepositions,
abbreviations, etc.
 Some information may be outdated, or just plain wrong
 Prior work:
 Recovering Traceability Links in Software Artifact Management
Systems using information retrieval methods [Lucia et al., 2007]
 Recovering Traceability Links between Code and Documentation
[Antoniol et al., 2002, Deerwester et al., 1990, Marcus and Maletic,
2003]

A Comparison of Traceability Techniques for Specifications 5


Example
/*
* The File interface provides…
*/
public class FileImpl extends
FilePOA{
private String nativefileName;

/**
* Creates a new File…
*/
public FileImpl(String nativePath
...){

}

/**
*…
*/
Private String f(..){…}
}

A Comparison of Traceability Techniques for Specifications 6


Traceability Link Recovery
Find basic parts of
identifiers (e,g, Method The three steps of
Auto_due into Auto document path
and due) Query Extraction

Source Code Identifiers Identifiers Text


INDEXER
Component Extraction Separation Normalization

Code Path
Document Scored Document

List of identifiers Classifier List

Text Normalization

Software Letter Stopwords Morphological


INDEXER
Documents Transformation Removal Analysis

Document Path Morphological


Analysis (Plural to
Capital to Small Articles, singular, infinitive,
Letter Punctuations, etc)
etc Removal
7
 Pre-processing
Text Preprocessing
… …
Copyright owners
Text copyright owner
grant member
Preprocessi grant member
companies of the
ng compani
OMG permission
omg permiss
to make a limited …
make limit …

•Lower-case , stop-words, number


etc.

A Comparison of Traceability Techniques for Specifications 9


Words Extraction
/*
* The File interface provides…
*/
public class FileImpl extends
FilePOA{
private String nativefileName;

/**
* Creates a new File… words
*/ extracti
public FileImpl(String nativePath on
...){

}

/**
*…
*/
Private String f(..){…}
}

•Class Name
•Public Function names
•Public function arguments and return
type
•Comments
A Comparison of Traceability Techniques for Specifications 10
Words Expansion

…NativePath, Words … NativePath,Native,Path,


fileName, expansion fileName,
File,Name, delete_all_elements,

•Use well-known coding standards for sub-words

A Comparison of Traceability Techniques for Specifications 11


 Vector Space Model
Vector Space Model
 Vector Space Model (VSM) [Salton et al., 1975]
 Each document, d, is represented by a vector of ranks
of the terms in the vocabulary:
vd = [rd(w1), rd(w2), …, rd(w|V|)]
 The query is similarly represented by a vector
 The similarity between the query and document is the
cosine of the angle between their respective vectors

 Assumes terms are independent


 Some terms are likely to appear together
 Terms can have different meanings depending on
context

13
Vector Space Model
■ Classic IR might lead to poor retrieval due to:
◆ unrelated documents might be included in the
answer set
◆ relevant documents that do not contain at least
one index term are not retrieved
◆ Reasoning: retrieval based on index terms is
vague and noisy

 Term-document matrix has a very high


dimensionality
are there really that many important features
for each document and term?

14
Example
 from Lillian Lee

auto car make


engine emissions hidden
bonnet hood Markov
tires make model
lorry model emissions
boot trunk normalize

Synonymy Polysemy
Will have small cosine Will have large cosine
but are related but not truly related
 Latent Semantic Indexing
Latent Semantic
Indexing
■ The user information need is more related to concepts
and ideas than to index terms
■ A document that shares concepts with another
document known to be relevant might be of interest

 LSI [Deerwester et al., 1990]


 Enhance the semantics of long descriptions.
 reduction can improve effectiveness
 reduction can find surprising relationships!

■ The key idea is to map documents and queries into a lower


dimensional space (i.e., composed of higher level concepts
which are in fewer number than the index terms)
■ Retrieval in this reduced concept space might be superior to
retrieval in the space of index terms

Lecture 12 Information Retrieval 17


Latent Semantic
Indexing
Singular Value
Decomposition
 unique mathematical decomposition of a
matrix into the product of three matrices:
two with orthonormal columns
one with singular values on the diagonal
 tool for dimension reduction
 finds optimal projection into low-dimensional
space

Lecture 12 Information Retrieval 19


Singular Value
Decomposition
DT
wtd = T
r r
r d
t t
d r
 Compute singular value decomposition of a
term-document matrix
D, a representation of M in r dimensions
T, a matrix for transforming new documents
 gives relative importance of
dimensions
Lecture 12 Information Retrieval 20
LSI Term matrix T
 T matrix
gives a vector for each term in LSI space
multiply by a new document vector to “fold in”
new documents into LSI space
 LSI is a rotation of the term-space
original matrix: terms are d-dimensional
new space has lower dimensionality
dimensions are groups of terms that tend to co-
occur in the same documents
○ synonyms, contextually-related words, variant endings

Lecture 12 Information Retrieval 21


Document matrix D
D matrix
coordinates of documents in LSI space
same dimensionality as T vectors
can compute the similarity between a term
and a document

http://lsi.research.telcordia.com/

Lecture 12 Information Retrieval 22


Dimension Reduction

Lecture 12 Information Retrieval 23


Improved Retrieval with
LSI
 New documents and queries are "folded in"
multiply vector by T
 Compute similarity for ranking as in VSM
compare queries and documents by dot-product
 Improvements come from
reduction of noise
no need to stem terms (variants will co-occur)
no need for stop list
○ stop words are used uniformly throughout collection, so
they tend to appear in the first dimension
No speed or space gains, though…

Lecture 12 Information Retrieval 24


Computing an Example
Technical Memo Titles
c1: Human machine interface for ABC computer applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user perceived response time to error measurement

m1: The generation of random, binary, ordered trees
m2: The intersection graph of paths in trees
m3: Graph minors IV: Widths of trees and well­quasi­ordering
m4: Graph minors: A survey
Computing an Example
M=

r (human.user) = -.38 r (human.minors) = -.29


Computing an Example
K=
Computing an Example

S=
Computing an Example

D=
Computing an Example
New M=

r (human.user) = .94 r (human.minors) = -.83


Incremental LSI
 Allows the fast and low-cost computation of
traceability links by using the results from
previous LSI computation.
 Avoids the full cost of LSI computation for
TLR by analyzing the changes to
documentation and source code in different
versions of the system,
 and then derive the changes to the set of
documentation-to-source code traceability
links.
 Evaluation
Experiment of LEDA
 Trance line from the Manual to the Source
Code
 Find out which parts of the source code are
described by a given manual section.
IR Quality Measures
 Precision @ n:

 Recall @ n:

 Average precision:

A Comparison of Traceability Techniques for Specifications 34


Experiment of LEDA
Experiment of LEDA
Experiment of LEDA
Figure 2. Time Cost with iLSI, threshold=0.7
Experiment of LEDA
Conclusions
 Latent semantic indexing provides an
interesting conceptualization of the IR
problem
 It allows reducing the complexity of the
underline representational framework
which might be explored, for instance,
with the purpose of interfacing with the
user
Conclusions
 Traceability between code and
documentation in real world systems is
effective via IR techniques.
 For realistic datasets the Vector Space
Model, which did not perform
dimensionality reduction where shown to
be the most effective.

A Comparison of Traceability Techniques for Specifications 40


References
 Hsinyi Jiang, Tien N. Nguyen, Ing-Xiang Chen, Hojun Jaygarl, Carl K. Chang:
Incremental Latent Semantic Indexing for Automatic Traceability Link Evolution
Management. ASE 2008: 59-68
 G. Antoniol, G. Canfora, G. Casazza, A.D. Lucia, and E. Merlo. Recovering
Traceability Links Between Code and Documentation. IEEE Trans. Softw. Eng. ,
28(10):970-983, 2002.
 Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Using
latent semantic analysis to improve information retrieval." In Proceedings of CHI'88:
Conference on Human Factors in Computing, New York: ACM, 281-285.
 Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A.
(1990) "Indexing by latent semantic analysis." Journal of the Society for Information
Science, 41(6), 391-407.
 Foltz, P. W. (1990) "Using Latent Semantic Indexing for Information Filtering". In R.
B. Allen (Ed.) Proceedings of the Conference on Office Information Systems,
Cambridge, MA, 40-47.
 Deerwester, S.,Dumais, S.T., Landauer, T.K.,Furnas, G.W. and Harshman, R.A.
(1990). "Indexing by latent semantic analysis." Journal of the Society for Information
Science, 41(6), 391-407.

You might also like