Research - Trace Ability With LSI

Team 2
TRACEABILITY
WITH LSI
 Traceability
Traceability
 The ability to link between different artifacts
Example artifacts: code, user manuals, design
documentation, development wikis, etc.
 In particular, link code to:
Relevant requirements
Sections in design documents
Test-cases
Other structured and free-text artifacts
 Also, link from requirements, design
documents, etc. to code
A Comparison of Traceability Techniques for Specifications 3

What’s Traceability Good
For?
 Program Comprehension
Which code segment implements which specific
requirement and vice-versa
 Impact analysis
Keeping non-code artifacts up-to-date
 Requirement Tracing
Discover what code needs to change to handle
a new req.
Aid in determining whether a specification is
completely implemented and covered by tests

Challenges
 Scalability
 Large # of artifacts
 Heterogeneity
 Large # of different document formats and programming languages
 Noisy
 Free text information (natural language): conjuctions, prepositions,
abbreviations, etc.
 Some information may be outdated, or just plain wrong
 Prior work:
 Recovering Traceability Links in Software Artifact Management
Systems using information retrieval methods [Lucia et al., 2007]
 Recovering Traceability Links between Code and Documentation
[Antoniol et al., 2002, Deerwester et al., 1990, Marcus and Maletic,
2003]

Example
/*
* The File interface provides…
*/
public class FileImpl extends
FilePOA{
private String nativefileName;
/**
* Creates a new File…
*/
public FileImpl(String nativePath
...){
…
}
/**
*…
*/
Private String f(..){…}
}

Traceability Link Recovery
Find basic parts of
identifiers (e,g, Method The three steps of
Auto_due into Auto document path
and due) Query Extraction
Source Code Identifiers Identifiers Text

INDEXER
Component Extraction Separation Normalization
Code Path
Document Scored Document
List of identifiers Classifier List
Text Normalization
Software Letter Stopwords Morphological

INDEXER
Documents Transformation Removal Analysis
Document Path Morphological

Analysis (Plural to
Capital to Small Articles, singular, infinitive,
Letter Punctuations, etc)
etc Removal
7
 Pre-processing
Text Preprocessing
… …
Copyright owners
Text copyright owner
grant member
Preprocessi grant member
companies of the
ng compani
OMG permission
omg permiss
to make a limited …
make limit …
•Lower-case , stop-words, number

etc.

Words Extraction
/*
* The File interface provides…
*/
public class FileImpl extends
FilePOA{
private String nativefileName;
/**
* Creates a new File… words
*/ extracti
public FileImpl(String nativePath on
...){
…
}
/**
*…
*/
Private String f(..){…}
}
•Class Name
•Public Function names
•Public function arguments and return
type
•Comments
Words Expansion
…NativePath, Words … NativePath,Native,Path,

fileName, expansion fileName,
File,Name, delete_all_elements,
•Use well-known coding standards for sub-words

 Vector Space Model
Vector Space Model
 Vector Space Model (VSM) [Salton et al., 1975]
 Each document, d, is represented by a vector of ranks
of the terms in the vocabulary:
vd = [rd(w1), rd(w2), …, rd(w|V|)]
 The query is similarly represented by a vector
 The similarity between the query and document is the
cosine of the angle between their respective vectors
 Assumes terms are independent

 Some terms are likely to appear together
 Terms can have different meanings depending on
context
13
Vector Space Model
■ Classic IR might lead to poor retrieval due to:
◆ unrelated documents might be included in the
answer set
◆ relevant documents that do not contain at least
one index term are not retrieved
◆ Reasoning: retrieval based on index terms is
vague and noisy
 Term-document matrix has a very high

dimensionality
are there really that many important features
for each document and term?
14
Example
 from Lillian Lee
auto car make

engine emissions hidden
bonnet hood Markov
tires make model
lorry model emissions
boot trunk normalize
Synonymy Polysemy
Will have small cosine Will have large cosine
but are related but not truly related
 Latent Semantic Indexing
Latent Semantic
Indexing
■ The user information need is more related to concepts
and ideas than to index terms
■ A document that shares concepts with another
document known to be relevant might be of interest
 LSI [Deerwester et al., 1990]

 Enhance the semantics of long descriptions.
 reduction can improve effectiveness
 reduction can find surprising relationships!
■ The key idea is to map documents and queries into a lower

dimensional space (i.e., composed of higher level concepts
which are in fewer number than the index terms)
■ Retrieval in this reduced concept space might be superior to
retrieval in the space of index terms
Lecture 12 Information Retrieval 17

Latent Semantic
Indexing
Singular Value
Decomposition
 unique mathematical decomposition of a
matrix into the product of three matrices:
two with orthonormal columns
one with singular values on the diagonal
 tool for dimension reduction
 finds optimal projection into low-dimensional
space

Singular Value
Decomposition
DT
wtd = T
r r
r d
t t
d r
 Compute singular value decomposition of a
term-document matrix
D, a representation of M in r dimensions
T, a matrix for transforming new documents
 gives relative importance of
dimensions
LSI Term matrix T
 T matrix
gives a vector for each term in LSI space
multiply by a new document vector to “fold in”
new documents into LSI space
 LSI is a rotation of the term-space
original matrix: terms are d-dimensional
new space has lower dimensionality
dimensions are groups of terms that tend to co-
occur in the same documents
○ synonyms, contextually-related words, variant endings

Document matrix D
D matrix
coordinates of documents in LSI space
same dimensionality as T vectors
can compute the similarity between a term
and a document
http://lsi.research.telcordia.com/

Dimension Reduction

Improved Retrieval with
LSI
 New documents and queries are "folded in"
multiply vector by T
 Compute similarity for ranking as in VSM
compare queries and documents by dot-product
 Improvements come from
reduction of noise
no need to stem terms (variants will co-occur)
no need for stop list
○ stop words are used uniformly throughout collection, so
they tend to appear in the first dimension
No speed or space gains, though…

Computing an Example
Technical Memo Titles
c1: Human machine interface for ABC computer applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user perceived response time to error measurement
m1: The generation of random, binary, ordered trees
m2: The intersection graph of paths in trees
m3: Graph minors IV: Widths of trees and wellquasiordering
m4: Graph minors: A survey
M=
r (human.user) = -.38 r (human.minors) = -.29

K=
S=
D=
New M=
r (human.user) = .94 r (human.minors) = -.83

Incremental LSI
 Allows the fast and low-cost computation of
traceability links by using the results from
previous LSI computation.
 Avoids the full cost of LSI computation for
TLR by analyzing the changes to
documentation and source code in different
versions of the system,
 and then derive the changes to the set of
documentation-to-source code traceability
links.
 Evaluation
Experiment of LEDA
 Trance line from the Manual to the Source
Code
 Find out which parts of the source code are
described by a given manual section.
IR Quality Measures
 Precision @ n:
 Recall @ n:
 Average precision:

Experiment of LEDA
Experiment of LEDA
Experiment of LEDA
Figure 2. Time Cost with iLSI, threshold=0.7
Experiment of LEDA
Conclusions
 Latent semantic indexing provides an
interesting conceptualization of the IR
problem
 It allows reducing the complexity of the
underline representational framework
which might be explored, for instance,
with the purpose of interfacing with the
user
Conclusions
 Traceability between code and
documentation in real world systems is
effective via IR techniques.
 For realistic datasets the Vector Space
Model, which did not perform
dimensionality reduction where shown to
be the most effective.

References
 Hsinyi Jiang, Tien N. Nguyen, Ing-Xiang Chen, Hojun Jaygarl, Carl K. Chang:
Incremental Latent Semantic Indexing for Automatic Traceability Link Evolution
Management. ASE 2008: 59-68
 G. Antoniol, G. Canfora, G. Casazza, A.D. Lucia, and E. Merlo. Recovering
Traceability Links Between Code and Documentation. IEEE Trans. Softw. Eng. ,
28(10):970-983, 2002.
 Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Using
latent semantic analysis to improve information retrieval." In Proceedings of CHI'88:
Conference on Human Factors in Computing, New York: ACM, 281-285.
 Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A.
(1990) "Indexing by latent semantic analysis." Journal of the Society for Information
Science, 41(6), 391-407.
 Foltz, P. W. (1990) "Using Latent Semantic Indexing for Information Filtering". In R.
B. Allen (Ed.) Proceedings of the Conference on Office Information Systems,
Cambridge, MA, 40-47.
 Deerwester, S.,Dumais, S.T., Landauer, T.K.,Furnas, G.W. and Harshman, R.A.
(1990). "Indexing by latent semantic analysis." Journal of the Society for Information
Science, 41(6), 391-407.

Research - Trace Ability With LSI

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research - Trace Ability With LSI

Uploaded by

Copyright:

Available Formats

Team 2

A Comparison of Traceability Techniques for Specifications 3

A Comparison of Traceability Techniques for Specifications 4

A Comparison of Traceability Techniques for Specifications 5

A Comparison of Traceability Techniques for Specifications 6

Source Code Identifiers Identifiers Text

List of identifiers Classifier List

Software Letter Stopwords Morphological

Document Path Morphological

•Lower-case , stop-words, number

A Comparison of Traceability Techniques for Specifications 9

…NativePath, Words … NativePath,Native,Path,

•Use well-known coding standards for sub-words

A Comparison of Traceability Techniques for Specifications 11

 Assumes terms are independent

 Term-document matrix has a very high

auto car make

 LSI [Deerwester et al., 1990]

■ The key idea is to map documents and queries into a lower

Lecture 12 Information Retrieval 17

Lecture 12 Information Retrieval 19

Lecture 12 Information Retrieval 21

Lecture 12 Information Retrieval 22

Lecture 12 Information Retrieval 23

Lecture 12 Information Retrieval 24

r (human.user) = -.38 r (human.minors) = -.29

r (human.user) = .94 r (human.minors) = -.83

A Comparison of Traceability Techniques for Specifications 34

A Comparison of Traceability Techniques for Specifications 40

You might also like