You are on page 1of 2

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882

Volume 5, Issue 1, January 2016

Document Clustering for Forensic Analysis


Nita Gawali1, Smita Saravade2, Sharwari Dadali3
1

(Department of Computer Engineering, Savitribai Phule Pune University, Pune


Email: gawalinita28@gmail.com)
2
(Department of Computer Engineering, Savitribai Phule Pune University, Pune
Email: smitasaravade123@gmail.com)
3
(Department of Computer Engineering, Savitribai Phule Pune University, Pune
Email : sherwaridadali12@gmail.com)

ABSTRACT
In computer forensic analysis, thousands of files are
usually examined. These files consist of unstructured
text, whose analysis by computer examiners is difficult
to be performed. In particular, algorithm for clustering
documents can facilitate the discovery of new and useful
knowledge from the document under analysis. We
propose the system that applies document clustering
algorithm to forensic analysis of computers sized in
police investigation. In this proposed system we will use
k-means algorithm
Keywords Clustering, Forensic computing, Text
mining

I.

INTRODUCTION

There is large amount of data that has a direct impact in


Computer Forensics, which can be broadly defined as
the discipline that combines elements of law and
computer science to collect and analyze data from
computer systems in a way that is admissible as
evidence in a court of law. It usually involves examining
thousands of files per computer. This activity exceeds
the experts ability of analysis and interpretation of data.
Clustering algorithms are typically used for analysis of
very large data set, where there is no prior or little
knowledge about the data. This is precisely the case in
several applications of Computer Forensics, including
the one used in our work. Our datasets consist of
unlabeled objectsthe classes or categories of
documents that can be found are a priori unknown. Even
assuming that labeled datasets could be available from
previous analyses which can help us in clustering.
Clustering algorithms are used which are capable of
finding patterns from computers for enhancing our
analysis.
Clustering is nothing but grouping of similar data objects
in one cluster that is objects in one cluster are similar to

each other than the objects in another cluster. Domain


experts have limited time for performing examinations
or doing analysis of data. Clustering prioritize the
relevant documents required for analysis. Clustering
helps to speed up the computer forensic process.

II.

PREPROCESSING STEPS

This step includes removal of stop words (prepositions,


pronouns, articles, and irrelevant documents metadata).
Using statistical approach for text mining; documents
are represented in vector space model.
In this model, each document is represented by a vector,
containing the frequencies of occurrences of words. We
use term variance (TV) which is dimensionality
reduction technique use to increase efficiency and
effectiveness of clustering algorithm.TV selects number
of attributes having greatest variances over the
documents using cosine-based distance measures. These
later used to calculate distances between files
(documents) names only.

III.

CLUSTERING ALGORITHM

In our case we are using k-means algorithm for


clustering purpose.
Algorithm:
Randomly selects numbers of cluster k
Repeat
Allocate or reallocate the number of objects to the
cluster with minimum distance
Calculate the mean
Until the number of objects in the cluster remain same.
K-means algorithm is sensitive to the initialization of
number of clusters.

IV.

CONCLUSION

We present an approach that applies document clustering


method to forensic analysis of computers seized in
police investigations. There are several practical results

www.ijsret.org

29

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 5, Issue 1, January 2016

based on our work which are extremely useful for the


experts working in forensic computing department.

REFERENCES
[1] S. Decherchi, S. Tacconi, J. Redi, A. Leoncini, F.
Sangiacomo, and R. Zunino, Text clustering for digital
forensics analysis, Computat.
Intel. Security Inf. Syst., vol. 63, pp. 2936, 2009.
[2] J. F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J.
McArthur, S. Minton, I. Xheneti, A. Toncheva, and A.
Manfrediz, The expanding digital universe: A forecast
of worldwide information growth through 2010, Inf.
Data, vol. 1, pp. 121, 2007.
[3] B. S. Everitt, S. Landau, and M. Leese, Cluster
Analysis. London, U.K.: Arnold, 2001.
[4] A. K. Jain and R. C. Dubes, Algorithms for
Clustering Data. Englewood Cliffs, NJ: Prentice-Hall,
1988.
[5] L. Kaufman and P. Rousseeuw, Finding Groups in
Gata: An Introduction to Cluster Analysis. Hoboken, NJ:
Wiley-Interscience, 1990.
[6] R. Xu and D. C.Wunsch, II, Clustering. Hoboken,
NJ: Wiley/IEEE Press, 2009.
[7] A. Strehl and J. Ghosh, Cluster ensembles: A
knowledge reuse framework for combining multiple
partitions, J. Mach. Learning Res., vol. 3, pp. 583617,
2002.
[8] E. R. Hruschka, R. J. G. B. Campello, and L. N. de
Castro, Evolving clusters in gene-expression data, Inf.
Sci., vol. 176, pp. 18981927,
2006.
[9] B. K. L. Fei, J. H. P. Eloff, H. S. Venter, and M. S.
Oliver, Exploring forensic data with self-organizing
maps, in Proc. IFIP Int. Conf. Digital Forensics, 2005,
pp. 113123.
[10] N. L. Beebe and J. G. Clark, Digital forensic text
string searching: Improving information retrieval
effectiveness by thematically clustering search results,
Digital Investigation, Elsevier, vol. 4, no. 1, pp. 4954,
2007.

www.ijsret.org

30

You might also like