You are on page 1of 7

Available online at www.sciencedirect.

com

ScienceDirect
Procedia Technology 11 (2013) 1181 1187

The 4th International Conference on Electrical Engineering and Informatics (ICEEI 2013)

An Intelligent Document Clustering Approach to Detect


Crime Patterns
Qusay Bsoul, Juhana Salim *, Lailatul Qadri Zakaria
Knowledge Technology Research Group, Universiti Kebangsaan Malaysia, 43600 Bangi Selangor, Malaysia

Abstract

A wide number of reports and news on crimes, increasingly conducted almost every day, have resulted in making detection of
such crimes more difficult if not complex. Therefore, the need for detecting and identifying such crimes emerges as a necessary
way of detecting and identifying such crime patterns on the news. Document Clustering have been increasingly becoming an
important task for obtaining good results with the unsupervised learning methods. It aims to automatically group similar
documents in one cluster using different types of extractions and cluster algorithms. There are ongoing works done to improve
Document Clustering techniques such as Extractions and Clustering approaches to overcome the difficulty in designing a general
purpose document clustering for crime investigation and the ill posed problem of extraction and clustering. This paper discusses
two major sequential stages in Document Clustering Extraction Features and Clustering Algorithms as well as the major
challenges and the key issues in designing extraction features and clustering algorithms. In addition, the following approach
assists the law enforcement officers and detectives to enhance performance and speed up the process of solving crimes.

Authors. Published
2013 The Authors. Publishedby
byElsevier
ElsevierB.V.
Ltd. Open access under CC BY-NC-ND license.
Selection
Selection and
and peer-review
peer-reviewunder
underresponsibility
responsibilityofofthe
theFaculty
Facultyofof
Information Science
Information & Technology,
Science Universiti
& Technology, Kebangsaan
Universiti Kebangsaan
Malaysia.
Malaysia.

Keywords: Intelligent Clustering; K-means; Affinity Propagation algorithm; Lemmatization Algorithm; Crime Detection

1. Introduction

Due to its increasing social importance, the crime domain has been selected as an area for application in this
work. Tracing back the history of tracking and solving crimes, it is evident that such important domain has been

*
Corresponding author.
E-mail address: js@ftsm.ukm.my

2212-0173 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license.
Selection and peer-review under responsibility of the Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia.
doi:10.1016/j.protcy.2013.12.311
1182 Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 1187

privileged by a wide number of experts and specialists in criminal justice and law enforcement. Recently, due to the
rapid technological advances including the increasing application of the computerized systems to trace and track
crimes, computer data analysts have taken practical steps in assisting the law enforcement officers and detectives to
enhance the process of solving crimes and increase its performance.
For the last few recent years, there has been an increasing interest on the part of researchers in detecting
and tracking crime news stories based on clustering methods. Such an increasingly emerging interest is attributed to
the social dilemma and epidemic disease represented and reflected by the occurrence of social crimes which
poses tremendous threats to societies [1-2]. Since large amount of news concerning stories in general and others
related to crimes news in particular are being increasingly accumulated like flood over the web. Many challenges
are increasingly encountered by the decision makers in law enforcement departments in detecting, identifying and
tracing or tracking the crimes events [1-3]. Thus, tracking the social crimes or events according to their time line is
becoming a tedious task. Such difficult challenges and complexities in organizing the news of crime stories are
generated from a huge dimensionality of crimes data, which usually refers to the highly diverse embedded modalities
such as criminal data and weapon data [4]. In other words, law enforcement officers and detectives are provided by
these modalities with justified explanations about the international or world view for crime patterns by carrying out
identification of the relations between local patterns [4-5].
Document Clustering is considered as one of the most commonly used methods in detecting topics/events or
types of crime documents [6], and a method in which document clustering has three main processes [1-7]. The first
process is preprocessing of documents to remove unimportant words and symbols from the document of crime. The
second process is representation of documents crimes to extract the most important information from the document
of crime and show the similarity among these documents. The last process of document clustering consists in
applying the clustering algorithm to the groups of documents of topics/events or types crime based on the
similarities among the documents.
This study looks forwards to providing the restraints of Document Clustering in two stages (Extraction of
Document and cluster algorithms) with regard to the crime domain. Document Clustering has been studied for ages,
however it is still a hot domain in the research area, and therefore it requires more improvements [8]. As to section 2,
it examines two out of the three main processes of document clustering. In section 3, we describe our suggested
system of Crime Document Clustering and lastly, section 4 presents the conclusion.

2. Process of document clustering

This section presents a comprehensive review of the previous related work on the extraction terms and
cluster algorithms of Document Clustering. A detailed review of previous work related to extraction terms and
cluster algorithms on Document Clustering is provided in the first and second sub-sections. In addition, in this
section, the latter has limitations in two out of the three processes of Document Clustering.

2.1. Extraction terms

Each crime document in this stage is represented by a set of terms which are called features. Selecting the
features from the indexed words is considered as a real challenge. This is because more than one feature or
combinations of features are considered in this stage. A wide body of research has been carried out in this particular
extraction domain. The focus of some researchers has been placed on extracting information from the terms or
features indicating a certain crime by employing name entity [9], bag of word [10-11], n-gram [12], Frequent
Word and Frequent Word Meaning [13], Concept Weighting ontology [14], Word Net Lexical ontology [15] and
using core semantic features [16] to make the document clustering better and more effective. Concerning this,
Zhiwei et.al [8] conducted a study in which they compared between bag of words and name entity. Their findings
revealed that the results gained through using the name entity approach were better and more effective than those
results generated from the data when using bag of word. In addition, yanjunli et.al [13] used Frequent Word and
Frequent Word Meaning to compare them with bag of word. Their result showed that Frequent Word and Frequent
Word Meaning were better than bag of word. On the other hand, Masnizah et.al [6] distinguished between Name
Entity and bag of word by users. Their findings revealed that the name entity approach is better than bag of word.
Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 1187 1183

Samah Fodeh et.al [16] carried out an experiment using ontology and semantic features in which all the polysemous
and synonymous nouns were extracted from the documents and a unique approach that was capable of permitting
them to measure the information gain in disambiguating these nouns in an unsupervised learning setting was
developed by them. The purpose of developing this approach was to identify the core subset of semantic features
representing a text corpus. Thus, based on this experiment, the results revealed that by employing core semantic
features for clustering, it is more possible to reduce the number of features by 90% or more. At the same time, it is
possible to produce clusters that capture the major themes in a text corpus. According to (Hmway Hmway and Thi
Thi [14] and Tarek F. Gharib et.al [15]), when compered Concept Weighting ontology and Word Net Lexical
ontology with bag of word, the performance of Concept Weighting ontology and Word Net Lexical ontology was
much better than bag of word.
Table 1. Summary of previous work on extraction
Authors Dataset used Stat-of-art outperform method

[8] Financial news BOW NER


[13] news BOW Frequent word, frequent word meaning
[6] TDT BOW NER
[16] Divers dataset Ontology and semantic Core semantic features
[14] news BOW Weighting ontology
[15] news BOW Word-net ontology

Lemmatization algorithm is an idea to extract information from sentences based on certain rules of the
sentence. For instance, Eiman Al-shamari and Lin [17] generated a new technique which was published as patent
application publication called lemmatization algorithm for Arabic language. It aims to extract nouns and verbs from
Arabic documents based on the preposition words, in addition to some rules related to other linguistic elements such
as the definite article the. This algorithm catches important words from the two lists of prepositions; the first one
includes proceeding verbs and the other one involves proceeding nouns. Based on our knowledge, it is very
important to extract words such as nouns and verbs related to topics/events of crimes, but this importance is more
stressed or emphasized especially in Document Clustering.

2.2. Cluster Algorithms

Many researchers have carried out empirical work on algorithms of cluster. For example, Dai et.al [7]
improved Agglomerative Hierarchical Clustering by taking into account the importance of the title part in a story.
That was in cases when the occurrence of the term was found in the title, and it was assigned as higher weight. The
findings showed that the proposed method was effective in clustering documents in financial news. However,
some researchers have focused on the clustering of topic or events. For Fernando Bao et.al [18], they used self-
organizing map (SOM) to cluster IRES image of eye and compared it with k-means, where the performance of
SOM was better than k-means. Until Sheng-Tun et.al [19], they used a fuzzy self-organizing map (FSOM) network
to detect and analyze the patterns of crime trends from temporal crime activity data. Other researchers such as
Fredrik and James [20], compared between scalable k-means, complete k-means and k-means for knowledge and
data discovery (KDD) of data set. Their results proved that k-means is better than other algorithms. Bouras and
Tsogkas [21] used the clustering methodologies including: single, maximum, linkage and centroid linkage
hierarchical clustering, as well as regular k-means, k-medians and k-means++. Based on their findings, it was found
that using the k-means not only generating the best result at the level of internal measurement of clustering index
function, but also leading to the best results on a real users experimentation. In addition, when comparing between
k-means and single pass algorithms and other algorithms of clustering on topics of news, Taeho Jo [22] revealed that
the k-means was better than single pass clustering. As suggested by Zhiwei Li et al. [8], the estimation of initial
number of events depends or is based on the article count-time distribution in their probabilistic model where the
estimation of events number represents the initial (K) clusters. Masnizah et.al [6] tried to produce new ways in
1184 Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 1187

choosing optimal initial centroid for each cluster, where their results achieved higher F- measure. In examining the
hybrid algorithm of clustering, Xiangying Dai, et.al [23] proposed a two-layer text clustering approach for detecting
retrospective news events by using the Affinity Propagation clustering (AP) as a first layer. Then, the researchers
conducted a second feature selection on the generated clusters of (AP) cluster, and they also adopted the usual
agglomerative hierarchical clustering for the purpose of generating the ultimate news events. Finally, the researchers
chose the traditional agglomerative hierarchical clustering (AHC) and the classic K-Means as comparative methods.
The findings revealed that the proposed method achieved the highest precision measure, moreover,it was followed
by the Affinity Propagation (AP) clustering and the k-means and (AHC) clustering respectively in their ranks. As far
as the recall measure is concerned, it was found that the proposed method and k-means obtained the highest result
followed by the (AHC) clustering and (AP) clustering. Velmurugan and Santhanan [24], compared between three
algorithms of cluster for geographic map data set. Their result showed that k-means was good with small data, k-
medoids was good with large data and fuzzy c-means was between k-means and k-medoids.
Table 2. Summary of previous work on cluster algorithms
Authors Dataset used Stat-of-art algorithm outperform clustering algorithm

[7] Financial news AHC Enhance AHC


[17] IRES k-means SOM
[18] Data series of crimes N/A FSOM
[19] KDD Scalable, complete k-means k-means
[20] web Single, maximum, centroid AHC and k-means
k-medians, k-means++
[21] news Single pass k-means
[8] News Probabilistic model cluster Enhance Probabilistic model
[6] Crimes Single pass, k-means Enhance k-means
[22] News k-means, AHC, AP APAHC
[23] Geographic map Fuzzy c-means k-means, k-medoids

2.3. Limitations of feature extraction and cluster algorithm

In general, the main limitations of the previous work which are summarized in Table 1, 2 are presented as follows:

x The researchers assumed that the name entity is good for extracting information from crime documents without
detecting the weakness of extracting the specific important information from crime documents such as extracting
nouns and verbs, and simultaneously avoiding extracting the unimportant features [25].
x The researchers assumed that k-means has the best algorithm [26], in carrying out a comparison between the k-
means with another algorithm without a hybrid and other algorithms of cluster without detecting the weakness of
k-means for crime documents. They, furthermore, provided a solution for these weaknesses by hybrid [6] with
other algorithms and compared such applied algorithms with another one to increase their performance.
x In general, all the proposed works solved the weakness of clustering document in terms of enhancing part of the
process, where the output of each stage affected on the accuracy of the next process [25]. So, that needs to detect
the problems from down to up which means from algorithm of clustering which is used to extracting the terms. As
mentioned above, there are three processes of clustering documents, from down cluster algorithm up to
extracting the terms from crime documents. Thus, the idea of down to up guides us to describe or identify the
problem of extracting information from crime documents when solving the weakness of k-means cluster
algorithm. However, it is so easy to detect the problem of extracting information and know the information that
should be avoided.
Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 1187 1185

2.4. Need

As far as we know, all the existing methods of detecting and identifying for document clustering have assumed
that all the crime documents are proved, however, there are two problems related to cluster algorithm, in which the
performance of k-means clustering highly depends on selecting the number of clusters and the initial seed centroids.
Therefore, it is expected that the result of this method is often suboptimal [27], as well as a problem related to
extraction features [25]. All in, each of these two problems impacts on other processes [25]. And therefore to
overcome these obstacles, we provide alternatives to help detect and identify the groups of topics/events or types of
crimes in Document Clustering with high performance and more effectiveness by using our proposed as shown in
next section.

3. Data collection and performance measure

3.1. Data collection and performance measure

For this research, the corpora have been collected from Bernama news*, and the dataset test of six categories of
topics, including Canny Ong, mona fandy, noritta samsudin, nurin jazlin, sharlinie mohd nashar and sosilawati
articles . In addition , the dataset test of ten types of crimes, including Traffic Violation, theft, sex crime, murder,
kidnap, fraud, drugs, cybercrime and arson gang articles. These topics and types consist of 247, 2400 documents that
are used as the testing dataset. In addition, this study employs an overall purity measure and an overall F-measure in
order to measure the external quality. These two measures are popular document clustering measures [28-29]. The
existence of higher overall purity and F-measures gives the best cluster.

3.2. Proposed of extraction features

As far as it is mentioned in section 2.3 the weakness of extraction, Name Entity is suggested to be used [30] to
extract information based on three questions namely; who, where and when, these questions play an important role
to be extracted, because most of these features are usually used to describe crime in general, we also would like to
do nouns and verbs extraction by using Word-Net to provide the most important features that could be used to
describe specific types or events of crimes (as show in Fig.1). Events or types of crimes could be identified by
looking at nouns which could be used to identify location, names, topics, date, etc..., as well as verbs to identify the
events and types of event. In addition will be remove redundant features extracted from NE with Verbs and Nouns.

3.3. Proposed of Affinity Propagation algorithm

In order to overcome the weakness of k-means clustering, we would used Affinity propagation algorithm [31] to
generate number of clusters as illustrated in Fig.2, also to create local optimal initial centroid for each clusters. In
addition, we will be detected unimportant features by studying the shared terms between wrong documents in wrong
cluster and correct document as shown in Fig.2.

Name Entity
Nuons

Verbs

Most Important
rtant
tan Feature of
Crime

Fig. 1. Our proposed of extract most important features of crime.


1186 Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 1187

detect
wronge
document by
extractio study share
n nouns all terms
and with their
verbs wrong
Decum also (AP) centroid and
ent of Name algorit corect
crime Entity hm centroid

pre- represen k- avoid


proces tation by means unimpo
sing used cluste rtant
TFIDF ring terms
and
redo
from
TFIDF

Fig. 2. Our proposed crime document clustering of enhances clustering and avoids unimportant features.

4. Conclusion

This paper detects and identifies some limitations in Crime Document Clustering. First off, it addresses the
fault detection and identification in the k-means algorithm, then, it examines the weakness of extraction terms from
documents as it is stated above. Therefore, this study aims to enhance the reliability of Document Clustering of
crime document by efficient k-means as well as the extraction features of crime document. Furthermore , it is used
for crime document clustering, and its results are the best testimony for its efficiency as it aims to enhance the k-
means algorithm for Document Clustering as well as the extracting of information which group topics/events of
crimes can outperform the original Document Clustering and other Document Clustering based on two criteria of
time and performance . Besides, we look forward that our suggestion of crime document clustering enhances the
performance and the effectiveness.

References

[1] Chandra. A, Gupta. B and Gupta. C, A multivariate time series clustering approach for crime trends prediction. Proceeding of International
Conference on Systems, Man and Cybernetics, SMC; 2008. p. 892-896.
[2] Alruily, Ayesh, Al-Marghilani. Using Self Organizing Map to Cluster Arabic Crime Documents, Proceedings of the International
Multiconference on Computer Science and Information Technology; 2010. p. 357363.
[3] Chen A, Zeng B, Atabakhsh C, Wyzga D, Schroeder, E., COPLINK: managing law enforcement data and knowledge, Communications of
the ACM 2003; 46(1): 28-34.
[4] Boo A, Alahakoon B.,Mining Multi-modal Crime Patterns at Different Levels of Granularity Using Hierarchical Clustering. CIMCA 2008,
IAWTIC 2008, and ISE 2008. p. 1268-1273.
[5] Bache A, Crestani B., Estimating real-valued characteristics of criminals from their recorded crimes; 2008. p. 1385-1386.
[6] Mohd, Bsoul, Ali, Noah, Saad, Omar And Ab, Aziz. Optimal Initialcentroid In K-Means For Crime Topic. Journal of Theoretical and
Applied Information Technology 2010; 45(1): 19-26.
Qusay Bsoul et al. / Procedia Technology 11 (2013) 1181 1187 1187

[7] Dai X, Chen Q, Wang X, Xu J. Online topic detection and tracking of financial news based on hierarchical clustering. Proceeding of
International Conference on Machine Learning and Cybernetics; 2010. p. 3341-3346.
[8] Zhiwei Li, Wang, Mingjing Li, Wei-Ying. A Probabilistic Model for Retrospective News Event Detection. Proceedings of the 28th annual
international ACM SIGIR conference on Research and development in information retrieval; 2005. p.106-113.
[9] Kumaran. G. and Allan. J. Text classification and named entities for new event detection. Proceedings of the 27th annual international
ACM SIGIR conference on Research and development in information retrieval; 2004.
[10] Can. F, , New event detection and topic tracking in Turkish, Journal of the American Society for Information Science and Technology2010;
61: 802-819.
[11] Chen H, Zeng D, Atabakhsh H, Wyzga W, Schroeder J. COPLINK: managing law enforcement data and knowledge, Communications of
the ACM 2003; 46(1):28-34.
[12] Mustafa, H. and Al-Radaideh.Q, Using N-grams for Arabic text searching, J. Am. Soc. Inf. Sci. Technol. 2004; 55(11): 1002-1007.
[13] Lia, Chungb, Holtb, Text document clustering based on frequent word meaning sequences, 8th International Conference on Enterprise
Information Systems (ICEIS' 2006); 2006. p. 381404.
[14] Tar and Nyunt. Enhancing Traditional Text Documents Clustering based on Ontology. International Journal of Computer Applications 2011;
33(10) : 38-42.
[15] Gharib, Fouad, Mashat, Bidawi. Self Organizing Map -based Document Clustering Using WordNet Ontologies, International Journal of
Computer Science Issues 2012; 9(1-2) : 88-95.
[16] Fodeh, Punch and Ning Tan. On ontology-driven document clustering using core semantic features, Journal of Knowl Inf Syst, Springer-
Verlag London; 2011.
[17] Al-Shammari, patent application publication of lemmatizing, stemming, and query expansion method and system,
Pub.no.:US 2010/0082333 A1, Pub.Date:
Apr.1, http://www.google.com/patents/US20100082333?printsec=abstract&source=gbs_overview_r&cad=0#v=onepage&q&f=false, 2010.
[18] Bao, Lobo, Painho , Self-organizing Maps as Substitutes for K-Means Clustering, Computational Science ICCS Lecture Notes in
Computer Science 2005; 3516: 476-483.
[19] Li, Kuo, Tsai. An intelligent decision-support model using FSOM and rule extraction for crime prevention, Expert Systems with
Applications, Elsevier 2010; 37(10): 71087119.
[20] Farnstrom and Lewis, Fast. Single-pass K-means algorithms, www.citeulike.org/user/zador/article/1772993 ; 2007.
[21] Bouras C, Tsogkas V. Assigning Web News to Clusters. Proceedings of Conference on Internet and Web Applications and Services; 2010.
p. 1-6.
[22] Taeho J, Clustering News Groups using Inverted Index based NTSO, NDT. First International Conference on Networked Digital
Technologies; 2009. p. 1-7.
[23] Dai, He, Sun. A Two-layer Text Clustering Approach for Retrospective News Event Detection. International Conference on Artificial
Intelligence and Computational Intelligence, IEEE computer security; 2010. p. 364-368.
[24] Velmurugan .T. and Santhanan .T. A survey of partition based clustering algorithms in data mining: An experimental approach, Information
Technology Journal 2011; 10 (3): 487-484.
[25] Ramadan and Mohd. A Review of Retrospective News Event Detection. International Conference on Semantic Technology and
Information Retrieval, Putrajaya, Malaysia; 2011. p. 209-214.
[26] Jain. Data clustering: 50 years beyond K-means, Journal of Pattern Recognition Letters, 2009; 136-713.
[27] Aouf M, Lyanage L, Hansen S. Review of data mining clustering techniques to analyze data with high dimensionality as applied in gene
expression data. Proceeding of International Conference on Service Systems and Service Management; 2008. p. 1-5.
[28] Jain.K. and Dubes.C. Algorithms for clustering data, Ed;1988.
[29] Steinbach, Karypis,M, Kumar,G. A comparison of document clustering techniques, In: KDD Workshop on Text Mining; 2000. p. 525-526.
[30] Chau,M. Xu.J, and Chen,H. Extracting Meaningful Entities from Police Narrative Reports.Proceedings of the 2002 Annual National
Conference on Digital Government Research; 2002. p. 1-5.
[31] Dueck and Frey. Non-PHWULFDIQLW\SURSDJDWLRQIRUXQVXSHUYLVHGimage categorization. IEEE 11th International Conference of Computer
Vision, ICCV; 2007. p.1-8.

You might also like