You are on page 1of 5

Intelligent Data Mining in Autonomous Heterogeneous Distributed Bio Databases

A:ra Shamim
Computer Science Department
COMSATS Institute oI InIormation
Technology
Islamabad, Pakistan
azrashamimcomsats.edu.pk
Maqbool Uddin Shaikh
Computer Science Department
COMSATS Institute oI InIormation
Technology
Islamabad, Pakistan
maqboolshaikhcomsats.edu.pk
Saif Ur Rehman Malik
Computer Science Department
COMSATS Institute oI InIormation
Technology
Islamabad, Pakistan
saiIurrehmancomsats.edu.pk


Abstract- A few years ago there is revolutionary change in
data mining and bioinformatics. Data mining techniques and
tools play important role in field of bioinformatics. It is very
useful to use data mining techniques to evaluate and analyze
bio-medical data. In this paper authors purpose a frame work
for intelligent data mining system for bio databases. This
system allows scientist to apply bio data analysis data mining
techniques intelligently on bio data as well as their proprietary
data. System first extracts the relevant, useful, valid and
actionable data from bio databases. Bio databases is often
autonomous heterogeneous and distributed in nature. The
extracted data is preprocessed. After preprocessing, data
mining techniques are applied on preprocessed data, local data
as well as proprietary data. Then the mined knowledge is
integrated with expert system knowledge to assist researchers
and scientists in their research work and crucial decision
making process.

Keywords- Autonomous heterogeneous distributed bio
databases; Data Mining;Expert System; Knowledge base;

I. INTRODUCTION
In the past Iew years there is a revolutionary change in
biomedical, bioinIormatics, bioengineering and
biotechnology. With the development oI inIormation
technology, more and more bio data was collected through
experiments in Iield oI bioinIormatics, bioengineering and
biotechnology. This huge amount oI bio data is stored in
distributed database (bio databases) and data warehouse (bio
warehouse). These bio databases are located in diIIerent
geographical area this distributed in nature. For this
explosive growth oI bio data, there is need to develop new
methods, techniques, algorithms, and tools to intelligently
extract valid, relevant, correct, unknown and actionable
inIormation Irom bio database as well as knowledge to make
crucial decision.
Distributed database is as a collection oI multiple,
logically interrelated databases distributed over a computer
network |1|. Heterogeneous distributed DB are the database
with no homogeneity among them either in oI the way data is
logically structured (data model) or in term oI the
mechanisms provided Ior accessing it (data language).
Autonomy reIers to the distribution oI control, not oI data. It
indicates the degree to which individual DB can operate
independently |1|. Autonomous heterogeneous distributed
databases are the non homogenous independent databases
stored on multiple locations and linked together via network.
Data mining is process oI extracting valid, previously
unknown, comprehensible, and actionable inIormation Irom
large databases and using it to make crucial business
decision |2|. Intelligent data mining is a hot issue oI
research now days and many researchers are working on it.
Researchers Iocus on two things, Iirst calling knowledge
intelligently |3| or intelligent query answering |4| query
answering mechanisms in knowledge-rich database can be
classiIied based on their responses to queries into two
categories: direct query answering and intelligent (or
cooperative) query answering |3|. Direct query answering is
a direct, simple retinal oI data or knowledge Irom the
knowledge- rich database; whereas Intelligent Query
Answering (IQA) consists oI analyzing the intent oI the
query and providing the generalized, neighborhood or
associated inIormation relevant to the query. Most
commercial data mining products provide a large number oI
models and tools Ior perIorming various data mining tasks,
but Iew provide intelligent assistance Ior addressing many
important decisions that must be considered during the
mining process |6|.
The bio data discovered in diIIerent research institutes
and laborites and stored into distributed bio database. These
research institutes and laborites are scattered all over the
world. Researches may need data and knowledge discovered
by other Ior comparison, analysis, modeling, classiIication
and clustering purpose. Thus, a system is necessary that
provide access to required data that is distributed in nature
and assist in evaluation an analysis oI bio data.
Purposed Iramework access valid, relevant, unknown
data Irom autonomous heterogeneous distributed bio
databases, preprocess it and load it into local bio database.
System take the user queries as input and Expert System then
search the knowledge in its knowledge base, iI knowledge
exist its replied to the user with expert assistance else it
select appropriate bio data analysis data mining
technique/algorithm and passed the query to data mining
engine . Data mining engine perIorm data mining on bio data
and return results to Expert system. ES analyze and evaluate
the mined knowledge and integrate mined knowledge with
expert knowledge to assist researchers in decision making
process.


A. Problem Statement
Due to tremendous advances and achievements in
biomedical, bioinIormatics, bioengineering and
biotechnology, biological data is being generated at
tremendous speed. Biological data volume and complexity
2010 Second International Conference on Computer Engineering and Applications
978-0-7695-3982-9/10 $26.00 2010 IEEE
DOI 10.1109/ICCEA.2010.9
6
increases exponentially. Biological data stored at diIIerent
geographical locations in diIIerent Iormats due to new
biological discoveries in diIIerent research
institutes and laboratories around the globe owned and
maintained autonomously.
Data sets oI researcher`s interest are large, diverse in
structure and content, and typically autonomously

Understandable. We can interpret and comprehend
the patterns |13|.

B. Data Mining and Knowledge Discoverv
Knowledge discovery in databases (KDD) is a very
popular term in context oI databases. The terms data mining
and knowledge discovery are getting the researchers
maintained |5|. However, the highly distributed and attention Irom last two decades. Knowledge discovery in
heterogeneous characteristics oI biological databases are
inconvenient Ior retrieval oI needed inIormation Irom
diIIerent data sources |6|. Due to the highly distributed,
uncontrolled generation and use oI a wide variety oI bio-
medical data, data cleaning, data preprocessing, and the
semantic integration oI such heterogeneous and widely
distributed biomedical databases have become an important
task Ior systematic and coordinated analysis oI bio-medical
databases is deIined as 'the non-trivial extraction oI implicit,
unknown, and potentially useIul inIormation Irom data |14|.
The knowledge discovery is a process to extract useIul
inIormation Irom data whereas data mining is an essential
step in knowledge discovery process. But many people use
data mining and knowledge discovery interchangeably.
Some people reIerred data mining as knowledge discovery.
One class oI researchers |15| believes that data mining is one
databases |7|. Biological data analysis and integration step in knowledge discovery in databases (KDD). Data
becomes very diIIicult due to heterogeneity, distribution,
volume and complexity oI data. Most commercial data
mining products provide a large number oI models and tools
Ior perIorming various data mining tasks, but Iew provide
intelligent assistance Ior addressing many important
decisions that must be considered during the mining process
|8|.
Traditional data analysis techniques can not support huge
and complex biological data. New data analysis techniques
such as data mining can be helpIul in analysis oI huge and
complex biological data. Researchers may need data and
knowledge which was discovered by other researchers Ior
their research that is distributed over the world in diverse
Iormat. New systems are needed to manage, integrate and
analyze large and complex biological data Irom distributed,
heterogeneous and autonomously maintained bio databases.
Not only the evaluation and analysis oI data is important but
providing the intelligent assistance is equally important.
Mostly, analysis products do not provide the intelligent
assistant in decision making process. Systems are needed
that assists in evaluation and analysis oI huge and complex
data and can help researchers in decision making to precede
their research work.
The rest oI the paper is organized as Iollows: Section2
presents an overview and structure oI the system. Future
work is described in Section 3. Section 4 discusses
Concluding remarks.

II. LITERATURE REVIEW

A. Data Mining
There are various deIinitions oI data mining ranging Iorm
the broadest deIinition to more speciIic deIinition. Raghu et.
al. deIined data mining as: 'Data mining is the exploration
and analysis oI large quantities oI data in order to discover
valid, novel, potentially useIul, and ultimately
understandable patterns in data |13|.
Jalid. The patterns hold in general.
Novel. We did not know the pattern beforehand.
Useful. We can devise actions from the patterns.
Mining or Knowledge Discovery in Databases used tools and
techniques Ior exploration oI databases to extract relevant
and interesting hidden relationships between variables |16|
|14|.
Figure 2 shows the knowledge discovery process in
which the data mining is an essential step. DiIIerent data
mining techniques are applied to extract valuable and hidden
inIormation. The result oI data mining is Iurther careIully
and accurately analyzed in knowledge discovery process to
provide the user valid, accurate and actionable inIormation.


C. Knowledge Discoverv Process
The knowledge discovery process contains Iive or more
steps. Each step oI knowledge discovery is brieIly discussed
below.
1) Databases and Flat Files
Database and Ilat Iiles are the repository oI data. Large
volume oI data is stored in huge and numerous databases and
Ilat Iiles.
2) Data Cleaning
Data cleaning deals with noisy, missy and inconsistent
data.
3) Data Integration
The aim oI data integration is to combine diIIerent data
sources i.e. database, data warehouse and Ilat Iiles.
4) Data Selection
The task oI selecting and retrieving related and relevant
data Irom massive data is done by data selection.
5) Data Transformation
Data transIormation transIormed data Irom diIIerent
Iormat into a uniIied Iormat.
6) Data Mining
Data mining is a necessary step in which diIIerent data
mining techniques are applied to search valuable data,
inIormation and knowledge.
7) Pattern Evaluation
The job oI evaluating the data provided by the data
mining is carried out by the pattern evaluation to discover the
patterns, behaviors, data trends and associations.
7
8) Graphical User Interface (GUI)
GUI uses various presentation and visualization tools and
techniques to present data in appropriate and understandable
Iormat.

D. Data Mining Models
There are two types oI data mining models predictive and
descriptive/ knowledge discovery oriented.
1) Predictive Models
Predictive models use the chronological data to predict
the Iuture. Predictive models explore massive data set and
identiIy the hidden patterns, behaviors and associations.
AIter analysis oI uncovered patterns, behaviors and
associations, predictive model predicts what may happen in
Iuture.
2) Descriptive / Knowledge Discoverv Models
Unlike the predictive models; descriptive models do not
predict any thing instead descriptive models search Ior
interesting, valuable relationships, patterns and behaviors in
the underlying data. Descriptive models are divided into
clustering, association, deviation detection, summarization
and text mining.


III. PROPOSED FRAMEWORK
Purposed Irame work consists three layers. Each layer is
discussed below and shown in the end in Iigure 1.

A. Data Extraction Laver
The pre processing /preparation oI data are done by this
layer. Due to the highly distributed, uncontrolled generation
and use oI a wide variety oI bio-medical data, data cleaning,
data preprocessing, and the semantic integration oI such
heterogeneous and widely distributed biomedical databases,
such as genome databases and proteome databases, have
become an important task Ior systematic and coordinated
analysis oI bio-medical databases |8|. This layer locates and
access relevant bio data Irom autonomous heterogeneous
distributed bio databases and then preprocesses it.
Preprocessing includes cleansing, transIormation and
loading. The Iollowing Iunctions are perIormed by this layer.

It locates and access relevant bio data Irom
Autonomous heterogeneous distributed bio
databases
Cleaning: noise, missing value oI data is removed
in this process.
TransIormation: Data Irom diIIerent Iormat is
transIormed into a uniIied Iormat.
Data optimization /Reducer: In this process
unwanted and unnecessary data is removed to
reduce data to a reasonable size.
Loading: AIter cleaning, transIormed bio data is
loaded into a local database.

Data extraction layer contain knowledge about the
autonomous heterogeneous distributed bio databases which
help it in extraction oI valid, relevant data.
B. Expert Svstem
The expert system is the main part oI intelligent data
mining system. Expert system provides the
expert/intelligence assistant to the users. It interacts with user
as well as data mining system Ior eIIective, eIIicient data
mining. ES system get user query, transIormed it into low
level query, and select appropriate data mining
technique/algorithms. The Iunctionality oI data extraction
layer is controlled by ES. In most cases, the knowledge Irom
the data mining system is not suIIicient to support decision-
maker. It should be linked with the expert knowledge to
show its real intelligence because the mined knowledge is
not Iar Irom conventional statistical analysis |5|. AIter
getting the result Irom data mining layer it, evaluate and
analyze the results. Expert system has ability to learn Irom
past experience using case base reasoning. Case-based
reasoning paradigm provides a good basis Ior the eIIicient
knowledge acquisition oI data mining Knowledge |6|.

ES has Iollowing parts:
Query handler: Query handler get user query,
translate high level query in to low level and
optimized queries.
Knowledge base: It contains domain knowledge,
knowledge about knowledge, Iactual data, and
procedural rules.
Inference engine: The inIerence engine is
important processing component oI expert system,
which inIers new knowledge and utilizes existing
knowledge Ior decision-making and problem
solving. It analyzes and evaluates the mined
knowledge.
Explanation/Reasoning Mechanism: This
mechanism provides justiIication/reasoning process
that lead to Iinal conclusion.
Knowledge integrator: Integrate the knowledge oI
expert system with the result oI data mining engine.

C. Data Mining Engines
Data mining engine perIorm the task oI data mining. It
gets low level user query and other parameter Irom upper
layer (Expert System). Data mining engine perIorm
Iollowing bio data mining tasks.

1) Analvsis of frequent, sequential and structured patterns
One oI the most important search problems in bio-data
analysis is similarity search and comparison among bio-
sequences and structures. For example, gene sequences
isolated Irom diseased and healthy tissues can be compared
to identiIy critical diIIerences between the two classes oI
genes |7|.
2) Association analvsis. identification of co occurring or
correlated
Currently, many studies have Iocused on the comparison
oI one gene to another. However, most diseases are not
triggered by a single gene but by a combination oI genes
acting together. Association and correlation analysis methods
8
can be used to help determine the kinds oI genes or proteins
that are likely to co-occur in target samples. Such analysis
would Iacilitate the discovery oI groups oI genes or proteins
and the study oI interactions and relationships among them
|8|.
3) Effective classification and comparison.
A critical problem in bio data analysis is to classiIy bio
sequences or structured based on their critical Ieature and
Iunction |8|.
4) Cluster analvsis methods.
It is crucial to discover pair wise Irequent patterns and
cluster bio data based on Irequent patterns |8|.
5) Modeling of biological networks.
Large amount oI data generated Irom micro array and
proteomics studies provide rich resources Ior theoretic study
oI the complex biological system by computational modeling
oI biological networks |8|.
6) Data visuali:ation and visual data mining.
Complex structures and sequencing patterns oI genes and
proteins are most eIIectively presented in graphs, trees,
cubes, and chains by various kinds oI visualization tools.
Such visually appealing structures and patterns Iacilitate
pattern understanding, knowledge discovery, and interactive
data exploration. Visualization and visual data mining
thereIore play an important role in biomedical data mining
|8|.

IV. FUTURE WORK
Data and inIormation exchange is useIul Ior the research
point oI view in bioinIormatics Iield. Research institutes
/laborites may still be reluctant and avoid giving other their
own bio data due to conIidentiality, privacy and other
reasons. Privacy preserving mechanism should exits to
satisIy these research institutes/ laborites. Due to distributed
nature oI bio databases security issue is very important.
Privacy mechanism and security issues should be taking into
consideration in Iuture work. Development oI new mining
algorithms capable oI extracting Irequent, sequential and
structured patterns, classiIying and clustering Bio data
should be included in Iuture work.

V. CONCLUSION
Data mining techniques and tools play important role in
Iield oI bioinIormatics. The intelligent data mining system
Ior bio database analysis help the researchers and scientists
in evaluation and analysis oI bio data and decision making
process. It collects data Irom distributed data bases and
provides integrated uniIormed view oI data. It Iacilitate the
researcher to integrate their proprietary data (may be
incomplete) with other data Ior analysis purpose. Then
extract valid, relevant, correct, and actionable inIormation
Irom bio database as well as knowledge. This mined
knowledge is then combined with expert knowledge to assist
researchers and scientists in their research and decision
making process.

REFERENCES
|1| M. Tamer Ozsu, Patric Valduriez. (2003). Pricnciple oI
Distributed
Database Systems. 2nd Ed.
|2| Thomas connolly ,Carolyn begg. (2003) Database Systems:
A Practical Approach to Design, Implementation and Management
.4th Ed. Addison-wesley
|3| Jian Liang, Xiao Li, Hongshuo Liu et al. (2002).`The Design
and Realization oI Intellectualized Data Mining System` In:
Application Research oI Computer, pp.89-91.
|4| Chen Yi-ming. (2002).`Data mining technique and intelligent
query answering associated analyzing` In: Journal oI Northwest
Normal University (Natural Science). (38) pp.41-43, pp.67.
|5| Junhua Hu, Yongmei Liu. (2006). Designing and Realization
oI Intelligent Data Mining System Based on Expert Knowledge`:
IEEE International ConIerence on Mannagement oI Innovation and
Technology June .2006. pp. 380 - 383
|6| Micheal Charest, sylvain Delisle, OIelia Cervantes, YanIen Shen.
(2006). Intelligent Data Mining Assistance via CBR
and Ontologies`: !7th Internation ConIerence on Database and
Expert Systems Applications. 04-08 Sept. 2006. pp. 593 - 597
|7| J. Yang, P. S. Yu, W. Wang, and J. Han. (2002) Mining
long sequential patterns in a noisy environment`:In SIGMOD pp.
406
417.
|8| Jason T. L. Wang, Mohammed J. Zaki, Hannu T. T. Toivonen,
Dennis Shasha,, Edition. (2004). `Data Mining in BioinIormatics`.
|9| Fasman, K., 'Restructuring the Genome Data Base: A model Ior
a Iederation oI biological databases, Journal oI
Computational Biology,Volume 1, Number 2, pp. 165-171
|10| Davidson SB, Overton C, Buneman P., 'Challenges in
integrating biological data sources, Journal oI Computational
Biology, 1995, Volume 2, Number 4, pp 557572.
|11| Jason T. L. Wang, Mohammed J. Zaki, Hannu T. T.
Toivonen, Dennis Shasha, Data Mining in BioinIormatics, Springer ,
2005
|12| Micheal Charest, sylvain Delisle, OIelia Cervantes, YanIen Shen.
Intelligent Data Mining Assistance via CBR and Ontologies`: !7th
Internation ConIerence on Database and Expert
Systems Applications. 04-08 Sept. 2006. pp. 593 - 597
|13| Raghu Ramakrishnan, Johannes Gehrke, Database Management
Systems, 3rd Edition, McGraw-Hill ProIessional, 2002.
|14| Frawley W.J., Piatetsky-Shapiro G., Matheus C.J.,
Knowledge Discovery in Databases: An Overview, AAAI
Press/MIT Press, Cambridge, M.A., 1991, pp. 1-30.
|15| Jiawei Han, Michelin Kamber, Data Mining: Concepts and
Techniques, New York: Morgan KauImann Publishers, 2006
|16| Fan Zhang, Bingru Yang ,Wei Song, Linna Li, 'Intelligent Decision
Support System Based on Data Mining: Foreign Trading
Case Study, IEEE International ConIerence on Control and
Automation Guangzhou, CHINA - May 30 to June 1, 2007
|17| U Fayyad, G Piatetsky-Shapiro, P Smyth, "From Data Mining
to Knowledge Discovery in Databases," AI Magazine, Volume
17, Number 3, pp. 37-54, 1996
9
List of Figures


Figure 1: Proposed Framework

Figure 2: Knowledge Discovery Process






Figure3: Types of Descriptive and Predictive Models
10

You might also like