You are on page 1of 6

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & ISSN

0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME

TECHNOLOGY (IJCET)

ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 6, November - December (2013), pp. 284-289 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com

IJCET
IAEME

APPLICATIONS OF DATA MINING IN MEDICAL DATABASES


P.N.Santosh Kumar1, Dr. C.Venugopal2, Dr. C.Sunil Kumar3

Assistant Professor in ECM, SNIST, Hyderabad, A.P., India1 Professor in ECM, SNIST, Hyderabad, A.P., India2 Professor in ECM, SNIST, Hyderabad, A.P., India3

ABSTRACT By scattering the information systems (ISs) an enormous quantity of data has been collected in these systems u p to the current. Because intentionally vital information can be concealed in this mass of data, these pieces of information may be very expensive. With the aid of data mining (DM) and knowledge discovery (KD) techniques; the hidden data from these huge amounts of data can be extracted. These techniques can be applied to several areas for e.g. Commerce, Telecommunication and healthcare, too. The hospital information systems (HIS) are well-known around the world [5]. These s y s t e m s store a great deal of data pertaining to the patients physical p a r a m e t e r s , laboratory values, treatment modality and case history. With the a p p l ic a ti o n of DM techniques to the medical and healthcare data, the unknown relationships among these parameters concerning the examined population is discovered. This procedure includes forming clusters characterizing the patients from the point of view of clinical outcome, identifying the risk factors, analyzing the trends of the changes of clinical parameters, etc. In this work; the preparation steps that must be taken before analyzing the medical data are discussed. T h e data mining methods that are practical to use for different purposes are also dealt. Finally, the applicability of these tools in a particular area of healthcare is discussed. Keywords: DM, Healthcare, KDD, DW, DB I. INTRODUCTION From the late 1980s to the current the principal research area in the information technology (IT) has been K D, including DM techniques and data warehouses (DWs). DM itself can be viewed as a result of the natural evolution of ISs. After solving the problem of
284

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME

creation and design of databases (DBs), a enormous amount of the collected data has been aggregated in these DB systems up to the current. That is how the mechanism works in the area of medical sciences, too. Around the world there are many research projects that are based on the application of DM techniques in various fields of discipline. When considering the Indian medical DB systems, these ISs are big and different enough to extract some valuable hidden data from them. Researches based on DM that cover healthcare field have difficult goals. On one hand these projects can examine the applicability of DM techniques in health care [1]. H owever the common algorithms can be improved by building the skill knowledge, and what the typical difficulties and mistakes are on which focus must be sited during this work [3]. On the other hand by the application of these techniques in healthcare DB systems some concealed pieces of information might be revealed, which can be used in medical practice, for e.g. improving treatment or analyzing risk factors. IT today is broadly adopted in current medical practice, especially supporting digitized equipment, administrative jobs, and data organization but less has been achieved in the use of computational methods to exploit the medical data in research or practice. There is a budding demand for the integration and exploitation of diverse medical information for improved medical practice, medical research and adapted healthcare. Some of the tasks suitable for the application of DM are categorization, estimation, prediction, affinity grouping, clustering, and description. Some of them are best approached in a top-down manner or hypothesis testing while others are best approached in a bottom-up manner called KD either directed or undirected. DM has a goal to discover knowledge out of data and present it in a form that is easily understandable to public [6]. There are several DM methods, such as Cluster Detection (CD), Decision Trees (DTs), Artificial Neural Networks (ANNs), Genetic Algorithms (GAs), and On-Line Analytic Processing (OLAP). DTs may be used for categorization, clustering, prediction, or estimation. There are different approaches in DM, namely assumption testing where a DB recording past behavior is used to verify or disprove defined notions, ideas, and guesses concerning relationships in the data, and KD where no prior hypothesis are made and the data is allowed to speak for itself. As for KD, it may be directed or undirected. Directed KD tries to explain or classify some particular data field while undirected knowledge KD aims at finding models or similarities among groups of records without the use of a particular target field or group of predefined classes [2]. II. DATA MINING TECHNIQUES Some of the frequently used techniques are the following A. Neural Networks A Neural network (NN) may be defined as a pattern of reasoning based on the human brain. It is perhaps the most common DM method, since it is a simple model of neural interconnections in brains, custom-made for use on digital computers. It learns from a training set, generalizing patterns inside it for classification and prediction. Neural networks can also be applied to undirected DM and time-series forecasting. B. Decision Trees DTs are a way of representing a sequence of rules that show the way to a class or value. Therefore, they are used for directed DM, particularly categorization. One of the significant advantages of DTs is that the pattern is quite understandable since it takes the form of unambiguous rules. This allows the evaluation of results and the recognition of key attributes in the procedure. The
285

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME

rules, which can be articulated easily as logic statements, in a language such as SQL, can be applied directly to new records. C. Cluster Detection CD consists of building patterns that find data records parallel to each other. This is naturally undirected DM, since the objective is to find previously unknown similarities in the data. Clustering data may be measured a very good way to start any analysis on the data. Self-similar clusters can provide the starting point for knowing what is in the data and for figuring out how to best make use of it. D. Genetic Algorithms GAs, which applies the procedure of genetics and natural selection to a search, are used for finding the most favorable set of parameters that describe an analytical function. Hence, they are mainly used for directed DM. GAs use many operators such as the selection, crossover, and mutation to evolve consecutive generations of solutions. As these generations evolve, only the most analytical survive, until the functions converge on optimal results. III. KNOWLEDGE DISCOVERY IN DATABASES The concept of DM is often used as a synonym of knowledge discovery in databases (KDD); however DM is only a crucial step of the KDD process. This procedure includes the following key steps: learning the application domain; creating the target data set (DS) ; choosing the DM functionalities and the correct algorithms; pattern assessment; knowledge production; assessing of the discovered data. DM work in an indefinite domain always starts with the understanding of the application domain and solving the specification of the problem. In the domain of healthcare it means that the major medical terms need to be familiarized; then the available data must be preprocessed, w h i c h includes data selection and aggregation, data cleaning, and data reduction and transformation. After collecting and preprocessing the data according to the objectives of the application, the functionality of DM activity must be chosen, and d i s c o v e r the best DM algorithms. DM functionalities include creating concept or class descriptions, clustering, classification, evolution analysis, and association analysis. The alternative among these possibilities is mainly influenced by the limitations of the DM system. After executing the DM algorithms the discovered patterns need to be visualized to the experts for analysis. For this reason, the charts, tables, diagrams, decision trees, rules, etc are used. By evaluating these results the desired new knowledge is obtained, which the end-users can utilize during their research work. IV. SOURCE DATA Medical data (MD) is arising f r o m diverse resources. There are two types of DBs available i n medical domain. The first type of MD comes from medical experts. For e.g., it can be medical diagnosis, drugs and so on. It is typical of this type of data that the number of records is little, but the number of attributes for each record is relatively huge if compared with the number of records and in this kind of data the missing values a r e n o t f o u n d frequently. The other type of MD is coming mainly from HIS. This data is automatically stored in DBs without any specific purpose. For e.g., laboratory test data is classified to this cluster. The source systems of MD are mostly the HIS and flat files, but in some special cases DWs can
286

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME

also provide this kind of information [4]. Regrettably in many cases important data is stored in paper format only. The data needed for DM assessments must be integrated before analysis. Most often the objective of the integrated data is a relational DB or a DW. Examining the MD it may be often seen that the base data is not in the appropriate form, and/or is filthy, and some data transformation ( D T) actions may needed to be performed on them. So before running the DM algorithm it must be selected, cleaned, integrated, and transformed in to appropriate format. V. DATA PREPROCESSING Analyzing filthy, wrong data never provides positive information. Before starting any DM job, the data must be preprocessed. This activity includes solving the problem of filthy data, the difficulty of missing values, managing redundant data, dealing with amorphous information and other data preprocessing actions, such as creation of new features, data normalization/data generalization techniques. In medical information systems (MISs) it often occurs that some fields are Null. The reason for this may be that data isnt available/ data isnt stored. To improve the discovery process it is recommended to get and fill in the missing information. Generally there are some other promises, for e.g., using a global constant/using the attribute/most probable values to fill in the missing values. Replacing the missing value with a global constant (for e.g., anonymous) is not a good option, because the DM algorithms may operate with this value as a new concept. In medical domain it is neither recommended to replace the missing values with the attribute/the most probable values of the field, as it can happen that this parameter would predict an illness or an adverse event which can be analyzed. The difficulty of noisy data can occur for a number of causes: arbitrary fault during recording, diverse unit of measures of laboratory values. The default values in many cases can cause difficulties, because for e.g., seeing a 0 value in the field of a laboratory parameter, it cannot make a decision whether it means the absence of the examination. Data outside the MD can be corrected manually, or deleted. The outliers (A value far from most others in a set of data) can be detected for e.g., by clustering. Outliers in medical databases (MDBs) may draw the attention of the analyzer. Redundant data is mostly generated by the aggregation of several different DBs. For e.g., physical parameters of patients are usually stored in more than one database which needs integration. Comparing the correspondent data of the different DBs, inconsistency may be found. The difference among the values may also derive from a temporal change. In this case new information can be obtained from time series data. For this purpose each piece of information can be placed in a new database or a DW accompanied with a timestamp (TS), and then evolution analysis can be executed for finding hidden patterns. The major difficulty of DM in the healthcare field is that a enormous amount of data is stored as simple and in unstructured text format. The analysis of textual data requires considerably diverse algorithms than the ones used for the analysis of continuous binary, ordinal or nominal data. So it is recommended to transform this data into some structured form. This can only be attained with the help of medical experts, because of the difficulty of the terminology used. Alike to other data pre-processing activities, DM applications working on medical data also require some data conversion procedures. In conditions, where the accurate value is not concerned, only the uniqueness of that data needs to be generalized. Such a classic situation may result, for e.g., blood- pressure values or laboratory values.

287

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME

VI. RESEARCH ON BONE DISEASE: CASE STUDY ON OSTEOPOROSIS The researches are carrying out work in the area of osteoporosis (bone disease). So far 25,000 persons data have been gathered, who were referred for assessment with uncertainty of osteoporosis. Evidently, a number of these patients later proved to be v e r y healthy. The examined persons were coming from different constituencies of county. From this vast amount of data 1200 delegate patients were selected for further research. The accessible data of patients is dissimilar, because the data pertaining to the personal and familial history of the patients, for e.g., birth weight, b o n e fractures, drug, prior illnesses, and illnesses of relatives. Possibly the most valuable data is the results of densitometry examinations for years back. In the future the DNA of the patients in association with osteoporosis would like to be inspected. All this data was stored earlier in paper format. So after getting identifiable with the application field our first mission was to provide opportunity for recording this data in a DB system. Parallelly with this copy the preprocessing of pattern discovery (PD) process has also started. Seeing the procedure, the association analysis (AA) for finding out the society of osteoporosis and the probable risk factors, and the link of densitometry values and fractures are performed. With the categorization algorithms; the patients are grouped into three (3) categories, namely osteoporosis, osteopenia and healthy status. In this work, the clustering t ec h ni qu es are based on phenotype, genotype and assessment results. The number of evolution analysis, including the examination of the change of density of bones in time are planned and searched for other regularities and inclinations. Osteoporosis is a bone disease that causes decrease of bone density and quality, leading to weakness of the skeleton and enlarged risk of fracture, particularly of the spine, wrist, hip, pelvis and upper arm. Osteoporosis and associated fractures characterize an important cause of mortality and morbidity. Bone loss is gradual and shows no obvious symptoms or warning signs until the disease has advanced to its late stage. Osteoporosis is a global crisis because 1 in 4 women and at least 1 in 15 men will develop osteoporosis during their existence. For these reasons, osteoporosis is often referred to as the "silent epidemic". The world health organization (WHO) has identified it as a priority health issue. The costs to national healthcare systems from osteoporosis-related hospitalization are staggering. In the UK, according to estimates made by the National Osteoporosis Society,

there are an estimated 3.5 million citizens in the UK suffering from osteoporosis osteoporosis is liable for nearly 220,000 fractures per year osteoporosis costs the NHS and government over 1.5 billion each year.

Although there are some treatments there is presently no cure for osteoporosis. But it could be effectively prevented. Early discovery of bone loss is key to the prevention of suffering and appreciation of healthcare costs. However, screening facilities and qualified scientific personnel remain insufficient in most countries. The UK has only about two DXA Bone Mass Densitometers per million of residents and less than 10% of patients receive treatment. The research has been conducted on osteoporosis since 1997 with an aim of examining and developing a tactic to identify the associated risk factor and to predict the likelihood of developing osteoporosis. The research has produced some very cheering initial findings which have been published in journals and major conferences, both in medical and computing fields.

288

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME

VII. CONCLUSION In the most recent decades, the amount of data stored in various ISs increased significantly. DM is one of the most popular techniques to analyze this enormous amount of data. HIS and other MDBs also store valuable data, raising a need for KD. MD is diverse and offers numerous analyses potential. The classes are mined or concept description of medical terms, penetrating for association rules (ARs), classifying patients and forecast medical events at new patients, searching for clusters from diverse points of view, and carry out evolution analysis based on timeseries data. REFERENCES Jiawei Han and Micheline Kamber: Data Mining: Concepts and Techniques, the Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, August 2000. ISBN 1-55860-489-8. 2. H. Galhardas, D. Florescu, D. Shasha, E Simon, C-A. Saita: Declarative Data Cleaning, Language, Model, and Algorithms, Proc of the 27th VLDB, pages 307-316, Rome, Italy, 2001. 3. Prather JC, Lobach DF, Goodwin LK, Hales JW, Hage ML, Hammond WE. Medical Data Mining: Knowledge Discovery in a Clinical Data Warehouse, Proc AMIA Annu Fall Symp. 1997. 4. Tsumoto, S.: Knowledge discovery in clinical databases, Proceedings of the 11th International Symposium on Foundations of Intelligent Systems, 1999. 5. Tsumoto S.: Clinical Knowledge Discovery i n Hospital Information Systems: Two Case Studies, PKDD2000, Springer Verlag, pp.652-656, 2000. 6. M. Last, O. Maimon, A. Kandel: Knowledge Discovery in Mortality Records: An InfoFuzzy Approach, Medical Data Mining and Knowledge Discovery, Vol. 60, 2001. 7. P. Fazi, D. Luzi, F. L. Ricci, m Vignetti: The Conceptual Basis of WITH, a Collaborative Writer System of Clinical Trials, ISMDA 2002 p. 86-97. 8. Asst. Prof. Jameelah H. Suad and Wurood A. Jbara, Subjective Quality Assessment of New Medical Image Database, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 5, 2013, pp. 155 - 164, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 9. R. Manickam, D. Boominath and V. Bhuvaneswari, An Analysis of Data Mining: Past, Present and Future, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 10. R. Lakshman Naik, D. Ramesh and B. Manjula, Instances Selection using Advance Data Mining Techniques, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 47 - 53, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 11. Rinal H. Doshi, Dr. Harshad B. Bhadka and Richa Mehta, Development of Pattern Knowledge Discovery Framework using Clustering Data Mining Algorithm, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 101 - 112, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 1.

289

You might also like