Data Mining

What is data mining?
the non trivial extraction of implicit, previously unknown, and potentially useful information from
data
Data mining encompasses a number of different technical approaches, such as:
o clustering,
o data summarization,
o learning classification rules,
o finding dependency net works,
o analysing changes, and
o detecting anomalies
Comparison Data Mining and DBMS
DBMS - queries based on the data held e.g.
o last months sales for each product
o sales grouped by customer age etc.
o list of customers who lapsed their policy
Data Mining - infer knowledge from the data held to answer queries e.g.
o what characteristics do customers share who lapsed their policies and how do they
differ from those who renewed their policies?
o why is the Cleveland division so profitable?
Who needs data mining?
Who(ever) has information fastest and uses it wins
Businesses are looking for new ways to let end users find the data they need to:
o make decisions
o serve customers and
o gain the competitive edge
Applications
Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.
Finance - stock market prediction, credit assessment, fraud detection etc.
Marketing/sales - product analysis, buying patterns, sales prediction, target mailing,
identifying `unusual behaviour' etc.
Knowledge Acquisition
Scientific discovery - superconductivity research, etc.
Engineering - automotive diagnostic expert systems, fault detection etc.
Data Mining Goals
Classification
DM system learns from examples or the data how to partition or classify the data i.e. it
formulates classification rules
Example - customer database in a bank
o Question - Is a new customer applying for a loan a good investment or not?
o Typical rule formulated -
if STATUS = married and INCOME > 10000
and HOUSE_OWNER = yes
then INVESTMENT_TYPE = good
Association
Rules that associate one attribute of a relation to another
Set oriented approaches are the most efficient means of discovering such rules
Example - supermarket database
o 72% of all the records that contain items A and B also contain item C
o the specific percentage of occurrences, 72 is the confidence factor of the rule
Sequence/Temporal
Sequential pattern functions analyse collections of related records and detect frequently
occurring patterns over a period of time
Difference between sequence rules and other rules is the temporal factor
Example - retailers database
o Can be used to discover the set of purchases that frequently precedes the purchase
of a microwave oven
Example - natural disasters database
o Discovery could be that when there is an earthquake in Los Angeles the next day
Mount Kilimanjaro erupts
Techniques
Set oriented database methods
Statistics
Clustering
Visualisation
Neural networks
Rule Induction
Set oriented approaches/Databases
o make use of DBMSs to discover knowledge, SQL is limiting
Statistics
o can be used in several data mining stages
data cleansing i.e. the removal of erroneous or irrelevant data known as
outliers
EDA, exploratory data analysis e.g. frequency counts, histograms etc.
data selection - sampling facilities and so reduce the scale of
computation
attribute re-definition e.g. Body Mass Index, BMI, which is
Weight/Height2
data analysis - measures of association and relationships between
attributes, interestingness of rules, classification etc.
Visualization
o enhances EDA, makes patterns more visible e.g. NETMAP a commercial data
mining tool uses this technique
Clustering i.e. Cluster Analysis
o Clustering and segmentation is basically partitioning the database so that each
partition or group is similar according to some criteria or metric
o Clustering according to similarity is a concept which appears in many
disciplines e.g. in chemistry the clustering of molecules
o Data mining applications make use of clustering according to similarity e.g. to
segment a client/customer base
o It provides sub-groups of a population for further analysis or action - very
important when dealing with very large databases
o Can be used for profile generation for target marketing i.e. where previous
response to mailing campaigns can be used to generate a profile of people who
responded and this can be used to predict response and filter mailing lists to
achieve the best response
Knowledge acquisition
using data mining
Expert systems are models of real world processes
Much of the information is available straight from the process e.g.
o in production systems, data is collected for monitoring the system
o knowledge can be extracted using data mining tools
o experts can verify the knowledge

Multimedia Data Mining in Digital Libraries:
Standards and Features
Sanjeevkumar R. Jadhav*, and Praveenkumar Kumbargoudar*
Abstract
The digital library retrieves, collects, stores and preserves the digital data. For this purpose,
there is need to convert different formats of information such as text, images, video, audio, etc. The data mining
techniques are popular while conversion of the multimedia files in the libraries. The present paper attempted to
define the term data mining. It also covered different data mining features and standards. The paper explained
about the Architecture of data mining, which contains the stages of the data mining such as (1) domain
understanding; (2) data selection; (3) cleaning and preprocessing; (4) discovering patters; (5) interpretation;
and (6) reporting and using discovered knowledge. It is emphasized that there is need to develop multimedia
data mining techniques and standards in the library for conversion of multimedia information.
1. INTRODUCTION
Over the past few decades, rapid changes in information technology have drastically
changed the functions and activities of the libraries. The Information and Communication
Technology created a new type of work culture, new forms of information storage, and new
means of communication and dissemination of information. The advent of electronic
resources and their increased use in libraries has brought about significant changes in Storage
and Communication of Information.
As a Result, the Conventional libraries are transforming into digital libraries.
Majority of the libraries have computerized already and digitizing their printed collection. In
India, the process of digitization is slow compared to other developed countries. This is so
because, only 21% of the Indian population is computer literate and only 14% of the Indian
Population is using Internet. Due to the development in digitization, many of the libraries
are digitizing their collection by transforming their printed materials into digital form.
A fully developed digital library environment involves the following elements1:
1. Initial Conversion of Content from Physical to Digital form.
2. The extraction or creation of metadata or indexing information describing the content
to facilitate searching and discovery, as well as administrative and structural metadata
to assist in object viewing, management and preservation.
3. Storage of digital content and metadata in appropriate multimedia repository. The
repository will include rights management capabilities to enforce Intellectual Property
Rights, if required. e-commerce functionality may also be present if needed to handle
accounting and billing.
4. Client Services for the browser, including repository querying and workflow.
5. Content delivery via file transfer or streaming media.
6. Patron access through a browser or dedicated client.
* Gulbarga University, GULBARGA: 585 106. Karnataka. E-Mail: kumbargoudar@rediffmail.com
55
7. A private or public network.
2. DIGITIZATION AND DATA MINING
Digitization refers to the conversion of an item be it printed text, manuscript, image
or sound, film and video recording from one format (usually print or analogue) into digital.
The process basically involves taking a physical object and essentially making an electronic
photograph of it. An image of the physical object is captured- using a scanner or digital
camera and converted to digital format that can be stored electronically and accessed via a
computer2.
It is noted that the data and information available in different formats. These formats
include Text, Images, Video, Audio, Picture, Maps, etc. It is noted that in case of text
information, there is needed to scan the printed text through scanners and provide different
links to access it. But in case of multimedia formats like images, Audio, Picture, Maps,
Video etc, the conversion and systematic presentation is not easy. Further, there is needed to
make automatic search for easy accessibility. The easy search, effective and systematic
presentation of the data is essential in case of multimedia information. For this purpose, there
is need to adopt data mining techniques in the library. Data mining techniques are basically
from logic, Multimedia and Artificial Intelligence techniques.
Data mining is the automatic extraction of patterns of information from historical
data, enabling companies to focus on the next important aspects of their businesstelling
them what they did not know and had not even thought of asking3. Data mining is that it is
the process of automating information discovery4, which improves decision making and
gives a company advantages on the market. Another definition is that is is the exploration
and analysis, by automatic or semiautomatic means, of large quantities of data in order to
discover meaningful patterns and rules: 5 Data mining is an applied discipline, which grew
our of the statistical pattern recognition, machine learning, and artificial intelligence and
coupled with business decision making to optimize and enhance it. Initially, data mining
techniques have been applied to structured data from databases.
Recently two branches of data mining, text data mining and Web data mining, have
emerged6&7. They have their own research agenda, communities of researchers, and
supporting companies that develop technologies and tools. Unfortunately, today multimedia
data mining is in beginning stage and still there is need for developments to make effective
presentation of multimedia information.
There are four types of multimedia data: audio data, which includes sound , speech,
and music; image data (black-and-white and colour images); video data, which include
timealigned
sequences of images; and electronic or digital, which is sequences of time aligned 2D
or 3D coordinates of a stylus, a light per, data glove sensors, or a similar device. All this data
is generated by specific kind of sensors.
The concept of mining in multimedia is also referred to as automatic annotation or
annotation mining. There appears to be three main pattern discovery approaches that have
been used for automatic annotation in multimedia data mining. These approaches primarily
differ in terms of how external knowledge is provided to mine concepts. The first approach
includes assigning key words or classifying the data. The second approach for automatic
annotation is through clustering and here multimedia documents are clustered first and then
the resulting clusters are assigned keywords by annotator. The third approach does not rely
on manual annotator and it tries to mine concepts by knowing the contextual information.
56
The Multimedia Data Mining (MDM) is a part of multimedia technology, which
covers the following areas8.
Media compression and storage.
Delivering streaming media over networks with required quality of service.
Media restoration, transformation, and editing.
Media indexing, summarization, search, and retrieval.
Creating interactive multimedia systems for learning/training and creative art
production.
Creating multimodal user interfaces.
3. MULTIMEDIA DATA MINING ARCHITECTURE
The data mining process consists of several processes and stages, which are related to
each other and interactive. The main stages of the data mining process are (1) domain
understanding; (2) data selection; (3) cleaning and preprocessing; (4) discovering patters; (5)
interpretation; and (6) reporting and using discovered knowledge. The domain understanding
stage requires learning how the results of data-mining will be used so as to gather all relevant
prior knowledge before mining9.
Figure: Multimedia Data Mining Architecture
The data selection stage requires the user to target a database or select a subset of
fields or data records to be used for data mining. A proper domain understands at this stage
57
helps in the identification of useful data. This is the most time consuming stage of the entire
data mining process for business applications; data are never clean and in the form suitable
for data mining. For multimedia data mining, this stage is generally not an issue, because the
data are not in relational form and there are no subsets of fields to choose from.
The next stage in a typical data mining process is the preprocessing step that involves
integrating data from different sources and making choices about representing or coding
certain data fields that serve as inputs to the pattern discovery stage. Such representation
choices are needed because certain fields may contain data at levels of details not considered
suitable for the pattern discovery stage. The preprocessing stage is of considerable
importance in multimedia data mining, given the unstructured nature of multimedia data.
The pattern discovery stage is the heart of the entire data mining process. It is the
stage where the hidden patterns and trends in the data are actually uncovered. There are
several approaches to the pattern discovery stage. These include association, classification,
clustering, regression, time-series analysis and visualization. Each of these approaches can
be implemented through one of several competing methodologies, such as statistical data
analysis, machine learning, neural networks and pattern recognition. It is because of the use
of methodologies from several disciplines that data mining is often viewed as a
multidisciplinary field.
The interpretation stage of the data mining process is used to evaluate the quality of
discovery and its value to determine whether previous stage should be revisited or not.
Proper domain understanding is crucial at this stage to put a value on discovered patterns.
The final stage of the data mining process consists of reporting and putting to use the
discovered knowledge to generate new actions or products and services or marketing
strategies as the case may be.
According to Myatt10 any exploratory data mining project should include the
following steps:
1. Problem Definition: The problem to be solved along with the projected deliverables
(information products) should be clearly defined, an appropriate team should be put
together, and a plan generated for executing the analysis.
2. Data Preparation: Prior to starting any data analysis or data mining project, the data
should be collected characterized, cleaned, transformed, and partitioned into an
appropriate form for processing further.
3. Implementation of the Analysis: On the basis of the information from steps 1 & 2,
appropriate analysis techniques should be selected and often these methods need to be
optimized.
4. Deployment of Results: The Results from Step 3 should be communicated and/ or
deployed into a pre-existing process.
4. FEATURES AND STANDARDS FOR MULTIMEDIA DATA MINING
It is noted that different image attributes such as Colour, edges, shape, and texture are
used to extract features for mining. Feature extraction based on these attributes may be
58
performed at the global or local level. For example, colour histograms may be used as
features to characterize the spatial distribution of colour in an image. Similarly, the shape of
a segmented region may be represented as a feature vector of Fourier descriptors to capture
global shape property of the segmented region or a shape could be described in terms of
salient points or segments to provide localized descriptions. Global descriptors are generally
easy to compute, provide a compact representation, and are less prone to segmentation errors.
However such descriptors may fail to uncover subtle patterns or changes in shape because
global descriptors tend to integrate the underlying information. Local descriptors, on the
other hand, tend to do generate more elaborate representation and can yield useful results
even when part of the underlying attribute, for example, the shape of a region is occluded, is
missing. In the case of video, additional attributes resulting from object and camera motion
are used.
In case of audio, both the temporal and the spectral domain features have been
employed. Examples of some of the features used include short-time energy, pause rate,
zero-crossing rate, normalized harmonicity, fundamental frequency, frequency spectrum,
bandwidth, spectral centroid, spectral roll-off frequency and band energy ratio. Many
researchers have found the cepstral based features, Mel-Frequency Cepstral Coefficients
(MFCC) and Linear Predictive Coefficients (LPC), very useful, especially in mining tasks
involving speech recognition. The MPEG-7 standard provides a good representative set of
features for multimedia data. The features are referred as descriptors in MPEG-7. The
MPEG-7 Visual description tools describe visual data such as images and videos while the
Audio description tools account for audio data. The MPEG-7 visual description defines the
following main features for color attributes: Color Layout Descriptor, Color Structure
Descriptor, Dominant Color Descriptor and Scalable Color Descriptor. The Color Layout
Descriptor is a compact and resolution invariant descriptor that is defined as YCbCr Color
space to capture the spatial distribution of color over major image regions. The Color
Structure Descriptor captures both color content and information about its spatial
arrangement using a structuring element that is moved over the image. The Dominant Color
Descriptor characterizes an image or an arbitrarily shaped region by a small number of
representative colors. The Scalable Color Descriptor is a color histogram in the HSV Color
Space encoded by Haar transform to yield a scalable representation. While the above
features are defined with respect to an image or its part, the feature Group of Frames-Group
of Pictures Color (GoFGoPColor) describes the color histogram aggregated over multiple
frames of a video9.
MPEG-7 provides for two main shape descriptors; others are based on these and
additional semantic information. The Region shape Descriptor describers the shape of a
region using Angular Radial Transform (ART). The description is provided in terms of 40
coefficients and is suitable for complex objects consisting of multiple disconnected regions
and for simple objects with or without holes. The Contour Shape Descriptor describes the
shape of an object based on its outlines. The descriptor used the curvature scale space
representation of the contour.
The motion descriptors in MPEG-7 are defined to cover a broad range of applications.
The motion activity descriptor captures the intuitive notion of intensity or pace of action in a
video clip. The descriptor provides information for intensity, direction, and spatial and
temporal distribution of activity in a video segment. The spatial distribution of activity
indicates whether the activity is spatially limited or not. Similarly, the temporal distribution
of activity indicates how the level of activity varies over the entire segment. The Camera
Motion Descriptor specifies the camera motion types and their quantitative characterization
over the entire video segment. The Motion Trajectory Descriptor describes motion trajectory
59
of moving object basic on spatiotemporal localization of trajectory points. The description
provided is at a fairly high level as each moving object is indicated by one representative
point at any time instant. The parametric Motion Descriptors describes motion, global and
object motion, in a bideo segment by describing the evolution of arbitrarily shaped regions
over time using a two-dimensional geometric transform.
The MPEG-7 Audio standard defines two sets of audio descriptors. The first set is of
low-level features, which are meant for a wide range of applications. The descriptors in this
set include silence, power, Spectrum, and Harmonicity. The silence Descriptor simply
indicates that there is no significant sound in the audio segment. The power Descriptor
measures temporally smoothed instantaneous signal power. The Spectrum Descriptor
captures properties such as the audio spectrum envelope, spectrum centroid spectrum spread,
spectrum flatness, and fundamental frequency. The second set of audio descriptors is of
high-level feature, which are meant for specific applications. The features in this set include
Audio Signature, Timbre, and Melody. The Signature Descriptor is designed to generate a
unique identifier for identifying audio content. The Timbre Descriptor captures perceptual
features of instrument sound. The Melody Descriptor captures monophonic melodic
information and is useful for matching of melodies. In addition, the high-level descriptors in
MPEG-7 Audio include descriptors for automatic speech recognition, sound classification
and indexing.
5. MULTIMEDIA DATA MINING IN DIGITAL LIBRARIES:
Quan Liu11 suggested the Standards and guidelines associated with library
digitization practices vary from project to project. Over the years, university, public, school,
and special libraries have adopted their own policies with regard to digitization. Some older
standards, as well as more recent ones, are widely accepted and practiced library digitization
projects. Metadata standards and image quality standards and guidelines are commonly
sought when planning digitization projects Common metadata standards used to date are
Dublin Core, RDF, EAD, TEI, and SGML and its descendents XML and HTML. The MARC
standard has been used as the standard interchange format in representing catalog records
electronically.
It is noted that in India, only a few University and College libraries have already
started digitization and a majority of the University and College libraries are yet to start the
work of digitization and conversion work of their collection. Further, it is noted that the
experts in library science and information science, to large extent only provided guidelines
for conversion of text documents. Hence, there is need to know about the standards and
processes of the data mining and storage of multimedia data through data mining techniques.
6. CONCLUSION
Multimedia data mining techniques are active and growing area of research now. In
case of digital library projects, there is need for multimedia data mining for conversion and
preservation of multimedia information. There is needed to make data mining strategy for
conversion of multimedia files in the libraries. The digital libraries, to a large extent
accessible through the web, must present multimedia information effectively. Then the
purpose of these libraries is served properly. To serve this purpose, there is needed to form
data mining strategy, considering standards, features and available techniques.

Data Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

What is data mining?

You might also like