Intro To Data Mining

Introduction to
Software Engineering
NOTES
Self-Instructional Material 3
Introduction to
UNIT 1 INTRODUCTION TO Software Engineering
DATAMINING &
WAREHOUSING (FULL TEXT) NOTES
Structure
1.0 Introduction
During the Early Nineties Industries realised that they were not getting
the promised returns on their investment in IT infrastructure. A major
focus of the industrial leaders was to utilise IT as a strategic tool to
maximise profits. They expected IT to leverage their decision making
capability and not merely in terms of obtaining MIS reports which were
primarily routine in nature and did not help them in sifting through
voluminous data and identifying hidden, camouflaged or implied
information . The emphasis therefore shifted from generating MIS reports
to what was termed as KDD( Knowledge Discovery in Databases). KDD
by itself involved a large number of components out of which the most
desirable was extraction of useful information or patterns from massive
corporate data and was termed as “Data Mining”. Following text is a
systematic presentation of Data Mining techniques and Data Warehouse
framework which is the repository of the vast, integrated, time variant,
historical & subject oriented data, to be operated upon.
1.1 Unit Objectives

• Explaining the evolution of Data Mining & Warehousing.
• Understanding the components of Data Mining & Warehousing system.
• To study the complementary relationship of Data Mining & Warehousing.
• To learn the steps involved in Implementation of Data Mining &

Warehousing architecture.
• Identifying problems associated with Data Mining & Data Warehousing

Framework.
• To know the role of Data Mining & Warehousing in strategic decision
making & giving a competitive advantage to a business activity.
1.2 Emergence of Data Mining & Warehousing

1.2.1 Business needs drive technology
Data Mining & Warehousing have now become familiar words for not only the
computer professionals but most of the decision makers A rapid growth has taken
place in developing a technology surrounding them, with most of the leading
companies of the world creating products and services to exploit its potential. It
therefore brings out an underlying fact before us, that is, any technology results as an Introduction to
outcome of Business needs. In this case, it was an inescapable and urgent
requirement of the Business community to have a powerful, online, decision making
tool to support and substantiate its own intuitive, thought processes. It has thus
resulted in development of Data Mining & Warehousing technology. NOTES
1.2.2 IT solutions for Strategic Decision making
Information Technology has emerged as a powerful business driver and an essential
component for the companies, to give them a competitive advantage in the
challenging market scenario. Each and every aspect of IT is being closely examined
and integrated into the business activity. But by far the very existence of the industry
depends on the strategic decisions taken by its top echelon, because these are the
once which are being translated by its middle level and operational staff into actions.
Data Mining & Warehousing IT solutions have taken Strategic Decision making
beyond the realm of conventional MIS (Management Information System) and OIS
(Operational Information System) boundaries.
Information Systems
Operational Management
Information System Information System
ERP, CRM, SCM DSS, EIS,Expert

etc Systems etc
1.2.3 Focus on the End Users

Data Mining & Warehousing , as we have seen has been developed for the Managers
& top level executives to assist them in reaching decisions based not only on facts
and figures seen superficially, but drawing inferences from hidden and widely
dispersed uncorrelated data. The focus of the system is on End Users and meeting
their requirements by offering simple interfaces. It is essential to keep the Front End
tools simple, less complicated and user friendly. The experts have to create a system
keeping in view the technical limitations of the End user, their lack of understanding
of system capabilities in the initial stages and support them to achieve the desired
result.
1.3 Evolution of Data mining & Warehousing

Data mining & Warehousing has grown to its full potential today in a number
of clearly demarcated stages. Prior to 1970s we had the flat files & databases.
The emphasis then was on Data Collection & Database Creation. A manager
was able to use them only for simplifying his/her day to day task.
These were further developed into DBMS from 1970s to Early1980s, offering
advantages like concurrent, shared or distributed access, ensuring the
Introduction to
consistency and providing information security, providing us the backbone for Software Engineering
the Reports and supporting the conventional MIS and OIS features.
Later on the DBMS followed three growth paths. These were:

NOTES
(a)Advanced database Systems – Developed from1980s till present.
They included Object–relational, Spatial, Temporal, Biological Database
systems.
(b) Data Warehousing & Data mining – Evolved from 1980s till
present as Knowledge Discovery & Data Mining, Data warehouse &
OLAP technologies. Their full potential was not realised due to a large
number of limitations of the then Hardware, Networking and software
constraints.
(c) Web Based Database Systems(1990s till present) were based on
phenomenal reach of Internet, XML Based systems, Web technology
and Web mining.
1980 till today
Advanced
Databases
Flat Files, DBMS Data Integrated

Database Mining Database
1960-1970 1970-1980 2000 onwards
Web
Database
The latest trend in the decade is Integrated information systems based on

above three paving way for revolutionary development in Decision making in
the corporate world.
Introduction to
NOTES
Introduction to
1.4 INTRODUCTION TO DATA Software Engineering
MINING
NOTES
1.4.1 What is Data mining
Data Mining means locating, identifying and finding unforeseen information
from a large data base. The information is one which is interesting to the end
user. It can also be understood as data analysis based on searching or learning
dependent on deduction.
1.4.2 What is Interestingness of Pattern?
A data pattern discovered through a data base search is considered

interesting, if it is easily understood, is valid on new or test data with some
degree of uncertainty, potentially useful and is novel. Interesting patterns are
identified by objective parameters which are combined with the subjective
requirements to reflect the needs and interests of a particular user.
1.4.3 Data mining & Knowledge Discovery;

Difference between Data Mining & Knowledge Discovery in Data bases
How are they different? Data Mining is devoted specifically to the processes
involved in extraction of useful information by applying specific techniques
based on certain knowledge domains. These are say, based on statistics,
Artificial Intelligence, and so on. While Knowledge discovery is a wide term
and is the entire range of activities right from deciding Business objectives,
Capturing desired data, preparing ,processing, arranging it, applying predefined
techniques and then presenting them in an understandable form to the user, To
say specifically Knowledge discovery can be sub-divided into Four specific
steps which are performed repetitively till the desired result is reached, one of
them is Data Mining.
1. Data Processing comprising of Data Selection, Data Cleaning,
Data Integration
2. Data Transformation & organising in a form ready for fast access
3. Data Mining( DM Engine) and other techniques like OLAP/
OLTP for searching and extraction.
4 Knowledge presentation methods through Graphical User
Interface ( GUI).
5. Analysing the Result and Assimilating it in a knowledge domain
Following diagram refers:

Introduction to
NOTES
Data Processing
Data Transformation
Data Mining Engine
Knowledge Presentation
Through GUI
Result Analysis
We can thus consider Data mining as a subset of Knowledge Discovery.
1.4.4 Nature of Data to be mined – Operational & Analytical

Data Mining is an essential step towards the creation of Information systems. These Introduction to
are Operational Information Systems like Enterprise Resource Planning ( ERP) or as
Management Information Systems including Decision support Systems The DSS
systems assist managers in taking decisions based on available unstructured data and
validate their intuitive judgements. OIS & DSS each has its own requirement of Data NOTES
Structures and Databases.
The Data in turn is categorised as Operational data which is dynamic in nature and
meets short term goals. Analytical data has a longer time span and supports intuitive
decisions. Operational Database supports Transaction processing through On Line
Transaction Processing (OLTP) Queries. Analytical Database meets On Line
Analytical Processing (OLAP) requirements of Decision Support Systems (DSS).
Differences in DB requirement differences for OLTP & DSS
Characteristic DB for OLTP DB for OLAP Needs
1.Nature of content Dynamic Static

2. Time span Current Historical
3. Time measured Implicit, Implied Explicit & mentioned
4. Level of Detail Primitive/ Detailed Detailed & Derived
/Granularity
5. Update cycle Real time Periodic, planned
6. Tasks Known Pattern, Repetitive Unpredictable
7. Response Time bound Flexible
1.4.5 OLTP & OLAP
Having seen the Database requirement of OIS & DSS let us differentiate
the Query systems associated with each. These are OLTP & OLAP. OLTP
fulfills the requirements of OIS well, as the Queries are simple in nature .
OLAP, on the other hand, addresses the needs of defining more complex
queries and requires novel Databases in the form of Multi Dimensional &
Multi Relational Databases ( MDDB & MRDB respectively) to provide
the back end.
Features of both OLAP & OLTP are compared below
Feature OLTP OLAP
1. Meant for OIS MIS/DSS
2. Purpose Supports Transaction For Analysis
3. End User Operations Level, DB Specialists Knowledge worker
4. Function Daily operations Long term needs

Introduction to
5. DB Design ER based, Star/snowflake schemas Software Engineering
Application oriented Subject oriented
6. Data Current, up-to-date Historical, NOTES
7. Summarization Primitive, Highly detailed Aggregated
8. View Relational Multidimensional
Multi-Relational
9. Work unit Short, simple transaction Complex query
10. Access Mode Both Read/Write Mostly read
11. Based on Data inputs Derived Information
12. Operations Operation on primary key Multiple scans
13. Number of records accessed Few Many
14. Number of Users Large Number Selected
15. DB Size In MB /GB In over 100GB to TB
16. Priority High performance & availability High flexibility,
End user autonomy
17. Measure Transactions throughput Specific Query
Comparison Database & DWH

DB(Transaction Processing) DWH(DSS)
1. Systematic data stored in a prescribed format Collection of
unstructured data
2. DB contains operational data Non-operational

data.
Introduction to
3. Uses structured language for searching DM tools for Software Engineering
extracting pattern.
4. Keeps normalized data Does not store

NOTES
normalized data.
Dr E.F. Codd ‘s guidelines for OLAP
OLAP is an essential ingredient of Data Mining. it is therefore essential to

understand the relevance of Dr E.F. Codd’s ( a well known authority on
RDBMS) guidelines. An interpretation of each of them is given below,
relating them to the issues involved:
1. Multidimensional Conceptual view- Business problems are complex

and can be solved only through a Multi dimensional concept as
Normal Queries cannot address them effectively. As such Multi
dimensional schemas are essential to create relevant databases.
2. Transparency-An end user must be presented a cohesive ,

unambiguous version of data and must not be exposed to complexity
and diversity of data sources.
3. Accessibility- Essential data must be identified and accessed.
4. Consistent reporting performance- Reporting must remain dependable

and reliable even with increase of database size.
5. Client/server architecture- DM and Warehousing systems are created

to meet growing business needs and financial constraints.
6. Generic dimensionality.-Every data dimension has the same

importance.
7. Dynamic sparse matrix handling- Provide capability to keep the

database size within limits by adopting suitable methods of handling
sparse matrices.
8. Multiuser support- The system must permit a large number of users to

be permitted access at the same time.
9. Unrestricted cross-dimensional operations- Multi dimensional schemas

must be well understood, designed and permit cross references.
10. Intuitive data manipulation- OLAP is created for Decision makers to

make intuitive decisions. They are not computer experts and must be
Introduction to
provided with a user friendly uncomplicated access to generate Software Engineering
Queries.
11. Flexible reporting-The system must be capable of providing reports

NOTES
desired by the end user.
12. Unlimited dimensions and aggregation levels The system must remain
flexible/ expandable for adding extra dimensions and permit additional
aggregations.
1.4.6 Data Mining a multidisciplinary area
Data mining is a confluence or combination of multiple disciplines. Some of

these are:
1. Information science
2. Database technology
3. Statistics
4. Machine Learning
5. Visualization
6. Other Disciplines
Statistics Information
Science
Database Machine
Data
Technology Learning
Mining
Visualisation Other Sciences

Introduction to
NOTES
Successful Development of Data Mining System would thus require
joint efforts from experts of different domains.
1.4.7 Classification of Data mining systems

Data mining development of special algorithms to answer queries of various users.
The procedure is to evolve a number of models and to match one of them to data
stored in the database. Three steps involved in this process are: creating a model,
Find out the criteria to give Preference of a model over others and identify the search
technique.
Data Mining models being mathematical in nature are classified as Predictive and
Descriptive.
(a) A Predictive model spells in advance, the values a data may assume
,based on known results from other data stored in the database.
A predictive model performs data mining tasks of Classification, Time series
Analysis, regression and Prediction.
(b) A descriptive model based on identification & relationships in data. The
descriptive model aims to discover rather than predict the properties of
data.
A descriptive model performs data mining tasks comprising of Clustering ,
Summarisation, association Rules and Sequence Discovery.
Data Mining
Models
Predictive Model Descriptive Model
1.4.8 Data Mining Tasks

The Basic tasks under Predictive and Descriptive models are:
Predictive Model
(a) Classification -Data is mapped into predefined groups or classes. Also
termed as supervised learning as classes are established prior to
examination of data.
(b) Regression- Mapping of data item into known type of functions. These
may be linear, logistic functions etc.
(c) Time Series Analysis- Value of an attribute are examined at evenly Introduction to
spaced times, as it varies with time.
(d) Prediction- It means fore telling future data states based on past and
current data.
Descriptive Model NOTES
(a) Clustering- It is referred as unsupervised learning or
segmentation/partitioning. In clustering groups are not pre-defined.
(b) Summarisation- Data is mapped into subsets with simple descriptions .
Also termed as Characterisation or generalisation.
(c) Sequence Discovery- Sequential analysis or sequence discovery utilised
to find out sequential patterns in data. Similar to association but
relationship is based on time.
(d) Association Rules- A model which identifies specific types of data
associations.
1.4.9 Data Mining Primitives

A data mining task is expressed in the form of a DMQL statement and requires
certain primitives to be stated. These are:
(a)Task-relevant data- It mentions the part of the database to be examined.
(b)Nature of Knowledge to be mined- It defines the tasks or functions to be
performed on the data. Examples are Characterisation, Association, Clustering.
(c)Background knowledge- It means here the concept hierarchy as they indicate the
level of abstraction at which data is to be mined.
(d)Interestingness measures- These are defined for the task or function to be
performed. Example ,For Association rule the Support and Confidence Factors are
measured corresponding to threshold levels specified by the users as a measure of
interestingness.
(e)Presentation & Visualisation of discovered patterns- They refer to the ways in
which the result obtained can be displayed for the convenience of the user.
1.4.10 Data Mining Query Language (DMQL)

Data mining systems are required to support ad hoc and interactive requirements
of knowledge discovery from Relational Database and Multiple levels of
abstraction. Data mining languages are designed to meet this requirement. They
help us in formulating a query to define a data mining task primitives.
The primitives require:
(a) Set of task –relevant data to be mined.
(b) Nature of knowledge to be mined.
(c) Background knowledge required for the discovery.
(d) Measures of Interestingness
(e) Visualisation representation.
DMQL follows a SQL like syntax which is amenable for linking with
Relational Query languages and simplifies a users task of knowledge extraction
easier.
1.4.11 Integration of Data mining systems

A diverse number of Data mining tools may be available in an
organisation. It is essential to identify and categorise them to understand which
model and tasks they support. Also find out if any data mining tools are being Introduction to
developed in-house. Display them on the GUI of the Client Desk top to select
the right data mining tool for the problem in hand.
1.4.12 Major issues of Data mining NOTES

These are mentioned below:
(a) Human Interaction.
(b) Over-fitting
(c) Outliers.
(d) Interpretation of Results
(e) Visualisation of Results.
(f) Large Datasets
(g) High Dimensionality.
(h) Multimedia data.
(i) Missing Data.
(j) Irrelevant Data.
(k) Noisy Data.
(l) Changing data.
(m) Integration.
(n) Application
1.5 Introduction to Data Warehousing( DWH)

1.5.1 What is Data Warehouse
W.H. Inmon , well known as Father of the data warehouse concept defines “ A
Data warehouse is a subject oriented, integrated, non-volatile and time-variant
collection of data in support of management’s decisions”.
Where subject oriented means database is organised in a data warehouse on a
subject wise manner even at the expense of redundancy. Thus every manager
would have access to desired information in the shortest possible time not with
standing the extra space occupied by it.
Integrated implies, related database tables created in the form of Fact &
Dimension tables can be linked to each other and are not stored as stand alone
data resources.
Non destructible means storage of data on a permanent, non-volatile basis. It can
only be purged or removed only as an exception as an organisational need.
Time variant requires all data to be entered in the data warehouse to be time stamped
or associated with its time of entry. The time element introduced may not be the
actual time when data entered the operational system.
Data Warehouse(DWH) Block Diagram
Information
Delivery
Admin & System
Mgt. tools
Data
External
mining
Data
Tools Self-Instructional Material 16
Introduction to
MRDB
Data warehouse
Transforma OLAP
DBMS & Data
-tion Tools Tools NOTES
Repository
MDDB
Report,
Operational Metadata DATA MARTS Query
Data Store Tools
1.5.2 Data Warehouse Building Blocks

Data Pre-processing Tools
It means sourcing, acquisition, cleaning and transformation of data prior to its
entry into a Data warehouse Data repository. The data is received from legacy
systems, Web or other external sources. The data and the database from where it
is received would itself be heterogenous and it requires:
(a) Removal of unwanted data
(b) Converting to common data & definition names
(c) Summarising the data
(d) Completing missing data
Operational Data Store

The data is transformed and loaded into the operational data store(ODS)in real
time frame. From the ODS it is loaded into the Data warehouse after extraction ,
cleaning operations at regular intervals but not as and when received from
external sources. As such a time of entry is attached with it. The data thus
available is loaded under the control of Metadata.
Metadata
The metadata is data about data and keeps information as
(a) Technical metadata- containing, Sources of data, data structure,
transformation description, rules specified during data processing,
access authorisations & back up history.
(b) Business Metadata- Contains information about Subject areas,
information object types, Internet home pages, Information delivery system
details- that is when to despatch information and to whom, Data warehouse
operational information & Ownership details.
Data Warehouse Database
It is the central database consisting of Data Warehouse RDBMS,A large
Repository and supporting databases like Multi Relational Database, Multi
Dimensional Database & Data marts.
Data Mart
Data mart is another important component of the Data warehouse and is a data
store that is subsidiary to a data warehouse. It is created to meet specific
information needs of different functional area managers. Data marts are a part
of the data warehouse database and cannot be taken as an alternative for a data Introduction to
warehouse.
Management & Administration Tools
They are provided to :
1. Managing & updating of metadata NOTES
2. Backup & recovery.
3. Removal of unwanted data
4. Security & assigning priorities
5. Quality checks.
6. Distribution of data.
Access Tools
They are categorised as:
1.Query & reporting Tools
2.Application Tools- to meet specific user requirements.
3. Data Mining tools- To discover knowledge, Data visualisation and
Correcting data when the input data is incomplete.
4.OLAP Tools- These are associated with multidimensional databases to
provide elaborate, complex views for analysis
Information Delivery System
It provides an external interface to provide Data Warehouse reports
information objects to external users as per a specified schedule.
1.5.3 Granularity of Data
It means the level of detail or summarisation at which data is stored in a data
warehouse. Larger the granularity less will the detail be available for those data
item. Vice versa also holds good. A data warehouse manager is required to
identify the granularity of data for any organisation so that reports of the
requisite detail are available.
An example is maintenance of the details of each and every call made by a
mobile user by Telecom Operators to provide a high level of details (Low
level of granularity) to meet legal requirements at a later stage.
Granular Data offers the advantage of reusability of data by other users and also
help in optimising the storage space.
1.5.4 Multidimensional Data Models & Schemas
Data warehouses and OLAP tools are based on what is known as a
Multidimensional model. Data is visualised as a Data cube in such model
identified by Fact & Dimension tables.
Facts are the numerical measures of a central theme. For example a Student.
The measures may be Marks_obtained, Division_Scored.
Dimensions are the entities with respect to which the organisation keeps its
records. For example Teacher, Subject, class, college, university etc.
Concept Hierarchies
It is a method of defining a sequence of identifying levels for each
entity.Example is a City, District, State and a country.
Schemas
While Entity- Relationship model was found adequate in the design of
Relational Databases, a data warehouse requires a subject oriented schema for better
analysis and handling more complex queries.
Three schemas are therefore created to meet the data warehouse requirements. These
are:
Introduction to
Star schema
Most common model. In which a large central Fact table containing maximum data
without duplication or redundancy is stored. A large number of Dimension tables are
referred by it . Each of these handles a dimension. Refer to example given: NOTES
Dimension
Dimension Table
Table
Key B
Key A
Fact table
Contains
Keys say
A,B,C, D &
measures
Dimension Dimension
Dimensio
Table Table
n Table
Snowflake schema
Key Cof Star schema. The dimension tables areKey
It is an extension Key DD extended to extra
further
tables.Diagram below gives an Example:
Dimension
Dimension Table
Table
Key B
Key A
Fact table
Contains
Keys say
A,B,C, D &
measures
Dimension Dimension
Dimensio
Table Table
n Table
Key C Key D
Key E Key
KeyFD
Dimension Dimension
Table Table
Key E Key F
Constellation Schem
It has Multiple fact tables to meet the requirements of more advanced Introduction to
applications.The fact tables are permitted to share Dimension tables. Example given
below refers:
\Fact Table 2
Fact table 1
Key B,D NOTES
Key A,C
Dimension
table
Keys A,B,C
Dimension Dimension
Dimensio
Table Table
n Table
Key C Key D
Key E Key D
Dimension
Table
Key E
1.5.5 Data Warehouse design

A Data Warehouse design consists of :
1. Choosing a business process to model. For example Orders, Invoices,
Shipments etc.
2. Choose a DWH for a large organisation while select a Datamart for

departmental implementation.
3. Choose the grain of the business- the fundamental, atomic level of data
to be represented in the Fact table.
4. Choose the Dimensions to be applied to each Fact table.
5. Choose the measures that will populate each fact table record e.g.
Units _sold, Rs_sold.
Introduction to
Based on these four principles a nine step method is evolved as under: Software Engineering
1.Choosing the subject matter.
2.Deciding what the Fact table represents
3. Identifying and conforming the dimensions.
4. Choosing the Facts. NOTES
5.Storing pre-calculations in the fact table.
6.Rounding out the dimension tables.
7. Choosing the duration of the data bases.
8.The need to track, slowly changing dimensions.
9.Deciding the Query priorities.
1.5.6 Data Warehouse Architecture
Data Warehouse Architecture-
Data Warehouse architecture is based on a RDBMS system server. It has a
massive central repository for storage of data, subsidiary Databases and front
end tools
The architecture consisting of:
1.Bottom Tier- A RDBMS & a DWH Server
2.Middle Tier-OLAP Server
3.Top Tier-Front End Tools
Data MDDB, GUI,

GUI
Warehouse MRDB, Presentation
Database, Metadata Logic,Query
Metadata, Specificatio
Data Logic n
DATA SERVERS APPLICATION SERVERS CLIENTS
Virtual Warehouse
Another commonly used terms is a Virtual server. is a set of views over
operational databases. For efficient query processing only some of the
possible summary views are materialised. It is easy to build but requires
excess capacity on operational database servers.
Developing a Data Warehouse
It consists of:
1. Defining a High level corporate data model.
2. Develop an Enterprise Data Warehouse and continue refining it to meet
user requirements.
3. In parallel Develop data marts and refine these models.
1.5.7 ROLAP,MOLAP & HOLAP
These tools utilise specialised data structures to organise, navigate and
analyse data, typically in a aggregated form. They require a tight coupling
with the application and the presentation layer.
Introduction to
1.MOLAP architecture creates a data structure to store in a way it will Software Engineering
finally be utilised to enhance its performance. It is particularly well suited
for iterative and time series analysis. It provides tools to access data
maintained in the DWH repository(RDBMS) and permits its access when
the MDDB does not have the desired data. They are used for providing NOTES
the user a high performance & better understanding , due to specialised
indexing & storage optimisations. They require less space due to usage of
compression
MOLAP ARCHITECTURE
Load
Required Information
Information Requested
Data MOLAP
Warehouse Server Front End
Database Data cube Tool
Server MDDB
Metadata
Query sent Result of
Processing
New Data Search
loaded
2. ROLAP works directly with RDBMs and is more scalable It depends

on Databases for calculations and therefore its performance suffers. ROLAP
servers contain both the numeric & textual data and serve broader needs. They
support large databases supporting parallel processing ,good security and
employs known technologies.
ROLAP ARCHITECTURE
SQL Information
Request
Data ROLAP Front End

Warehouse Server Tools
Database MRDB
& Metadata
Metadata
Request
Requested Result
Dataset returned of search given
Introduction to
3. HOLAP use the best features of both i.e. flexibility of ROLAP RDBMS Software Engineering
and the optimised multidimensional structure of MOLAP. Users are provided
ability to perform limited analysis capability either against RDBMS products
or by introducing an intermediate MOLAP server. A user can send a query to
select data from the DBMS which then delivers the requested data to the NOTES
desktop where it is placed in a data cube. The desired information is
maintained locally and need not be created each time a query is given.
4. Salient differences to be noted are:
(a) In MOLAP there is no Query directly given by the user to the

DWH Server. The desired Multidimensional data is positioned in
the MOLAP server after a SQL sent by MOLAP server is sent to
the DWH Server.
(b) ROLAP server does not store the intermediate result in a cube but
a Relational table. The user gets his query serviced by the ROLAP
server.
(c) In HOLAP, SQL is sent by user to the DWH server then either the
Result is received by it directly or an intermediate MOLAP server
data cube is created and accessed by the user.
(d) ROLAP server does not store the intermediate result in a cube but
a Relational table. The user gets his query serviced by the ROLAP
server.
(e) In HOLAP, SQL is sent by user to the DWH server then either the
Result is received by it directly or through an intermediate
MOLAP server
1.6 Developing a Data Mining & Warehousing framework
1.6.1 Evolving System.

Data Mining & warehouse are not only massive but extremely critical systems
for the entire organisation .They are not to be built in one stroke but to be
gradually evolved as they require a substantial investment from the organisation
in the form of human and financial resources. It is therefore common practice
identify one subject area and build a system for it and then gradually move to
other functional areas. Several measures to curtail the investment are introduced.
One of them is only implementing a Datamart to begin with and then create the
data warehouse repository. It is also essential to select systems which are
upward scalable, keeping the Evolution factor in mind.
1.6.2 SDLC & CLDS
Systems Development Life Cycle (SDLC) is a requirements driven life
Cycle and supports the operational environment . A well known model is
the Waterfall Development .
Introduction to
Feasibility
NOTES
Analysis
Design
Coding
Testing
Integration
Implementation
The CLDS ( Reverse way of saying SDLC) is the methodology followed for
developing a data mining application. It is a reverse way as the end user being a
manager and not a technocrat does not at an outset realise the potential of the
system and its decision support capability. The user expects the Technical experts to
present the available data, identify suitable algorithms and test the results.
Implement
Warehouse
Integrate data
Test for
information bias
Introduction to
Develope program
for data
NOTES
Design DSS
System
Analyze result
Understand
requirements
1.6.3 Selection of Hardware of Data Warehouse

Decision to select hardware is based on:
(a) Existing & Future Business growth plan of the organisation
(b) Data expected to be stored.
(c) Performance expected from system.
(d) Identifying suitable SMP or MPP system
(e) Network Bandwidth for external connectivity
(f) Nature of legacy systems and their interfaces.
(g) Nature of Information Delivery System
(h) Data & Application Servers
(i) Client Desk top system with adequate storage & processing power
1.6.4 Selection or Development of Data Mining Tools

(a) Identify the User requirement.
(b) Categorise the model applicable and the associated tasks.
(c) Make or Buy suitable tools
(d) Find out the available data warehouse capabilities and develop an
application based on these.
1.7 Data Mining & Warehousing Challenges

1.7.1 Major issues/challenges to Data mining
These are:
1. Mining different kinds of knowledge in databases.
2. Interactive mining of knowledge at multiple levels of abstraction.
3. Incorporation of background knowledge.
4. Data mining Query languages and ad hoc data mining.
5. Presentation and visualisation of data mining results.
6. Handling noisy or incomplete data.

Introduction to
1.7.2 Data Warehouse- Open issues & research problems Software Engineering
Data Warehouse being an evolving system has many issues which are
open for further study & research. These are in the areas of:
1. Security- The Data warehouse contains full details of organisational
NOTES
information, some of which is highly confidential. Data warehouse
provides access to many employees to access its contents. Special
efforts are t required to maintain safety, security and confidentiality of
this valuable and sensitive information residing in it.
2. Performance- Very large data bases requires their storage on multi-

processors and not conventional computers. These fall under the
category of Massively Parallel Processors and Symmetric
Multiprocessors. Both are upgradable and permit Very large database
to be partitioned and accessed at fast speeds. Features like Shared
memory protection and dynamic load balancing are in-built in them.
Designing scalable systems is an important Factor.
3. Functionality-Modification of accessing methods is a major challenge

to Data warehouse managers. Developing novel Data mining
algorithms & OLAP for meeting User’s latest requirements are
essential features of extracting interesting data from warehouse.
4. Presentation- Data visualisation techniques have to be constantly

improved for the end users to utilise full capability of the system, draw
inferences and analyse the results.
1.8 Impact of Data Mining & Warehousing
1. Collecting, cleaning, transforming data from heterogeneous sources
2. Integrating data from dissimilar sources. Examples of Heterogeneous

& Dissimilar sources are legacy systems, World wide web,
organisations ERPs.
3. Storing data in a systematic manner but easily accessible manner.

Examples are Creating appropriate Fact & relation tables, Organising
data marts and Partitioning of database.
4. Organising data to facilitate –Manipulation, processing of Very large

data. A good Metadata design is a prime factor in attaining this
objective.
5 Simplify Exploration of massive data. Creation of Data mart are

essential for meeting this goal.
6 Arrange Data for interpretation & pattern recognition.Efficient Data

Mining Algorithms are necessary to achieve it.
Introduction to
7 Presenting data in a readable manner for the managers by designing Software Engineering
proper GUI.
8 Business Decision making based on Data mining & other access tools
NOTES
1.9 Summary- Overview of Data Mining & Warehousing
1. Data Mining & Data warehousing principles were identified during early 1990s
but were implemented only after a decade due to non availability of:
(a) Suitable hardware at an affordable cost, to support parallel processing, provide
fast speeds and for storage of massive database.
(b) Operating systems to support parallel architectures.
(c) DBMS to manage very large database.
(d) Network bandwidth for interconnectivity.
(e) Suitable data mining algorithms.
(f) Visualisation & presentation tools.
2.The Data warehouse provides a platform for capturing, refining, integrating and
transforming data received from diverse sources. It is then stored in subject wise
Form in a central repository accessed through a metadata mechanism.
3. The data is stored in a relational form by creating Fact & Dimension tables
connected through specialised schemas based on concept hierarchies. These are Star,
Snowflake& Fact Constellation schema.
4. The data is further organised as MRDB & MDDB which may also be considered
as part of Data marts created for different users to meet their specific requirements.
5. For producing useful reports Data mining, OLAP & Query tools and Application
tools are utilised.
6. Any result requires an effective presentation. GUI and Visualisation tools are made
available for the managers for them to assimilate and analyse the results for speedy
decision making.
7.Data mining & warehousing is constantly evolving and has to adopt a framework
which is flexible and absorb the organisational and technological changes
1.10 Answer to ‘Check your Progress’

Question 1: In what way Data warehouse and Data Mining complement each other.
Question 2: Data Warehouse & Data Mining principles were formulated in early
1990s but implemented only after a decade. Why?
Questions3: Differentiate between Operational Data Base, Operational Data

Store & Data Warehouse Repository.
Question 4: What is the function of “Information Delivery System”?
Introduction to
Question 5 Differences between Data Mining, OLAP & Query Tools?
1.11 Exercises and Questions NOTES

1.11.1 Short-Answer Type Questions
Q1: Explain how the evolution of Database technology lead to Data

Mining?
Q2: Describe the steps involved in Data mining when viewed as a process
of knowledge discovery?
Q3: What are Data Mining models and the tasks associated with them?
Q4: What are the Data Mining issues. In what way do they affect
implementation of a Data mining system?
Q5: What is meant by interestingness of data as related to data mining?
1.11.2 Long-Answer type Questions
Q1:Define the data mining functionalities.
Q2:What is the difference between :

(a) Discrimination and Classification
(b) Characterization & Clustering
(c) Between Classification & Prediction?
(d) ROLAP & MOLAP
(e) MDDB & MRDB
(f) Data Warehouse & Data Mart.
Q3. Describe three challenges to data mining regarding Data mining
methodology and user interaction issues.
Q4: Describe two challenges to data mining regarding performance

issues:
Tip: Performance issues are:
(a) Efficiency & scalability of data mining algorithm.
(b) Parallel, Distributed and incremental mining algorithm.
Q5: Describe issues related to diversity of data base types.
Tip (a) Handling of relational and complex types of data.
(b) Mining information from heterogeneous databases and
global information systems
1.12 Further Reading
1. Data Mining Concepts & Techniques- Jiawei Han and Michelene Introduction to
Kamber, Second Edition, Elsevier,2006
2. Data Warehousing, Data mining & OLAP – Alex Berson, Stephen J.
Smith, Tata McGraw Hill,2004
3. Building the Data warehouse – W.H. Inmon, Third Edition, NOTES
Wiley,2009

Intro To Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro To Data Mining

Uploaded by

Copyright:

Available Formats

Introduction to

1.1 Unit Objectives

• To study the complementary relationship of Data Mining & Warehousing.

• To learn the steps involved in Implementation of Data Mining &

• Identifying problems associated with Data Mining & Data Warehousing

1.2 Emergence of Data Mining & Warehousing

ERP, CRM, SCM DSS, EIS,Expert

1.2.3 Focus on the End Users

1.3 Evolution of Data mining & Warehousing

Later on the DBMS followed three growth paths. These were:

1980 till today

Flat Files, DBMS Data Integrated

1960-1970 1970-1980 2000 onwards

The latest trend in the decade is Integrated information systems based on

1.4.2 What is Interestingness of Pattern?

A data pattern discovered through a data base search is considered

1.4.3 Data mining & Knowledge Discovery;

2. Data Transformation & organising in a form ready for fast access

3. Data Mining( DM Engine) and other techniques like OLAP/

OLTP for searching and extraction.

4 Knowledge presentation methods through Graphical User

5. Analysing the Result and Assimilating it in a knowledge domain

Following diagram refers:

Data Mining Engine

We can thus consider Data mining as a subset of Knowledge Discovery.

1.4.4 Nature of Data to be mined – Operational & Analytical

Differences in DB requirement differences for OLTP & DSS

Characteristic DB for OLTP DB for OLAP Needs

1.Nature of content Dynamic Static

7. Response Time bound Flexible

1.4.5 OLTP & OLAP

Features of both OLAP & OLTP are compared below

Feature OLTP OLAP

1. Meant for OIS MIS/DSS

2. Purpose Supports Transaction For Analysis

3. End User Operations Level, DB Specialists Knowledge worker

4. Function Daily operations Long term needs

Application oriented Subject oriented

6. Data Current, up-to-date Historical, NOTES

7. Summarization Primitive, Highly detailed Aggregated

8. View Relational Multidimensional

9. Work unit Short, simple transaction Complex query

10. Access Mode Both Read/Write Mostly read

11. Based on Data inputs Derived Information

12. Operations Operation on primary key Multiple scans

13. Number of records accessed Few Many

14. Number of Users Large Number Selected

15. DB Size In MB /GB In over 100GB to TB

16. Priority High performance & availability High flexibility,

End user autonomy

17. Measure Transactions throughput Specific Query

Comparison Database & DWH

2. DB contains operational data Non-operational

4. Keeps normalized data Does not store

Dr E.F. Codd ‘s guidelines for OLAP

OLAP is an essential ingredient of Data Mining. it is therefore essential to

1. Multidimensional Conceptual view- Business problems are complex

2. Transparency-An end user must be presented a cohesive ,

3. Accessibility- Essential data must be identified and accessed.

4. Consistent reporting performance- Reporting must remain dependable

5. Client/server architecture- DM and Warehousing systems are created

6. Generic dimensionality.-Every data dimension has the same