Professional Documents
Culture Documents
Software Engineering
NOTES
Self-Instructional Material 3
Introduction to
UNIT 1 INTRODUCTION TO Software Engineering
DATAMINING &
WAREHOUSING (FULL TEXT) NOTES
Structure
1.0 Introduction
During the Early Nineties Industries realised that they were not getting
the promised returns on their investment in IT infrastructure. A major
focus of the industrial leaders was to utilise IT as a strategic tool to
maximise profits. They expected IT to leverage their decision making
capability and not merely in terms of obtaining MIS reports which were
primarily routine in nature and did not help them in sifting through
voluminous data and identifying hidden, camouflaged or implied
information . The emphasis therefore shifted from generating MIS reports
to what was termed as KDD( Knowledge Discovery in Databases). KDD
by itself involved a large number of components out of which the most
desirable was extraction of useful information or patterns from massive
corporate data and was termed as “Data Mining”. Following text is a
systematic presentation of Data Mining techniques and Data Warehouse
framework which is the repository of the vast, integrated, time variant,
historical & subject oriented data, to be operated upon.
Information Systems
Operational Management
Information System Information System
Advanced
Databases
Web
Database
Self-Instructional Material 6
Introduction to
Software Engineering
NOTES
Self-Instructional Material 7
Introduction to
1.4 INTRODUCTION TO DATA Software Engineering
MINING
NOTES
1.4.1 What is Data mining
Data Mining means locating, identifying and finding unforeseen information
from a large data base. The information is one which is interesting to the end
user. It can also be understood as data analysis based on searching or learning
dependent on deduction.
Data Integration
Interface ( GUI).
NOTES
Data Processing
Data Transformation
Knowledge Presentation
Through GUI
Result Analysis
Having seen the Database requirement of OIS & DSS let us differentiate
the Query systems associated with each. These are OLTP & OLAP. OLTP
fulfills the requirements of OIS well, as the Queries are simple in nature .
OLAP, on the other hand, addresses the needs of defining more complex
queries and requires novel Databases in the form of Multi Dimensional &
Multi Relational Databases ( MDDB & MRDB respectively) to provide
the back end.
Multi-Relational
Self-Instructional Material 11
Introduction to
3. Uses structured language for searching DM tools for Software Engineering
extracting pattern.
12. Unlimited dimensions and aggregation levels The system must remain
flexible/ expandable for adding extra dimensions and permit additional
aggregations.
2. Database technology
3. Statistics
4. Machine Learning
5. Visualization
6. Other Disciplines
Statistics Information
Science
Database Machine
Data
Technology Learning
Mining
NOTES
Data Mining
Models
Self-Instructional Material 14
(c) Time Series Analysis- Value of an attribute are examined at evenly Introduction to
Software Engineering
spaced times, as it varies with time.
(d) Prediction- It means fore telling future data states based on past and
current data.
Descriptive Model NOTES
(a) Clustering- It is referred as unsupervised learning or
segmentation/partitioning. In clustering groups are not pre-defined.
(b) Summarisation- Data is mapped into subsets with simple descriptions .
Also termed as Characterisation or generalisation.
(c) Sequence Discovery- Sequential analysis or sequence discovery utilised
to find out sequential patterns in data. Similar to association but
relationship is based on time.
(d) Association Rules- A model which identifies specific types of data
associations.
DMQL follows a SQL like syntax which is amenable for linking with
Relational Query languages and simplifies a users task of knowledge extraction
easier.
Information
Delivery
Admin & System
Mgt. tools
Data
External
mining
Data
Tools Self-Instructional Material 16
Introduction to
Software Engineering
MRDB
Data warehouse
Transforma OLAP
DBMS & Data
-tion Tools Tools NOTES
Repository
MDDB
Report,
Operational Metadata DATA MARTS Query
Data Store Tools
Dimension
Dimension Table
Table
Key B
Key A
Fact table
Contains
Keys say
A,B,C, D &
measures
Dimension Dimension
Dimensio
Table Table
n Table
Snowflake schema
Key Cof Star schema. The dimension tables areKey
It is an extension Key DD extended to extra
further
tables.Diagram below gives an Example:
Dimension
Dimension Table
Table
Key B
Key A
Fact table
Contains
Keys say
A,B,C, D &
measures
Dimension Dimension
Dimensio
Table Table
n Table
Key C Key D
Key E Key
KeyFD
Dimension Dimension
Table Table
Key E Key F
Constellation Schem
Self-Instructional Material 19
It has Multiple fact tables to meet the requirements of more advanced Introduction to
Software Engineering
applications.The fact tables are permitted to share Dimension tables. Example given
below refers:
\Fact Table 2
Fact table 1
Key B,D NOTES
Key A,C
Dimension
table
Keys A,B,C
Dimension Dimension
Dimensio
Table Table
n Table
Key C Key D
Key E Key D
Dimension
Table
Key E
3. Choose the grain of the business- the fundamental, atomic level of data
to be represented in the Fact table.
5. Choose the measures that will populate each fact table record e.g.
Units _sold, Rs_sold.
Self-Instructional Material 20
Introduction to
Based on these four principles a nine step method is evolved as under: Software Engineering
1.Choosing the subject matter.
2.Deciding what the Fact table represents
3. Identifying and conforming the dimensions.
4. Choosing the Facts. NOTES
5.Storing pre-calculations in the fact table.
6.Rounding out the dimension tables.
7. Choosing the duration of the data bases.
8.The need to track, slowly changing dimensions.
9.Deciding the Query priorities.
1.5.6 Data Warehouse Architecture
Data Warehouse Architecture-
Data Warehouse architecture is based on a RDBMS system server. It has a
massive central repository for storage of data, subsidiary Databases and front
end tools
The architecture consisting of:
1.Bottom Tier- A RDBMS & a DWH Server
2.Middle Tier-OLAP Server
3.Top Tier-Front End Tools
Virtual Warehouse
Another commonly used terms is a Virtual server. is a set of views over
operational databases. For efficient query processing only some of the
possible summary views are materialised. It is easy to build but requires
excess capacity on operational database servers.
Developing a Data Warehouse
It consists of:
1. Defining a High level corporate data model.
2. Develop an Enterprise Data Warehouse and continue refining it to meet
user requirements.
3. In parallel Develop data marts and refine these models.
1.5.7 ROLAP,MOLAP & HOLAP
These tools utilise specialised data structures to organise, navigate and
analyse data, typically in a aggregated form. They require a tight coupling
with the application and the presentation layer.
Self-Instructional Material 21
Introduction to
1.MOLAP architecture creates a data structure to store in a way it will Software Engineering
finally be utilised to enhance its performance. It is particularly well suited
for iterative and time series analysis. It provides tools to access data
maintained in the DWH repository(RDBMS) and permits its access when
the MDDB does not have the desired data. They are used for providing NOTES
the user a high performance & better understanding , due to specialised
indexing & storage optimisations. They require less space due to usage of
compression
MOLAP ARCHITECTURE
Load
Required Information
Information Requested
Data MOLAP
Warehouse Server Front End
Database Data cube Tool
Server MDDB
Metadata
Query sent Result of
Processing
New Data Search
loaded
SQL Information
Request
Metadata
Request
Requested Result
Dataset returned of search given
Self-Instructional Material 22
Introduction to
3. HOLAP use the best features of both i.e. flexibility of ROLAP RDBMS Software Engineering
and the optimised multidimensional structure of MOLAP. Users are provided
ability to perform limited analysis capability either against RDBMS products
or by introducing an intermediate MOLAP server. A user can send a query to
select data from the DBMS which then delivers the requested data to the NOTES
desktop where it is placed in a data cube. The desired information is
maintained locally and need not be created each time a query is given.
4. Salient differences to be noted are:
(b) ROLAP server does not store the intermediate result in a cube but
a Relational table. The user gets his query serviced by the ROLAP
server.
(c) In HOLAP, SQL is sent by user to the DWH server then either the
Result is received by it directly or an intermediate MOLAP server
data cube is created and accessed by the user.
(d) ROLAP server does not store the intermediate result in a cube but
a Relational table. The user gets his query serviced by the ROLAP
server.
(e) In HOLAP, SQL is sent by user to the DWH server then either the
Result is received by it directly or through an intermediate
MOLAP server
Self-Instructional Material 23
Introduction to
Software Engineering
Feasibility
NOTES
Analysis
Design
Coding
Testing
Integration
Implementation
The CLDS ( Reverse way of saying SDLC) is the methodology followed for
developing a data mining application. It is a reverse way as the end user being a
manager and not a technocrat does not at an outset realise the potential of the
system and its decision support capability. The user expects the Technical experts to
present the available data, identify suitable algorithms and test the results.
Implement
Warehouse
Integrate data
Test for
information bias
Self-Instructional Material 24
Introduction to
Software Engineering
Develope program
for data
NOTES
Design DSS
System
Analyze result
Understand
requirements
8 Business Decision making based on Data mining & other access tools
NOTES
1.9 Summary- Overview of Data Mining & Warehousing
1. Data Mining & Data warehousing principles were identified during early 1990s
but were implemented only after a decade due to non availability of:
(a) Suitable hardware at an affordable cost, to support parallel processing, provide
fast speeds and for storage of massive database.
(b) Operating systems to support parallel architectures.
(c) DBMS to manage very large database.
(d) Network bandwidth for interconnectivity.
(e) Suitable data mining algorithms.
(f) Visualisation & presentation tools.
2.The Data warehouse provides a platform for capturing, refining, integrating and
transforming data received from diverse sources. It is then stored in subject wise
Form in a central repository accessed through a metadata mechanism.
3. The data is stored in a relational form by creating Fact & Dimension tables
connected through specialised schemas based on concept hierarchies. These are Star,
Snowflake& Fact Constellation schema.
4. The data is further organised as MRDB & MDDB which may also be considered
as part of Data marts created for different users to meet their specific requirements.
5. For producing useful reports Data mining, OLAP & Query tools and Application
tools are utilised.
6. Any result requires an effective presentation. GUI and Visualisation tools are made
available for the managers for them to assimilate and analyse the results for speedy
decision making.
7.Data mining & warehousing is constantly evolving and has to adopt a framework
which is flexible and absorb the organisational and technological changes
Question 2: Data Warehouse & Data Mining principles were formulated in early
1990s but implemented only after a decade. Why?
Self-Instructional Material 27
Introduction to
Software Engineering
Q3: What are Data Mining models and the tasks associated with them?
Q4: What are the Data Mining issues. In what way do they affect
implementation of a Data mining system?
Self-Instructional Material 28
1. Data Mining Concepts & Techniques- Jiawei Han and Michelene Introduction to
Software Engineering
Kamber, Second Edition, Elsevier,2006
2. Data Warehousing, Data mining & OLAP – Alex Berson, Stephen J.
Smith, Tata McGraw Hill,2004
3. Building the Data warehouse – W.H. Inmon, Third Edition, NOTES
Wiley,2009
Self-Instructional Material 29