You are on page 1of 7

[Winter 2014] ASSIGNMENT

PROGRAM Master of Science in Information Technology (MSc IT) Revised Fall


2011
SEMESTER 4
SUBJECT CODE & NAME MIT401 Data Warehousing and Data Mining
CREDIT 4
BK ID B1633 MAX. MARKS 60

Q.No 1 Explain the Top-Down and Bottom-up Data Warehouse development


Methodologies. 10
Answer:
Top- Down and Bottom - Up Development Methodology
Despite the fact that Data Warehouses can be designed in a number of different ways, they all
share a number of important characteristics. Most Data Warehouses are Subject Oriented. This
means that the information that is in the Data Warehouse is stored in a way that allows it to be
connected to objects or event, which occur in reality.
Another characteristic that is frequently seen in Data Warehouses is called Time Variant. A time
variant Data Warehouse will allow changes in the information to be monitored and recorded
over time. All the programs that are used by a particular institution will be stored in the Data
Warehouse, and it will be integrated together. The first Data Warehouses were developed in the
1980s. As societies entered the information age, there was a large demand for efficient methods
of storing information.
Many of the systems that existed in the 1980s were not powerful enough to store and manage
large amounts of data. There were a number of reason for this. The systems that existed at the
time took too long to report and process information. Many of these systems were not designed
to analyze or report information. In addition to this, the computer programs that were necessary

for reporting information were both costly and slow. To solve these problems, companies began
designing computer databases that placed an emphasis on managing and analyzing information.
These were the first Data Warehouses, and they could obtain data from a variety of different
sources, and some of these include PCs and mainframes.
Spreadsheet programs have also played an important role in the development of Data
Warehouses. By the end of the 1990s, the technology had greatly advanced, and was much lower
in cost. The technology has continued to evolve to meet the demands of those who are looking
for more functions and speed. There are four advances in Data Warehouse technology that has
allowed it to evolve. These advances are offline operational databases, real time Data
Warehouses, offline Data Warehouses, and the integrated Data Warehouses.
The offline operational database is a system in which the information within the database of an
operational system is copied to a server that is offline. When this is done, the operational system
will perform at a much higher level. As the name implies, a real time Data Warehouse system
will be updated every time an event occurs. For example, if a customer orders a product, a real
time Data Warehouse will automatically update the information in real time.
With the integrated Data Warehouse, transactions will be transferred back to the operational
systems each day, and this will allow the data to easily be analyzed by companies and
organizations. There are a number of devices that will be present in the typical Data Warehouse.
Some of these devices are the source data layer, reporting layer, Data Warehouse layer, and
transformation layer. There are a number different data sources for Data Warehouses. Some
popular forms of data sources are Teradata, Oracle database, or Microsoft SQL Server.
Another important concept that is related to Data Warehouses is called data transformation. As
the name suggests, data transformation is a process in which information transferred from
specific sources is cleaned and loaded into a repository.

2 Explain the Functionalities and advantages of Data Warehouses 5+5=10


Answer:
Functionality of Data Warehouses

Data Warehouses exist to facilitate complex, data-intensive and frequent adhoc queries. Data
Warehouses must provide far greater and more efficient query support than is demanded of
transactional databases. Data Warehouses provide the following functionality:
Roll-up: Data is summarized with increased generalization.
Drill-down: Increasing levels of detail are revealed.
Pivot: Cross tabulation that is, rotation is performed.
Slice and Dice: Performing projection operations on the dimensions.
Sorting: Data is sorted by ordinal value.
Selection: Data is available by value or range.
Derived or Computer Attributes: Attributes are computed by operations on stored data and
values are derived.

Advantages of Data Warehouse


A Data Warehouse provides a common data model for data, regardless of the data source. This
makes it easier to report and analyze information than it would be if multiple data models from
disparate sources were used to retrieve information such as sales invoices, order receipts,
general ledger charges, etc.
Prior to loading data into the Data Warehouse inconsistencies are identified and resolved.
This greatly simplifies reporting and analysis.
Information in the Data Warehouse is under the control of Data Warehouse users so that,
even if the source system data is purged over time, the information in the warehouse can be
stored safely for extended periods of time.
Because they are separate from operational systems, Data Warehouses provide fast retrieval
of data without slowing down operational systems.
Data Warehouses facilitate Decision Support System applications such as trend reports (e.g.,
the items with the most sales in a particular area within the last two years), exception reports,
and reports that show actual performance versus goals.

3 Describe about Hyper Cube and Multicube 5+5=10


Answer:
Hypercubes and Multicubes
Multidimensional databases can present their data to an application using two types of cubes:
hypercubes and multicubes. The Hypercube is the cube with four Dimensions. In the hypercube
model, as shown in the following illustration, all data appears logically as a single cube. This
intuitive representation is a hypercube, a representation that accommodates more than three
dimensions. At a lower level of simplification, a Hypercube can very well accommodate three
dimensions. A hypercube is a general metaphor for representing multidimensional data. Often,
Multi Dimensional Structures (MDS) are used to represent such data.
Multicube: In the multicube model, data is segmented into a set of smaller cubes, each of
which is composed of a subset of the available dimensions It means we can view the cube in
multiple dimensions.

Fig.: Multicube

4 List and explain the Strategies for data reduction. 5*2=10


Answer:
Strategies for data reduction include the following:
1) Date cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.

2) Dimension reduction, where irrelevant, weakly relevant, or redundant attributes or


dimensions may be detected and removed.
3) Data compression, where encoding mechanisms are used to reduce the data set size.
4) Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as a parametric models (which need store only the model parameters
instead of the actual data), or nonparametric methods such as clustering, sampling, and the use
of histograms.
5) Discretization and concept hierarchy generation, where raw data values for attributes
are replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of data
at multiple levels of abstraction and are a powerful tool for data mining.

5. Describe K-means method for clustering. List its advantages and drawbacks.
5+5=10
Answer:
K-means
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve
the well known clustering problem. The procedure follows a simple and easy way to classify a
given data set through a certain number of clusters (assume k clusters) fixed a priori. The main
idea is to define k centroids, one for each cluster. The basic step of k-means clustering is simple.
In the beginning we determine number of cluster K and we assume the centroid or center of
these clusters. We can take any random objects as the initial centroids or the first K objects in
sequence can also serve as the initial centroids. Then the K means algorithm will do the three
steps given below until convergence iterate until stable (= no object move group)
1. Determine the centroid coordinate
2. Determine the distance of each object to the centroids
3. Group the object based on minimum distance

These steps are given in the form of flow chart. (See fig. below)

Fig.: Flow chart representation of K-means


Advantages:
With a large number of variables, K-Means may be computationally faster than hierarchical
clustering (if K is small).
K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters
are globular.
The K-means method as described has the following drawbacks:
It does not do well with overlapping clusters.
The clusters are easily pulled off-center by outliers.
Each record is either inside or outside of a given cluster.

6 Describe about Multilevel Databases and Web Query Systems 5+5=10


Answer:
Multilevel Databases

Several researchers have proposed a multilevel database approach to organizing Web-based


information. The main idea behind these proposals is that the lowest level of the database
contains primitive semi-structured information stored in various web repositories, such as
hypertext documents. At the higher level(s) meta data or generalizations are extracted from
lower levels and organized in structured collections such as relational or object-oriented
databases.
Web Query Systems
There have been many web-base query systems and languages developed recently that attempt
to utilize standard database query languages such as SQL, structural information about web
documents, and even natural language processing for accommodating the types of queries that
are used in World Wide Web searches. We mention a few examples of these Web-base query
systems here. W3QL combines structure queries, based on the organization of hypertext
documents, and content queries, based on information retrieval techniques. WebLog is a logicbased query language for restructuring extracted information from Web information sources.
Lorel and UnQL support querying of heterogeneous and semi-structured information on the
Web using a labeled graph data model. TSIMMIS helps to extract data from heterogeneous and
semi-structured information sources and correlates them to generate an integrated database
representation of the extracted information.

You might also like