Professional Documents
Culture Documents
for reporting information were both costly and slow. To solve these problems, companies began
designing computer databases that placed an emphasis on managing and analyzing information.
These were the first Data Warehouses, and they could obtain data from a variety of different
sources, and some of these include PCs and mainframes.
Spreadsheet programs have also played an important role in the development of Data
Warehouses. By the end of the 1990s, the technology had greatly advanced, and was much lower
in cost. The technology has continued to evolve to meet the demands of those who are looking
for more functions and speed. There are four advances in Data Warehouse technology that has
allowed it to evolve. These advances are offline operational databases, real time Data
Warehouses, offline Data Warehouses, and the integrated Data Warehouses.
The offline operational database is a system in which the information within the database of an
operational system is copied to a server that is offline. When this is done, the operational system
will perform at a much higher level. As the name implies, a real time Data Warehouse system
will be updated every time an event occurs. For example, if a customer orders a product, a real
time Data Warehouse will automatically update the information in real time.
With the integrated Data Warehouse, transactions will be transferred back to the operational
systems each day, and this will allow the data to easily be analyzed by companies and
organizations. There are a number of devices that will be present in the typical Data Warehouse.
Some of these devices are the source data layer, reporting layer, Data Warehouse layer, and
transformation layer. There are a number different data sources for Data Warehouses. Some
popular forms of data sources are Teradata, Oracle database, or Microsoft SQL Server.
Another important concept that is related to Data Warehouses is called data transformation. As
the name suggests, data transformation is a process in which information transferred from
specific sources is cleaned and loaded into a repository.
Data Warehouses exist to facilitate complex, data-intensive and frequent adhoc queries. Data
Warehouses must provide far greater and more efficient query support than is demanded of
transactional databases. Data Warehouses provide the following functionality:
Roll-up: Data is summarized with increased generalization.
Drill-down: Increasing levels of detail are revealed.
Pivot: Cross tabulation that is, rotation is performed.
Slice and Dice: Performing projection operations on the dimensions.
Sorting: Data is sorted by ordinal value.
Selection: Data is available by value or range.
Derived or Computer Attributes: Attributes are computed by operations on stored data and
values are derived.
Fig.: Multicube
5. Describe K-means method for clustering. List its advantages and drawbacks.
5+5=10
Answer:
K-means
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve
the well known clustering problem. The procedure follows a simple and easy way to classify a
given data set through a certain number of clusters (assume k clusters) fixed a priori. The main
idea is to define k centroids, one for each cluster. The basic step of k-means clustering is simple.
In the beginning we determine number of cluster K and we assume the centroid or center of
these clusters. We can take any random objects as the initial centroids or the first K objects in
sequence can also serve as the initial centroids. Then the K means algorithm will do the three
steps given below until convergence iterate until stable (= no object move group)
1. Determine the centroid coordinate
2. Determine the distance of each object to the centroids
3. Group the object based on minimum distance
These steps are given in the form of flow chart. (See fig. below)