You are on page 1of 8

Homework Title/ No.

: 4 Course Code: CSE551

Course Instructor : Babita Pandey

Student’s Roll No.RA27T1A35 Section No. : 27T1

Declaration:
I declare that this assignment is my individual work. I have not
copied from any other student’s work or from any other source
except where due acknowledgment is made explicitly in the text,
nor has any part been written for me by another person.

Student’s Signature:

Ramandeep kaur
Part A:
Q1. What is difference between ROLAP and MOLAP?

Answer: -

Different from a ROLAP system, a MOLAP system is based on


an ad hoc logical model that can be used to represent
multidimensional data and operations directly.
The greatest advantage of MOLAP systems in comparison with
ROLAP is that multidimensional operations can be performed
in an easy, natural way with MOLAP without any need for
complex join operations. For this reason, MOLAP system
performance is excellent. However, MOLAP system
implementations have very little in common, because no
multidimensional logical model standard has yet been set.
Generally, they simply share the usage of optimization
techniques specifically designed for sparsity management.
The lack of a common standard is a problem being
progressively solved. This means that MOLAP tools are
becoming more and more successful after their limited
implementation for many years. This success is also proven
by the investments in this technology by major vendors,
such as Microsoft (Analysis Services) and Oracle (Hyperion).
The intermediate architecture type, HOLAP, aims at mixing
the advantages of both basic solutions. It takes advantage
of the standardization level and the ability to manage
large amounts of data from ROLAP implementations, and the
query speed typical of MOLAP systems. HOLAP implies that
the largest amount of data should be stored in an RDBMS to
avoid the problems caused by sparsity, and that a
multidimensional system stores only the information users
most frequently need to access. If that information is not
enough to solve queries, the system will transparently
access the part of the data managed by the relational
system.
MOLAP Architecture

ROLAP Architecture
Q2. In data warehouse technology, a multiple dimensional view can be implemented by
a relational database technique (ROLAP), or by a multidimensional database technique
(MOLAP). Briefly describe each implementation technique.

Answer: - The description of both the techniques used in data warehouse technology are
as enlisted below: -

1. ROLAP: - ROLAPs are the intermediate servers that are found between relational back
end server and client front end tools. They make use of RDBMSs or extended RDBMSs to
store and manage data warehouse. ROLAP servers have the ability to optimize the all back
end DBMSs, as well as deployment of other tools and services. ROLAPs tend to be more
scalable than MOLAP.

Implementation of ROLAP functions.

I. Generation of data warehouse with implementation of aggregation: - As we know


that ROLAP relies on relational tables as its basic data structures. This is the relational
table that stores data at the abstracted level. This aggregated data then can be stored
inside the fact tables.Various techniques are used for implementing aggregation these
are as: -

• Sorting.
• Hashing.
• Grouping operations.

II. Roll-up: - In ROLAP by this we mean that the relational tables are aggregated from
more to les specific.
III. Drill-down: - In this we introduce additional dimensions into the relation tables.
IV. Incremental updating: - In this we break down whole of the database into various
segments and then apply updates to the data warehouse.

2. MOLAP: - These servers allow for multidimensional views of data through array-based
multidimensional engines. They can map multidimensional views onto data cube arrays. The
advantage to this is quicker indexing to pre-computed summarized data. MOLAPs may have
a two-level storage system in order to handle sparse and dense data. Dense sub-cubes are
identified and stored as array structures, whereas sparse sub-cubes use compression to make
storage more efficient.

Implementation of ROLAP functions.

I. Generation of data warehouse with implementation of aggregation: - Here in


MOLAP the concept of aggregation is implemented using SQL commands. Whole of
the data that is present in the data warehouse is aggregated together into a single cube
of data known as an aggregated cube which is an array based cube.
II. Roll-up: - It is similar to MOLAP except that now we are rolling up cub-cubes that
form the array based cubes.
III. Drill-down: - Here we introduce additional dimensions or sub cubes into the array.
IV. Incremental updating: - The incremental updating in case of ROLAP is some how
more sophisticated as compared to MOLAP, this is because of additional complexity
of sub-cubes and arrays.
Q3: How does association technique apply to data mining?

Answer: - By association we mean that, we are given a set of items and a large collection of
transactions which are subsets of these items. The task is to find relationship between the
presences of various items within these subsets.

The association technique can be applied to data mining using the two steps which are
as given below: -

• Find all frequent sets of items: by definition, each of these sets of items will occur at
least as frequently as a pre-determined minimum support count.
• Generate strong association rules from the frequent sets of items: by definition, these
rules must satisfy minimum support and minimum confidence.

In this way we can apply association technique to data mining.

The association technique can also be applied to data mining based upon one of the
followings: -

• Based on the type of values handled in the rule: if a rule concern associations between
the presence and absence of items, it is a Boolean association rule.
• If a rule describes associations between quantitative items or attributes, then it is a
quantitative association rule. In these rules, quantitative values for items or attributes
are partitioned into intervals.
• Based on dimensions of data involved in the rule: if the items in association rule
reference only one dimension, then it is a single dimensional association rule.
• Based on levels of abstractions involved in the rule set: some methods for association
rule mining can find rules at differing levels of abstraction.

In order for the rules to be useful there are two pieces of information that must be supplied as
well as the actual rule:

1. Support- - How often does the rule apply?


2. Confidence- How often is the rule is correct?
Part B:

Q4: List the KDD process and briefly describe the steps of the process.

Answer: - KDD stands for Knowledge Discovery in Databases is defined as a broad


process of finding knowledge in data, and emphasizes the high-level application of particular
data mining methods. It is of interest to researchers in machine learning, pattern recognition,
databases, statistics, artificial intelligence, knowledge acquisition for expert systems, and data
visualization.

The primary goal of the KDD process is to extract knowledge from data in the context of
large databases. It does this by using data mining methods to extract what is deemed
knowledge, according to the specifications of measures and thresholds, using a database
along with any required preprocessing, sub sampling, and transformations of that database.

The steps of the KDD process are as illustrated below: -

1. Developing an understanding of the application domain, the relevant prior


knowledge and the goals of the end-user.
2. Creating a target data set, selecting a data set, or focusing on a subset of variables, or
data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing includes removal of noise or outliers, collecting
necessary information to model or account for noise and strategies for handling
missing data fields and accounting for time sequence information and known
changes.
4. Data reduction and projection includes finding useful features to represent the data
depending on the goal of the task using dimensionality reduction or transformation
methods to reduce the effective number of variables under consideration or to find
invariant representations for the data.
5. Choosing the data mining task involve deciding whether the goal of the KDD
process is classification, regression, clustering, etc.
6. Choosing the data mining algorithms includes selecting methods to be used for
searching for patterns in the data and deciding which models and parameters may
be appropriate and matching a particular data mining method with the overall criteria
of the KDD process.
7. Data mining includes searching for patterns of interest in a particular
representational form or a set of such representations as classification rules or trees,
regression, clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
Q5: Describes three measures of association used in market basket analysis?

Answer: - The three measures that are used in market basket analysis are as:

1. Support: - The support of an association rule measures the fraction of baskets for
which the rule is true.
2. Confidence: - The confidence in an association rule is a percentage value that shows
how frequently the rule head occurs among all the groups that contain the rule body.
The higher the value, the more often this set of items is associated together.
3. Lift: - The lift value for the association is the ratio of the rule confidence to the
expected confidence of finding the rule in any basket.
Assume min_support = 40% = 2/5, min_confidence = 70%. Five transactions
are recorded in a supermarket:

# Transaction Code

1 Beer, diaper, baby powder, bread, umbrella B D P R U

2 Diaper, baby powder, DP

3 Beer, diaper, milk BDM

4 Diaper, beer, detergent DBG

5 Beer, milk, cola BMC

1. Find frequent itemsets

F1 = {B, D, P, M}

k = 2 C2 = BD, BP, BM, DP, DM, PM -> eliminate infrequent BP, DM, PM

F2 = {BD, BM, DP}

K=3: C3 = {BDM} (according to other books; this web page erroneously


includes BDP)

F3 = {BDM}
Q6: A database has five transactions. Let min sup = 60% and min con f = 80%.

TID items bought


T100 A, B, C, D, E
T200 B, E, A, C, H
T300 P, A, G, E
T400 U, R, B, A, N, A
T500 G, A, U, D, I

1. Find all frequent itemsets using Apriori and FP-growth, respectively.


Compare the efficiency of the two mining processes.
2. List all the association rules (with support s and confidence c) matching
the following metarule, where X is a variable representing customers, and
itemi denotes variables representing items (e.g., "A", "B", etc.):

You might also like