MC0088 Data Warehousing & Data Mining

Que 1 - Differentiate between Data Mining and Data Warehousing.
Ans: Data Mining: Data Mining is actually the analysis of data. It is the computer-assisted
process of digging through and analyzing enormous sets of data that have either been
compiled by the computer or have been inputted into the computer. In data mining, the
computer will analyze the data and extract the meaning from it. It will also look for hidden
patterns within the data and try to predict future behavior. Data Mining is mainly used to find
and show relationships among the data.
The purpose of data mining, also known as knowledge discovery, is to allow businesses to view
these behaviors, trends and/or relationships and to be able to factor them within their
decisions. This allows the businesses to make proactive, knowledge-driven decisions.
The term data mining comes from the fact that the process of data mining, i.e. searching for
relationships between data, is similar to mining and searching for precious materials. Data
mining tools use artificial intelligence, machine learning, statistics, and database systems to
find correlations between the data. These tools can help answer business questions that
traditionally were too time consuming to resolve.
Data Mining includes various steps, including the raw analysis step, database and data
management aspects, data preprocessing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered structures, visualization, and
online updating.
Example: Credit card copanies have a history of your purchases from the past and know
geographically where those purchases have been made. If all of a sudden some purchases are
made in a city far from where you live, the credit card companies are put on alert to a possible
fraud since their data mining shows that you dont normally make purchases in that city. Then,
the credit card company can disable your card for that transaction or just put a flag on your
card for suspicious activity.
Data Warehousing: In contrast, data warehousing is completely different. However, data
warehousing and data mining are interrelated. Data warehousing is the process of compiling
information or data into a data warehouse. A data warehouse is a database used to store data.
It is a central repository of data in which data from various sources is stored. This data
warehouse is then used for reporting and data analysis. It can be used for creating trending
reports for senior management reporting such as annual and quarterly comparisons.
The purpose of a data warehouse is to provide flexible access to the data to the user. Data
warehousing generally refers to the combination of many different databases across an entire
enterprise.
The main difference between data warehousing and data mining is that data warehousing is
the process of compiling and organizing data into one common database, whereas data mining
is the process of extracting meaningful data from that database. Data mining can only be done
once data warehousing is complete.
Example: Facebook gathers all of your data your friends, your likes, who you stalk, etc. and
then stores that data into one central repository. Even though Facebook most likely stores your
friends, your likes, etc., in separate databases, they do want to take the most relevant and
important information and put it into one central aggregated database.
Que 2 - Explain briefly about Business Intelligence.

Ans: Every business intelligence (BI) deployment has an underlying architecture. The BI
architecture is much like the engine of a car a necessary component, often powerful, but one
that users, like drivers, dont always understand. For some companies new to business
intelligence, the BI architecture may primarily be the operational systems and the BI front-end
tools. For more mature BI deployments and particularly for enterprise customers, it will involve
ETL (extract, transform, and load) tools, a data warehouse, data marts, BI front-end tools, and
other such components.
When IT discusses BI with users, we readily fall into techno babble, and senseless acronyms
abound. Most car drivers know that cars have a battery, a transmission, a fuel tank an
adequate level of knowledge for having a conversation with a mechanic or salesperson but
arguably not so much expertise to begin rebuilding an engine. In this chapter, then, Ill present
the major architectural technical components that make up BI and that business users should
have at least a high-level understanding of to participate in discussions about building and
leveraging a BI solution. If you are a technical expert, you might find this chapter to be overly
simplified and it is. If you are looking for a reference on any one of these components, consult
the list of resources in Appendix B of Successful Business Intelligence.
Operational and Source Systems
Operational systems are the starting point for most quantitative data in a company. Operational
systems may also be referred to as transaction processing systems, source systems, and
enterprise resource planning (ERP) systems.
Manufacturing system
When a product is produced, the production order is entered in the manufacturing system.
The quantity of raw material used and the finished product produced are recorded.
Sales system
When a customer places an order, the order details are entered in an order entry system.
Supply chain system
When the product is available, the product is shipped and order fulfillment details are entered.
Accounting system
Accounting then invoices the customer and collects payment. The invoices and payments may
be recorded in an operational system that is different from the order entry system.
In each step in this process, users are creating data that can eventually be used for business
intelligence. As well, to complete a task, operational users may need business intelligence.
Perhaps in order to accept an order, the product must be available in inventory. As is the case
with many online retailers, customers cannot place an order for a product combination (color,
size) that is not available; a report immediately appears with a list of alternative sizes or colors.
A better approach is to systematically transfer data between the systems or modules. However,
even when data is systematically transferred, the Customer ID entered in the order system may
not, for example, be the same Customer ID entered in the accounting system even though
both IDs refer to the same customer!
Ideally, consistent information flows through the process seamlessly, Enterprise resource
planning (ERP) systems ensure adherence to standard processes and are broader in scope than
custom operational systems of the past. From a data perspective, ERPs reduce duplicate data
entry and thus improve data quality (see Chapter 7 of Successful Business Intelligence). With
an integrated ERP, a common set of reference tables with consistent customer IDs, product
codes, and chart of accounts are shared across the modules or applications.
Within the business intelligence life cycle, the operational systems are the starting point for
data you will later want to analyze. If you do not capture the data in the operational system,
you cant analyze it. If the operational system contains errors, those errors will only get
compounded when you later aggregate and combine it with other data.
While much of the data warehouse is populated by operational systems, data may also come
from additional data sources such as:
Distributors who supply sales and inventory information.
Click-stream data from web logs that show the most frequently viewed products or online
shopping cart analysis for partially completed orders.
Whether this additional data gets loaded into a central data warehouse will depend on how
consistently it can be merged with corporate data, how common the requirement is, and
politics. If the data is not physically stored in the data warehouse, it may be integrated with
corporate data in a specific data mart. Disparate data sources may, in some cases, also be
accessed or combined within the BI front-end tool.
Que 3 - Explain the concepts of Data Integration and Transformation

Ans:
Data Integration
Data Integration is the process of combining heterogenous data sources in to a single queriable
schema so as to get an unified view of these data.
Often large companies and enterprises maintain separate departmental databases to store the
data pertaining to the specific department. Although such separations of the data provide them
better manageability and security, performing any cross departmental analysis on these
datasets becomes impossible.
For example, if marketing department and sales department maintain two secluded databases,
then it might not be possible to analyze the effect of a certain advertising campaign by the
marketing department on sales of a product. Similarly, if HR department and production
department maintain their individual databases, it might not be possible to analyze the
correlation between yearly incentives and employee's productivity.
Data integration provides a mechanism to integrate these data from different departments into
a single queriable schema.
Below is a list of examples where data integration is required. The list, however, is not
comprehensive
Cross functional analysis - as discussed in the above example
Finding correlation - Statistical intelligence / scientific application
Sharing information - legal or regulatory requirements e.g. sharing customers' credit

information among banks
Maintaining single point of truth - Higher management topping over several departments
may need to see a single picture of the business
Merger of Business - after merger two companies want to aggregate their individual data
assets
Data integration can be done by 2 major approaches for data integration:

Tight Coupling: Data Warehousing
In case of tight coupling approach - which is often implemented through data warehousing, data is pulled
over from disperate sources into a single physical location through the process of ETL - Extraction, Transformation
and Loading. The single physical location provides an uniform interface for querying the data. ETL layer helps to map
the data from the sources so as to provide a semantically uniform data warehouse.
This approach is called tight coupling since in this approach the data is tightly coupled with the physical repository at
the time of query.
Loose Coupling: Virtual Mediated Schema

In contrast to tight coupling approach, a virtual mediated schema provides a interface
that takes the query input from the user, transform the query in the way source database can
understand and then sends the query directly to the source databases to obtain the result. In
this approach, the data does not really remain in the schema and only remain in the actual
source databases. However, mediated schema contains several "adapters" or "wrappers" that
can connect back to the source systems in order to bring the data to the front end. This
approach is often implemented through middleware architecture (EAI).
Data Transformation
In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Data transformation can involve the following:
1. Smoothing, which works to remove the noise from data. Such techniques include binning,
clustering, and regression.
2. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for analysis of the data at
multiple granularities.
3. Generalization of the data, where low level or \primitive" (raw) data are replaced by higher
level concepts through the use of concept hierarchies. For example, categorical attributes, like
street, can be generalized to higher level concepts, like city or county. Similarly, values for
numeric attributes, like age, may be mapped to higher level concepts, like young, middle-aged,
and senior.
4. Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0.
5. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
Smoothing is a form of data cleaning. Aggregation and generalization also serve as forms of
data reduction. In this section, we therefore discuss normalization and attribute construction.
An attribute is normalized by scaling its values so that they fall within a small specified range,
such as 0 to 1.0.
Normalization is particularly useful for classifying algorithms involving neural networks, or
distance measurements such as nearest-neighbor classification and clustering. If using the
neural network back-propagation algorithm for classification mining, normalizing the input
values for each attribute measured in the training samples will help speed up the learning
phase. For distance-based methods, normalization helps prevent attributes with initially large
ranges (e.g., income) from outweighing attributes with initially smaller ranges (e.g., binary
attributes).
There are many methods for data normalization. We study three: min-max normalization, zscore normalization, and normalization by decimal scaling.
Min-max normalization performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute A. Min-max
normalization maps a value v of A to v0 in the range [new_ minA; new_ maxA] by computing.
Min-max normalization preserves the relationships among the original data values. It will
encounter an out of bounds" error if a future input case for normalization falls outside of the
original data range for A.
Que 4 - Differentiate between database management systems (DBMS) and data mining.
Ans: A DBMS (Database Management System) is a complete system used for managing digital
databases that allows storage of database content, creation/maintenance of data, search and
other functionalities. On the other hand, Data Mining is a field in computer science, which deals
with the extraction of previously unknown and interesting information from raw data. Usually,
the data used as the input for the Data mining process is stored in databases. Users who are
inclined toward statistics use Data Mining. They utilize statistical models to look for hidden
patterns in data. Data miners are interested in finding useful relationships between different
data elements, which is ultimately profitable for businesses.
DBMS
DBMS, sometimes just called a database manager, is a collection of computer programs that is
dedicated for the management (i.e. organization, storage and retrieval) of all databases that
are installed in a system (i.e. hard drive or network). There are different types of Database
Management Systems existing in the world, and some of them are designed for the proper
management of databases configured for specific purposes. Most popular commercial Database
Management Systems are Oracle, DB2 and Microsoft Access. All these products provide means
of allocation of different levels of privileges for different users, making it possible for a DBMS to
be controlled centrally by a single administrator or to be allocated to several different people.
There are four important elements in any Database Management System. They are the
modeling language, data structures, query language and mechanism for transactions. The
modeling language defines the language of each database hosted in the DBMS. Currently
several popular approaches like hierarchal, network, relational and object are in practice. Data
structures help organize the data such as individual records, files, fields and their definitions
and objects such as visual media. Data query language maintains the security of the database
by monitoring login data, access rights to different users, and protocols to add data to the
system. SQL is a popular query language that is used in Relational Database Management
Systems. Finally, the mechanism that allows for transactions help concurrency and multiplicity.
That mechanism will make sure that the same record will not be modified by multiple users at
the same time, thus keeping the data integrity intact. Additionally, DBMS provide backup and
other facilities as well.
Data Mining
Data mining is also known as Knowledge Discovery in Data (KDD). As mentioned above, it is a
felid of computer science, which deals with the extraction of previously unknown and
interesting information from raw data. Due to the exponential growth of data, especially in
areas such as business, data mining has become very important tool to convert this large
wealth of data in to business intelligence, as manual extraction of patterns has become
seemingly impossible in the past few decades. For example, it is currently been used for
various applications such as social network analysis, fraud detection and marketing. Data
mining usually deals with following four tasks: clustering, classification, regression, and
association. Clustering is identifying similar groups from unstructured data. Classification is
learning rules that can be applied to new data and will typically include following steps:
preprocessing of data, designing modeling, learning/feature selection and Evaluation/validation.
Regression is finding functions with minimal error to model data. And association is looking for
relationships between variables. Data mining is usually used to answer questions like what are
the main products that might help to obtain high profit next year in Wal-Mart.
Difference Summary
DBMS is a full-fledged system for housing and managing a set of digital databases. However
Data Mining is a technique or a concept in computer science, which deals with extracting useful
and previously unknown information from raw data. Most of the times, these raw data are
stored in very large databases. Therefore Data miners use the existing functionalities of DBMS
to handle, manage and even preprocess raw data before and during the Data mining process.
However, a DBMS system alone cannot be used to analyze data. But, some DBMS at present
have inbuilt data analyzing tools or capabilities.
Que 5 - Differentiate between K-means and Hierarchical clustering

Ans: K-means clustering
Algorithm
Split the data into k random clusters
Repeat
o
calculate the centroid of each cluster
(re-)assign each gene/experiment to the closest centroid
stop if no new assignments are made
Features
Low memory usage
Running time: O ( n )
Improves iteratively: not trapped in previous mistakes
Non-deterministic: will in general produce different clusters with different initializations
Number of clusters must be decided in advance
Hierarchical Clustering
Algorithm
INPUT: n genes/experiments
Consider each gene/experiment as an individual cluster and initiate an n n distance

matrix d
Repeat
o
identify the two most similar clusters in d (i.e. smallest number in d )
merge the two most similar clusters and update the matrix (i.e. substitute the two
clusters with the new cluster)
OUTPUT: A tree of merged genes/experiments (called a dendrogram)

Features
Huge memory requirements: stores the n n matrix
Running time: O(n3)
Deterministic: produces the same clustering each time
Nice visualization: dendrogram
Number of clusters can be selected using the dendrogram
Hierarchical vs. k-means

Hierarchical clustering:
computationally expensive -> relatively small data sets
nice visualization, no. of clusters can be selected
deterministic cannot correct early mistakes
K-means:
computationally efficient -> large data sets
predefined no. of clusters
non-deterministic -> should be run several times
iterative improvement
Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 > best of both worlds.
Que 6 - Differentiate between Web content mining and Web usage mining.
Ans: Web Content Mining
Web content mining targets the knowledge discovery, in which the main objects are the
traditional collections of multimedia documents such as images, video, and audio, which are
embedded in or linked to the web pages.
It is also quite different from Data mining because Web data are mainly semi-structured and/or
unstructured, while Data mining deals primarily with structured data. Web content mining is
also different from Text mining because of the semi-structure nature of the Web, while Text
mining focuses on unstructured texts. Web content mining thus requires creative applications
of Data mining and / or Text mining techniques and also its own unique approaches. In the past
few years, there was a rapid expansion of activities in the Web content mining area. This is not
surprising because of the phenomenal growth of the Web contents and significant economic
benefit of such mining. However, due to the heterogeneity and the lack of structure of Web
data, automated discovery of targeted or unexpected knowledge information still present many
challenging research problems.
Web content mining could be differentiated from two points of view: Agent-based approach or
Database approach. The first approach aims on improving the information finding and filtering.
The second approach aims on modeling the data on the Web into more structured form in order
to apply standard database querying mechanism and data mining applications to analyze it.
Web Content Mining Problems/Challenges
Data/Information Extraction: Extraction of structured data from Web pages, such as products
and search results is a difficult task. Extracting such data allows one to provide services. Two
main types of techniques, machine learning and automatic extraction are used to solve this
problem.
Web Information Integration and Schema Matching: Although the Web contains a huge amount
of data, each web site (or even page) represents similar information differently. Identifying or
matching semantically similar data is a very important problem with many practical
applications.
Opinion extraction from online sources: There are many online opinion sources, e.g., customer
reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer
opinions) is of great importance for marketing intelligence and product benchmarking.
Knowledge synthesis: Concept hierarchies or ontology are useful in many applications.
However, generating them manually is very time consuming. A few existing methods that
explores the information redundancy of the Web will be presented. The main application is to
synthesize and organize the pieces of information on the Web to give the user a coherent
picture of the topic domain.
Segmenting Web pages and detecting noise: In many Web applications, one only wants the
main content of the Web page without advertisements, navigation links, copyright notices.
Automatically segmenting Web page to extract the main content of the pages is interesting
problem.
All these tasks present major research challenges and their solutions.
Web Usage Mining

Web Usage Mining focuses on techniques that could predict the behavior of users while they
are interacting with the WWW. Web usage mining, discover user navigation patterns from web
data, tries to discovery the useful information from the secondary data derived from the
interactions of the users while surfing on the Web. Web usage mining collects the data from
Web log records to discover user access patterns of web pages. There are several available
research projects and commercial tools that analyze those patterns for different purposes. The
insight knowledge could be utilized in personalization, system improvement, site modification,
business intelligence and usage characterization.
The only information left behind by many users visiting a Web site is the path through the
pages they have accessed. Most of the Web information retrieval tools only use the textual
information, while they ignore the link information that could be very valuable. In general, there
are mainly four kinds of data mining techniques applied to the web mining domain to discover
the user navigation pattern:
Association Rule mining
Sequential pattern
Clustering
Classification
The features of sequential mining and classification mining are given below:
Sequential Pattern Mining
In Web server logs, a visit by a client is recorded over a period of time. The time stamp
associated with a transaction in this case will be a time interval, which is determined and
attached to the transaction during the data preprocesses. The discovery of sequential patterns
in Web server access logs allows Web-based organizations to predict user navigation patterns
and helps in targeting advertising aimed at groups of users based on these patterns. By
analyzing this information, the Web mining system can determine temporal relationships
among data items such as the following:
30% of clients who visited /company/products had done a search in Google, within the
past week on keyword.
60% of clients, who placed an online order in/company/product1, also placed an online
order in /company/product4 within 15 days.
Classification Mining
Discovering classification rules allows one to develop a profile of items belonging to a particular
group according to their common attributes. This profile can then be used to classify new data
items that are added to the database.
In Web Mining, classified techniques allow one to develop a profile for clients who access
particular server files based on demographic information available on those clients, or based on
their navigation patterns. For example, classification on Web access logs may lead to the
discovery of relationships such as the following:
Clients from state or government agencies who visit the site tend to be interested in the
page /company/product.
50% of clients who placed an online order in /company/product2, were in the 20-25 age
group and lived on the West Coast

MC0088 Data Warehousing &amp; Data Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MC0088 Data Warehousing &amp; Data Mining

Uploaded by

Copyright:

Available Formats

Que 1 - Differentiate between Data Mining and Data Warehousing.

Que 2 - Explain briefly about Business Intelligence.

Que 3 - Explain the concepts of Data Integration and Transformation

Cross functional analysis - as discussed in the above example

Finding correlation - Statistical intelligence / scientific application

Sharing information - legal or regulatory requirements e.g. sharing customers' credit

Data integration can be done by 2 major approaches for data integration:

Loose Coupling: Virtual Mediated Schema

Que 5 - Differentiate between K-means and Hierarchical clustering

Split the data into k random clusters

calculate the centroid of each cluster

(re-)assign each gene/experiment to the closest centroid

stop if no new assignments are made

Low memory usage

Improves iteratively: not trapped in previous mistakes

Non-deterministic: will in general produce different clusters with different initializations

Number of clusters must be decided in advance

Consider each gene/experiment as an individual cluster and initiate an n n distance

identify the two most similar clusters in d (i.e. smallest number in d )

OUTPUT: A tree of merged genes/experiments (called a dendrogram)

Huge memory requirements: stores the n n matrix

Running time: O(n3)

Deterministic: produces the same clustering each time

Nice visualization: dendrogram

Number of clusters can be selected using the dendrogram

Hierarchical vs. k-means

computationally expensive -> relatively small data sets

nice visualization, no. of clusters can be selected

deterministic cannot correct early mistakes

computationally efficient -> large data sets

predefined no. of clusters

non-deterministic -> should be run several times

Web Usage Mining

Association Rule mining

You might also like

MC0088 Data Warehousing & Data Mining

MC0088 Data Warehousing & Data Mining