Professional Documents
Culture Documents
Ans: Data Mining: Data Mining is actually the analysis of data. It is the computer-assisted
process of digging through and analyzing enormous sets of data that have either been
compiled by the computer or have been inputted into the computer. In data mining, the
computer will analyze the data and extract the meaning from it. It will also look for hidden
patterns within the data and try to predict future behavior. Data Mining is mainly used to find
and show relationships among the data.
The purpose of data mining, also known as knowledge discovery, is to allow businesses to view
these behaviors, trends and/or relationships and to be able to factor them within their
decisions. This allows the businesses to make proactive, knowledge-driven decisions.
The term data mining comes from the fact that the process of data mining, i.e. searching for
relationships between data, is similar to mining and searching for precious materials. Data
mining tools use artificial intelligence, machine learning, statistics, and database systems to
find correlations between the data. These tools can help answer business questions that
traditionally were too time consuming to resolve.
Data Mining includes various steps, including the raw analysis step, database and data
management aspects, data preprocessing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered structures, visualization, and
online updating.
Example: Credit card copanies have a history of your purchases from the past and know
geographically where those purchases have been made. If all of a sudden some purchases are
made in a city far from where you live, the credit card companies are put on alert to a possible
fraud since their data mining shows that you dont normally make purchases in that city. Then,
the credit card company can disable your card for that transaction or just put a flag on your
card for suspicious activity.
Data Warehousing: In contrast, data warehousing is completely different. However, data
warehousing and data mining are interrelated. Data warehousing is the process of compiling
information or data into a data warehouse. A data warehouse is a database used to store data.
It is a central repository of data in which data from various sources is stored. This data
warehouse is then used for reporting and data analysis. It can be used for creating trending
reports for senior management reporting such as annual and quarterly comparisons.
The purpose of a data warehouse is to provide flexible access to the data to the user. Data
warehousing generally refers to the combination of many different databases across an entire
enterprise.
The main difference between data warehousing and data mining is that data warehousing is
the process of compiling and organizing data into one common database, whereas data mining
is the process of extracting meaningful data from that database. Data mining can only be done
once data warehousing is complete.
Example: Facebook gathers all of your data your friends, your likes, who you stalk, etc. and
then stores that data into one central repository. Even though Facebook most likely stores your
friends, your likes, etc., in separate databases, they do want to take the most relevant and
important information and put it into one central aggregated database.
While much of the data warehouse is populated by operational systems, data may also come
from additional data sources such as:
Distributors who supply sales and inventory information.
Click-stream data from web logs that show the most frequently viewed products or online
shopping cart analysis for partially completed orders.
Whether this additional data gets loaded into a central data warehouse will depend on how
consistently it can be merged with corporate data, how common the requirement is, and
politics. If the data is not physically stored in the data warehouse, it may be integrated with
corporate data in a specific data mart. Disparate data sources may, in some cases, also be
accessed or combined within the BI front-end tool.
Maintaining single point of truth - Higher management topping over several departments
may need to see a single picture of the business
Merger of Business - after merger two companies want to aggregate their individual data
assets
Data Transformation
In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Data transformation can involve the following:
1. Smoothing, which works to remove the noise from data. Such techniques include binning,
clustering, and regression.
2. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for analysis of the data at
multiple granularities.
3. Generalization of the data, where low level or \primitive" (raw) data are replaced by higher
level concepts through the use of concept hierarchies. For example, categorical attributes, like
street, can be generalized to higher level concepts, like city or county. Similarly, values for
numeric attributes, like age, may be mapped to higher level concepts, like young, middle-aged,
and senior.
4. Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0.
5. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
Smoothing is a form of data cleaning. Aggregation and generalization also serve as forms of
data reduction. In this section, we therefore discuss normalization and attribute construction.
An attribute is normalized by scaling its values so that they fall within a small specified range,
such as 0 to 1.0.
Normalization is particularly useful for classifying algorithms involving neural networks, or
distance measurements such as nearest-neighbor classification and clustering. If using the
neural network back-propagation algorithm for classification mining, normalizing the input
values for each attribute measured in the training samples will help speed up the learning
phase. For distance-based methods, normalization helps prevent attributes with initially large
ranges (e.g., income) from outweighing attributes with initially smaller ranges (e.g., binary
attributes).
There are many methods for data normalization. We study three: min-max normalization, zscore normalization, and normalization by decimal scaling.
Min-max normalization performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute A. Min-max
normalization maps a value v of A to v0 in the range [new_ minA; new_ maxA] by computing.
Min-max normalization preserves the relationships among the original data values. It will
encounter an out of bounds" error if a future input case for normalization falls outside of the
original data range for A.
Que 4 - Differentiate between database management systems (DBMS) and data mining.
Ans: A DBMS (Database Management System) is a complete system used for managing digital
databases that allows storage of database content, creation/maintenance of data, search and
other functionalities. On the other hand, Data Mining is a field in computer science, which deals
with the extraction of previously unknown and interesting information from raw data. Usually,
the data used as the input for the Data mining process is stored in databases. Users who are
inclined toward statistics use Data Mining. They utilize statistical models to look for hidden
patterns in data. Data miners are interested in finding useful relationships between different
data elements, which is ultimately profitable for businesses.
DBMS
DBMS, sometimes just called a database manager, is a collection of computer programs that is
dedicated for the management (i.e. organization, storage and retrieval) of all databases that
are installed in a system (i.e. hard drive or network). There are different types of Database
Management Systems existing in the world, and some of them are designed for the proper
management of databases configured for specific purposes. Most popular commercial Database
Management Systems are Oracle, DB2 and Microsoft Access. All these products provide means
of allocation of different levels of privileges for different users, making it possible for a DBMS to
be controlled centrally by a single administrator or to be allocated to several different people.
There are four important elements in any Database Management System. They are the
modeling language, data structures, query language and mechanism for transactions. The
modeling language defines the language of each database hosted in the DBMS. Currently
several popular approaches like hierarchal, network, relational and object are in practice. Data
structures help organize the data such as individual records, files, fields and their definitions
and objects such as visual media. Data query language maintains the security of the database
by monitoring login data, access rights to different users, and protocols to add data to the
system. SQL is a popular query language that is used in Relational Database Management
Systems. Finally, the mechanism that allows for transactions help concurrency and multiplicity.
That mechanism will make sure that the same record will not be modified by multiple users at
the same time, thus keeping the data integrity intact. Additionally, DBMS provide backup and
other facilities as well.
Data Mining
Data mining is also known as Knowledge Discovery in Data (KDD). As mentioned above, it is a
felid of computer science, which deals with the extraction of previously unknown and
interesting information from raw data. Due to the exponential growth of data, especially in
areas such as business, data mining has become very important tool to convert this large
wealth of data in to business intelligence, as manual extraction of patterns has become
seemingly impossible in the past few decades. For example, it is currently been used for
various applications such as social network analysis, fraud detection and marketing. Data
mining usually deals with following four tasks: clustering, classification, regression, and
association. Clustering is identifying similar groups from unstructured data. Classification is
learning rules that can be applied to new data and will typically include following steps:
preprocessing of data, designing modeling, learning/feature selection and Evaluation/validation.
Regression is finding functions with minimal error to model data. And association is looking for
relationships between variables. Data mining is usually used to answer questions like what are
the main products that might help to obtain high profit next year in Wal-Mart.
Difference Summary
DBMS is a full-fledged system for housing and managing a set of digital databases. However
Data Mining is a technique or a concept in computer science, which deals with extracting useful
and previously unknown information from raw data. Most of the times, these raw data are
stored in very large databases. Therefore Data miners use the existing functionalities of DBMS
to handle, manage and even preprocess raw data before and during the Data mining process.
However, a DBMS system alone cannot be used to analyze data. But, some DBMS at present
have inbuilt data analyzing tools or capabilities.
Repeat
o
Features
Running time: O ( n )
Hierarchical Clustering
Algorithm
INPUT: n genes/experiments
Repeat
o
merge the two most similar clusters and update the matrix (i.e. substitute the two
clusters with the new cluster)
K-means:
iterative improvement
Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 > best of both worlds.
Que 6 - Differentiate between Web content mining and Web usage mining.
Ans: Web Content Mining
Web content mining targets the knowledge discovery, in which the main objects are the
traditional collections of multimedia documents such as images, video, and audio, which are
embedded in or linked to the web pages.
It is also quite different from Data mining because Web data are mainly semi-structured and/or
unstructured, while Data mining deals primarily with structured data. Web content mining is
also different from Text mining because of the semi-structure nature of the Web, while Text
mining focuses on unstructured texts. Web content mining thus requires creative applications
of Data mining and / or Text mining techniques and also its own unique approaches. In the past
few years, there was a rapid expansion of activities in the Web content mining area. This is not
surprising because of the phenomenal growth of the Web contents and significant economic
benefit of such mining. However, due to the heterogeneity and the lack of structure of Web
data, automated discovery of targeted or unexpected knowledge information still present many
challenging research problems.
Web content mining could be differentiated from two points of view: Agent-based approach or
Database approach. The first approach aims on improving the information finding and filtering.
The second approach aims on modeling the data on the Web into more structured form in order
to apply standard database querying mechanism and data mining applications to analyze it.
Web Content Mining Problems/Challenges
Data/Information Extraction: Extraction of structured data from Web pages, such as products
and search results is a difficult task. Extracting such data allows one to provide services. Two
main types of techniques, machine learning and automatic extraction are used to solve this
problem.
Web Information Integration and Schema Matching: Although the Web contains a huge amount
of data, each web site (or even page) represents similar information differently. Identifying or
matching semantically similar data is a very important problem with many practical
applications.
Opinion extraction from online sources: There are many online opinion sources, e.g., customer
reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer
opinions) is of great importance for marketing intelligence and product benchmarking.
Knowledge synthesis: Concept hierarchies or ontology are useful in many applications.
However, generating them manually is very time consuming. A few existing methods that
explores the information redundancy of the Web will be presented. The main application is to
synthesize and organize the pieces of information on the Web to give the user a coherent
picture of the topic domain.
Segmenting Web pages and detecting noise: In many Web applications, one only wants the
main content of the Web page without advertisements, navigation links, copyright notices.
Automatically segmenting Web page to extract the main content of the pages is interesting
problem.
All these tasks present major research challenges and their solutions.
Sequential pattern
Clustering
Classification
The features of sequential mining and classification mining are given below:
Sequential Pattern Mining
In Web server logs, a visit by a client is recorded over a period of time. The time stamp
associated with a transaction in this case will be a time interval, which is determined and
attached to the transaction during the data preprocesses. The discovery of sequential patterns
in Web server access logs allows Web-based organizations to predict user navigation patterns
and helps in targeting advertising aimed at groups of users based on these patterns. By
analyzing this information, the Web mining system can determine temporal relationships
among data items such as the following:
30% of clients who visited /company/products had done a search in Google, within the
past week on keyword.
60% of clients, who placed an online order in/company/product1, also placed an online
order in /company/product4 within 15 days.
Classification Mining
Discovering classification rules allows one to develop a profile of items belonging to a particular
group according to their common attributes. This profile can then be used to classify new data
items that are added to the database.
In Web Mining, classified techniques allow one to develop a profile for clients who access
particular server files based on demographic information available on those clients, or based on
their navigation patterns. For example, classification on Web access logs may lead to the
discovery of relationships such as the following:
Clients from state or government agencies who visit the site tend to be interested in the
page /company/product.
50% of clients who placed an online order in /company/product2, were in the 20-25 age
group and lived on the West Coast