Professional Documents
Culture Documents
DATA MINING
Data Mining is the process of extracting useful information and patterns from enormous
data. Data Mining includes collection, extraction, analysis and statistics of data. It is also
known as Knowledge discovery process, Knowledge Mining from Data or data/ pattern
analysis. Data Mining is a logical process of finding useful information to find out useful
data. Once the information and patterns are found it can be used to make decisions for
developing the business. Data mining tools can give answers to your various questions
related to your business which was too difficult to resolve. They also forecast the future
trends which lets the business people to make proactive decisions.
Exploration– In this step the data is cleared and converted into another form. The nature of
data is also determined
Pattern Identification– The next step is to choose the pattern which will make the best
prediction
Deployment– The identified patterns are used to get the desired outcome.
Statistics
Clustering
Visualization
Decision Tree
Association Rules
Neural Networks
Classification
1. Statistical Techniques
Data mining techniques statistics is a branch of mathematics which relates to the collection
and description of data. Statistical technique is not considered as a data mining technique
by many analysts. But still it helps to discover the patterns and build predictive models. For
this reason data analyst should possess some knowledge about the different statistical
techniques. In today’s world people have to deal with large amount of data and derive
important patterns from it. Statistics can help you to a greater extent to get answers for
questions about their data like
Statistics not only answers these questions they help in summarizing the data and count it.
It also helps in providing information about the data with ease. Through statistical reports
people can take smart decisions. There are different forms of statistics but the most
important and useful technique is the collection and counting of data. There are a lot of
ways to collect data like
Histogram
Mean
Median
Mode
Variance
Max
Min
Linear Regression
2. Clustering Technique
Clustering is one among the oldest techniques used in Data Mining. Clustering analysis is
the process of identifying data that are similar to each other. This will help to understand
the differences and similarities between the data. This is sometimes called segmentation
and helps the users to understand what is going on within the database. For example, an
insurance company can group its customers based on their income, age, nature of policy
and type of claims.There are different types of clustering methods. They are as follows
Partitioning Methods
Hierarchical Agglomerative methods
Density Based Methods
Grid Based Methods
Model Based Methods
The most popular clustering algorithm is Nearest Neighbour. Nearest neighbour technique
is very similar to clustering. It is a prediction technique where in order to predict what a
estimated value is in one record look for records with similar estimated values in historical
database and use the prediction value from the record which is near to the unclassified
record. This technique simply states that the objects which are closer to each other will
have similar prediction values. Through this method you can easily predict the values of
nearest objects very easily. Nearest Neighbour is the most easy to use technique because
they work as per the thought of the people. They also work very well in terms of
automation. They perform complex ROI calculations with ease. The level of accuracy in
this technique is as good as the other Data Mining techniques.In business Nearest
Neighbour technique is most often used in the process of Text Retrieval. They are used to
find the documents that share the important characteristics with that main document that
have been marked as interesting.
3. Visualization -Visualization is the most useful technique which is used to discover data
patterns. This technique is used at the beginning of the Data Mining process. Many
researches are going on these days to produce interesting projection of databases, which is
called Projection Pursuit. There are a lot of data mining technique which will produce
useful patterns for good data. But visualization is a technique which converts Poor data into
good data letting different kinds of Data Mining methods to be used in discovering hidden
patterns.
4. Induction Decision Tree Technique - A decision tree is a predictive model and the name
itself implies that it looks like a tree. In this technique, each branch of the tree is viewed as
a classification question and the leaves of the trees are considered as partitions of the
dataset related to that particular classification. This technique can be used for exploration
analysis, data pre-processing and prediction work. Decision tree can be considered as a
segmentation of the original dataset where segmentation is done for a particular reason.
Each data that comes under a segment has some similarities in their information being
predicted. Decision trees provides results that can be easily understood by the user.
Decision tree technique is mostly used by statisticians to find out which database is more
related to the problem of the business. Decision tree technique can be used for Prediction
and Data pre-processing. The first and foremost step in this technique is growing the tree.
The basic of growing the tree depends on finding the best possible question to be asked at
each branch of the tree. The decision tree stops growing under any one of the below
circumstances
CART which stands for Classification and Regression Trees is a data exploration and
prediction algorithm which picks the questions in a more complex way. It tries them all and
then selects one best question which is used to split the data into two or more segments.
After deciding on the segments it again asks questions on each of the new segment
individually. Another popular decision tree technology is CHAID (Chi-Square Automatic
Interaction Detector). It is similar to CART but it differs in one way. CART helps in
choosing the best questions whereas CHAID helps in choosing the splits.
5. Neural Network - Neural Network is another important technique used by people these
days. This technique is most often used in the starting stages of the data mining technology.
Artificial neural network was formed out of the community of Artificial intelligence.
Neural networks are very easy to use as they are automated to a particular extent and
because of this the user is not expected to have much knowledge about the work or
database. But to make the neural network work efficiently you need to know
How the nodes are connected ?
How many processing units to be used ?
When should the training process to be stopped ?
There are two main parts of this technique – the node and the link
The node– which freely matches to the neuron in the human brain
The link– which freely matches to the connections between the neurons in the human brain
A neural network is a collection of interconnected neurons. which could form a single layer
or multiple layer. The formation of neurons and their interconnections are called
architecture of the network. There are a wide variety of neural network models and each
model has its own advantages and disadvantages. Every neural network model has different
architectures and these architectures use different learning procedures. Neural networks are
very strong predictive modelling technique. But it is not very easy to understand even by
experts. It creates very complex models which is impossible to understand fully. Thus to
understand the Neural network technique companies are finding out new solutions. Two
solutions have already been suggested
First solution is Neural network is packaged up into a complete solution which will let it to
be used for a single application
Second solution is it is bonded with expert consulting services
Neural network has been used in various kinds of applications. This has been used in the
business to detect frauds taking place in the business.
This technique helps to find the association between two or more items. It helps to know
the relations between the different variables in databases. It discovers the hidden patterns in
the data sets which is used to identify the variables and the frequent occurrence of different
variables that appear with the highest frequencies. Association rule offers two major
information
7. Classification
Data mining techniques classification is the most commonly used data mining technique
which contains a set of pre classified samples to create a model which can classify the large
set of data. This technique helps in deriving important information about data and metadata
(data about data). This technique is closely related to cluster analysis technique and it uses
decision tree or neural network system. There are two main processes involved in this
technique
Market basket analysis only uses transactions with more than one item, as no associations
can be made with single purchases. Item association does not necessarily suggest a cause
and effect, but simply a measure of co-occurrence. It does not mean that since energy
drinks and video games are frequently bought together, one is the cause for the purchase of
the other, but it can be construed from the information that this purchase is most probably
made by (or for) a gamer. Such rules or hypothesis must be tested and should not be taken
as truth unless item sales say otherwise. There are two main types of MBA:
1. Predictive MBA is used to classify cliques of item purchases, events and services that
largely occur in sequence.
2. Differential MBA removes a high volume of insignificant results and can lead to very in-
depth results. It compares information between different stores, demographics, seasons of
the year, days of the week and other factors.
MBA is commonly used by online retailers to make purchase suggestions to consumers.
For example, when a person buys a particular model of smartphone, the retailer may
suggest other products such as phone cases, screen protectors, memory cards or other
accessories for that particular phone. This is due to the frequency with which other
consumers bought these items in the same transaction as the phone. MBA is also used in
physical retail locations. Due to the increasing sophistication of point of sale systems
coupled with big data analytics, stores are using purchase data and MBA to help improve
store layouts so that consumers can more easily find items that are frequently purchased
together.
Design and construction of data warehouses for multidimensional data analysis and data
mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of
data from on sales, customer purchasing history, goods transportation, consumption and
services. It is natural that the quantity of data collected will continue to expand rapidly
because of the increasing ease, availability and popularity of the web. Data mining in retail
industry helps in identifying customer buying patterns and trends that lead to improved
quality of customer service and good customer retention and satisfaction. Here is the list of
examples of data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e-mail, web
data transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason why
data mining is become very important to help and understand the business. Data mining in
telecommunication industry helps in identifying the telecommunication patterns, catch
fraudulent activities, make better use of resource, and improve quality of service. Here is
the list of examples for which data mining improves telecommunication services −
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection −
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build discriminating
attributes.
Analysis of Stream data.
Distributed data mining.
Visualization and query tools.
Data Types− The data mining system may handle formatted text, record-based data, and
relational data. The data could also be in ASCII text, relational database data or data
warehouse data. Therefore, we should check what exact format the data mining system can
handle.
System Issues− We must consider the compatibility of a data mining system with different
operating systems. One data mining system may run on only one operating system or on
several. There are also data mining systems that provide web-based user interfaces and
allow XML data as input.
Data Sources− Data sources refer to the data formats in which data mining system will
operate. Some data mining system may work only on ASCII text files while others on
multiple relational sources. Data mining system should also support ODBC connections or
OLE DB for ODBC connections.
Data Mining functions and methodologies− There are some data mining systems that
provide only one data mining function such as classification while some provides multiple
data mining functions such as concept description, discovery-driven OLAP analysis,
association mining, linkage analysis, statistical analysis, classification, prediction,
clustering, outlier analysis, similarity search, etc.
Coupling data mining with databases or data warehouse systems− Data mining systems
need to be coupled with a database or a data warehouse system. The coupled components
are integrated into a uniform information processing environment. Here are the types of
coupling listed below -
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
Data Mining query language and graphical user interface− An easy-to-use graphical
user interface is important to promote user-guided, interactive data mining. Unlike
relational database systems, data mining systems do not share underlying data mining query
language.
Application Exploration.
Scalable and interactive data mining methods.
Integration of data mining with database systems, data warehouse systems and web
database systems.
Standardization of data mining query language.
Visual data mining.
New methods for mining complex types of data.
Biological data mining.
Data mining and software engineering.
Web mining.
Distributed data mining.
Real time data mining.
Multi database data mining.
Privacy protection and information security in data mining.
UNIT-4
Types of knowledge
Knowledge management is an activity practiced by enterprises all over the world. In
the process of knowledge management, these enterprises comprehensively gather
information using many methods and tools. Then, gathered information is organized,
stored, shared, and analysed using defined techniques. The analysis of such information
will be based on resources, documents, people and their skills. Properly analysed
information will then be stored as ‘knowledge’ of the enterprise. This knowledge is
later used for activities such as organizational decision making and training new staff
members.
There have been many approaches to knowledge management from early days. Most of
early approaches have been manual storing and analysis of information. With the
introduction of computers, most organizational knowledge and management processes
have been automated. Therefore, information storing, retrieval and sharing have
become convenient. Nowadays, most enterprises have their own knowledge
management framework in place.
The framework defines the knowledge gathering points, gathering techniques, tools
used, data storing tools and techniques and analyzing mechanism.
1. A Priori -- A priori and a posteriori are two of the original terms in epistemology (the study of
knowledge). A priori literally means “from before” or “from earlier.” This is because a
priori knowledge depends upon what a person can derive from the world without needing to
experience it. This is better known as reasoning. Of course, a degree of experience is necessary
upon which a priori knowledge can take shape. Let’s look at an example. If you were in a
closed room with no windows and someone asked you what the weather was like, you would
not be able to answer them with any degree of truth. If you did, then you certainly would not be
in possession of a priori knowledge. It would simply be impossible to use reasoning to produce
a knowledgeable answer. On the other hand, if there were a chalkboard in the room and
someone wrote the equation 4 + 6 = ? on the board, then you could find the answer without
physically finding four objects and adding six more objects to them and then counting them.
You would know the answer is 10 without needing a real world experience to understand it. In
fact, mathematical equations are one of the most popular examples of a priori knowledge.
2. A Posteriori -- Naturally, then, a posteriori literally means “from what comes later” or “from
what comes after.” This is a reference to experience and using a different kind of reasoning
(inductive) to gain knowledge. This kind of knowledge is gained by first having an experience
(and the important idea in philosophy is that it is acquired through the five senses) and then
using logic and reflection to derive understanding from it. In philosophy, this term is sometimes
used interchangeably with empirical knowledge, which is knowledge based on observation. It is
believed that a priori knowledge is more reliable than a posteriori knowledge. This might seem
counter-intuitive, since in the former case someone can just sit inside of a room and base their
knowledge on factual evidence while in the latter case someone is having real experiences in
the world. But the problem lies in this very fact: everyone’s experiences are subjective and open
to interpretation. This is a very complex subject and you might find it illuminating to read
this post on knowledge issues and how to identify and use them. A mathematical equation,
on the other hand, is law.
3. Explicit Knowledge --Now we are entering the realm of explicit and tacit knowledge. As you
have noticed by now, types of knowledge tend to come in pairs and are often antitheses of each
other. Explicit knowledge is similar to a priori knowledge in that it is more formal or perhaps
more reliable. Explicit knowledge is knowledge that is recorded and communicated through
mediums. It is our libraries and databases. The specifics of what is contained is less important
than how it is contained. Anything from the sciences to the arts can have elements that can be
expressed in explicit knowledge. The defining feature of explicit knowledge is that it can be
easily and quickly transmitted from one individual to another, or to another ten-thousand or ten-
billion. It also tends to be organized systematically. For example, a history textbook on the
founding of America would take a chronological approach as this would allow knowledge to
build upon itself through a progressive system; in this case, time.
4. Tacit Knowledge
It should note that tacit knowledge is a relatively new theory introduced only as recently
as the 1950s. Whereas explicit knowledge is very easy to communicate and transfer
from one individual to another, tacit knowledge is precisely the opposite. It is extremely
difficult, if not impossible, to communicate tacit knowledge through any medium. For
example, the textbook on the founding of America can teach facts (or things we believe
to be facts), but someone who is an expert musician can not truly communicate their
knowledge; in other words, they can not tell someone how to play the instrument and
the person will immediately possess that knowledge. That knowledge must be acquired
to a degree that goes far, far beyond theory. In this sense, tacit knowledge would most
closely resemble a posteriori knowledge, as it can only be achieved through experience.
The biggest difficult of tacit knowledge is knowing when it is useful and figuring
out how to make it usable. Tacit knowledge can only be communicated through
consistent and extensive relationships or contact (such as taking lessons from a
professional musician). But even in this cases there will not be a true transfer of
knowledge. Usually two forms of knowledge are born, as each person must fill in
certain blanks (such as skill, short-cuts, rhythms, etc.).
Institutionalized knowledge
Till date, four models have been selected based on their ability to meet the growing demands.
The four models are the Zack, from Meyer and Zack (1996), the Bukowitz and Williams
(2000), the McElroy (2003), and the Wiig (1993) KM cycles.
1. Knowledge Creation –The actual process of conducting research and producing new
knowledge
Five modes of knowledge generation
• Acquisition
• Dedicated resources
• Fusion
• Adaptation
• Knowledge networking
2. Knowledge codification
The aim of knowledge codification is to put organizational knowledge into a form that makes it
accessible to those who need it.
• Documented knowledge
• Mapped knowledge
• Modeled knowledge
3. Knowledge transfer –
Knowledge transfer is the process of passing available knowledge to specified
audiences
Functionalities
1. Channel identification and choice,
2. Scheduling, and
3. Sending
Aspects of knowledge transfer
Hard aspects – focus on improved access to knowledge (information), electronic
communication, document repositories, and so forth;
Soft aspects – focus on human face-to-face communication (meetings, talk rooms etc.).
The Nonaka and Takeuchi model of KM has its base in a universal model of knowledge
creation and the management of coincidence. There are four different modes of knowledge
conversion in the Nonaka and Takeuchi model of knowledge conversion:
Socialization (tacit to tacit) i.e. Indirect way,
Externalization (tacit to explicit) i.e. Indirect to Direct way,
Combination (explicit to explicit) i.e. Direct way, and
Internalization (explicit to tacit) i.e. Direct to indirect way.
1.Socialization is the technique of sharing tacit knowledge through observation, imitation,
practice, and participation in formal and informal communities and groups. This process is
basically preempted by the creation of a physical or virtual space where a given community can
interact on a social level.
2. Externalization is the technique of expressing tacit knowledge into explicit concepts. As
tacit knowledge is highly internalized, this process is the key to knowledge sharing and
creation.
3. Combination is the technique of integrating concepts into a knowledge system. Some
examples or cases would be a synthesis in the form of a review report, a trend analysis, a brief
executive summary, or a new database to organize content.
4. Internalization is the technique of embodying explicit knowledge into tacit knowledge.
The key technologies are communication and collaboration technologies that are web based for
internet and intranet usage, as well as mobile technologies such as PDA’s, PC’s, telephone and
videoconferencing. New technologies are rapidly emerging that act as intelligent agents and
assistants to search, summarise, conceptualise and recognise patterns of information and
knowledge.
For an effective KM initiative across the organisation, there needs to be in place, at least:
▪ Knowledge Portal
There is often confusion between the terms ‘information portal’ and ‘knowledge portal’.
An information portal is often described as a gateway to information to enable the user to have
one, more simplified way of navigating towards the desired information.
However a ‘knowledge portal’ is far more than an information portal because, as well as
information navigation and access, it contains within it software technologies to, at least,
support the processes of virtual team communication and collaboration and software
technologies to support the 9 step process of managing knowledge. Furthermore, it contains
intelligent agent software to identify and automatically distribute information and knowledge
effectively to knowledge workers based on knowledge profiling.
▪ Knowledge Profiles
Within the knowledge portal, each knowledge worker can update and maintain a personal
‘knowledge profile’ which identifies his/her specific knowledge needs, areas of interest and
frequency of distribution.
▪ Collaborative workspaces
Within the knowledge portal, shared work spaces can be set up for each new team or project.
These will become knowledge repositories from which new knowledge will be distilled
regularly and systematically and shared across other teams in the organisation. Within the
shared and collaborative workspace, at least, the following communication and collaboration
functions could be performed:
▪ Shared vision and mission ▪ Specific team objectives ▪ Knowledge Plan ▪ Team members roles
and responsibilities ▪ Team contract ▪ Best Knowledge Bases or Banks ▪ Expert locator ▪ Task
management ▪ Shared Calendar management ▪ Meeting management ▪ Document libraries ▪
Discussion forums ▪ Centralised email ▪ Capturing of new learnings and ideas ▪ Peer reviews,
learning reviews, after action reviews ▪ New knowledge nominations
▪ Urgent requests
Within the knowledge portal, it is very useful to have a facility and underlying process to enter
any ‘Urgent Request’ into the portal and receive back any responses from across the
organisation. Rather than needing to know ‘who might know’ the request is entered blindly and
responses will be made if it is known in the organisation and people are willing to support and
respond to this activity. This is a very effective way of better leveraging the knowledge across
the organisation.
▪ Document Libraries
The document library is typically the location where all documents are stored. The library
should be context relative and allow the ease of control over any document type. Many
organisations now employ an Electronic Document and Records Management System
(EDRMS) for this requirements but the integration of the EDRMS with all other relevant
information and knowledge sources is imperative.
In order to foster knowledge networking across the entire organisation and support knowledge
processes for creating, retaining, leveraging, reusing, measuring and optimising the use of the
organisational knowledge assets, a centralised knowledge server is required that will:
▪ a knowledge portal interface designed around a knowledge asset schema (see KM consulting
section) as a gateway to user access, security and applications
▪ Knowledge banks
▪ Advanced search capabilities ▪ collaboration services ▪ search and discovery services ▪
publishing services based on user knowledge needs and knowledge profiling ▪ a knowledge map
(taxonomy) ▪ knowledge repository for information and process management ▪ Text
summarising and conceptualising ▪ Intelligent agentware ▪ an Intranet infrastructure for
integrated email, file servers, internet/intranet services
For each key knowledge area identified, there needs to be a Knowledge Base.