You are on page 1of 8

Web intelligence is the area of scientific research and development that explores the roles

and makes use of artificial intelligence and information technology for new products, services
and frameworks that are empowered by the World Wide Web.

Intelligent web application consists of services and frameworks that are empowered by the
World Wide Web built using new techniques of artificial intelligence and information
technology.

Mashups are an exciting genre of interactive web applications that draw upon content
retrieved from external data sources to create entirely new and innovative services.

Structural classification algorithms


As shown in figure 5.2, the branch of rule-based structural algorithms consists of production
rules (if-then clauses) and decision tree (DT)based algorithms. The production
rules can be collected manually by human experts or deduced by decision trees. Rulebased
algorithms are typically implemented as forward-chaining production systems.

Decision-tree algorithms have several advantages, such as ease of use and computational
efficiency. Their disadvantages are : In general, decision-tree algorithms dont have good
generalization properties, and as a result, they dont perform well with unseen data.
A commonly used algorithm in this category is C5.0 (on Unix machines) or See5 (on
Microsoft Windows
machines). It can be found in a number of commercial products, such as Clementine
(http://www.spss.com/clementine/) and RuleQuest (http://www.rulequest.com/).
k-Means: Step-By-Step Example
The k-means algorithm randomly picks (see method pickInitialMeanValues) k
points that represent the initial centroids of the candidate clusters. Subsequently the
distances between these centroids and each point of the set are calculated, and each
point is assigned to the cluster with the minimum distance between the cluster centroid
and the point. As a result of these assignments, the locations of the centroids for
each cluster have now changed, so we reevaluate the new centroids until their locations
stop changing.
As a simple illustration of a k-means algorithm, consider the following data set consisting of
the scores of two variables on each of seven individuals:
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

This data set is to be grouped into two clusters. As a first step in finding a sensible initial
partition, let the A & B values of the two individuals furthest apart (using the Euclidean
distance measure), define the initial cluster means, giving:
Mean
Individual Vector
(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)

The remaining individuals are now examined in sequence and allocated to the cluster to
which they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the following series of steps:
Cluster 1 Cluster 2
Mean Mean
Step Individual Vector Individual Vector
(centroid) (centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)

Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:
Mean
Individual Vector
(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individuals distance to its own cluster mean and to
that of the opposite cluster. And we find:
Distance to Distance to
mean mean
Individual
(centroid) (centroid)
of Cluster 1 of Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1

Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own
(Cluster 1). In other words, each individual's distance to its own cluster mean should be
smaller that the distance to the other cluster's mean (which is not the case with individual 3).
Thus, individual 3 is relocated to Cluster 2 resulting in the new partition:
Mean
Individual Vector
(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)

The iterative relocation would now continue from this new partition until no more relocations
occur. However, in this example each individual is now nearer its own cluster mean than that
of the other cluster and the iteration stops, choosing the latest partitioning as the final cluster
solution.
Also, it is possible that the k-means algorithm won't find a final solution. In this case it
would be a good idea to consider stopping the algorithm after a pre-chosen maximum of
iterations.
Advantages:
It works well with many metrics.
Its easy to derive versions of the algorithm that are executed in parallelwhen
the data are divided into, say, N sets and each separate data set is clustered, in
parallel, on N different computational units.
Its insensitive with respect to data ordering.

The k-means algorithm is fast, especially compared to other clustering algorithms.


Its computational complexity is typically O (N), where N is the number of data points
that we want to cluster.

ROCK (RObust Clustering using linKs) clustering algorithm which belongs to the class of
agglomerative hierarchical clustering algorithms.

The steps involved in clustering using ROCK are described in Figure 2. After drawing a
random sample from the database, a hierarchical clustering algorithm that employs links is
applied to the sampled points. Finally, the clusters involving only the sampled points are used
to assign the remaining data points on disk to the appropriate clusters.

Machine learning refers to the capability of a software system to generalize based


on past experience, and use these generalizations to provide answers to questions that
relate to data that it has encountered in the past as well as new data that the system has
never encountered before. Some learning algorithms are transparent to humansa
human can follow the reasoning behind the generalization. Examples of transparent
learning algorithms are decision trees and, more generally, any rule-based learning
method. Other algorithms, though, arent transparent to humansneural networks
and support vector machines (SVM) fall in this category.

4 Clustering

Definition:

term clustering refers to the process of grouping similar


things together.

4.1 Need for clustering:


the identification of user groups in a web application.
You could use it to perform targeted advertisement, enhance the user
experience by displaying posts by like-minded individuals to each user, facilitate the creation
of social networks in your site, and so on.

4.2 An overview of clustering algorithms

properties from algorithms that deal with large databases for online applications:
If possible, you should scan the database only once.
You should allow for online behaviora good answer is available at any time.
The algorithm should be able to suspend, stop, and resume its activity.
You should support incremental updates to account for new data.
You should respect RAM limitations, if any.
You should utilize various scan modes, such as sequential, index-based, and
sampling, if theyre available.
You should prefer algorithms that can work with the forward-only cursor over a
view of the database, because these views are typically the result of computationally
expensive joins.

Classifier

Given a population whose members can be potentially separated into a number of different
sets or classes, a classification rule is a procedure in which the elements of the population set
are each assigned to one of the classes.

Classification can be thought of as two separate problems binary


classification and multiclass classification. In binary classification, a better understood task,
only two classes are involved, whereas multiclass classification involves assigning an object
to one of several classes.[2] Since many classification methods have been developed
specifically for binary classification, multiclass classification often requires the combined use
of multiple binary classifiers.
Example, In automatic categorization of emails and spam filtering, first binary
classification algorithm and then, multiclass classification is applied.

Rule based classification-

Drools is a robust rule engine with ample documentation and a fairly liberal
open source license.

The basic elements of the Drools rule


engine system:

automatic categorization of emails and spam filtering.

building a rule engine based on the Drools library:

The creation of a Drools rule engine has two parts: authoring and runtime. The
authoring part begins with the parsing of the Drools filethe file with the .drl extension.
The parser checks the grammatical consistency of the Drools file and creates
an intermediate abstract syntax tree (AST).

steps to create and use a Drools rule engine for automatic categorization of emails and spam
filtering.

Determine runtime compiler

After creating a reference to the file that contains the rules, we create a Properties
instance and give a value to the property drools.dialect.java.compiler.

This property determines the runtime compiler that Drools uses in order to compile Java code

Contains our rules

With the PackageBuilder in our disposal, we can build packages that contain the
rules. We pass the reference of the file to the addPackageFromDrl method and immediately
call the getPackage method of our builder.

Runtime container for rules


This step involves building the runtime part of the engine.

Stateful Working- Memory


add asynchronous methods for inserting, updating, and firing rules, as well as a
dispose() method.

Insert fact in working memory


use the insert method to add facts into the WorkingMemory instance. After
inserting a fact, the Drools engine will match it against all the rules.
Execute all rules

After creating a reference to the file that contains the rules, we create a Properties instance
and give a value to the property drools.dialect.java.compiler. Whats this property? And what
does the value JANINO mean? You can incorporate Java code straight into the Drools rule
files. This property determines the runtime compiler that
we want Drools to use in order to compile Java code. Janino is the name of an embedded
Java compiler thats included in the Drools distribution under the BSD license
(http://www.janino.net/).
To complete the authoring part, we need to create an instance of the Package-
Builder class, which in turn will create instances of the class Package. We use the auxiliary
PackageBuilderConfiguration class for the configuration of our package
builder. This class has default values, which you can change through the appropriate set
methods or, as we do here, on first use via property settings. In this case, we pass only a
single property, but we couldve provided much more information. At the heart of the
settings is the ChainedProperties class, which searches a number of locations looking
for drools.packagebuilder.conf files. In order of precedence, those locations are system
properties, a user-defined file in system properties, the users home directory, the
working directory, and various META-INF locations. The PackageBuilderConfiguration
handles the registry of AccumulateFunctions, registry of Dialects, and the main
ClassLoader.

With the PackageBuilder in our disposal, we can build packages that contain the
rules. We pass the reference of the file to the addPackageFromDrl method and immediately
call the getPackage method of our builder. Our rules are ready to use now!
This is our first step in building the runtime part of the engine. A RuleBase can have
one or more Packages. A RuleBase can instantiate one or more WorkingMemory
instances at any time; a weak reference is maintained unless configured otherwise.
The WorkingMemory class consists of a number of subcomponents; for details, consult
the Drools online documentation.
The class StatefulSession extends the WorkingMemory class. It adds asynchronous
methods for inserting, updating, and firing rules, as well as a dispose() method. The
RuleBase retains a reference to each StatefulSession instance that it creates, in
order to update them when new rules are added. The dispose() method is needed
to release the StatefulSession reference from the RuleBase in order to avoid memory
leaks.
We use the insert method to add facts into the WorkingMemory instance. When we
insert a fact, the Drools engine will match it against all the rules. This means that all
the work is done during insertion, but no rules are executed until you call fireAll-
Rules(), which we do in the next step.
This invokes the rule execution. You shouldnt call fireAllRules() before youve finished
inserting all your facts. The crucial matching phase happens during the insertion
of the facts, as mentioned previously. So, you dont want to execute the rules
without matching all the facts against the rules first.

A Drools-based rule engine that detects spam email


Steps:

1. First we need to create a RuleEngine instance, so that we can delegate the application
of the rules. We pass the name of the file that contains the rules and let the Rule-
Engine do the heavy lifting.

2. These are two auxiliary variables used by the classify method. Since theyre constant
in our case, no matter what the rules or the emails are, we treat them as instance
variables.
Its possible that your implementation of the ClassificationResult class is
responsible for identifying the right concept from a more elaborate data structure of
concepts (for example, an ontology).

3. This class encapsulates two important things. It includes the tests of the rule
conditions
(through the isSimilar method) as well as the actions of the rules (through the
setSpamEmail method). We could have created different objects to encapsulate the
conditions and the actions. If your conditions or actions involve algorithmically
difficult
implementations, its better to separate the implementations and create a clean
separation of these two parts.

4. This is where we delegate the application of the rules to the RuleEngine. We reviewed
this method in listing 5.9.

5. We again use the ClassificationResult instance to obtain the information that was
created as a result of the (fired) rules actions. That information could have been
recorded in a persistent medium (such as a database record or a file); in our simple
case, we use the ClassificationResult class as the carrier of all related activity.

6. This method helps us classify all the emails in our dataset at once. Note that we could
have passed the dataset itself to the classify method and overridden the executeRule
method in the RuleEngine class, so that we load into the working memory all the
emails at once. But note that, in the context of a rule-based system, the classification
of an email as spam doesnt depend on whether other emails are spam.

NOTE: only algorithms /pseudocode is important not the source code.

You might also like