Professional Documents
Culture Documents
DATA MINING
Group 3
Created by:
Adnan Khawarizwi
Amin Saputra
Candra Bayu Permana
Dimas Febry
M. Rizky Thariq
Maruf Abdullah
Noval Yazid
Rival Rinofiansyah
Ryan Hidayatullah
Taufrino Cahyadi
Yano Yahya
Zufar Attaqi
2018
PREFACE
First of all, thanks to Allah SWT because of the help of Allah, writer finished
writing the paper entitled “Data Mining” right in the calculated time.
The purpose in writing this paper is to fulfill the assignment that given by Mr.
Bambang Irfani as lecturer in samantics major.
in arranging this paper, the writer trully get lots challenges and obstructions but
with help of many indiviuals, those obstructions could passed. writer also realized
there are still many mistakes in process of writing this paper.
because of that, the writer says thank you to all individuals who helps in the
process of writing this paper. hopefully allah replies all helps and bless you all.the
writer realized tha this paper still imperfect in arrangment and the content. then
the writer hope the criticism from the readers can help the writer in perfecting the
next paper.last but not the least Hopefully, this paper can helps the readers to gain
more knowledge about samantics major.
author
ii
TABLE of CONTENTS
PREFACE ................................................................................................................
ABSTRACT .............................................................................................................
CHAPTER I INTRODUCTION
CHAPTER II DISCUSSION
iii
2.13 Data Mining and Campaign Management in the real world ...................
2.14 The Benefits of integrating Data Mining and Campaign Management ..
2.15 The ten Steps of Data Mining .................................................................
2.16 Evaluating the Benefits of a Data Mining Model ...................................
2.17 The data mining suite ..............................................................................
2.18 The Data Mining Suite is Unique ............................................................
REFERENCES
iv
LIST of TABLES
v
LIST of FIGURES
vi
ABSTRACT
vii
CHAPTER I
INTRODUCTION
1.1 Background
Data mining is the process of extracting useful information. Basically it is the
process of discovering hidden patterns and information from the existing
data. In data mining, one needs to primarily concentrate on cleansing the data
so as to make it feasible for further processing. The process of cleansing the
data is also called as noise elimination or noise reduction or feature
elimination [1]. This can be done by using various tools available supporting
various techniques. The important consideration in data mining is whether the
data to be handled static or dynamic.
1
CHAPTER II
DISCUSSION
2.19Definition
Data mining techniques are the result of a long process of research and
product development. This evolution began when business data was first
stored on computers, continued with improvements in data access, and more
recently, generated technologies that allow users to navigate through their
data in real time. Data mining takes this evolutionary process beyond
retrospective data access and navigation to prospective and proactive
information delivery. Data mining is ready for application in the business
community because it is supported by three technologies that are now
sufficiently mature:
• Massive data collection
• Powerful multiprocessor computers
• Data mining algorithms
In the evolution from business data to business information, each new step
has built upon the previous one. For example, dynamic data access is critical
for drill-through in data navigation applications, and the ability to store large
databases is critical to data mining. From the user’s point of view, the four.
2
3
steps listed in Table 1 were revolutionary because they allowed new business
questions to be answered accurately and quickly.
Pilot,
Advanced Lockheed,
"What’s likely to Prospective,
Data Mining algorithms, IBM, SGI,
happen to Boston proactive
(Emerging multiprocessor numerous
unit sales next information
Today) computers, startups
month? Why?" delivery
massive databases (nascent
industry)
Table 1.Steps in the Evolution of Data Mining.
Data warehouse is a central repository for data. There are three different
basic architectures for constructing a data warehouse. In first type there is
only central location to store data, which we call data warehouse physical
storage media. In this type of construction, data is gathered from
heterogeneous, data sources, like different types of files, local database
system and from other external sources.
As the data is stored in a central place its' access is very easy and simple.
But disadvantage of this construction is the loss of performance.
these feats in data mining is called modeling. Modeling is simply the act of
building a model in one situation where you know the answer and then
applying it to another situation that you don't. For instance, if you were
looking for a sunken Spanish galleon on the high seas the first thing you
might do is to research the times when Spanish treasure had been found by
others in the past. You might note that these ships often tend to be found off
the coast of Bermuda and that there are certain characteristics to the ocean
currents, and certain routes that have likely been taken by the ship’s captains
in that era. You note these similarities and build a model that includes the
characteristics that are common to the locations of these sunken treasures.
With these models in hand you sail off looking for treasure where your model
indicates it most likely might be given a similar situation in the past.
Hopefully, if you've got a good model, you find your treasure.
This act of model building is thus something that people have been doing
for a long time, certainly before the advent of computers or data mining
technology. What happens on computers, however, is not much different than
the way people build models. Computers are loaded up with lots of
information about a variety of situations where an answer is known and then
the data mining software on the computer must run through that data and
distill the characteristics of the data that should go into the model. Once the
model is built it can then be used in similar situations where you don't know
the answer. For example, say that you are the director of marketing for a
telecommunications company and you'd like to acquire some new long
distance phone customers. You could just randomly go out and mail coupons
to the general population - just as you could randomly sail the seas looking
for sunken treasure. In neither case would you achieve the results you desired
and of course you have the opportunity to do much better than random - you
could use your business experience stored in your database to build a model.
As the marketing director you have access to a lot of information about all
of your customers: their age, sex, credit history and long distance calling
12
usage. The good news is that you also have a lot of information about your
prospective customers: their age, sex, credit history etc. Your problem is that
you don't know the long distance calling usage of these prospects (since they
are most likely now customers of your competition). You'd like to concentrate
on those prospects who have large amounts of long distance usage. You can
accomplish this by building a model. Table 2 illustrates the data used for
building a model for new customer prospecting in a data warehouse.
Customers Prospects
This model could then be applied to the prospect data to try to tell
something about the proprietary information that this telecommunications
company does not currently have access to. With this model in hand new
customers can be selectively targeted.
If someone told you that he had a model that could predict customer usage
how would you know if he really had a good model? The first thing you
might try would be to ask him to apply his model to your customer base -
where you already knew the answer. With data mining, the best way to
accomplish this is by setting aside some of your data in a vault to isolate it
from the mining process. Once the mining is complete, the results can be
tested against the data held in the vault to confirm the model’s validity. If the
model works, its observations should hold for the vaulted data.
The large numbers of campaigns that run on a daily or weekly basis can
be difficult to schedule and can swamp the available resources.
The process is error prone; it is easy to score the wrong database or the
wrong fields in a database.
Scoring is typically very inefficient. Entire databases are usually scored,
not just the segments defined for the campaign. Not only is effort wasted,
but the manual process may also be too slow to keep up with campaigns
run weekly or daily.
The solution to these problems is the tight integration of Data Mining and
Campaign Management technologies. Integration is crucial in two areas:
First, the Campaign Management software must share the definition of the
defined campaign segment with the Data Mining application to avoid
modeling the entire database. For example, a marketer may define a campaign
segment of high-income males between the ages of 25 and 35 living in the
northeast. Through the integration of the two applications, the Data Mining
application can automatically restrict its analysis to database records
containing just those characteristics.
Second, selected scores from the resulting predictive model must flow
seamlessly into the campaign segment in order to form targets with the
highest profit potential.
Scoring only the relevant customer subset and eliminating the manual
process shrinks cycle times. Scoring data only when needed assures
"fresh," up-to-date results.
Ideally, marketers who build campaigns should be able to apply any model
logged in the Campaign Management system to a defined target segment. For
example, a marketing manager at a cellular telephone company might be
interested in high-value customers likely to switch to another carrier. This
segment might be defined as customers who are nine months into a twelve-
month contract, and whose average monthly balance is more than $150.
The easiest approach to retain these customers is to offer all of them a new
high-tech telephone. However, this is expensive and wasteful since many
customers would remain loyal without any incentive.
Formarketers:
Improved campaign results through the use of model scores that further
refine customer and prospect segments.
Records can be scored when campaigns are ready to run, allowing the use
of the most recent data. "Fresh" data and the selection of "high" scores
within defined market segments improve direct marketing results.
Accelerated marketing cycle times that reduce costs and increase the
likelihood of reaching customers and prospects before competitors.
Scoring takes place only for records defined by the customer segment,
eliminating the need to score an entire database. This is important to keep
18
For statisticians:
Once you have defined your goal, your next step is to select the data to
meet this goal. This may be a subset of your data warehouse or a data mart
that contains specific product information. It may be your customer
information file. Segment as much as possible the scope of the data to be
mined.
Here are some key issues.
- Are the data adequate to describe the phenomena the data mining
analysis is attempting to model?
- Can you enhance internal customer records with external lifestyle and
demographic data?
- Are the data stable-will the mined attributes be the same after the
analysis?
- If you are merging databases and you find a common field for linking
them?
- How current and relevant are the data to the business goal?
3. Prepare The Data
Once you’ve assembled the data, you must decide which attributes to
convert into usable formats. Consider the input of domain experts-creators
and users of the data.
- Establish strategies for handling missing data, extraneous noise, and
outliers
- Identify redundant variables in the dataset and decide which fields to
exclude
For a closer look at how the use of model scores can improve
profitability, consider an example campaign with the following
assumptions:
The Data Mining SuiteTM is truly unique, providing the most powerful, complete and
comprehensive solution for enterprise-wide, large scale decision support. It leads the
world of discovery with the exceptional ability to directly mine large multi-table
SQL databases.
The Data Mining Suite works directly on large SQL repositories with no need for
sampling or extract files. It accesses large volumes of multi-table relational data on
the server, incrementally discovers powerful patterns and delivers automatically
generated English text and graphs as explainable documents on the intranet.
The Data Mining Suite is based on a solid foundation with a total vision for decision
support. The three-tiered, server-based implementation provides highly scalable
discovery on huge SQL databases with well over 90% of the computations
performed directly on the server, in parallel if desired.
24
With server-based discovery, the Data Mining Suite performs over 90% of the
analyses on the server, with SQL, C programs and Java. Discovery takes place
simultaneously along multiple dimensions on the server, and is not limited by the
power of the client. The system analyzes both relational and multi-dimensional data,
discovering highly refined patterns that reveal the real nature of the dataset. Using
built-in advanced mathematical techniques, these findings are carefully merged by
the system and the results are delivered to the user in plain English, accompanied by
tables and graphs that highlight the key patterns.
The Data Mining Suite pioneered multi-dimensional data mining. Before this,
OLAP had usually been a multi-dimensional manual endeavor, while data mining
25
The Data Mining Suite also pioneered the use of incremental pattern-base
population. With incremental data mining, the system automatically discovers
changes in patterns as well as the patterns of change. For instance, each month sales
data is mined and the changes in the sales trends as well as the trends of change in
how products sell together are added to the pattern-base. Over time, this knowledge
becomes a key strategic asset to the corporation.
These truly unique products are all designed to work together, d in concert with the
Knowledge Access SuiteTM.
either OLAP or data mining alone. The system performs multi-table, dimensional
data mining at the server level, providing the best possible results. The Rule-
based Influence Discovery System is not a multi-dimensional repository, but a
data mining system. It accesses granular data in a large database via standard
SQL and reaches for multi-dimensional data via a ROLAP approach of the user's
choosing.
figures. Patterns in the data snap-shot are found on a monthly basis and are added
to the pattern-base. As new data becomes available (say once a month) the
system automatically finds new patterns, merges them with the previous patterns,
stores them in the pattern-base and notes the differences from the previous time-
periods.
Trend Discovery
Trend Discovery with the Data Mining Suite uncovers time-related patterns that
deal with change and variation of quantities and measures. The system expresses
trends in terms of time-grains, time-windows, slopes and shapes. The time-grain
defines the smallest grain of time to be considered, e.g. a day, a week or a month.
Time-windows define how time grains are grouped together, e.g. we may look at
daily trends with weekly windows, or we may look at weekly grains with
monthly windows. Slopes define how quickly a measure is increasing or
decreasing, while shapes give us various categories of trend behavior, e.g.
smoothly increasing vs. erratically changing.
Forensic Discovery
Forensic Discovery with the Data Mining Suite relies on automatic anomaly
detection. The system first identifies what is usual and establishes a set of norms
through pattern discovery. The transactions or activities that deviate from the
norm are then identified as unusual. Business users can discover where unusual
activities may be originating and the proper steps can be taken to remedy and
control the problems. The automatic discovery of anomalies is essential in that
the ingenious tactics used to spread activities within multiple transactions can
usually not be guessed beforehand
Predictive Modeler
The Data Mining SuitePredictive Modeler makes predictions and forecasts by
using the rules and patterns which the data mining process generates. While
induction performs pattern discovery to generate rules, the Predictive Modeler
performs pattern matching to make predictions based on the application of these
rules. The predictive models produced by the system have higher accuracy
28
because the discovery process works on the entire dataset and need not rely on
sampling.
The output from the seven component products of the Data Mining Suite is stored
within the pattern-base and is accessible with PQL: The Pattern Query
Language. Readable English text and graphs are automatically generated in
ASCII and HTML formats for the delivery on the inter/intranet.
The products in the Data Mining SuiteTM deliver the most advanced and
scalable technologies within a user friendly environment. The specific
reasons draw on the solid mathematical foundation, which Information
Discovery, Inc. pioneered and a highly scalable implementation. Click
here to see what makes The Knowledge Access Suite. So unique.
Multi-Table Discovery
The Data Mining Suite discovers patterns in multi-table SQL databases
without having to join and build an extract file. This is a key issue in
mining large databases. The world is full of multi-table databases which
can not be joined and meshed into a single view. In fact, the theory of
normalization came about because data needs to be in more than one
table. Using single tables is an affront to all the work of E.F. Codd on
database design. If you challenge the DBA in a really large database to
put things in a single table you will either get a laugh or a blank stare --
in many cases the database size will balloon beyond control. In fact,
there are many cases where no single view can correctly represent the
semantics of influence because the ratios will always be off regardless
of how you join. The Data Mining Suite leads the world of discovery
with the unique ability to mine large multi-table databases.
No Sampling or Extracts
Sampling theory was invented because one could not have access to the
underlying population being analyzed. But a warehouse is there to
provide such access.
General and Powerful Patterns
The format of the patterns discovered by the Data Mining Suite is very
general and goes far beyond decision trees or simple affinities. The
advantage to this is that the general rules discovered are far more
powerful than decision trees. Decision trees are very limited in that they
cannot find all the information in the database. Being rule-based keeps
the Data Mining Suite from being constrained to one part of a search
space and makes sure that many more clusters and patterns are found --
allowing the Data Mining Suite to provide more information and better
predictions.
Language of Expression
The Data Mining Suite has a powerful language of expression, going
several times beyond what most other systems can handle. For instance,
30
These unique features and benefits make the Data Mining Suite the ideal
solution for large-scale Data Mining in business and industry.
respond AND make large purchase. The patterns data mining finds for
those two goals may be very different.
Although a good data mining tool shelters you from the intricacies
of statistical techniques, it requires you to understand the working of the
tools you choose and the algorithms on which they are based. The choices
you make in setting up your data mining tool and the optimizations you
choose will affect the accuracy and speed of your models.
CHAPTER III
CLOSING
3.1 CONCLUSION
The data warehouse systems enable us to store large volume of data from a
variety of interrelated databases and process them together. A data warehouse
thus solves the complex OLAP queries made by the analyst and give the
required information. Hence, data warehousing system provides the right way
to access large amount of databases at a fraction of time. Data Mining is the
extraction of hidden predictive information from large databases. This is a
new powerful new technology with great
34
REFERENCES