You are on page 1of 25

An Oracle White Paper

May 2014

Create a Spend Classification Knowledge Base


Best Practices

Create a Spend Classification Knowledge Base Best Practices

Executive Overview ........................................................................... 2


Terminology....................................................................................... 2
Introduction ....................................................................................... 3
Key Challenges ............................................................................. 4
Oracle Spend Classification ........................................................... 5
Spend Classification Process Overview............................................. 7
Guidelines on Creating an Effective Training Dataset ........................ 9
Points to Remember .................................................................... 11
Managing Taxonomy Over Time...................................................... 12
Implementation Considerations ....................................................... 13
Conclusion ...................................................................................... 14
Appendix A: Deep Dive into the Training and Classification Process 14
The Knowledge Base Creation Process ...................................... 15
Classification process .................................................................. 18
Appendix B: Frequently Asked Questions........................................ 20

Create a Spend Classification Knowledge Base Best Practices

Executive Overview
Spend Classification is the process of transforming raw data from payment and purchasing
systems and cleansing, enriching, and classifying it into common, meaningful categories. A
good Spend Classification and Analysis tool helps organizations to identify saving
opportunities and monitors compliance with corporate policies and negotiated contracts.
Statistics show that organizations with good spend visibility easily save between 0.25 and 1%
on total spend every year.
One of the unique selling propositions of Oracle Spend Classification is on-premise installation
and classification. This presents a challenge to the implementers as they need to build a
Knowledge Base specific to the business requirements and that adapts well to the changing
business needs. The application makes use of powerful Data Mining algorithms to provide
predictive capabilities to the Knowledge Base. However, history has shown that business
users frequently do not understand how the Knowledge Base is used to classify transactions.
This white paper throws some light on how the Knowledge Base functions, guidelines on
creating an effective Training dataset and provides guidelines on using Spend Classification
optimally.

Terminology
Apply operation: Refers to the Knowledge Base creation/ enrichment process
Build operation: Refers to the classification process
Class: The Commodity/ Category code. An example could be HARDWARE.LAPTOP for a 2
level taxonomy.
Classification Batch: The group of records submitted for a classification run using a Knowledge
Base..
Document: Concatenation of text fields of each input row, either for a Training Dataset or
Classification Dataset.
Enrichment of Knowledge Base: A process by which the learning acquired by the Spend engine
using previous training process is completely erased, and new learning is gained with the help of a
revised training dataset. It is not an incremental process, but a complete refresh of the Knowledge
Base.

Create a Spend Classification Knowledge Base Best Practices

Knowledge Base: An Oracle Data Mining (ODM) Model that is created using sophisticated
mathematical algorithms to segment data and discover patterns. This, in turn, aids in the
prediction of likely outcomes.
Taxonomy: A method of categorizing or classifying any data.
TF-IDF: Term Frequency Inverse Document Frequency, is a numerical statistic that indicates
how important a word is to a document in a collection or corpus of documents. It is often used as
a weighting factor in information retrieval and text mining. These statistical values increase
proportionally to the number of times a word appears in the document; however they are offset by
the frequency of the word in the corpus, which helps to control the fact that some words are
generally more common than others.
Training Dataset: A representative sample of the spend data brought in from disparate legacy
systems into Oracle BI Applications.
Unclassified Data: All of the spend data brought in from disparate legacy systems into Oracle BI
Applications that have never undergone the classification process even once using a Knowledge
Base.

Introduction
In todays dynamic business environment and intense competitiveness, global organizations
are facing tremendous business pressures to manage spend effectively. Organizations need to
achieve their savings goals, and ensure that the quality of supply is improved over time. The
need of the hour is to consolidate spend with fewer, more strategic suppliers and report a more
granular level of spending. In this era of globalization, the focus is to identify sources of
contract leakage globally and ensure spend visibility across systems. Organizations will not be
able to get an accurate picture of their overall spend unless spend data is classified accurately.
Hence, organizations definitely need an effective spend classification and analysis tool to
improve strategic sourcing, meet performance targets and ensure compliance with internal
policies and government regulations. According to Forrester research data, enterprises in
multiple markets across multiple industries with multiple ERPs benefit the most by using spend
classification tools.

Create a Spend Classification Knowledge Base Best Practices

Figure 1: Do you need a Spend Classification solution

Key Challenges
The key challenges faced by the organizations with spend analysis and classification is as
follows:

No Categorization: Presence of disparate ERP systems, including home grown solutions,


often result in spend transactions without associated category codes.

Incorrect Classification and Miscellaneous Classification: Generally, users categorize


most, or all, of spend categories as Miscellaneous. Even if they try to figure out the category
into which their purchase might fall, they frequently select the wrong category code. A
combination of such factors results in classifying spend in incorrect categories, thereby
portraying an inaccurate picture of overall spend.

Taxonomy Errors: Frequently, the Taxonomy definition is incomplete, or it only meets the
financial reporting requirements. Sometimes, the Taxonomy is not defined to the granularity
level needed for getting accurate picture of actual spend.

Different Taxonomies across Geographies: Presence of disparate source systems and


definitions in operating units across the globe results in multiple item coding structures and
standards. Consolidating global spend information using a global taxonomy is a behemoth
task.

Create a Spend Classification Knowledge Base Best Practices

Oracle Spend Classification


Traditionally, spend data is extracted from multiple source systems and shipped to third party
service providers who perform manual or semi-manual classification of the data. These service
providers use a generic Knowledge Base in order to categorize spend data, which does not
fulfill the organizations reporting requirements. Post-classification, the sorted data is sent back
to the organization. This data is then loaded into the analytics applications, after performing
sanity checks on the classified data. The entire process usually takes between 4 to 8 weeks
and the service providers charge high rates, even in offshore operations. Using this approach,
organizations are able to use data classification at long and infrequent intervals quarterly or
bi-annually. Additionally, sending sensitive spend data outside the organization poses data
privacy issues.

Figure 2: Traditional Spend Classification Approach

On the other hand, Oracle Spend Classification provides a fully integrated end-to-end
classification solution with Oracle Procurement and Spend Analytics and Oracle iProcurement.
Unclassified data is seamlessly read from Procurement and Spend Analytics and classified
using Oracle Data Mining engine. When you are satisfied with the accuracy of the
classification, approve the classification run. After approval, classified data is immediately
available in Procurement and Spend Analytics for further analysis.

Create a Spend Classification Knowledge Base Best Practices

Figure 3: Oracle Spend Classification The Integrated Solution

Following are the advantages of using Oracle Spend Classification over third party
classification service providers:

Intelligence based Learning System: The application learns from spend data using
powerful Data Mining algorithms and improves its capability with your feedback. You can
enrich the Knowledge Base to increase classification accuracy.

On-Premise Installation: You need not transfer spend data to service providers, thereby
eliminating data privacy concerns. You can classify data as and when required. Out of the
box integration with Procurement and Spend Analytics ensures immediate availability of
classified spend data after the classification process.

On-Demand Services: Spend Classification On-Demand services results in faster


deployment and quicker return on investment.

Assisted Classification: Users have an option to override category codes predicted by the
system.

Inline Classification: The application helps in classifying the Non Catalog Requests in
Oracle iProcurement, thereby correcting classification code at the point of entry. Users need
not worry about trying to figure out the category code of the transactions.

Create a Spend Classification Knowledge Base Best Practices

Spend Classification Process Overview


The following process diagram depicts the process flow of a Spend Classification implementation:
Modify Training
Dataset

No

Analyze Spend
Data

Start

Create Training
Dataset

Create Taxonomy

Training Dataset
Satisfactory

Yes

Create Knowledge
Base

End

No

Unclassified
Data exists

Approve
Classification
Batch

Yes

Classification
Results
Satisfactory

Classify Data

No

Yes

Perform Manual
Corrections

No

Large Volume of
Manual Corrections
Required

Yes

Changes in
Taxonomy/ New
Data Sources

Yes

Modify Training
Dataset

Enrich Knowledge
Base

No

Figure 4: Oracle Spend Classification - Implementation Overview

Analyze Spend Data: Prior to the implementation of the application, it is very important that you
know and understand the data that needs to be classified. This helps in identifying the transactions that
need to be part of the Training Dataset. A good Training Dataset represents the entire population of
spend data that needs to be classified by the application. For a new implementation, a good starting
point is the classified dataset organizations have received from third party classification service
providers in the past, if available. Else, you will need to create the Training Dataset manually using
sample transactions for each spend category.

Create a Spend Classification Knowledge Base Best Practices

Create Taxonomy: The next step is to create the taxonomy, or modify the existing taxonomy, to fit
the reporting requirements of the global organization. Spend Data is generally captured and stored in
multiple systems. Frequently, these systems utilize different classification schemes (that is, taxonomies)
to categorize expenditure. Usually classification schemes are designed to aid financial reporting, and are
not adequate for procurement and spend analysis and reporting. Hence, it is essential to create (or use)
a standard taxonomy across the enterprise to categorize spend data. Often, most organizations end up
creating custom taxonomies that address their business processes and reporting requirements. Oracle
Spend Classification is well equipped to handle custom taxonomies.

Create Training Dataset: Next, you need to provide the correct category codes for all the
transactions that are part of the Training Dataset, against the auto code column of taxonomy being
used. This is a very important step, as the application completely relies on the richness and accuracy of
the Training Dataset in order to predict classification codes accurately. For more information, please
refer to section Guidelines on Creating an Effective Training Dataset.

Create Knowledge Base: Once the Training Dataset is in place, create the Knowledge Base. When a
Knowledge Base is built, it understands the knowledge or patterns present in the training dataset
through inspection. During the Classification process, the Knowledge Base applies this learning for
predicting the classification code. A Knowledge Base can be created using a single Taxonomy. Users
have the following options for creating a Knowledge Base:

Standard Knowledge Base A Knowledge Base created using the standard Oracle Spend
Classification application. It utilizes the hierarchal Support Vector Machine algorithm for
creating the Model.

Advanced Knowledge Base A Knowledge Base that is created outside the standard
Oracle Spend Classification application. The available algorithms are linear Support
Vector Machine, Nave Bayes and Generalized Linear Model. For more information
about these algorithms, please refer to Oracle Advanced Analytics user guide.

For more information on Knowledge Base creation process, please refer to Appendices A and B.

Classify Spend Data: Once the Knowledge Base is in place, classify the spend data. Essentially, all the
spend data is treated as unclassified data at the time of the first classification run. Thereafter, addition
of new source transaction systems or incremental data load for existing source systems result in
unclassified data. For more information on Classification process, please refer to Appendices A and B.

Perform Manual Corrections: If the predicted category codes of some transactions are not as per
expectation, you can perform manual corrections and overwrite the predicted category codes, prior to
approving the classification batch. Please note that users cannot enter classification codes manually for
unclassified data. This is possible in Oracle Spend Classification only after a Classification run

Approve Classification Batch: If the classification results are satisfactory, approve the Classification
Batch. This, in turn, updates the Oracle Business Intelligence Applications (OBIA) base tables with the
classification codes of the concerned taxonomy. Please note that it doesnt override the existing
commodity code value present in unclassified data.

Create a Spend Classification Knowledge Base Best Practices

Enrich the Knowledge Base: In situations where a lot of manual corrections need to be done to the
predicted category codes, or new transaction data sources are added over a period of time, users need
to scrutinize the existing Training Dataset and make the required modifications. The existing
Knowledge Base can then be enriched using the modified Training Dataset.
Users can enrich the Knowledge Base in the following scenarios:

To improve the accuracy of classification, in scenarios where the knowledge base is not
picking up certain keywords correctly or providing unwanted weightage to other
keywords.

To support changes in taxonomy (addition or removal of category codes, changes in


existing category codes) or addition of new legacy systems from which spend data is
required to be included.

Reclassify Spend Data: In scenarios such as classification runs with very low classification accuracy,
reset the predicted category codes of classified transactions and reclassify them against the enriched
Knowledge Base.

Guidelines on Creating an Effective Training Dataset


The Knowledge Base is nothing but a manifestation of the Training Dataset used for its creation. The
accuracy with which the Knowledge Base predicts correct category code depends on the richness and
accuracy of the Training Dataset used to build it. During an implementation cycle, the implementers or
business users usually underestimate the amount of time needed to put together a good Training
Dataset. This activity could further get complicated if persons in charge of putting together the
Training Dataset dont know the data that they are looking at number of distinct commodity codes,
sample transactions covering most of the variations in the way a particular purchase or invoice may
manifest, knowledge about the keywords influencing the category code, etc. Sometimes, underrepresentation of transactions of a particular commodity code could lead to classification with lower
confidence (although the Knowledge Base could predict the code accurately during classification
process). Having an over-representation of transactions of a particular commodity code could
sometimes affect the accurate prediction of another commodity. Hence, creating a good Training
Dataset is more of an art rather than a science to the implementers.
Below we provide some guidelines on creating a good Training Dataset.

Understand the Template: You need to download the Template file from the application, and
repurpose the training data in the format as specified in the Template. The Template contains the
following different kinds of attributes:
o

Mandatory columns where the application expects a value to be present:

Dataset Identifier Name of the Dataset being uploaded.


Transaction Number System generated alphanumeric code for the
transaction

Create a Spend Classification Knowledge Base Best Practices

Datasource Id Signifies the source system to which the transaction


belongs
Columns used by the Data Mining engine for both creating Knowledge Base and
Classification:
Transaction Description
Line Description
Item Code
Item Description
Supplier Name
Supplier Site
Operating Unit
UOM
Currency
Cost Center
Line Amount
Columns used by application to learn the correct category code. You can populate
one or more of these columns, depending on the Taxonomies being used:
EBS Auto Code
UNSPSC Auto Code
Custom Auto Code1
Custom Auto Code2
Custom Auto Code3
Columns that are used by the application only for populating the predicted category
code after classification completes, depending on the Taxonomy being used:
EBS Category Code
UNSPSC Category Code
Custom Category Code1
Custom Category Code2
Custom Category Code3
The remaining columns are used simply for tracking purposes

Creation of Item Text: During the Knowledge Base creation and classification process, the
application constructs a string object by concatenating all the text-based attributes used by the Oracle
Data Mining (ODM) engine, except for Line Amount. This string object, along with Line Amount is
what gets passed to the ODM engine. Hence, the association between the attribute-value pair gets lost
and the ODM engine looks at those text inputs simply as keywords. For e.g., a sample transaction is
listed below:
Line Description

SUPERIOR KITCHEN SERVICE INC

Supplier Name

CBR REAL ESTATE SERVICES

Supplier Site

SSTRMBOSTRACROSER001

Currency

USD

10

Create a Spend Classification Knowledge Base Best Practices

Line Amount

276.41

Cost Center

7097068

The application concatenates the text based attributes into a single attribute called Item Text, and
passes the Item Text and Line Amount to the ODM engine. In the above sample, the Item Text gets
constructed as: SUPERIOR KITCHEN SERVICE INC CBR REAL ESTATE SERVICES
SSTRMBOSTRACROSER001 7097068. Therefore, the ODM engine cannot distinguish as to what
comprises the supplier name, line description, etc.

Stop List: Words that are part of the Stop List are not sent to the ODM engine either during
Knowledge Base creation or Classification. A stop list is a list of words that do not get indexed. These
are usually common words in a language such as this, that, and can in English. There are different Stop
Lists for different languages, such as English, Chinese, Danish, etc.

Data Cleansing: Ensure that you clean the text strings in the data to remove acronyms and spelling
errors. Standardize and normalize the information on item, item description and supplier, if required.

Attribute Overloading: Sometimes you might feel that there are certain additional attributes that
could help the ODM engine in better predicting the commodity code (e.g., GL Code). However, the
application only allows a fixed set of columns to be considered for Classification and Knowledge Base
creation. In such cases, you can add such attributes to an existing text-based attribute that is not being
used, or overload an existing text-based attribute and append the additional attribute values to an
existing attribute in case all the attributes considered by the ODM engine have already been utilized.
Since the ODM engine doesnt know about the relationship between an attribute-value pair and works
based only on keywords, this could be a possible workaround to introduce additional attributes.
However, you would need to do some customization so that the information about these additional
attributes is passed to the ODM engine during the Classification process.
The aim is to have a minimum of 35 records covering all possible variations of words that can occur at
each leaf node of the taxonomy.
At the end of classification run, typically transactions classified with high confidence are expected to be
more accurate, compared to those classified with medium and low confidence. However, if
transactions are classified with low confidence, it doesnt imply that the predicted category codes will
be incorrect.
It is not possible for users to specify classification rules manually. The system utilizes Oracle Data
Mining classification algorithms to learn/ identify the trends present in training dataset, and applies this
knowledge on unclassified data. This learning resides in the Knowledge Base, and cannot be viewed in
a user interface. The only way to comprehend if the Knowledge Base can classify the transactions (or
concerned commodity codes) correctly is after a classification run

Points to Remember

Ensure that the Dataset name should be different from the seeded ones AP Invoice, Purchase
Requisitions and Purchase Orders.

11

Create a Spend Classification Knowledge Base Best Practices

In case the Training Dataset consists of transactions for a single commodity code, the Knowledge Base
creation process will fail. The ODM engine requires at least two distinct commodity codes in order to
create the Knowledge Base.

It is mandatory to enter values into one or more of the Auto code columns (depending on the
Taxonomy), or else the Knowledge Base creation process will fail.

The category code columns, such as EBS Category Code etc., are ignored during Knowledge Base
creation process.
In order to analyze why a transaction is not getting classified accurately, note the keywords that are
assigned the maximum weightage and analyze the transactions in the Training dataset belonging to the
category code predicted by the application and the correct category code. This will help in discovering
the reason why the keywords in training data for the correct category code are not getting picked up.

Managing Taxonomy Over Time


The taxonomy may keep changing over time due to numerous reasons:

Adding new categories to the existing taxonomy


Enhance the existing Training Dataset by adding transaction details pertaining to the newly added
category codes. Enrich the Knowledge Base using the modified Training Dataset. This would help the
Knowledge Base in predicting the newly added category codes. Reclassify the historical transactions, as
per business needs.

Deleting categories from the existing taxonomy


Modify the existing Training Dataset by removing transaction details pertaining to the deleted category
codes. Enrich the Knowledge Base using the modified Training Dataset. This would ensure that the
Knowledge Base never predicts the deleted category codes. Reclassify the historical transactions, as
your business requires.

Modification of existing category code description in existing taxonomy


Consider situations where for example, the description of UNSPSC category Code 13111207 is
changed from Metalized Films to Non-Metalized Films. In such cases, delete the old category data
from the Oracle Spend Classification internal tables; otherwise, the taxonomy processing might fail due
to presence of old records for the relevant category codes (for more information, please refer to the
latest Oracle Spend Classification Process Guide). Make the necessary corrections in the existing Training
Dataset. Enrich the Knowledge Base using the modified Training Dataset. Reclassify the historical
transactions, as per business needs.

Switching over to an entirely new taxonomy


Consider scenarios where the business is entirely changing the taxonomy from a custom taxonomy to
UNSPSC taxonomy. In such situations, create a new Training Dataset and associate the training

12

Create a Spend Classification Knowledge Base Best Practices

records with category codes of the new Taxonomy. Create a new Knowledge Base using the new
Training Dataset. Reclassify all historical transactions, based on your business requirements.

Reload existing Taxonomy in Oracle Business Intelligence Applications (OBIA)


If the reloading of the taxonomy leads to changes in the category ids internally, you could opt to create
a new knowledge base or enrich an existing knowledge base. Reclassification of historical transactions
might not be required in this situation.

Implementation Considerations
The following guidelines should be kept in mind while setting up Oracle Spend Classification:

Understand Your Data: This is the most important aspect of any implementation, and is helpful in
creating a rich and comprehensive Training Dataset. You need to understand and explore the nature of
spend in the organization, major areas of spend (by amount as well as volume), supplier data etc. In
most situations, direct spend data is relatively easier to train and classify than indirect spend.

Perform a Keyword Analysis: You need to analyze and associate the keywords that users would use
for spend categories. Keywords could range from supplier names to transaction or item specific words,
or a combination of other attributes, such as Operating Unit. If you can associate a set of unique
keywords with a particular category code, the classification accuracy for such transactions will be very
high.

Eliminate Noise: It is important to identify and eliminate superfluous attributes and keywords that do
not influence the prediction of category codes. Keywords that occur across all transactions in a
Training Dataset should be removed from the Training Dataset. For example, if the currency USD
occurs in most of the transactions in a Training Dataset, it is very likely that it would get a very high
weightage during the classification process, and this may not be a desired outcome.

Update the Training Dataset: If certain category codes are being classified incorrectly, you need to
revisit the Training Dataset and modify the training data to reflect the intended behaviour. You should
use only those keywords in these transactions that are deemed important from a classification
perspective. As explained earlier, weightage is given to keywords, depending on the interaction of these
keywords across category codes in the Training Dataset. This activity becomes especially important
when the organization decides to introduce new category codes and no corresponding transaction data
exists in the ERP/ legacy systems.

Decide on the Taxonomy Structure: Before creating the Training Dataset, it is important that the
business identifies the Taxonomy to be used. The number of levels and the kind of Taxonomy (ERP/
UNSPSC/ Custom) should be decided and should be in place (which in turn would depend on the
business and reporting requirements of the organization). Also, it is of utmost importance that the
Taxonomy is well-defined and detailed enough, without any duplication or any other factors that could
bring in inaccuracies. For example, using category codes such as HARWARE.HARDWARE, along
with other sub-categories under the HARDWARE parent category is not a judicious decision. This is
because anything that can be classified under a different sub-category under the HARDWARE parent

13

Create a Spend Classification Knowledge Base Best Practices

category can also be classified under HARWARE.HARDWARE. This would create irrelevant results
by the system and also leave room for misinterpretation by users.

Conclusion
In this white paper, we see how Oracle Spend Classification scores as compared to third party
classification service providers. Oracle Spend Classification helps global organizations in organizing
spend into logical categories; automating the expensive data management process and adapting to
changing spend patterns over time.
Setup considerations for implementing the application were also described, along with preimplementation and post implementation steps. Please refer to Metalink Note 1450275.1 in order to
access the latest Oracle Spend Classification Process Guide. Additionally, you can also refer to the Oracle
Advanced Analytics guides and Oracle Text guides to better understand the Data Mining process.

Appendix A: Deep Dive into the Training and Classification Process


When a Knowledge Base is built, one of the operations involved is processing the text columns of the
input dataset by concatenating the various text columns using Oracle Text. Throughout this discussion,
the concatenation of text fields of each input row will be referred to as a document discrete to that
row. How Oracle Text assigns weights (popularly known as term frequency and inverse document
frequency, TF-IDF) to the document keywords, is not covered in this paper, because there are many
factors involved in this process. While Oracle Spend Classification does provide some control over this
process, calculation of TF-IDF is based on keyword frequency counting within the document as well
as the corpus of documents. Oracle Text assigns a TF-IDF value to each keyword in a document,
provided it is not included in the stop word list. The maximum number of keywords that could be
assigned such a value is limited by the MAX_DOCTERMS option (upper limit on the number of
keywords that can represent one document) and MAX_FEATURES option (upper limit on the
number of distinct keywords for the corpus of documents). The TF-IDFs for a single document are
assigned based on the examination and analysis of all documents presented to the Data Mining engine.
By using the Oracle Text pre-processing step prior to the build operation, the set of attributes (or
predictors) for each document is configured to include the keyword/value pairs from the item text, in
addition to any other columns present in the input dataset, such as Line Amount. Oracle Spend
Classification then proceeds to build the Knowledge Base using the data provided by the text
processing phase as input. The build process takes the populated target column (auto category code
column) and given predictors and assigns coefficients to the predictors in a training process that
observes the relationships between the target and predictors in the exiting data. What follows below are

14

Create a Spend Classification Knowledge Base Best Practices

the various steps involved in the build and apply operations of an Oracle Spend Classification
Knowledge Base creation. Data from a sample dataset is used as an example.

The Knowledge Base Creation Process


1. Assign TF-IDF: As part of the pre-processing step, Oracle Text indexes the text column for each

document in the input Dataset. In this process, the keywords in each document are assigned a weight
(TF-IDF). This process is repeated across the corpus of documents, which in turn constitutes the
input Training Dataset. Few of the parameters that can influence the indexing process include the
Lexer and the Stop List. The Lexer preference specifies the language of the text to be indexed,
whereas Stop lists identify the words in your language that are not to be indexed. For more
information on these entities, please refer to the Oracle Text documentation.
2. Factors affecting TF-IDF Calculation: The keywords and their weightage (TF-IDFs) for each

document, derived in the previous step, are extracted and then fed as an input to the Knowledge
Base build process. Each of the keyword weight pairs is considered to be an independent
predictor. Consider the following example, where PROD_DESC field is considered as the text or
document field. In this example, we would build a Knowledge Base having a single level hierarchy,
such that PROD_CATEGORY is the target column.
PROD_ID: 44775
PROD_NAME: Vunelli Bb-956
PROD_LIST_PRICE: 79.95
PROD_DESC: this is the famous Vunelli Bb-956 in color 69.90 of size Bcbg 3/4 Length Cargo Trouser
PROD_CATEGORY: Women
PROD_SUBCATEGORY: Shoes Women

Actual TF-IDF Values: The keyword/weight (TF-IDF) pairs that Oracle Text computes from the
PROD_DESC field for this row are:
KEYWORD
-----------VUNELLI
956
BB
LENGTH
69.90
BCBG
4
CARGO
3
TROUSER
FAMOUS
SIZE
COLOR

VALUE
---------.374162
.374162
.374162
.346139
.346139
.326257
.270213
.270213
.256344
.148989
0
0
0

Note that the words this, is, the, in and of present in the PROD_DESC column are not
considered to be keywords because they are part of the Stop List. Notice that the characters hyphen

15

Create a Spend Classification Knowledge Base Best Practices

(-) and forward slash (/) are treated as keyword delimiters so that the appearance of 3/4 in the
document is translated into 2 different keywords, 3 and 4; similarly the appearance of Bb-956
is interpreted as 2 different keywords. The same is not true for the string 69.90 which is indexed
as a single keyword. These behaviors may or may not be desirable based on the business situation.
You would need to take the decision based on business requirements. In this case, since Bb-956 is
most likely a model number for the item and 3/4 appears to be a size indication, it is logical to not
have these characters treated as keyword delimiters while leaving the treatment of 69.90 (a color)
as is. This can be achieved by specifying these 2 delimiter characters in the PRINTJOIN attribute to
the Lexer that is passed to Oracle Text during indexing. Please refer to Oracle Text documentation
for more information on the Lexer attributes.

Effect on TF-IDF Values after modifying PRINTJOIN Lexer Attribute: Once the modification
to the PRINTJOIN attribute is complete, the following keyword/weight pairs are produced by Oracle
Text:
KEYWORD
-----------BB-956
VUNELLI
3/4
LENGTH
69.90
BCBG
CARGO
TROUSER
COLOR
SIZE
FAMOUS

VALUE
---------.398528
.398528
.398528
.398528
.368681
.347504
.287809
.158692
0
0
0

As a word of caution, even though the hyphen (-) and forward slash (/) characters are not
considered to be keyword delimiters in this document, it may not be correct to assume that the same
applies to all documents. The PRINTJOIN attribute passed to the Lexer impacts all documents that
are indexed. For example, here are some other hyphenated keywords from the input data set that
would be affected by this delimiter change:
RELAXED-FIT
SATIN-TRIM
SCOOP-NECK
SEAM-DETAILED
SHADOW-PLAID
SHIRT-JACKET
SHORT-SLEEVE
SHORTS-4
STRAIGHT-COLLAR
SWEATER-2T

16

Create a Spend Classification Knowledge Base Best Practices

For most of these keywords, changing the hyphen character to be non-delimiter clearly seems
appropriate. But for others like SHIRT-JACKET, SHORTS-4 and SWEATER-2T the
correct treatment is not always very clear.

Impact of Adding Keywords having TF-IDF value Zero to the Stop List: As mentioned earlier,
there is a theoretical, mathematical algorithm behind the assignment of TF-IDF values for each
keyword. But the fact that keywords FAMOUS, SIZE and COLOR received a value of 0
indicates that these keywords have no discriminatory value between documents. To validate this, an
examination of the input data set reveals that all three of these keywords appear in 100% of the input
documents. You might conclude that because of this, it might be better to include these terms in the
Stop List. The following keyword/value pairs results after this is done, but notice that TF-IDF values
for the non-zero keywords do not change.
KEYWORD
-----------VUNELLI
3/4
BB-956
LENGTH
69.90
BCBG
CARGO
TROUSER

VALUE
---------.398528
.398528
.398528
.398528
.368681
.347504
.287809
.158692

Not only do the assigned weight values remain unchanged, even if the keywords with TF-IDF value
equal to zero are not removed and get passed to SVM model build as valid predictors, the SVM build
process ignores these as non-attributes and does not assign them coefficients. Therefore, they do
not affect the classification process. In cases where the text column in the input data set is
sufficiently large so that the number of resultant keywords approaches the MAX_DOCTERMS
limit, it might make sense to either add these as stop words or possibly increase the value of
MAX_DOCTERMS.
The values extracted from Oracle Text in this step for keywords and weights, and the
PROD_NAME and PROD_LIST_PRICE are the mining attributes or predictors input to
knowledge base build operation. Up until this point, the value in the keyword/value pairs is specific
to the individual document. Once the SVM model is built in the next step, the coefficients in the
keyword/coefficient pairs will represent global values that are used in a linear computation during
prediction to compute the highest probability target or class. Each keyword will have one
corresponding coefficient for each of the predicted classes in which it was seen. In our example
dataset, there are 4 different classes (PROD_CATEGORY):
CLASS
------------------Women
Men
Boys
Girls

17

Create a Spend Classification Knowledge Base Best Practices

3. Calculation of Global Keyword/ Coefficient Pair: The system extracts the keyword/coefficient

pairs from the Knowledge Base and sets them aside in a separate details table for use by the apply
operation. Notice that the keyword/coefficient pairs are now constant for this model and are used
when a prediction operation is performed on a data item.
If we look in the details table at the keyword VUNELLI found in our example row, we see that a
coefficient is present only for two classes Women and Boys. By going back to the
keyword/weights generated in the previous step for all input rows, it can be observed that the
keyword VUNELLI was seen only in rows where the target class was Women or Boys. Another
keyword, LENGTH was seen for all target classes, and is therefore represented by
keyword/coefficient pairs for all target classes in the model details table.

SA$MODEL_ID
----------1
1
1
1

CLASS
---------------------------------------Women
Men
Girls
Boys

KEYWORD COEFFICIENT
------------ ----------------LENGTH
.096640336
LENGTH
-.00717371
LENGTH
-.05882683
LENGTH
-.01990017

Classification process
In an apply operation, steps 1 and 2 of the previous section are repeated on the input dataset.
However, the global values of keyword/ coefficient pairs generated during the build process are reused instead of being re-generated. This operation first generates keyword/weight pairs, or TF-IDF
values, specific to each row. These are then fed into the data mining prediction operation. Inside the
prediction operation, a linear expression is calculated for each class by multiplying each of the attribute
values with their corresponding coefficients. The class with the greatest value for the linear expression
is considered the top prediction.
Now consider the apply operation on the row shown below:
PROD_ID: 45
PROD_NAME: Vunelli Bb-956
PROD_LIST_PRICE: 79.95
PROD_DESC: famous Vunelli 3/4 Length Cargo Trouser
PROD_CATEGORY: Women
PROD_SUBCATEGORY: Shoes Women

The predicted class in this case is Women, therefore the keyword coefficients in the keyword mapping
table are generated as follows:
1. Assign TF-IDF Values: The keyword/weights generated are :
KEYWORD

18

VALUE

Create a Spend Classification Knowledge Base Best Practices

-----------VUNELLI
LENGTH
4
CARGO
3
TROUSER
FAMOUS

---------.532484
.492604
.38455
.38455
.364813
.212032
0

2. Global Co-efficient Values and Related Keywords: Since Women was the predicted class, you

can use the global coefficient values for the keywords from the Knowledge Base details table for this
class. These are shown below:
SA$MODEL_ID
-------------------1
1
1
1
1
1

CLASS
-------Women
Women
Women
Women
Women
Women

KEYWORD COEFFICIENT
-------------- ---------------3
-.12492003
4
-.10921244
CARGO
.096394381
LENGTH
.096640336
TROUSER .154780699
VUNELLI
.02028074

3. Calculate & Rank Multiplication Products: The keyword weight (TF-IDF) from text (step 1) is

multiplied by its corresponding coefficient (step 2) from the Knowledge Base for the predicted class,
and these multiplication products are ranked in descending order with only the top 5 (default) taken.
SA$MODEL_ID
-------------------1
1
1
1
1

SA$CASE_ID
-----------------

CLASS
--------

KEYWORD MULTIPLICATION PRODUCT


------------ ----------------------------------

RNK
---------

45
45
45
45
45

Women
Women
Women
Women
Women

LENGTH
CARGO
TROUSER
VUNELLI
4

1
2
3
4
5

.047605416
.037068459
.032818461
.01079917
-.04199765

4. Calculate Normalization Number: The sum of the absolute values of the top 5 coefficients is

calculated as a normalization number. This result is:


SA$CASE_ID
-----------------45

CLASS
-------Women

SUM_COEFFICIENTS
---------------.170289152

5. Calculate Normalized Keyword Co-efficient Values: Finally, the multiplication product from

step 4 is divided by the normalization number to produce the final keyword weighting coefficient.
These will not necessarily sum to 1 as the sign from the coefficient in step 5 is retained. But the
absolute value of the coefficients will sum to 1. So in the example, the final keyword coefficients are:
SA$CASE_ID

19

TLEVEL

PREDICTION

KEYWORD COEFFICIENT

Create a Spend Classification Knowledge Base Best Practices

---------------45
45
45
45
45

--------0
0
0
0
0

----------------Women
Women
Women
Women
Women

-----------LENGTH
CARGO
TROUSER
VUNELLI
4

----------.279556364
.217679511
.192721972
.063416663
-.24662549

Appendix B: Frequently Asked Questions


1. Why do some keywords have negative or zero weightage? What do these weightages signify?

Should you remove such keywords?


Oracle Text does not assign negative TF-IDF values to keywords, only values greater than or equal
to 0. The assignment of value 0 typically means that the keyword is not useful for distinguishing
between documents. It may be considered a candidate for the Stop List. However, the fact that
the keyword weight is 0 is, to a certain extent, informative in that it can be concluded that this
keyword did not affect the class prediction.
Negative TF-IDF values are introduced by the build operation. In build operations, all the
attribute/ value pairs (TF-IDFs) for the document are multiplied by their corresponding global
linear coefficient from the Knowledge Base and summed up. This is done for each target class in
the model. The target class with the greatest sum is taken to be the predicted class. A predictor
that has negative linear coefficient for a particular target class in the Knowledge Base can be
interpreted as an attribute that, when present in a transaction being classified, could possibly reduce
the chances of it being predicted as the target class.
As an example, consider a Knowledge Base that is built to predict the biological classification of an
animal. Suppose our taxonomy (target classes) consists of mammals, reptiles and birds. For
illustration purpose, lets just consider 3 predictors: has 2 legs, is warm blooded, and has
feathers. In this case, for the mammal class, the positive weightage would be for attributes has 2
legs and is warm blooded, but would have a negative weightage for has feathers. That is, for a
new animal being classified, a positive weightage for has feathers works against the conclusion or
prediction that the animal being classified is a mammal.
The above example shows a clear case for a negative coefficient. However, in text data, negative
coefficients can occur due to the interaction between different keywords in documents where two
or more keywords are co-related. In such situations, it is more difficult to analyze why a coefficient
is negative.
2. Can a keyword have different weightages in the context of different transactions?

20

Create a Spend Classification Knowledge Base Best Practices

The keyword weightages are the product of the global SVM coefficients (constant) and the TF-IDF
values assigned by Oracle Text (transaction-specific), therefore they will most likely always be
different between transactions. Additionally, the coefficients in the keyword map are normalized,
which might thereby lead to different weights in the transactions.
3. Can you change the number of keywords to be considered by the application for build and

apply operations?
The application allows you to specify MAX_DOCTERMS and MAX_FEATURES in the options
parameter passed to the build procedure. MAX_DOCTERMS is the upper limit on the number of
keywords that can represent one document. MAX_FEATURES is the upper limit on the number
of distinct keywords for the corpus of documents. Oracle Spend Classification uses the defaults (30,
5000). The lower number for MAX_DOCTERMS was chosen as the description field (document)
for a spend item is typically much smaller than what you would expect for a several page literary
document. The larger MAX_FEATURES was chosen to provide for a much richer feature set of
unique keywords that could be expected to be seen in spend item descriptions. It also allows for a
larger training set to be provided to the build process.
4. How do the calculations work out in case of multi-level Taxonomy?

The linear equations in an SVM model (or Knowledge Base) are not really equations, but rather set
of attribute/coefficient pairs plus a bias factor. There is one set per target class. The prediction
function puts the linear equation together by looking at the attribute/values of the incoming row,
matching it with the attribute/coefficients in the model and calculating the sum of multiplication
products for each attribute present in the row, one for each target class. The target class with the
largest final value is the predicted class.
Also, each SVM model has its own set of coefficients. In the Oracle Spend Classification
hierarchical model, at each node in the tree (Taxonomy), there is an SVM model. The SVM model
predicts the class at that level. Based on the predicted class, the tree is descended to the child node
for that class for which there is another SVM model. This is repeated until the tree leaf node is
reached.
5. Can there be scenarios where no keywords in a document get any weightage? How are such

situations handled?
The only scenarios where there could possibly be no keywords associated with a classification are
the cases where the item description is null, or contains words that are in the Stop List. When no
keywords are generated for a particular spend item, the prediction would be achieved using any
non-text attributes/values. In cases where the non-text attribute values are null for the given spend
item, the scoring function uses a missing value replacement mechanism instead. For numerical data
types, this number is simply the mean for all values for this attribute seen during training. For
categorical data types, this value is the mode (the most frequently seen) value during training.
Missing value replacement is not done for nested types which is how the document keywords are
represented, thus the fact that there are no keywords implies that the prediction will be made solely
on the non-text attributes (Line Amount in our case).

21

Create a Spend Classification Knowledge Base Best Practices

6. What is a measurable method to fine tune the Knowledge Base?

To fine tune your Knowledge Base take a training data set with populated categories and run a
Spend Classification TRAIN/TEST operation by specifying (SPLIT,TEST) in the build options.
This operation first performs a stratified split of the dataset allocating 60% to train, and 40% to
test. The system then builds the Knowledge Base using the training subset and thereafter runs an
apply operation on the test subset. Please ensure that you specify a value for the test_result_ds
parameter to build, because the test results get stored here. When the operation is completed, take
a look at the test result table. The content of the test result table for a sample run (for data with a 2
level Taxonomy) are listed below.
The class values where TLEVEL=0 are the PROD_CATEGORY predictions (or, Parent
Category). The class values below these where TLEVEL=1 are the PROD_SUBCATEGORY
predictions (or, Sub Category). The Training Dataset was populated with the correct category
codes for the Parent as well as the Sub category under the parent. The TEST operation first
performs an apply operation, and then cross-checks the predictions in the apply results table with
the actual values in the test dataset. This check is how the CORRECT and INCORRECT numbers
are calculated. From line 1, we know that in the test dataset there were 61 cases where the Parent
category was Boys, and the apply operation got 60 correct and 1 incorrect. The ID column is the
model id; hence the rows for ID=1 are the root level predictions, or predictions on
PROD_CATEGORY (which is why the FILTER column is NULL). The rows for ID=2, 3, 4, 5
are second level predictions, one each for the 4 classes from the root level. The more under
represented a particular class is the training data, the lower the percentage of correct predictions. If
you look at the cases where the percent is zero or low, we observe that the total number of
occurrences being scored is low. The input dataset had a stratified split (meaning the split attempts
to ensure that a proportionately equal number of target classes end up in the training dataset as in
the test dataset), thus you can surmise that these sub-classes are underrepresented in the training
dataset.
ID PARENT FILTER
TLEVEL
--- ------ ------------------------------1
0
1
0
1
0
1
0
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1
2
1 "SA$PROD_CATEGORY"='Boys' 1

22

CLASS
-------------------------Boys
Men
Women
Girls
Shorts - Boys
Outerwear - Boys
Sleepwear - Boys
Shirts - Boys
Trousers And Jeans - Boys
Shoes - Boys
Underwear - Boys
Sweaters - Boys
Casual Shirts - Men
Outerwear - Men
Shirts - Girls
Outerwear - Girls
Sleepwear - Girls

CORRECT INCORRECT PERCENT


----------------------60
81
101
34
9
10
8
8
10
6
6
2
0
0
0
0
0

1
6
9
6
0
0
0
0
0
0
0
1
1
1
1
2
2

98.4
93.1
91.8
85
100
100
100
100
100
100
100
66.7
0
0
0
0
0

Create a Spend Classification Knowledge Base Best Practices

2
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5

23

1 "SA$PROD_CATEGORY"='Boys' 1
1 "SA$PROD_CATEGORY"='Girls' 1
1 "SA$PROD_CATEGORY"='Girls' 1
1 "SA$PROD_CATEGORY"='Girls' 1
1 "SA$PROD_CATEGORY"='Girls' 1
1 "SA$PROD_CATEGORY"='Girls' 1
1 "SA$PROD_CATEGORY"='Girls' 1
1 "SA$PROD_CATEGORY"='Girls' 1
1 "SA$PROD_CATEGORY"='Girls' 1
1 "SA$PROD_CATEGORY"='Girls' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Men' 1
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'
1 "SA$PROD_CATEGORY"='Women'

1
1
1
1
1
1
1
1
1
1
1
1

Trousers - Men
Outerwear - Girls
Trousers And Jeans - Girls
Dresses - Girls
Shorts - Girls
Skirts - Girls
Sleepwear - Girls
Shirts - Girls
Underwear - Girls
Shirts And Jackets - Women
Casual Shirts - Men
Dress Shirts - Men
Sportcoats - Men
Shorts - Men
Trousers - Men
Jeans - Men
Outerwear - Men
Sweaters - Men
Underwear And Socks - Men
Shirts And Jackets - Women
Shorts - Boys
Shoes - Women
Sweaters - Women
Knit Outfits - Women
Tees - Women
Dresses - Women
Shoes - Women
Skirts And Shorts - Women
Shirts And Jackets - Women
Easy Shapes - Women
Trousers - Women
Dresses - Girls
Outerwear - Women
Jeans - Men

0
1
4
5
5
6
3
2
4
0
5
6
4
6
22
12
11
7
5
0
0
0
1
12
9
16
14
5
9
8
12
0
0
0

2
0
0
0
0
0
1
1
2
1
0
0
0
0
0
0
1
1
1
2
1
6
0
0
1
2
2
1
2
2
4
1
1
2

0
100
100
100
100
100
75
66.7
66.7
0
100
100
100
100
100
100
91.7
87.5
83.3
0
0
0
100
100
90
88.9
87.5
83.3
81.8
80
75
0
0
0

Best

Practises

for

Creating

Spend

Copyright 2014, Oracle and/or its affiliates. All rights reserved.

Classification Knowledge Base


May 2014

This document is provided for information purposes only, and the contents hereof are subject to change without notice. This

Author: Amit Jha

document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in

Contributing Authors: Mark McCraken and

law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any

Sandeep Sood

liability with respect to this document, and no contractual obligations are formed either directly or indirectly by this document. This

Oracle Corporation
World Headquarters
500 Oracle Parkway
Redwood Shores, CA 94065
U.S.A.

document may not be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without our
prior written permission.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.
Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and

Worldwide Inquiries:

are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are

Phone: +1.650.506.7000

trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group. 0114

Fax: +1.650.506.7200
oracle.com

You might also like