You are on page 1of 6

Text Classification: When Not to Use Machine

Learning

Machine learning is a great approach for many text classification


problems. For example, the problem of classifying an email as spam or
not spam, based on its textual content.
The following is not one of them.
Consider the problem of classifying a job title into a rank from C-level, VPlevel, Director-level, Manager-level, or Staff. Below are some examples of
job titles and the ranks wed like them to be classified to.

Chief Information Officer Vice President C-level


Admin to Chief Information Officer Staff
E-Commerce Project Manager Manager-level
Director, Software Engineering Director-level
General Manager VP-level
Assistant to Vice President Staff
Assistant Vice President VP-level
At first glance, one might be tempted to use keywords/regular
expressions mapped to ranks for this purpose. For example, if the title
contained the phrase chief <word> officer, we might rank it C-level.
On further thought, after noting some subtleties, as witnessed in some of
the examples above, we might prefer a more sophisticated approach. If
we are (or aspire to be) data scientists, we might imagine that machine
learning would work better for this problem. Specifically, use a training
set of job titles classified to ranks to automatically learn, via machine
learning, to classify a title to its appropriate rank.
Sounds very appealing.
As well see below, this doesnt work out as well as we might imagine.

A Machine Learning Solution

For a machine learning solution to this problem,


one needs the following:
1. A training set of (job title, rank) pairs
2. Features to be extracted from a title that the machine learning
algorithm will use

We might choose, for example, every word and every two-word phrase in
the title to be a feature. Below are some examples of job titles and their
word-level and two-word level features.
Title

Features

General Manager

general, manager, general manager

Director, Software
Engineering

director, software, engineering, director software,


software engineering

The difficulty with this approach is that, unless the training set is very
large and sufficiently diverse, a machine learning solution can
significantly overfit it.
The term overfit means the learned model does not work adequately
well on titles not seen during training.
Below is a simple example that illustrates this. Imagine that the training
set has an entry chief medical officer C-level. Also imagine that no
other title in the training set has the word medical in it. In view of this, a
machine learning algorithm is likely to learn the association
that medical predicts C-level, which is clearly wrong.
Why is this happening? We are expecting the machine learning algorithm
to automatically figure out which words, and which two-word phrases,
predict specific ranks and which dont. Hundreds of thousands of different
words can occur in the imagined universe of titles. (The contacts
database at Data.com has more than ten million distinct titles.) Hundreds
of thousands squared two-word phrases. For the machine learning
solution to automatically discover which of these words and two-word
phrases predict specific ranks requires a very large training set.
Can we alleviate this issue by limiting our features to words? The
reasoning being that limiting features to words drastically reduces the
universe of feature values, thereby, needing a significantly smaller
training set to learn associations between individual words and ranks
from.
Yes, but we pay a price for it, in reduced accuracy. Certain two-word
phrases, for instance vice president, predict ranks more accurately than

the independent combination of the words in them. (president predicts Clevel, vice in of itself does not strongly predict VP-level.)
Moreover, the number of distinct words in the universe of titles is still
rather large, so the requisite training set will still remain large.
If a very large training set is available, great. If not, as is often the case,
what to do? Lets revisit the keyword rank rule-based approach.

A Rules-Based Solution

Consider the rule


manager manager-level

Interpret this as if the title contains the word manager, classify it to the
rank manager-level.
This single rule classifies most (but not all) titles which contain the word
manager in it correctly. In the parlance of machine learning, this one rule
generalizes massively (albeit not perfectly).
To improve on this, the following mechanism helps
If two rules fire on a particular title, and the antecedent of one of
the two is a subphrase of the antecedent of the other, override
the former rule.
Lets see an example.
Add the following rule:

general manager VP-level

Consider the title General Manager, data.com. Both rules fire on this title.
The general manager rule wins because manageris a subphrase
of general manager. This results in the title getting classified to VP-level.
How do we ensure that Assistant to Vice President gets classified
to Staf whereas Assistant Vice President to VP-level?
We need to add a simple mechanism, a numeric strength to each rule. To
illustrate this imagine that the rules set is as follows:
1. assistant Staff (1)
2. vice president VP-level (2)
3. assistant to Staff (3)

Consider the title Assistant to Vice President. Rules 1, 2, and 3 all fire on
this title. Rule 3 overrides rule 1. Next, rule 3 predicts Staf more strongly
than rule 2 predicts VP-level. So Staff wins. Next, consider the
title Assistant Vice President. Rules 1 and 2 fire. Rule 2 predicts VPlevel more strongly than rule 1 predicts Staf. So VP-level wins.
It turns out that by hand-crafting a couple of hundred such rules one can
achieve a high classification accuracy on a large test set of titles.
Sure, hand-crafting a few hundred rules takes work. Putting together a
training set of ten to hundreds of thousands of titles pre-classified to
ranks might take a whole lot more work.
Combining Rules and Machine Learning
The rules-based approach gives us massive generalization from a small
set of rules. However it doesnt automatically learn from its mistakes. If
feedback is expected to arrive continually (even if at a low rate),
automated learning from such feedback to improve classification
accuracy is very attractive. The alternative of manually adjusting the
rules from such feedback is more laborious, and injects humans in the
loop. (Humans are intelligent, but dont scale.)
So a sensible combination would be to use the rule-based approach to
quickly get a decent classifier off the ground; then use machine learning
to automatically adjust the rules from feedback.

For instance, machine learning can be used to automatically adjust the


strengths of the various rules from feedback.
How do you solve such issues? Wed love to hear from you.
If youre interested in these sorts of problems, Salesforce is hiring!
Visit http://www.salesforce.com/tech to find your #dreamjob.

You might also like