You are on page 1of 14

Word Syllabification with Linear-Chain Conditional Random Fields

CSE 250B, Winter 2013, Project 2

Clifford Champion
Computer Science and Engineering UC San Diego, CA 92037 cchampio@cs.ucsd.edu

Malathi Raghavan
Computer Science and Engineering UC San Diego, CA 92037 m1raghav@eng.ucsd.edu

Abstract
In this paper we consider the application of machine learning to orthographic syllabification of English words. We detail our choice of linear-chain conditional random fields for this task, and our choice of specific feature function classes. For numerical solutions we employ two gradient-based optimization solvers, stochastic gradient ascent and closely related Collins Perceptron, and explore parameter values for regularization and learning rate. We apply our methods to a labeled data set of English words and report our challenges and findings.

Introduction

Automatic orthographic syllabification is an important and interesting problem. It is important because of its direct use in column-boundary hyphenation for print communications. Hyphenation used in this way, when combined with other print techniques such justification and anti-river algorithms, helps immensely in readability, thus enabling information to be consumed more efficiently. It is an interesting problem because from the outside it appears both non-trivial but possibly tractable to apply machine learning successfully. The reason it is non-trivial is because, as with most natural languages, English word formation is neither perfectly regular nor constant over time, and is similarly so with respect to rules for choosing the most appropriate place to insert hyphenation. Lastly, either you are inspecting the word on a letter by letter basis, or you are relying on higher-level knowledge such as phonetics, etymology or other aspects to natural language, and none are non-trivial in scope. In this paper we limit our consideration to letter-based learning. The length of words and number of possible characters is quite large. For a very rough sense of possible, sensical (present or future) English words, consider the number of six-letter words formed from 26 possible letters without repetition, which amounts to
26! (266)!

or 165.7 million

possible words, and of course not all words are precisely six letters long. One key point here is that we cannot train on future words, thus it is important that for the limited 1

number of current words available to train on that some portability should exist on the knowledge acquired in the learning algorithm, so that it is suitable for unseen and future words.

1.1

Observable Characteristics of Word Formation and Spelling

Nevertheless, tractability seems possible for a number of reasons. Firstly, many English words are actually compounds of Latin, Greek, and Germanic roots and affixes, thus the entropy of the true space of words is much smaller. Secondly and similarly, many English words are merely inflections of a base word (e.g. walk versus walked), and these inflections often form with regularity in spelling and pronunciation. Thirdly, given the nature of vowels and consonants, there are certain vowel pairs and consonant pairs that are unlikely to be divided by syllabification. Fourthly and finally, there is a degree of correlation (to the untrained eye) between word formation and both spelling and syllabification. For example, consider nonplussed with Latin roots non and plus(s), and English inflectional suffix ed, which has a standard syllabification of non - plussed. The last two points above give us the most cause for hope in our purely letter-based approach for finding accurate and tractable automatic syllabification.

1.2

Choosing the Right Model

Because most words are formed from smaller units (one or more roots, prefixes, suffixes), and because each unit tends not to affect the others of the word, there is a locality of influence among contiguous letter groups. For instance consider in - sti - tute and in - sti - tu - tion. The hyphenation for letter group institu is unchanged between these two words, which is to say, the change of -te to -tion had no effect on the hyphenation for the first half of the words. Thus because of strong isolation between letter groups, a linear-chain graphical model is believed appropriate here. Further, we can only expect to train a model correctly if our training examples are fully labeled, thus we also need conditional probability model. The two obvious choices are a directed HMM model, or an undirected random field model. The HMM model has probably more complexity than is necessary, and other work has already shown strong results using an undirected model [1]. Thus we choose a linearchain conditional random field (CRF) with the assumption that for our goals it is both necessary and sufficient.

1.3

Applying Linear-Chain CRFs to Letter Tagging

Conceptually the inputs and outputs to a fully-trained linear-chain CRF are simple. For an input word of length , our output should be a tag sequence also of length , representing the most appropriate syllabification for . Note that our tag sequence is also called our label of . Each tag position in corresponds naturally to a position in . Tag encoding mechanisms are described in further detail later in this paper. As we will see later in more detail, querying our linear-chain CRF for the most likely for a given will mean constructing tag by tag in sequence. The likelihood of a certain next tag in the sequence should depend at most on its neighboring tags (this is the linear-chain restriction) and on in some way, otherwise it is not linear-chain. Thus learning the best linear-chain CRF means learning the best predictors for a next tag 2

given a previous tag (and ). Beyond the linear-chain restriction on the structure of predicting we introduce a further restriction on how wide our search net is within the , elements of input . Again by assuming locality of influence between letters and tags, we assume the next tag in the construction of is only influenced by the previous letter in , and zero or more following letters (we designate the total number of considered elements , short for context size). We explore different values (and combinations) of later in our experiments, basically amounting using different sized n-grams (c-grams) for help choosing the next tag.

1.4

As a Linear Model

General linear models are simple and effective [2], and a log-linear model in particular is appropriate and intuitive for representing our linear-chain CRF. There are two important and equivalent ways of looking at the log-linear model representation here: through the lens of linear-chain CRFs, and as a simple conditional likelihood. Through the lens of linear-chain CRFs, our model is composed of many feature functions (1 , , , ) that quantify relationships between letters and tagging; many weights stored as vector that quantify the relative important of difference feature functions; and finally a score function (1 , ) that is simply the following weighted sum.

(1 , |; ) =

=1

(1 , , , )

for letter position

Intuitively, this linear equation gives us a score of how likely a tag value ( ) at letter position is to follow the previous tag value (1 ) at position 1. It should be noted that the above notation for is actually more general than what our c-gram based approach needs (as discussed in section 1.3). A more precise form would be (1 , , 1 , , 1+()1 ) to remind us that we are only considering c-grams taken from and aligned with the position 1 ( is a function of in this form since we may use different size c-grams for different subsets of our feature function set). We will continue to use the first form given for consistency with existing literature.

In contrast, we can view our problem in terms of simple conditional likelihood, where the general linear model outputs the entire label (tag sequence) rather than individual tags position-at-a-time. The resulting equation is very similar.
( ; ) = | ( ; | ) ( | ; )

where

( ; = exp( | )
=1

( ) ) ,

( , = )

=1

(1 , , , )

This is basically the familiar form for multi-class logistic regression. Here is a formal probability distribution due to its denominator, also denoted as ( , which ensures ), the sum of over all possible is 1. However if you consider only the expression inside the exponent of , we see that and are very similar to , the only difference being the former measure the likelihood of pairs of entire and , while the latter measures the (unnormalized) likelihood of pairs of aligned tag(letter)-groups of and . This rela tionship between and is captured explicitly by the equation for given in terms of summed over every position .

1.5

Learning and the Log Conditional Likelihood

As with most learning problems, we start by formalizing the learning problem in terms of an objective function that must be optimized. Here our goal is to optimize the log conditional likelihood (LCL) of our model according to the training sample set we wish to learn from. For a training sample of size our objective function is the following.

=1

) log ( | ;

Where ( ; is as given above, and each and are the tth human-provided word | ) and syllabification tagging from the traing sample set. Our free parameters are the weights and inclusion/exclusion of feature functions. All else being equal, learning weights is a continuous optimization problem and thus we can use gradient methods. Learning the inclusion/exclusion of feature functions strictly speaking is not continuous, thus the feature function set must be decided upon by a human being. However, we can still in some sense find an optimal set of feature functions by simply including many feature functions very liberally, and letting our optimization over parameter effectively exclude feature functions which are inconsequential. As we will see in section 3, we can in fact filter our feature function set significantly even before we begin optimizing over . Formally this leaves us with as our only free parameter. We seek an optimal value defined as follows.

= argmax

As we will see in section 2, there are ways to locate this optimum efficiently, by exploiting the structure and assumptions of linear-chain CRFs.

Design and Analysis of Algorithms

We utilize two different approaches to solving the log conditional likelihood (LCL) optimization. Both are a form of gradient following, however each estimate the gradient in very different ways. As with gradient methods in general, the goal is to find parameter values such that the gradient of the objective function is zero. Because there is no missing data, the objective function is convex and any local optimum will in fact be the global optimum [3].

2.1

Stochastic Gradient Ascent

The gradient of the log conditional likelihood as defined in section 1 can be shown to be as follows.
= ( ( , ) ~(| ) [ ( , )]) ; =1

In standard gradient ascent our update rule would be + ( ) for some

learning rate , however for stochastic gradient ascent (SGA), we drop the sum over and instead update each after evaluating the summand expression for just one, randomly drawn example, and repeat this process until convergence. Proof of the convergence of gradient ascent is beyond the scope of this paper. We can assume the randomization has taken place once before starting SGA, and so we will continue to use subscript to refer to individual training examples. Our SGA update rule for a component of then becomes the following.
+ ( ( , ) ~(| ) [ ( , )]) ;

Where for each training example ( , ) we update all values by computing the con tribution of the training example to the total gradient, applied to learning rate . Note that computing the value of the feature function is constant time, but compu ting the expectation [ ] is not. In order to compute the expection quickly, we rely on the so-called forward and backward vectors, and . It can be shown [2] that the expectation can be rewritten as follows.

~( | ) [ ( , )] = (1 , , , ) ;
=1 1

( 1, 1 ) exp( (1 , )) ( , ) ( , )

In the above equation, and are look-ups in the forward and backward matrices, and the two innermost sums are over all possible tag values for elements in The forward . and backward vectors are well-documented in other sources [2] and so we will not go

into detail here other than to remind the reader that they capture unnormalized, marginal probabilities over tag sequences ending (or beginning) with specific tag values, and that to precompute and takes O(nm2) time for each new or . The outer three summations take O(nm2) time, times the cost of the inner expression, which is constant after , , ( , and have been computed. can be computed in ), constant time from . Computing for all tag value pairs takes O(Jnm2) time but can be reused for all values of for that update. Thus computing the expectation [ ] for a single training example ( , ), for weight updates in one iteration of SGA, is simply O(Jnm2 + Jnm2) or just O(Jnm2). As we will discuss in section 3, we are able to substantially reduce the size of used by SGA without affecting the accuracy of our training.

2.2

Collins Perceptron

A close cousin to stochastic gradient descent is Collins Perceptron. The basic argument against (stochastic) gradient ascent is that computing the expectation is expensive and unnecessary. Collins Perceptron proposes a reasonable approximation to the true expectation, by assuming [ ] ( , where = argmax ( | ; Proof of convergence ) ). is beyond the scope of this paper. The update step for one element of using Collins perceptron instead of SGA is below.
+ ( ( , ) ( , ))

Thus, in place of computing the expectation [ ] we compute an argmax to find . Solving the argmax is essentially the inference problem for linear-chain CRFs, and is computed efficiently using the Verterbi algorithm. We will not go into much detail of the Viterbi algorithm [2] except to remind the reader of the recursive equation for finding .
1 = argmax [( 1, ) + (, )]

Function above provides the score of the best path through tag values at each position and is computable in O(m2) time and depends directly on , which is computable in O(Jnm2) time as described earlier but only depends on and Thus, computing . optimal thus takes O(Jnm2 + nm2) time. The value does not change while iterating through in the update step, thus compu ting is a one-time cost per iteration. All other computations for a single are constant, thus the complexity of one iteration of Collins Perceptron is O(J + Jnm2) or simply O(Jnm2). As with SGA, we can greatly reduce the size of for greater efficiency in practice.

2.3

Sparsity of Feature Function Output

As alluded to and described in more detail in section three, each feature function is based on inspecting two sequential tags 1 and , and c-gram from . In fact each feature function will be defined as a conjunction of indicator functions, based on the presence (or absence) of specific character sequences, such as gol if a 3-gram. The space of feature functions therefore is quite large. However, for any given word , most feature functions will return 0, simply because each word cannot contain more than a handful of specific c-grams. Further, for any given training set of words, some c-grams will not appear at all (such as qqq). Thus we introduce two optimizations that both effectively reduce . The first is to remove any feature functions that depend on c-grams not seen in the training set. Such feature functions would end up with value of 0, thus simply removing them from consideration has no effect on correctness. The second optimization we perform is to associate with each example a set of feature function indices corresponding to those feature functions which depend on c-grams contained in . For example, if is golf and is strictly 2 then the feature function indices for should involve only those feature functions depending on go, ol, and lf. Our setup is described in more detail in section 3, and includes some other nuance not specific to the algorithm analysis here. Because all of the above algorithms depend on computing , and depends on both and each , the above two optimizations have a significant effect on the overall computing time.

Design of Experiments

Our sample set comes from a modified Celex dataset of hyphenated English words, as prepared and used by Elkan et al. [1] for their paper on the same problem topic. The dataset is about 66,000 words, excludes proper nouns, words with numbers, punctuation, or accent markings, and excludes multiple alternative hyphenations per single word if any.

3.1

Label Space and Word Preprocessing

For our experiment we tested two different tag spaces for representing the given hyphenated words. Our first scheme, BIO, has three possible tags for any letter of the input word. B denotes the beginning of a syllable, I denotes an intermediate letter in the syllable, and O denotes the end of a syllable. As a convention, single letter syllables are encoded as B, and 2-letter syllables as BO. For instance, the hyphenated word a-base will be encoded as BBIIO. The second encoding scheme we use is our OX scheme inspired by prior work [1], which has two possible tags for any letter. X denotes a letter the end of a syllable,

while O denotes any other letter. As a convention, single letter syllables are encoded as O. For instance, the hyphenated word a-base will be encoded as XOOOO Further, we wrap each word and each label with a special starting and ending tokens, ^ and $ respectively (borrowing a convention from regular expressions), to make it more convenient to compute our feature functions at word boundaries, and because we also want to be able to learn any patterns that are related to start of word or end of word. The final step of our preprocessing is to map all characters into a contiguous space of 8bit integer values. For our BIO experiments the mapping is given below.
{ ^, $, B, I, O } { 0,1, 2, 3, 4 } { ^, $, A, B, C, , Z } { 0, 1, 2, 3, 4, , 28 }

3.2

Feature Function Design

As described earlier, we consider feature functions that depend on c-grams instances within . We devised templates for 2-grams, 3-grams, and 4-grams, and can invoke these templates for every possible c-gram permutation of letters and special symbols. One example of a low level feature function we used can be defined as, all 3-letter sequences ase with the corresponding tag label as BI*, where * can be any tag.
(, , 1 , ) = (1 +1 = 'ase') (1 = 'BI' )

For a BIO scheme using 2-grams, 3-grams and 4-grams, the total number of feature functions possible using this template 9 262 + 9 263 + 9 264 = 4,277,052. We notice that not all of these feature functions will appear in the input set and will maintain a zero weight anyway. Therefore, instead of enumerating all the possible feature function instantiations, we parse the input set and define in memory only those feature functions that appear at least once. Performing this step reduces our feature function set size to 47,599 in the OX scheme and 53,590 in the BIO scheme. We provide our reduction of in greater detail (per n-gram choices) below.

Decrease in J after Preprocessing Optimization


120 100 80 60 40 % decrease 20 0

3.3

Programming Environment and Code Optimizations

Our initial efforts were focused on leveraging the R programming language, as was used in our most recent paper. However, for the goals of this paper, R soon became unwieldy, primarily due to poorer development tools, and the absence of static typing. Thus, halfway into our efforts we ported our existing code to Java, immediately showing performance gains. Any remaining bottlenecks we were able to identify and optimize using the Visual VM profiler included in the JDK. Owing to the large number of feature functions and input samples, initial performance (in Java) was worrisome at first. Over a much smaller subset totalling only 7500 examples, our initial implementation of Collins Perceptron 2-grams took 20 minutes for one epoch, while SGA took 2 minutes for just one example (for 2-grams feature functions). Through profiling we ultimately introduced a number of optimizations at key points in the code. These optimizations included temporary memoization of , better multiplicative short-circuiting inside loops as soon as any zero term was detected, and the introduction of a feature function index list per example, as described in section 2.3, which greatly improved speed in general. Other minor optimizations included using 32bit integers instead of 64-bit floating point numbers for Collins Perceptron where correctness is not affected, and rewriting code to avoid unnecessary copying when performing subsequence comparisons. The above optimizations reduced our runtime for one epoch of Collins Perceptron down to 45 seconds, and one update of SGA down to 1-2 seconds. Individual experiments were divided and executed over different machines in order to ensure we could include results for every permutation, including two multi-core machines with 6 GB and 12 GB of RAM, and one cloud instance with dedicated 1.6 GB of RAM and two cores. Casual observation never indicated our experiments utilizing more than 400 MB of allocated memory.

3.4

Regularization

For SGA we include a regularization factor to prevent overfitting and weight explosion. This modifies our SGA update rule slightly to include a 2 term [4]. + ( ( , ) ~(| ) [ ( , )] 2 ) ; The optimal choice of is determined via grid search.

3.5

Hyper-parameter Search

We performed a randomized 70%/30% split of the sample set for use in training and validation respectively. Among our training set (70% of the original data set), we perform a 7-fold rotating cross-validation for two separate grid searches [5] for the hyperparameters of SGA: learning rate and regularization constant . After sufficiently expanding our search limits, our grid search candidates were as follows.
: {10-7 , 10-6 , , 100 } : {2-7 , 2-6 , , 2-1 }

We first grid search over using = 0.125, and limit SGA to 2000 iterations per candidate value. The point of grid search is not to fully converge, but to get a quick estimate of the best for convergence during the full set training. After the best is selected, we use it during grid search over to find the best regularization rate. We again limit SGA to 2000 iterations per candidate value.

3.6

Full Training Stopping Conditions

For Collins Perceptron we determine convergence by comparing the trailing average of validation set accuracy of between two consecutive epochs. The trailing average is computed by taking the average of the accuracy percentage of the last three epochs. We stop the process when the trailing average starts decreasing, or after hitting a predetermined limit on the number of epochs. As a convenience in the code, instead of computing the accuracy as a percentage, we count the number of correct predictions onthe-fly while performing the perceptron update.

10

Collins Perceptron while prev_trailing_avg < current_trailing_avg and epoch number < max epochs do set num_correct = 0 foreach sample x do set = Viterbi(x, weights) if then foreach weight do + ( (, ) (, )) end foreach else do set num_correct = num_correct + 1 end foreach compute cur_trailing_avg for last 3 trials set prev_trailing_avg = cur_trailing_avg end while

For SGA we simply halt on a hard limit on number of epochs only. As we wanted to explore a breadth of experiment permutations (different tag schemes, feature function schemes) within our time limits, we opted for a hard limit of 1 epoch for SGA and 40 epochs for Collins Perceptron. In both cases, we would ideally use a much higher hard limit on epoch count to allow convergence to happen on its own, within some measure (e.g. trailing average as above, or by a threshold difference in magnitude).
Stochastic Gradient Ascent for count = 1 to T do for = 1 to J do + ( ( , ) ~(| ) [ ( , )]) ; end for end for

4
4.1

Results of Experiments
Grid Search Results

The results of grid search for learning rate and regularization constant are shown below, based on 2-gram, OX scheme experiment configuration. The optimal value of learning rate we found is 0.01.

11

Learning Rate vs Accuracy


90 80 70 60 50 40 1 0.1 0.01 0.001 0.0001 0.00001 0.000001

30
20 10 0

Grid search for the regularization constant indicated an optimal value of 0.25.

Regularization Constant vs Accuracy


90 80 70 60 50 40 30 20 10 0

Accuracy

4.2

Full Training Results

For final training over our training set (70% of entire data set), we performed and compared 16 different experiment permutations, spanning both training methods, both tagging schemes, and various combinations of c-grams choices of c for our feature functions. We estimate the success of each permutation by measuring per-letter accuracy over the remaining 30% of the data set. Our letter-level accuracy results are shown below.

12

Letter-level Accuracy
100 90 80 70 60 50 40 30 20 Collins SGA

10
0 2-gram OX 2-gram BIO 3-gram OX 3-gram 2 and 3 2 and 3 BIO grams OX grams BIO

We find that the overall accuracy of the SGA solver is lower than Collinss Perceptron. This could possibly be due to our hard stopping condition for SGA, not giving it sufficient time to sufficiently converge. Within SGA, we see that the accuracy for the BIO tag set is lesser than that of OX tag set. One reason for this could be that, we used the same hyperparameters for both tag sets. Otherwise, BIO tag encoding performed about as good as OX with Collins Perceptron and in some cases slightly better. Because of the faster running time for Collins Perceptron, we were also able to compare 4-grams with and 2-through-4-grams. We found that using 4-gram feature functions increases accuracy compared to 2-grams and 3-grams, but the combination of 2-grams, 3grams, and 4-grams feature functions does not cause any major change in the accuracy over simply using 4-grams only.
Tag set OX BIO 4-gram 96.77% 95.96% 2, 3 and 4-gram 96.34% 96.32%

Computed using Collins Perceptron only.

Findings and Lessons Learned

We were very pleased with the speed and efficiency of Collins Perceptron. We spent less total processing time per experiment than with SGA, yet received better final accuracy than SGA. The nature of the code optimizations employed opened up a new way of thinking about machine learning algorithms in general for us. Where before would have been thought of as simply an input on a conveyor belt, we now look at each example as pairable with meta information (e.g. the use of a per-example feature function index list), solely for the purpose of using in the most efficient way possible during learning. This perspec-

13

tive shift will continue to influence how we implement our object-oriented or functional approaches to machine learning going forward. For Collins Perceptron we find that using c-grams feature functions for a single value c gives better results compared to combination of multiple c-grams. For instance, the accuracy for 3-gram OX is 93.48% whereas 2 and 3-gram OX is 92.83%. One explanation for this might be that when 2 and 3-gram feature functions are both available that the weight for a particular 2-gram feature function is diluted over two weights, one for the 2-gram and another for related 3-grams. We under-estimated the importance of performing grid-search for each individual permutation to the experiments, which we feel is the most likely explanation for why SGA performed very competitively for the 2-grams OX experiment (on which we executed grid search), while comparatively poorly for all other experiment permutations. In the future we see the importance of being much more thorough about our use of grid search. References [1] N. Trogkanis and C. Elkan, "Conditional Random Fields for Word Hyphenation," in

Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010.
[2] C. Elkan, "Log-linear models and conditional random fields," 2013. [Online]. Available: http://cseweb.ucsd.edu/~elkan/250B/loglinearCRFs.pdf. [Accessed Feburary 2013]. [3] C. Sutton and A. McCallum, "An Introduction to Conditional Random Fields," 2010. [4] C. Elkan, "Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training," 17 January 2013. [Online]. Available: http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf. [Accessed February 2013]. [5] C.-W. Hsu, C.-C. Chang and C.-J. Lin, "A Practical Guide to Support Vector Classification," 15 April 2010. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. [Accessed January 2013].

14

You might also like