You are on page 1of 120

FOREST-BASED ALGORITHMS IN NATURAL LANGUAGE

PROCESSING
Liang Huang
A DISSERTATION
in
Computer and Information Science
Presented to the Faculties of the University of Pennsylvania in Partial
Fulllment of the Requirements for the Degree of Doctor of Philosophy
2008
Aravind K. Joshi and Kevin Knight
Supervisors of Dissertation
Rajeev Alur
Graduate Group Chairperson
COPYRIGHT
Liang Huang
2008
Dedication
To my parents, for their love, teaching, and sacrice.
iii
Acknowledgements
First and foremost, I thank my advisors, Aravind Joshi at Penn and Kevin Knight at
USC/ISI. Aravind was personally responsible for bringing me to Penn, and oered excep-
tional academic freedom during my study while at the same time made sure I was on the
right track. After two internships at ISI, Aravind asked Kevin to serve as my external
co-advisor, a rare privilege that I would never imagine. Kevin brought me to the fascinat-
ing world of machine translation, and generously hosted my numerous visits to ISI where
many parts of this thesis was done.
Next, I thank my terric Committee, consisting of Mitch Marcus, Fernando Pereira
and Ben Taskar at Penn, and Mark Johnson from Brown. Mitch was always extremely
enthusiastic about my work, ever since my rst year, and I thank him for his encourage-
ment. Fernando was like an informal second advisor at Penn, and I am glad that I can
hear his sharp comments again after graduation. Although arriving late, Ben served as the
Committee Chair and spent so much time brainstorming with me in my last year. He was
so patient with my disorganization and I am excited to continue our collaboration. Mark
kindly served as the external member, and oered some of the best suggestions including
the point on (virtual) -best list which greatly improved the quality of my work.
I am also indebted to many external collaborators during my PhD study, especially
David Chiang (ISI), Dan Gildea and Hao Zhang (Rochester). David was my mentor
when we overlapped for a year at Penn, and collaborated with me on our initial k-best
paper (Chapter 3) as well as an later forest rescoring paper (Chapter 4) which laid down
the foundation of the whole thesis. Dan and Hao guided my exploration of synchronous
grammars, and invited me to visit Rochester regularly. Dan inspired many aspects of this
iv
work, especially on forest reranking (Chapter 5) which is based on his suggestion to apply
forest rescoring back to parsing. I learned a ton from all of them.
During my last year of study, I was also very lucky to be hosted by Prof. Qun Liu
and his Lab at the CAS Institute of Computing Techonology (ICT), Beijing. It was at
ICT where I did most of the forest reranking experiments, and initiated an collaboration
with Haitao Mi on forest-based translation which became an integral part of this thesis
(Chapter 6). Needless to say, I also thank the US Government for making it possible by
delaying my visa for ve months.
In addition, I thank my undergraduate advisors Ruzhan Lu and Yuquan Chen at Shang-
hai Jiao Tong University for supervising my senior project on a probablistic Earley parser
for Chinese. I wouldnt have studied the problem of k-best parsing (Chapter 3) had they
not requested k-best output from my parser. Algorithm 1 (Section 3.5) was developed
there, which (rather unexpectedly) became the cornerstone of this thesis.
Furthermore, I want to thank people at Penn. My (mainly NLP) peers not only helped
me much in research but also buoyed my spirits, especially during dicult times. They
include Libin Shen and Julia Hockenmaier at IRCS, and Yuan Ding, Ryan McDonald,
John Blitzer, Axel Bernal, Qian Liu and Nikhil Dinesh at CIS. I also thank our faculty
members that I interacted with: Jean Gallier, Sampath Kannan, Sanjeev Khanna, Sudipto
Guha, and Stephanie Weirich. Jean deserves a special note here for his French humor and
Chinese support. Benjamin Pierce and Lawrence Saul guided me on technical writing
during their writing seminars; Benjamin is also the best instructor I ever had. In terms of
my own teaching, I was very lucky to have TAed for Jean and Sanjeev, and I also thank
Steve Zdancewic for encouraging me to develop and instruct a new course on Python
Programming. I cherish my teaching experiences at Penn as some of my best memories.
I am also grateful to other people at ISI and Language Weaver (besides Kevin and
David): Daniel Marcu, Jonathan Graehl, Victoria Fossum, Jon May, Steve Deneefe, and
Wei Wang. They taught me many aspects of machine translation and the ISI environment
is so supportive that I never thought myself a visitor there. Among them Jonathan Graehl
deserves special acknowledgement who shares with me a theoretical interest and was always
ready to oer sharp comments. The forest pruning algorithm (Section 4.3) was due to him.
v
This thesis also beneted from many discussions with Jason Eisner (Hopkins) who
suggested the space-ecient variant of Algorithm 3 (Section 3.7.1), Chris Quirk (MSR)
who shared with me his independent ndings of forest-to-string decoding (Section 6.2.1),
Giorgio Satta (Padua) who helped me with formalizations, and Dekai Wu (HKUST) who
also hosted my last few weeks of revision. Owen Rambow (Columbia) oered good advice
before my coming to America and during my rst few years into the PhD program.
I also thank Ken Dill and Adam Lucas at UCSF for teaching me much about biology
and thermodynamics when we worked on protein folding during my rst three years.
Outside academics, I was fortunate enough to be surrounded by many good Chinese
friends in Philadelphia: Jing Chen, Di Liu, Tingting Sha, Gang Song, Jinsong Tan, Stephen
Tse, Bei Xiao, Meng Yang, Qihui Zhu, and many many others. I look forward to joining
my friends Junfeng Pan, Zheng Shao, Jiqing Tang and Qin Iris Wang in the Bay Area.
Last but also the most, I thank my parents for their unconditional love and endless
support throughout all these years, without which I could never have gone this far. Back
in the old days, my father led me into the fascinating world of programming with BASIC
and LOGO on Apple II computers when I was in the elementary school. My mother took
me to weekend classes to learn C and data structures at the other end of the city during
my middle school years; ever since then I have been intrigued by the beauty of recursion
(which is still evident today from much of this work). I dedicate this thesis to them.
This research was mainly supported by NSF ITR EIA-0205456 (at Penn), and also
by NSF ITR IIS-0428020 and IIS-09325646 (at USC/ISI and Rochester), and by China
National NSF contracts 60736014 and 60573188 (at CAS/ICT).
vi
ABSTRACT
FOREST-BASED ALGORITHMS IN NATURAL LANGUAGE PROCESSING
Liang Huang
Supervisors: Aravind K. Joshi and Kevin Knight
Many problems in Natural Language Processing (NLP) involves an ecient search for
the best derivation over (exponentially) many candidates. For example, a parser aims to
nd the best syntactic tree for a given sentence among all derivations under a grammar,
and a machine translation (MT) decoder explores the space of all possible translations
of the source-language sentence. In these cases, the concept of packed forest provides a
compact representation of huge search spaces by sharing common sub-derivations, where
ecient algorithms based on Dynamic Programming (DP) are possible.
Building upon the hypergraph formulation of forests and well-known 1-best DP algo-
rithms, this dissertation develops fast and exact k-best DP algorithms on forests, which
are orders of magnitudes faster than previously used methods on state-of-the-art parsers.
We also show empirically how the improved output of our algorithms has the potential to
improve results from parse reranking systems and other applications.
We then extend these algorithms to approximate search when the forests are too big for
exact inference. We discuss two particular instances of this new method, forest rescoring for
MT decoding, and forest reranking for parsing. In both cases, our methods perform orders
of magnitudes faster than conventional approaches. In the latter, faster search also leads
to better learning, where our approximate decoding makes whole-Treebank discriminative
training practical and results in an accuracy better than any previously reported systems
trained on the Treebank.
Finally, we apply the above materials to the problem of syntax-based translation and
propose a new paradigm, forest-based translation. This scheme translates a packed forest
of the source sentence into a target sentence, rather than just using 1-best or k-best parses
as in usual practice. By considering exponentially many alternatives, it alleviates the
propogation of parsing errors into translation, yet only comes with fractional overhead in
running time. We also push this direction further to extract translation rules from packed
vii
forests. The combined results of forest-based decoding and rule extraction show signif-
icant improvements in translation quality with large-scale experiments, and consistently
outperform the hierarchical system Hiero, one of the best performing systems to date.
viii
Contents
Dedication iii
Acknowledgements iv
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background: Packed Forests and the Hypergraph Framework 6
2.1 Packed Forests as Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Examples: Forests in Machine Translation . . . . . . . . . . . . . . . . . . . 10
2.2.1 Translation as Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Adding a Language Model . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Dynamic Programming on Hypergraphs . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Monotonicity, Acyclicity, and Superiority . . . . . . . . . . . . . . . 16
2.3.2 Generalized Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Knuth 1977 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Exact k-best Dynamic Programming on Forests 22
3.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Excursion: Derivations vs. Hyperpaths . . . . . . . . . . . . . . . . . 27
3.4 Algorithm 0 (Nave) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
ix
3.5 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Algorithm 3 (Lazy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7.1 Extension 1: Space-Ecient Algorithm 3 . . . . . . . . . . . . . . . 35
3.7.2 Extension 2: the Unique k-best Algorithm . . . . . . . . . . . . . . . 36
3.8 k-best Parsing and Decoding Experiments . . . . . . . . . . . . . . . . . . . 37
3.8.1 Experiment 1: Bikel Parser . . . . . . . . . . . . . . . . . . . . . . . 38
3.8.2 Experiment 2: Hiero decoder . . . . . . . . . . . . . . . . . . . . . . 42
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Approximate Dynamic Programming I: Forest Rescoring 44
4.1 Cube Pruning based on Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Cube Growing based on Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Forest Pruning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.1 Phrase-based Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.2 Tree-to-string Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Approximate Dynamic Programming II: Forest Reranking 57
5.1 Generic Reranking with the Perceptron . . . . . . . . . . . . . . . . . . . . 58
5.2 Factorization of Local and Non-Local Features . . . . . . . . . . . . . . . . 61
5.3 Approximate Decoding via Cube Pruning . . . . . . . . . . . . . . . . . . . 63
5.4 Forest Oracle Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 Application: Forest-based Translation 74
6.1 Tree-based Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
x
6.1.1 Tree-to-String System . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.1.2 Tree-to-String Rule Extraction . . . . . . . . . . . . . . . . . . . . . 78
6.2 Forest-based Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2.1 From Parse Forest to Translation Forest . . . . . . . . . . . . . . . . 80
6.2.2 Decoding on the Translation Forest with Language Models . . . . . 83
6.3 Forest-based Rule Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3.1 Generalized Rule Extraction Algorithm . . . . . . . . . . . . . . . . 84
6.3.2 Fractional Counts and Rule Probabilities . . . . . . . . . . . . . . . 87
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.1 Small Data: Forest-based Decoding . . . . . . . . . . . . . . . . . . . 89
6.4.2 Small Data: Forest-based Rule Extraction . . . . . . . . . . . . . . . 91
6.4.3 Large Data: Combined Results . . . . . . . . . . . . . . . . . . . . . 93
6.5 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7 Conclusions and Future Work 95
References 98
xi
List of Tables
2.1 Correspondence between hypergraphs and related formalisms. . . . . . . . . 10
2.2 Classication and examples of major syntax-based MT models. . . . . . . 11
2.3 Summary of Viterbi and Knuth Algorithms. . . . . . . . . . . . . . . . . . . 20
3.1 Summary of k-best Algorithms. See Section 3.7.1 for Algorithm 3. . . . . . 34
4.1 Comparison of the three methods for decoding with n-gram LMs. . . . . . . 45
5.1 Comparison of various approaches for incorporating local and non-local fea-
tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Features used in this work. Those with a

are from (Collins, 2000), and
others are from (Charniak and Johnson, 2005), with simplications. . . . . 69
5.3 Forest reranking compared to n-best reranking on sec. 23. The pre-comp.
column is for feature extraction, and training column shows the number of
perceptron iterations that achieved best results on the dev set, and average
time per iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Comparison of our nal results with other best-performing systems on the
whole Section 23. Types D, G, and S denote discriminative, generative, and
semi-supervised approaches, respectively. . . . . . . . . . . . . . . . . . . . 72
6.1 Results with dierent rule extraction methods (trained on small data). Ex-
traction and decoding times are secs per 1000 sentences and per sentence,
resp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Statistics of rules extracted from small data. . . . . . . . . . . . . . . . . . 92
6.3 BLEU score results trained on large data. . . . . . . . . . . . . . . . . . . . 93
xii
List of Figures
1.1 Two possible parse trees for an ambiguous phrase in the Penn Treebank
style: (a) PP attached to the verb; (b) PP attached to the noun phrase. . . 2
2.1 A partial forest of the example sentence. . . . . . . . . . . . . . . . . . . . . 7
2.2 Illustration of tree-to-string deduction (translation rule on the right). . . . . 13
3.1 Examples of hypergraph, hyperpath, and derivation: (a) a hypergraph H,
with t as the target vertex and p, q as source vertices, (b) a hyperpath
t
in
H, and (c) a derivation of t in H, where vertex u appears twice with two
dierent (sub-)derivations. This would be impossible in a hyperpath. . . . . 28
3.2 An Earley derivation where the item (A .B, i, j) appears twice (predict
and complete). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 An illustration of Algorithm 1 in [e[ = 2 dimensions. Here k = 3, _ is the
numerical , and the monotonic function f is dened as f(a, b) = a + b.
Italic numbers on the x and y axes are a
i
s and b
j
s, respectively. We
want to compute the top 3 results from f(a
i
, b
j
). In each iteration the
current frontier is shown in shades, with the bold-face denoting the best
element among them. That element will be extracted and replaced by its
two neighbors ( and ) in the next iteration. . . . . . . . . . . . . . . . . 31
3.4 Eciency results of the k-best Algorithms, compared to Jimenez and Marzals
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Absolutive and Relative F-scores of oracle reranking for the top k ( 100)
parses for section 23, compared to (Charniak and Johnson, 2005), (Collins,
2000) and (Ratnaparkhi, 1997). . . . . . . . . . . . . . . . . . . . . . . . . . 41
xiii
3.6 Average number of parses for each sentence length in section 23, using
k=100, with beam width 10
4
and 10
3
, compared to (Collins, 2000). . . . 41
3.7 Algorithm 2 compared with Algorithm 3 (oine) on MT decoding task.
Average time (both excluding initial 1-best phase) vs. k (log-log). . . . . . 42
4.1 Cube pruning along one hyperedge. (a): the numbers in the grid denote the
score of the resulting +LM item, including the combination cost; (b)-(d):
the best-rst enumeration of the top three items. Notice that the items
popped in (b) and (c) are out of order due to the non-monotonicity of the
combination cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Example of cube growing along one hyperedge. (a): the h(x) scores for the
grid in Figure 4.1(a), assuming h
combo
(e) = 0.1 for this hyperedge; (b) cube
growing prevents early ranking of the top-left cell (2.5) as the best item in
this grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 (a) Pharaoh expands the hypotheses in the current bin (#2) into longer ones.
(b) In Cubit, hypotheses in previous bins are fed via hyperedge bundles
(solid arrows) into a priority queue (shaded triangle), which empties into
the current bin (#5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 A hyperedge bundle represents all +LM deductions that derives an item in
the current bin from the same coverage vector (see Figure 4.3). The phrases
on the top denote the target-sides of applicable phrase-pairs sharing the
same source-side. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Cube pruning vs. full-integration (with beam search) on phrase-based de-
coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6 Cube growing vs. cube pruning vs. full-integration (with beam search) on
tree-to-string decoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Illustration of some example features. Shaded nodes denote information
included in the feature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Example of the unit NGramTree feature at node A
i, k
: A (B . . . w
j1
)
(C . . . w
j
) ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
xiv
5.3 Forests (shown with various pruning thresholds) enjoy higher oracle scores
than k-best lists. For the latter, the number of hyperedges is the (average)
number of brackets in the k-best parses per sentence. . . . . . . . . . . . . 70
6.1 Example translation rule r
1
. The Chinese conjunction y u (and) is trans-
lated into the English preposition with. . . . . . . . . . . . . . . . . . . . 76
6.2 An example derivation of tree-to-string translation. Each shaded region
denotes a tree fragment that is pattern-matched with the rule being applied. 77
6.3 Tree-based rule extraction, `a la Galley et al. (2004). Each non-leaf node in
the tree is annotated with its target span (below the node), where denotes
a gap, and non-faithful spans are crossed out. Nodes with contiguous and
faithful spans form the frontier set shown in shadow. The rst two rules
extracted can be composed to form rule r
1
in Figure 6.1, other rules
(r
2
. . . r
5
) are omitted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4 (a) the parse forest of the example sentence; solid hyperedges denote the 1-
best parse in Figure 6.2(b) while dashed hyperedges denote the alternative
parse. (b) the resulting translation forest after applying the translation
rules (lexical rules not shown); the derivation shown in bold solid lines (e
t
1
and e
t
4
) corresponds to the derivation in Figure 6.2 while the one in dashed
lines (e
t
2
and e
t
3
) uses the alternative parse. (c) the correspondence between
translation hyperedges and translation rules. . . . . . . . . . . . . . . . . . 81
6.5 Forest-based rule extraction on the parse forest in Figure 6.4(a). . . . . . . 85
6.6 Comparison of decoding on forests with decoding on k-best trees. . . . . . . 90
6.7 Percentage of the i-th best parse tree being picked in decoding. 32% of the
distribution for forest decoding is beyond top-100 and is not shown on this
plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.8 Comparison of extraction time: forest-based vs.1-best and 30-best. . . . . 92
xv
Chapter 1
Introduction
The best science, like the best engineering, often comes from understanding
not just how things are, but how else they could have been.
Gary F. Marcus (b. 1970), Kluge, 2008
In some sense this thesis centers on the notation of alternatives: alternative derivations,
how to represent them compactly, how to enumerate them eciently, and how to use them
eectively for integrating more contextual information in human language processing. On
a more philosophical standpoint, this thesis also explores alternative, but more principled
search algorithms than the standard approaches, and has a more theoretical view point
than the current empirical dominance in this eld. So in short, it is all about what else
things could have been done.
1.1 Background
Many problems in Natural Language Processing (NLP) involve an ecient search for the
best derivation or interpretation over exponentially many candidates. For example, a parser
1
VP
VBD
saw
NP
DT
the
NN
man
PP
IN
with
NP
DT
a
NN
telescope
VP
VBD
saw
NP
NP
DT
the
NN
man
PP
IN
with
NP
DT
a
NN
telescope
(a) (b)
Figure 1.1: Two possible parse trees for an ambiguous phrase in the Penn Treebank style:
(a) PP attached to the verb; (b) PP attached to the noun phrase.
aims to nd the best syntactic tree for a given sentence among all derivations under a gram-
mar, and a machine translation (MT) decoder explores the space of all possible transla-
tions of the source-language sentence. In many cases, the concept of packed forest (Earley,
1970; Billot and Lang, 1989) provides an elegant way to compactly represent the huge
search spaces by sharing equivalent sub-derivations. When the objective function of the
search problem is compatible with the packed representation, i.e., decomposes nicely into
local congurations with independence assumptions, it can be optimized eciently by Dy-
namic Programming (DP). For example, the highest-probabilty tree under a Probabilistic
Context-Free Grammar (PCFG) can be easily extracted by the CKY Algorithm, which
runs in time cubic to the sentence length.
However, in practice the independence assumptions are often too strong for modeling
human languages, and we would always like to incorporate some non-local or long-distance
information. For example, consider Figure 1.1, which shows two possible interpretations
of the (famously) ambiguous phrase saw a man with a telescope. We as human-beings
would most likely favor parse (a) over parse (b), because saw ... with a telescope is a more
sensible real-world situation, though (b) might be more likely in some other contexts. This
intuition requires dependency among three CFG productions (VP, PP, and the right NP),
and is thus non-local in the simple PCFG model. In other words, the objective function
2
is no longer compatible with the CFG forest, and we can not apply the same dynamic
programming algorithm. How would we solve this problem then? So far there have been
at least three approaches, but each of which involves an open problem algorithmically:
1. First, we can enlarge the packed representation so that the new objective function
with non-local information still factors. For example, we can split the nodes (i.e.,
states) in a CFG forest by annotating the parent-nonterminal (Johnson, 1998), so
that our rule probabilities can now condition on derivation history. However, this
method would often blow up the search space, sometimes by an exponential factor if
the annotation requires unbounded information. On the other hand, we need a unied
theoretical framework to represente these annotations and their objective functions
so that ecient dynamic programming can be adapted naturally. In practice this
framework should not only cover parsing but also machine translation, including
various syntax-based models and their decoding processes with integrated language
models (Koehn, 2004; Galley et al., 2006).
2. An alternative approach, without blowing up the search space, is to do k-best search
under the small forest (with the simple decomposable model), and then rerank or
rescore the k-best list with more powerful model, such as in parse reranking (Collins,
2000; Charniak and Johnson, 2005) and translation (Shen et al., 2004). Many other
applications such as minimum error rate training (Och, 2003) also use large k-best
lists. Unfortunately this problem has not received enough attention in the literature,
and the k-best algorithms used in practice are often reported to be prohibitively slow
(Collins, 2000; Gildea and Jurafsky, 2002).
3. A third approach, as a compromise between the rst two, is to approximately integrate
non-local information on the original forest. This method is more principled than
the second approach which often suers from the limited scope of the k-best list, and
is also considerably faster than the rst approach without explicitly annotating the
whole forest. Although least explored among the three, it has the potential of being
almost as fast as k-best yet almost as accurate as exact search. So we devote much
of this thesis to this largely open area.
3
1.2 Thesis Outline and Contributions
This dissertation addresses the above algorithmic problems, and makes the following con-
tributions:
1. In Chapter 2, we rst formalize packed forests as directed hypergraphs, on top of
which we provide a generic framework for DP on forests based on monotonic weight
functions (Chapter 2). Parsers and various machine translation models are provided
as examples of this framework. Well-known algorithms such as the CKY algorithm
mentioned above, as well as the Dijkstra and Knuth Algorithms can be seen as special
instances under this framework. This Chapter provides a background of formalisms
and algorithms for the remainder of this thesis.
2. Chapter 3 develops a series of exact k-best DP algorithms on forests by extending
the 1-best Viterbi Algorithm from Chapter 2. These algorithms are orders of mag-
nitude faster than previously used methods on state-of-the-art parsers, and have the
potential to improve results from parse reranking systems and other applications.
3. In the next two Chapters (4 and 5), we extend these algorithms to approximate
search when the forests are too big for exact inference. We discuss two particular
instances of this new method, forest rescoring for MT decoding (Chapter 4), and
forest reranking for parsing (Chapter 5). In both cases, our methods perform orders
of magnitude faster than conventional approaches. In the latter, faster search also
leads to better learning, where our approximate decoding makes whole-treebank dis-
criminative training practical and results in an accuracy better than any previously
reported systems trained on the Penn Treebank.
4. Chapter 6 is an application of materials developed in all previous chapters to the
problem of syntax-based translation. Here we propose a new paradigm, forest-based
translation, which translates a packed forest of the source sentence into a target
sentence, as opposed to just using 1-best or k-best parses as in usual practice. By
considering exponentially many alternatives, this scheme alleviates the propogation
of parsing errors into translation, yet only comes with fractional overhead in running
4
time. We also push this direction further to extract translation rules from packed
forests. The combined results of forest-based decoding and rule extraction show
signicant improvements in translation quality with large-scale experiments, and
consistently outperform the hierarchical system Hiero, one of the best performing
systems to date.
On the theoretical side, this dissertation focuses on variants of the dynamic program-
ming algorithms. We start by putting together a general forest framework in Chapter 2
based on monotonic weight functions in directed hypergraphs where 1-best search is easy
and ecient, and then extends it to the k-best case in Chapter 3. Chapters 4 and 5
use these k-best algorithms for approximate search in huge-size forests without explicitly
representing them.
On the practical side, we show by large-scale experiments that the forest representa-
tion and our forest-based search algorithms not only improve the eciency, but also the
accuracy of state-of-the-art NLP systems. This is due to the fact that, rst of all, the for-
est represents exponentially many alternatives in a very compact space, and secondly, our
algorithms can eectively guide the search to more meaningful parts of the prohibitively
large search without blowing up the forest. We believe that the forest representation and
these principled search algorithms shall have many other applications in Natural Language
Processing.
5
Chapter 2
Background: Packed Forests and
the Hypergraph Framework
I could be bounded in a nutshell, and count myself a king of innite space...
William Shakespear (1564-1616), Hamlet, c. 1601
In this chapter, we rst give the intuition of packed forests and then formalize them as
directed hypergraphs (Gallo et al., 1993). The search space of decoding in many popular
models in machine translation can also be formalized as packed forests. This framework
provides a convienent tool to capture many instances of Dynamic Programming on hierar-
chically branching search spaces such as forests in parsing, where we solve a big problem
by dividing it into several sub-problems. Classical examples of these problems also include
matrix-chain multiplication, optimal polygon triangulation, and optimal binary search tree
(Cormen et al., 2001). The well-known CKY Algorithm and Dijkstra-Knuth Algorithms
will be presented as special cases under this unied framework.
Much of this chapter is based on my WPE II report (Huang, 2006) (which later becomes
a COLING tutorial (Huang, 2008a)), with Section 2.2 drawn from Huang (2007) and Huang
and Chiang (2007).
6
VP
1, 6
VBD
1, 2
blah NP
2, 6
NP
2, 3
blah PP
3, 6
b
e
2
e
1
Figure 2.1: A partial forest of the example sentence.
2.1 Packed Forests as Hypergraphs
Informally, a packed parse forest, or forest in short, is a compact representation of all the
derivations (i.e., parse trees) for a given sentence under a context-free grammar (Billot and
Lang, 1989). For example, consider the following sentence
0
I
1
saw
2
him
3
with
4
a
5
mirror
6
where the numbers between words denote string positions. Shown in Figure 2.1, this
sentence has (at least) two derivations depending on the attachment of the prep. phrase
PP
3, 6
with a mirror: it can either be attached to the verb saw,
VP
1, 6
VBD
1, 2
NP
2, 3
PP
3, 6
, (*)
or be attached to the noun him, which will be further combined with the verb to form
the same VP
1, 6
as above. These two derivations can be represented as a single forest by
sharing common sub-derivations. As rst noted by Klein and Manning (2001), such a
forest has a structure of a directed hypergraph, where items like PP
3, 6
are called nodes,
and deductive steps like (*) correspond to hyperedges, which connect a set antecedant nodes
to a consequent node.
Denition 2.1 (Hypergraph)
A (weighted) directed hypergraph is a pair H = V, E) where V is the set of vertices and
E is the set of hyperedges. Each hyperedge e E is a tuple e = tails(e), head(e)), where
head(e) V is its head vertex and tails(e) V

is an ordered list of tail vertices. There is


7
also a weight function f
e
associated with each hyperedge e, mapping from R
|tails(e)|
to R,
where R is the set of weights. There is also a distinguished root node TOP V in each
forest.
For a context-free grammar G = N, T, P, S) and a given sentence w
1:n
= w
1
. . . w
n
, the
forest H = V, E) induced by G on w
1:n
is constructed as follows: each node v V is in the
form of X
i, j
, where X N T is a nonterminal or terminal in the context-free grammar,
which denotes the recognition of X spanning the substring from positions i through j (that
is, w
i+1
. . . w
j
). The root node TOP represents the goal item in parsing, which is S
0, n
with
S being the start symbol of the grammar. For each terminal rule A w, we have a set of
hyperedges
, A
i, i+1
) [ w
i+1
= w.
For each nonterminal rule A B
1
. . . B
m
where B
1
. . . B
m
N are also nonterminals,
1
we
have a set of hyperedges
(B
1i, i
1
, B
2i
1
, i
2
, . . . , B
mi
m1
, j
), A
i, j
) [ i
1
< i
2
. . . < i
m1
< j.
For example, the hyperedge for Deduction (*) is notated:
e
1
= (VBD
1, 2
, NP
2, 3
, PP
3, 6
), VP
1, 6
)
Denition 2.2 (Arity)
We denote [e[ = [tails(e)[ to be the arity of the hyperedge. If [e[ = 0, then f
e
() R is a
constant (f
e
is a nullary function) and we call head(e) a source vertex. We dene the arity
of a hypergraph to be the maximum arity of its hyperedges.
A CKY forest has an arity of 2, since the input grammar is required to be binary
branching (cf. Chomsky Normal Form) to ensure cubic time parsing complexity (see for
example (Klein and Manning, 2001; Huang and Chiang, 2005) for details). However, this
does not hold in general and throughout this dissertation we will see many examples of
forests with arity larger than 2. This situation mainly arises from two aspects:
1
Without loss of generality, we assume there is no terminal symbol on the right-hand side of a nonterminal
rule, since we can always convert any CFG to an equivalent one which meets this restriction. This is a
simple generalization of Chomsky Normal Form.
8
First, some of the context-free grammar we use are directly induced from a Treebank
(Marcus et al., 1993) where there are many at productions. For example, using the
Treebank grammar, the arity of the forest in Figure 2.1 is 3. Such a Treebank-style
forest is easier to work with for reranking (see Chapter 5), since many features can
be directly expressed in it.
Second, in syntax-based machine translation, for example the systems described in
Section 2.2, we often have synchronous context-free rules with more than two vari-
ables due to the structural divergences between languages. Theoretically, an un-
pruned forest for a grammar of arity m and a sentence of length n has O(n
m+1
)
size, which is exponential in the arity. Zhang et al. (2006) and Huang (2007) dis-
cuss binarization algorithms so that ecient search become possible on the binarized
forests.
Denition 2.3 (Backward-star and forward-star)
The backward-star BS(v) of a vertex v is the set of incoming hyperedges e E [ head(e) =
v. The in-degree of v is [BS(v)[. The forward-star FS(v) of a vertex v is the set of outgoing
hyperedges e E [ v tails(e). The out-degree of v is [FS(v)[.
For example, in the forest in Figure 2.1, backward-star BS(VP
1, 6
) = e
1
, e
2
, with
the second hyperedge being e
2
= (VBD
1, 2
, NP
2, 6
), VP
1, 6
). The forward-star of another
node, VBD
1, 2
, is coincidentally also e
1
, e
2
.
Hypergraphs are closely related to other formalisms like AND/OR graphs, context-free
grammars, and deductive systems (Shieber et al., 1995; Nederhof, 2003).
In an AND/OR graph, the OR-nodes correspond to vertices in a hypergraph and the
AND-nodes, which links several OR-nodes to another OR-node, correspond to a hyperedge.
Similarly, in context-free grammars, nonterminals are vertices and productions are hyper-
edges; in deductive systems, items are vertices and instantied deductions are hyperedges.
Table 2.1 summaries these correspondences. Obviously one can construct a corresponding
hypergraph for any given AND/OR graph, context-free grammar, or deductive system.
However, the hypergraph formulation provides greater modeling exibility than weighted
9
hypergraph AND/OR graph context-free grammar deductive system
vertex OR-node symbol item
source-vertex leaf OR-node terminal axiom
target-vertex root OR-node start symbol goal item
hyperedge AND-node production instantiated deduction
(u
1
, u
2
, v, f) v
f
u
1
u
2
v : f(a, b)
u
1
: a u
2
: b
Table 2.1: Correspondence between hypergraphs and related formalisms.
deductive systems of Nederhof (2003): in the former we can have a separate weight function
for each hyperedge, where as in the latter, the weight function is dened for a deductive
(template) rule which corresponds to many hyperedges.
2.2 Examples: Forests in Machine Translation
Besides parsing, many instances of decoding in machine translation can also be cast as a
search problem in packed forests. State-of-the-art statistical MT models can be classied
into two broad categories: phrase-based models (Koehn et al., 2003; Och and Ney, 2004),
and syntax-based models. Depending on the input and output being strings or trees, the
latter are often further divided into four sub-categories, which is summarized in Table 2.2.
2
Section 2.2.1 will establish a unied framework for decoding under the phrase-based and
two popular syntax-based translation models, namely the string-to-string and tree-to-string
models, while Section 2.2.2 extends it for integrated decoding with an n-gram language
model, which is essential for achieving good translation quality and has been a standard
practice in the research community.
We will use the following example from Chinese to English for all translation systems
described in this section:
2
Note that the dierence between outputing a string or tree (column 2 in Table 2.2) in syntax-based
MT may seem a little blurred, since we can always read o the yield of the output tree to a string. So the
major distinction is on the input side: whether using a source-language string or parse tree (column 1 in
Table 2.2).
10
source target examples (partial)
string
string (Wu, 1997; Yamada and Knight, 2001; Chiang, 2005)
tree (Galley et al., 2006)
tree
string (Liu et al., 2006; Huang et al., 2006)
tree (Quirk et al., 2005; Ding and Palmer, 2005; Zhang et al., 2008)
Table 2.2: Classication and examples of major syntax-based MT models.
B` aoweier
Powell
y u
with
Sh al ong
Sharon
j uxng
hold
le
[past]
hu`t an
meeting
Powell held a meeting with Sharon
In order to do translation, we need to extend context-free grammar to the bilingual
case, which becomes synchronous context-free grammar (Lewis and Stearns, 1968; Aho
and Ullman, 1972).
Denition 2.4 (Synchronous Context-Free Grammar (SCFG))
An SCFG is a context-free rewriting system for generating string pairs. Each rule (syn-
chronous production)
A ,
rewrites a pair of nonterminals in both languages, where and are the source and
target side components, and there is a one-to-one correspondence between the nonterminal
occurrences in and the nonterminal occurrences in .
For example, the following SCFG can generate the sentence pair in the above example:
S NP
(1)
VP
(2)
, NP
(1)
VP
(2)
VP PP
(1)
VP
(2)
, VP
(2)
PP
(1)
NP B` aoweier, Powell
VP j uxng le hu`t an, held a meeting
PP y u Sh al ong, with Sharon
Note that the second rule is an interesting reordering rule, which captures the swapping
of VP and PP between Chinese (source) and English (target).
11
2.2.1 Translation as Parsing
We will now consider three particular MT models:
the typical phrase-based model `a la Pharaoh (Koehn, 2004),
the SCFG-based string-to-string model (Yamada and Knight, 2001), and
the tree-to-string model (Liu et al., 2006; Huang et al., 2006),
and show how decoding under these models can all be cast as a search problem in packed
forests under the hypergraph framework.
A typical phrase-based decoder generates partial target-language outputs in left-to-
right order in the form of hypotheses (Koehn, 2004). Each hypothesis has a coverage vector
capturing the source-language words translated so far, and can be extended into a longer
hypothesis by a phrase-pair translating an uncovered segment.
This process can be formalized as a deductive system. For example, the following
deduction step grows a hypothesis by the phrase-pair y u Sh al ong, with Sharon):
() : (w +c, held a talk with Sharon)
( ) : (w, held a talk)
(2.1)
where a in the coverage vector indicates the source word at this position is covered (for
simplicity we omit here the ending position of the last phrase which is needed for distortion
costs), and where w and w +c are the weights of the two hypotheses, respectively, with c
being the cost of the phrase-pair.
Similarly, the decoding problem with SCFGs can also be cast as a deductive (parsing)
system (Shieber et al., 1995). Basically, we parse the input string using the source projec-
tion of the SCFG while building the corresponding subtranslations in parallel. A possible
deduction of the above example is notated:
VP
1, 6
: (w
1
+w
2
+c

, t
2
t
1
)
PP
1, 3
: (w
1
, t
1
) VP
3, 6
: (w
2
, t
2
)
(2.2)
where the subscripts denote indices in the input sentence just as in CKY parsing, w
1
, w
2
are
the scores of the two antecedent items, and t
1
and t
2
are the corresponding subtranslations.
12
...
VP

: t
2
t
1
PP
1
: t
1
...
VP
2
: t
2
...
VP
x
1
x
2
x
2
x
1
Figure 2.2: Illustration of tree-to-string deduction (translation rule on the right).
The resulting translation t
2
t
1
is the inverted concatenation as specied by the target-side
of the SCFG rule with the additional cost c

being the cost of this rule.


Finally, in the tree-to-string or syntax-directed approach (Liu et al., 2006; Huang et
al., 2006), the decoder takes a source-language tree as input and tries to recursively rewrite
the tree by matching it to SCFG rules.
3
Shown in Figure 2.2, the deduction corresponding
to an application of the same VP rule would be:
VP

: (w
1
+w
2
, t
2
t
1
)
PP
1
: (w
1
, t
1
) VP
2
: (w
2
, t
2
)
where , 1, and 2 are Gorn addresses (Shieber et al., 1995), 1 and 2 being the
rst and second child of , respectively. For pattern-matching, the nonterminal labels at
these tree nodes must match those in the SCFG rule, e.g., the input tree must have a PP
at node 1 and a VP at node 1.
These three deductive systems represent the search space of decoding without a lan-
guage model (henceforth LM decoding). When one is instantiated for a particular
input string, it denes a set of derivations, called a translation forest, which has the same
compact structure as the packed parse forest we saw in Section 2.1. Accordingly we call
items like (), VP
1, 6
, and VP

nodes in the forest. The hyperedge for Deduction 2.1


is notated
e
1
= ( ), ())
3
Actually both Liu et al. (2006) and Huang et al. (2006) use synchronous tree-substitution grammars
(STSG) instead of SCFGs, but the dierence is immaterial to the algorithms discussed in this dissertation.
See Section 6.1 for details.
13
with its cost function being
f
e
1
((w, t)) = (w +c, t + with Sharon).
Similarly, the hyperedge for Deduction 2.2 is notated
e
2
= (PP
1, 3
, VP
3, 6
), VP
1, 6
)
with its cost function being
f
e
2
((w
1
, t
1
), (w
2
, t
2
)) = (w
1
+w
2
+c

, t
2
t
1
).
The goal of LM decoding is simply to nd the best derivation of the root node in the
hypergraph, i.e., (. . . ), S
0, n
or S

, where n is the length of the input string and is the


tree address of the root node of the input tree. Its running time is proportional to the size
of the forest, [F[, i.e., the number of hyperedges, which is O(2
n
n) for phrase-based models,
O(n
3
) for string-to-string and O(n) for tree-to-string approaches.
2.2.2 Adding a Language Model
To integrate with a bigram language model, we can use the dynamic-programming algo-
rithms of Och and Ney (2004) and Wu (1996) for phrase-based and SCFG-based systems,
respectively, which we may think of as doing a search in a ner-grained version of the
forests above. Each node v in the LM forest will be split into a set of augmented items,
which we call +LM items. For phrase-based decoding, a +LM item has the form (v,
a
)
where a is the last word of the hypothesis. Thus a +LM version of Deduction (2.1) might
be:
(,
Sharon
) : (w

, held a talk with Sharon)


( ,
talk
) : (w, held a talk)
where the score of the resulting +LM item
w

= w +c log P
lm
(with [ talk)
now includes a combination cost due to the bigrams formed when applying the phrase-pair.
Similarly, a +LM item in SCFG-based models has the form (v
ab
), where a and b are
boundary words of the hypothesis string, and is a placeholder symbol for an elided part
14
of that string, indicating that a possible translation of the part of the input spanned by v
starts with a and ends with b. An example +LM version of Deduction (2.2) is:
(VP
held Sharon
(1,6)
): (w, t
2
t
1
)
(PP
with Sharon
(1,3)
): (w
1
, t
1
) (VP
held talk
(3,6)
): (w
2
, t
2
)
where
w = w
1
+w
2
+c

log P
lm
(with [ talk)
with a similar combination cost formed in combining adjacent boundary words of an-
tecedents. The case for tree-to-string decoding is similar and thus omitted here.
To make sure the translation begins with a start symbol (<s>) and ends with a stop
symbol (</s>), we also need to construct special hyperedges leading into the new root
node in the form of
e
i
= (S
a b
(0,n)
), TOP)
for all a, b in the target-language vocabulary, with the cost function being
f
e
i
((w, t)) = (w + log P
lm
(a [ <s>) + log P
lm
(</s> [ b), <s> t </s>).
This scheme can be easily extended to work with a general m-gram model (Chiang, 2007).
The running time for +LM decoding is the size of the +LM forest, which is O([F[[T[
(m1)
)
for phrase-based models, and O([F[[T[
4(m1)
) for binary-branching SCFG-based models,
where [F[ is the size of the LM forest, and [T[ is the number of possible target-side
words, because each hyperedge combines two +LM items, which each have a pair of (m1)
boundary words (Huang et al., 2005). Even if we assume a constant number of translations
for each word in the input, with a standard trigram model, this still amounts to O(2
n
n
3
) for
phrase-based models, O(n
11
) for SCFG-based models, and O(n
9
) for tree-to-string models.
As these huge-sized +LM forests make exact +LM decoding practically infeasible, we will
study approximate search algorithms in Chapter 4 and Chapter 5.
2.3 Dynamic Programming on Hypergraphs
In previous sections we used the hypergraph framework to formalize the forest search spaces
in parsing and machine translation. Now we turn to the search problem itself, which aims
15
to nd the best derivation (e.g. parse tree or POS tag sequence) in a forest. These problems
are also called optimization problems. But before discussing the algorithms, we will need
some algebraic and structural properties of hypergraphs.
2.3.1 Monotonicity, Acyclicity, and Superiority
Dened below, the crucial property for doing optimization is monotonicity, which corre-
sponds to the optimal substructure property in dynamic programming (Cormen et al.,
2001).
Denition 2.5 (Monotonicity)
A function f : R
m
R is monotonic with regarding to _, if for all i 1..m
(a
i
_ a

i
) f(a
1
, , a
i
, , a
m
) _ f(a
1
, , a

i
, , a
m
).
A hypergraph H with costs in R is monotonic if there is a total ordering _ on R such
that every weight function f in H is monotonic with regarding to _.
We can borrow the additive operator from semirings (Mohri, 2002) to dene a com-
parison operator
a b =

a a _ b,
b otherwise.
Besides the required property of monotonicity, we next dene two optional properties
related to the orders of node traversal in hypergraphs. First, the following structural
property generalizes the acyclicity of graphs to hypergraphs so that we can traverse them
in a topological order.
Denition 2.6 (Graph Projection, Acyclicity and Topological Order)
The graph projection of a hypergraph H = V, E, t, R) is a directed graph G = V, E

)
where
E

= (u, v) [ e BS(v), s.t. u T(e).


A hypergraph H is acyclic if its graph projection G is acyclic; then a topological ordering
of H is an ordering of V that is a topological ordering in G.
16
Second, the algebraic property superiority corresponds to the non-negative edge
weights requirement in the Dijkstra Algorithm so that we can explore the hypergraph
in a best-rst order.
Denition 2.7 (Superiority)
A function f : R
m
R is superior if the result of function application is worse than each
of its argument:
i 1..m, a
i
_ f(a
1
, , a
i
, , a
m
).
A hypergraph H is superior if every weight function f in H is superior.
We also need a formal denition of derivation as a recursive structure in the hypergraph.
Denition 2.8 (Derivation)
A derivation D of a vertex v in a hypergraph H, its size [D[ and its weight w(D) are
recursively dened as follows:
If e BS(v) with [e[ = 0, then D = e, ) is a derivation of v, its size [D[ = 1, and
its weight w(D) = f
e
().
If e BS(v) where [e[ > 0 and D
i
is a derivation of tails
i
(e) for 1 i [e[, then
D = e, D
1
D
|e|
) is a derivation of v, its size [D[ = 1 +

|e|
i=1
[D
i
[ and its weight
w(D) = f
e
(w(D
1
), . . . , w(D
|e|
)).
The ordering on weights in Rinduces an ordering on derivations: D _ D

i w(D) _ w(D

).
We denote D(v) to be the set of derivations of v and the best cost of a node v is dened
as:
(v) =

1 [BS(v)[ = 0

DD(v)
w(D) otherwise
(2.3)
2.3.2 Generalized Viterbi Algorithm
The well-known Viterbi Algorithm (Viterbi, 1967) was originally dened on a lattice or
directed acyclic graph (DAG). We now extend it to directed acyclic hypergraphs (DAHs)
17
with only marginal modications (see Figure 2.1).
Code 2.1 Generalized Viterbi Algorithm.
1: procedure General-Viterbi(H = V, E))
2: topologically sort the vertices of H
3: Initialize(H)
4: for each vertex v in topological order do
5: for each hyperedge e in BS(v) do
6: e is (u
1
, u
2
, , u
|e|
, v, f
e
)
7: d(v) = f
e
(d(u
1
), d(u
2
), , d(u
|e|
))
The correctness of this algorithm can be proved by a simple induction on the topolog-
ically sorted sequence of nodes. Its time complexity is O([V [ +[E[) since every hyperedge
is visited exactly once (assuming the arity of the hypergraph is a constant).
The above algorithm uses hyperedges in the backward-star BS(v) of node v to update
the best derivation of v, but we can also have a symmetric version, shown in Code 2.2,
using the best derivation of v to update along hyperedges in the forwards-star FS(v). This
algorithm is sometimes easier to implement in practice, for example in Treebank parsers
(Collins, 1999; Charniak, 2000). To ensure that a hyperedge e is red only when all of
its tail vertices have been xed to their best weights, we maintain a counter r[e] of the
remaining vertices yet to be xed (line 5) and res the update rule for e when r[e] = 0
(line 9). This method is also used in the Knuth algorithm (Section 2.3.3).
CKY Algorithm
The most widely used algorithm for parsing in NLP, the CKY algorithm (Kasami, 1965),
is a specic instance of the Viterbi algorithm for hypergraphs. The CKY algorithm takes
a context-free grammar G in Chomsky Normal Form (CNF) and essentially intersects G
with a DFA D representing the input sentence to be parsed. The resulting search space
by this intersection is an acyclic hypergraph whose vertices are items like X
i, j
and whose
hyperedges are instantiated deduction like (Y
i, k
Z
k, j
), X
i, j
) for all i < k < j if there is a
18
Code 2.2 Forward update variant of the Generalized Viterbi Algorithm (2.1).
1: procedure General-Viterbi-Forward(H)
2: topologically sort the vertices of H
3: Initialize(H)
4: for each hyperedge e do
5: r[e] [e[ counter of remaining tails to be xed
6: for each vertex v in topological order do
7: for each hyperedge e in FS(v) do
8: r[e] r[e] 1
9: if r[e] == 0 then all tails have been xed
10: e is (u
1
, u
2
, , u
|e|
, h(e), f
e
)
11: d(h(e)) = f
e
(d(u
1
), d(u
2
), , d(u
|e|
))
production X Y Z. The weight function f is simply
f(a, b) = a b Pr(X Y Z).
The Chomsky Normal Form ensures acyclicity of the hypergraph but there are multiple
topological orderings which result in dierent variants of the CKY algorithm such as the
standard bottom-up CKY, left-to-right CKY, etc.
Most treebank parsers, including (Collins, 1999; Charniak, 2000), use the forward-
update version of the Viterbi algorithm (Code 2.2) because the context-free grammars
used in those parsers are in some sense dynamic, where enumerating all rules X
rewriting a nonterminal is much harder than dynamically combining nonterminals Y
and Z to form X.
2.3.3 Knuth 1977 Algorithm
Knuth (1977) generalizes the Dijkstra algorithm to what he calls the grammar problem,
which essentially corresponds to the search problem in a monotonic superior hypergraph
(see Table 2.1 for the correspondence). However, he does not provide an ecient implemen-
tation nor analysis of complexity. Graehl and Knight (2004) present an implementation
19
Algorithm Requirements Complexity Example
Gen. Viterbi monotonicity and acyclicity O([V [ +[E[) CKY
Knuth 77 monotonicity and superiority O([V [ log [V [ +[E[) A* parsing
Table 2.3: Summary of Viterbi and Knuth Algorithms.
that runs in time O([V [ log [V [ + [E[) using the method described in the forward-update
version of the generalized Viterbi Algorithm (Code 2.2) to ensure that every hyperedge is
visited only once (assuming the priority queue is implemented as a Fibonaaci heap; for
binary heap, it runs in O(([V [ + [E[) log [V [)). Table 2.3 compares Viterbi and Knuth
Algorithms.
Code 2.3 The Knuth 1977 Algorithm.
1: procedure Knuth(H)
2: Initialize(H)
3: Q V [H] prioritized by d-values
4: for each hyperedge e do
5: r[e] [e[
6: while Q ,= do
7: v Extract-Min(Q)
8: for each edge e in FS(v) do
9: e is (u
1
, u
2
, , u
|e|
, h(e), f
e
)
10: r[e] r[e] 1
11: if r[e] == 0 then
12: d(h(e)) = f
e
(d(u
1
), d(u
2
), , d(u
|e|
))
13: Decrease-Key(Q, h(e))
The A* Algorithm
The Knuth algorithm can be extended to A* algorithm (Hart et al., 1968) on hypergraphs
when the weight functions factor to semiring operations. A specic case of this algorithm
is the A* parsing of Klein and Manning (2003) where they achieve signicant speed up
20
using some carefully designed heuristic functions.
2.4 Summary
We have presented a general framework of directed hypergraph to represent the packed
forests in parsing and machine translation. The concept of monotonic weight functions
captures the optimal substructure property in dyanmic programming (DP). We also pre-
sented two classical DP algorithms for 1-best inference in the forest under this framework.
In the next chapter, we will extend the 1-best Viterbi algorithm to the k-best case, which
will be further developed for approximate search in the forest in Chapters 4 and 5.
21
Chapter 3
Exact k-best Dynamic
Programming on Forests
The best [now] is not necessarily the best in a larger scale, or in the future.
Yu, Minghong (b. 1962), New Oriental School Addresses
This chapter develops fast and exact k-best dynamic programming algorithms on forests
by extending the generalized Viterbi Algorithm in the previous Chapter. In practice they
are shown to be orders of magnitude faster than previously used methods on state-of-the-
art parsers and MT decoders. We also show empirically how the improved output of our
algorithms has the potential to improve results from parse reranking systems and other
applications.
Much of this chapter is based on Huang and Chiang (2005), with Section 3.7.1 due to
a suggestion by Jason Eisner (p.c.), and Section 3.7.2 drawn from Huang et al. (2006).
3.1 Motivations
As discussed in Chapter 1, many problems in natural language processing (NLP) involve
optimizing some objective function over a set of possible analyses of an input string. This
set is often exponential-sized but can be compactly represented as a forest by merging
22
equivalent subanalyses. If the objective function f is compatible with the forest, i.e., f
decomposes into monotonic weight functions on each hyperedge, then it can be optimized
eciently by dynamic programming (Chapter 2).
However, when the objective function f has no compatible packed representation, exact
inference would be intractable. To alleviate this problem, a common approach in NLP is
to split the computation into two phases: in the rst phase, use some compatible objective
function f

to produce a k-best list (the top k candidates under f

), which serves as
an approximation to the full set. Then, in the second phase, optimize f over all the
analyses in the k-best list. A typical example is discriminative reranking on k-best lists
from a generative module, such as Collins (2000) for parsing and Shen et al. (2004) for
translation, where the reranking model has nonlocal features that cannot be computed
during parsing proper. Another example is minimum-Bayes-risk decoding (Kumar and
Byrne, 2004; Goodman, 1998),where, assuming f

denes a probability distribution over


all candidates, one seeks the candidate with the highest expected score according to an
arbitrary metric (e.g., PARSEVAL or BLEU); since in general the metric will not be
compatible with the parsing algorithm, the k-best lists can be used to approximate the full
distribution f

. A similar situation occurs when the parser can produce multiple derivations
that are regarded as equivalent (e.g., multiple lexicalized parse trees corresponding to the
same unlexicalized parse tree); if we want the maximum a posteriori parse, we have to
sum over equivalent derivations. Again, the equivalence relation will in general not be
compatible with the parsing algorithm, so the k-best lists can be used to approximate
f

, as in Data Oriented Parsing (Bod, 1992) and in speech recognition (Mohri and Riley,
2002).
Another instance of this k-best approach is cascaded optimization. NLP systems are
often cascades of modules, where we want to optimize the modules objective functions
jointly. However, often a module is incompatible with the packed representation of the
previous module due to factors like non-local dependencies. So we might want to postpone
some disambiguation by propagating k-best lists to subsequent phases, as in joint parsing
and semantic role labeling (Gildea and Jurafsky, 2002; Sutton and McCallum, 2005), infor-
mation extraction and coreference resolution (Wellner et al., 2004), and formal semantics
23
of TAG (Joshi and Vijay-Shanker, 1999).
Moreover, much recent work on discriminative training uses k-best lists; they are some-
times used to approximate the normalization constant or partition function (which would
otherwise be intractable), or to train a model by optimizing some metric incompatible
with the packed representation. For example, Och (2003) shows how to train a log-linear
translation model not by maximizing the likelihood of training data, but maximizing the
BLEU score (among other metrics) of the model on the data.
3.2 Related Work
For algorithms whose packed representations are graphs, such as Hidden Markov Models
and other nite-state methods, Ratnaparkhis MXPARSE parser (Ratnaparkhi, 1997), and
many stack-based machine translation decoders (Brown et al., 1995; Och and Ney, 2004),
the k-best paths problem is well-studied in both pure algorithmic context (see Eppstein
(2001) and Brander and Sinclair (1995) for surveys) and NLP/Speech community (Mohri,
2002; Mohri and Riley, 2002). This paper, however, aims at the k-best tree algorithms
whose packed representations are hypergraphs (Gallo et al., 1993; Klein and Manning,
2001) (equivalently, and/or graphs or packed forests), which includes most parsers and
parsing-based MT decoders. Any algorithm expressible as a weighted deductive system
(Shieber et al., 1995; Goodman, 1998; Nederhof, 2003) falls into this class. In our exper-
iments, we apply the algorithms to the lexicalized PCFG parser of Bikel (2004), which is
very similar to Collins Model 2 (Collins, 2003), and to a synchronous CFG based machine
translation system (Chiang, 2005).
As pointed out by Charniak and Johnson (2005), the major diculty in k-best parsing
is dynamic programming. The simplest method is to abandon dynamic programming
and rely on aggressive pruning to maintain tractability, as is used in Collins (2000; Bikel
(2004). But this approach is prohibitively slow, and produces rather low-quality k-best
lists (see Sec. 3.8.1). Gildea and Jurafsky (2002) described an O(k
2
)-overhead extension
for the CKY algorithm and reimplemented Collins Model 1 to obtain k-best parses with
an average of 14.9 parses per sentence. Their algorithm turns out to be a special case of
24
our Algorithm 0 (Sec. 3.4), and is reported to also be prohibitively slow.
Since the original design of the algorithm described below, we have become aware of
two eorts that are very closely related to ours, one by Jimenez and Marzal (2000) and
another done in parallel to ours by Charniak and Johnson (2005). Jimenez and Marzal
present an algorithm very similar to our Algorithm 3 (Sec. 3.7) while Charniak and Johnson
propose using an algorithm similar to our Algorithm 0, but with multiple passes to improve
eciency. They apply this method to the Charniak (2000) parser to get 50-best lists for
reranking, yielding an improvement in parsing accuracy.
Our work diers from Jimenez and Marzals in the following three respects. First, we
formulate the parsing problem in the more general framework of hypergraphs (Klein and
Manning, 2001), making it applicable to a very wide variety of parsing algorithms, whereas
Jimenez and Marzal dene their algorithm as an extension of CKY, for CFGs in Chomsky
Normal Form (CNF) only. This generalization is not only of theoretical importance, but
also critical in the application to state-of-the-art parsers such as Collins (2003) and Char-
niak (2000). In Collins parsing model, for instance, the rules are dynamically generated
and include unary productions, making it very hard to convert to CNF by preprocessing,
whereas our algorithms can be applied directly to these parsers. Second, our Algorithm
3 has an improvement over Jimenez and Marzal which leads to a slight theoretical and
empirical speedup. Third, we have implemented our algorithms on top of state-of-the-
art, large-scale statistical parser/decoders and report extensive experimental results while
Jimenez and Marzals was tested on relatively small grammars.
On the other hand, our algorithms are more scalable and much more general than
the coarse-to-ne approach of Charniak and Johnson. In our experiments, we can obtain
10000-best lists nearly as fast as 1-best parsing, with very modest use of memory. Indeed,
Charniak (p.c.) has adopted our Algorithm 3 into his own parser implementation
1
and
conrmed our ndings.
In the literature of k shortest-path problems, Minieka (1974) generalized the Floyd
algorithm in a way very similar to our Algorithm 0 and Lawler (1977) improved it using an
idea similar to but a little slower than the binary branching case of our Algorithm 1. For
1
Available at ftp://ftp.cs.brown.edu/pub/nlparser/.
25
hypergraphs, Gallo et al. (1993) study the shortest hyperpath problem and Nielsen et al.
(2005) extend it to k shortest hyperpath. Our work dieres from Nielsen et al. (2005) in
two aspects. First, we solve the problem of k-best derivations (i.e., trees), not the k-best
hyperpaths, although in some cases they coincide (see Sec. 3.3.1 for further discussions).
Second, their work assumes non-negative costs (or probabilities 1) so that they can
apply Dijkstra-like algorithms. Although generative models, being probability-based, do
not suer from this problem, more general models (e.g., log-linear models) may require
negative edge costs (McDonald et al., 2005; Taskar et al., 2004). Our work, based on the
Viterbi algorithm, is still applicable as long as the hypergraph is acyclic, and is used by
McDonald et al. (2005) to get the k-best parses.
3.3 Preliminaries
We will start o from the 1-best Viterbi Algorithm (Code 2.1) which traverses the hy-
pergraph in topological order and for each vertex v, calculates its 1-best derivation D
1
(v)
using all incoming hyperarcs e BS(v).
In order to extend it to the k-best case, we need to compute the top-k derivations,
D
1
(v) _ D
2
(v) _ . . . _ D
k
(v),
which we shall denote by a vector D(v).
With the derivations thus ranked, we can introduce a nonrecursive representation for
derivations that is analogous to the use of back-pointers in parser implementation.
Denition 3.1 (Derivation with back-pointers (dbp))
A derivation with back-pointers (dbp)

D of v is a tuple e, j) such that e = v, (u
1
. . . u
|e|
))
BS(v), and j 1, 2, . . . , k
|e|
. There is a one-to-one correspondence between dbps of v
and derivations of v:
e, (j
1
j
|e|
)) e, D
j
1
(u
1
(e)) D
j
|e|
(u
|e|
(e))).
Accordingly, we extend the cost function w to dbps: w(

D) = w(D) if

D D. This in turn
induces an ordering on dbps:

D _

D

i w(

D) _ w(

D

). Let

D
i
(v) D
i
(v) denote the
i
th
-best dbp of v.
26
For example, for a hyperedge e = (u
1
, u
2
), v), the dbp e, (2, 3)) denotes the derivation
of v that combines the second-best derivation of u
1
with the third-best of u
2
, whose cost
(or weight) is f
e
(w(D
2
(u
1
)), w(D
3
(u
2
))). For simplicity of presentation, in the following we
will no longer make the notational distinction between derivations and dbps.Code 3.1
shows the 1-best Viterbi Algorithm under the dbp notation.
Code 3.1 Generalized Viterbi Algorithm (2.1) in the dbp notation.
1: procedure Viterbi(V, E)
2: for v V in topological order do
3: for e BS(v) do for all incoming hyperarcs
4: D
1
(v) min

(D
1
(v), e, 1)) update
Before moving on to k-best Algorithms, let us elaborate on the dierence between two
important concepts, derivations and hyperpaths, which distinguishes our work from that
of Nielsen et al. (2005).
3.3.1 Excursion: Derivations vs. Hyperpaths
The work of (Klein and Manning, 2001) introduces a correspondence between hyperpaths
and derivations. When extended to the k-best case, however, that correspondence no
longer holds.
Denition 3.2 (Hyperpath (Nielsen et al., 2005))
Given a hypergraph H = V, E), a hyperpath
v
of destination v V is an acyclic minimal
hypergraph H

= V

, E

) such that
1. E

E
2. v V

eE
(tails(e) head(e))
3. u V

, u is either a source vertex or connected to a source vertex in H

.
As illustrated by Figure 3.1, derivations (as trees) are dierent from hyperpaths (as
minimal hypergraphs) in the sense that in a derivation the same vertex can appear more
than once with possibly dierent sub-derivations while it is represented at most once in a
27
p v
u t
q w
p v
u t
w
p u v
t
q u w
(a) (b) (c)
Figure 3.1: Examples of hypergraph, hyperpath, and derivation: (a) a hypergraph H,
with t as the target vertex and p, q as source vertices, (b) a hyperpath
t
in H, and (c)
a derivation of t in H, where vertex u appears twice with two dierent (sub-)derivations.
This would be impossible in a hyperpath.
(A .B, i, j)
(A .B, i, j)
(B ., j, j)

(B ., j, k)
(A B., i, k)
Figure 3.2: An Earley derivation where the item (A .B, i, j) appears twice (predict
and complete).
hyperpath. Thus, the k-best derivations problem we solve in this paper is very dierent in
nature from the k-shortest hyperpaths problem in Nielsen et al. (2005).
However, the two problems do coincide when k = 1 (since all the sub-derivations must
be optimal) and for this reason the 1-best hyperpath algorithm by Klein and Manning
(2001) is very similar to the 1-best tree algorithm of Knuth (1977) (see Section 2.3.3).
For k-best case (k > 1), they also coincide when a node can appear at most once in any
derivation, i.e., subderivations of sibling nodes should be mutually disjoint. In this case,
the hypergraph is isomorphic to a Case-Factor Diagram (CFD) (McAllester et al., 2004)
(proof omitted). For example, CKY derivations can obviously be represented as CFDs
since the spans of subtrees are disjoint, while in an Earley derivation (Earley, 1970) an
28
item can appear twice because of the prediction rule (see Figure 3.2).
The k-best derivations problem has potentially more applications in tree generation
(Knight and Graehl, 2005), which can not be modeled by hyperpaths. But detailed dis-
cussions along this line are out of the scope of this thesis.
3.4 Algorithm 0 (Nave)
Following (Goodman, 1998; Mohri, 2002), we isolate two basic operations in line 4 of
the 1-best algorithm that can be generalized in order to extend the algorithm: rst, the
formation of the derivation e, 1) out of [e[ best sub-derivations (this is a generalization
of the binary operator in a semiring); second, min

, which chooses the better of two


derivations (same as the operator in an idempotent semiring (Mohri, 2002)). We now
generalize these two operations to operate on k-best lists.
Let r = [e[ to be arity of hyperedge e. The new multiplication operation, mult
k
(e),
is performed in three steps:
1. enumerate the k
r
derivations e, j
1
j
r
) [ i, 1 j
i
k. Time: O(k
r
).
2. sort these k
r
derivations (according to cost). Time: O(k
r
log(k
r
)) = O(rk
r
log k).
3. select the rst k elements from the sorted list of k
r
elements. Time: O(k).
So the overall time complexity of mult
k
is O(rk
r
log k).
We also have to extend min

to merge
k
, which takes two vectors of length k (or
fewer) as input and outputs the top k (in sorted order) of the 2k elements. This is similar
to merge-sort (Cormen et al., 2001) and can be done in linear time O(k). Then, we only
need to rewrite line 4 of the 1-best Viterbi algorithm to extend it to the k-best case:
4: D(v) merge
k
(D(v), mult
k
(e)).
The time complexity for this line is O([e[k
|e|
log k), making the overall complexity O([E[k
a
log k)
if we consider the arity a of the hypergraph to be constant.
2
The overall space complexity
is O([V [k) since for each vertex we need to store a vector of length k.
2
Actually, we do not need to sort all k
|e|
elements in order to extract the top k among them; there are
ecient linear-time algorithms (both randomized and deterministic) (Cormen et al., 2001) that can select
the kth best element from the k
|e|
elements in time O(k
|e|
). So we can improve the overhead to O(k
a
).
29
In the context of CKY parsing for CFG, for instance, the 1-best Viterbi algorithm
has complexity O(n
3
[P[) while the k-best version is O(n
3
[P[k
2
log k), which is slower by a
factor of O(k
2
log k).
3.5 Algorithm 1
First we seek to exploit the fact that input vectors are all sorted and the function f is
monotonic; moreover, we are only interested in the top k elements of the k
|e|
possibilities.
Dene 1 to be the vector whose elements are all 1; its dimension should be clear from
context. Also dene b
i
to be the vector whose elements are all 0 except b
i
i
= 1.
As we compute p
e
= mult
k
(e), we maintain a candidate set cand of derivations
that have the potential to be the next best derivation in the list. If we picture the input as
an [e[-dimensional space, cand contains those derivations that have not yet been included
in p
e
, but are on the boundary with those which have. It is initialized to e, 1). At each
step, we extract the best derivation from candcall it e, j)and append it to p
e
. Then
e, j) must be replaced in cand by its neighbors,
e, j +b
l
) [ 1 l [e[
(see Figure 3.5 for an illustration). We implement cand as a priority queue (Cormen et al.,
2001) to make the extraction of its best derivation ecient. At each iteration, there are
one Extract-Min and [e[ Insert operations. If we use a binary-heap implementation
for priority queues, we get O([e[ log k[e[) time complexity for each iteration, because the
size of the heap is bounded by 1 + [e[(k 1) = O(k[e[), and both Extract-Min and
Insert cost O(log k[e[) time.
3
Since we are only interested in the top k elements, there
are k iterations and the time complexity for a single mult
k
is O(k[e[ log k[e[), yielding
an overall time complexity of O([E[k log k) and reducing the multiplicative overhead by
a factor of O(k
a1
) (again, assuming a is constant). In the context of CKY parsing, this
reduces the overhead to O(k log k). Code 3.2 shows the additional pseudocode needed for
this algorithm. It is integrated into the Viterbi algorithm (Code 3.1) simply by rewriting
3
With Fibonacci heap we can improve this per-iteration cost to O(|e|+log k|e|). But this will not change
the overall complexity when arity a is constant, as we will see below.
30
2
2
0 1
1 2 4
2
2 3
0 1 2
1 2 4
2
2 3 4
0 1 2 4
1 2 4
Figure 3.3: An illustration of Algorithm 1 in [e[ = 2 dimensions. Here k = 3, _ is the
numerical , and the monotonic function f is dened as f(a, b) = a + b. Italic numbers
on the x and y axes are a
i
s and b
j
s, respectively. We want to compute the top 3 results
from f(a
i
, b
j
). In each iteration the current frontier is shown in shades, with the bold-face
denoting the best element among them. That element will be extracted and replaced by
its two neighbors ( and ) in the next iteration.
line 4 of to invoke the function Mult(e, k):
4: D(v) merge
k
(D(v), Mult(e, k))
3.6 Algorithm 2
We can further speed up both merge
k
and mult
k
by a similar idea. Instead of letting
each mult
k
generate a full k derivations for each hyperarc e and only then applying
merge
k
to the results, we can combine the candidate sets for all the hyperarcs into a
single candidate set. That is, we initialize cand to e, 1) [ e BS(v), the set of all the
top parses from each incoming hyperarc (cf. Algorithm 1). Indeed, it suces to keep only
the top k out of the [BS(v)[ candidates in cand, which would lead to a signicant speedup
in the case where [BS(v)[ k.
4
Now the top derivation in cand is the top derivation
for v. Then, whenever we remove an element e, j) from cand, we replace it with the [e[
elements e, j +b
l
) [ 1 l [e[ (again, as in Algorithm 1). The full pseudocode for this
algorithm is shown in Code 3.2.
4
This top-k selection is implemented by the linear-time randomized algorithm (a.k.a. quick-select),
which performs better than its deterministic counterpart (Cormen et al., 2001).
31
Code 3.2 k-best Algorithm 1.
1: function Mult(e, k)
2: cand e, 1) initialize the heap
3: p empty list the result of mult
k
4: while [p[ < k and [cand[ > 0 do
5: AppendNext(cand, p, k)
6: return p
7: procedure AppendNext(cand, p)
8: e, j) Extract-Min(cand)
9: append e, j) to p
10: for i 1 . . . [e[ do add the [e[ neighbors
11: j

j +b
i
12: if j

i
[D(tails
i
(e))[ and e, j

) / cand then
13: Insert(cand, e, j

)) add to heap
Code 3.3 k-best Algorithm 2.
1: procedure FindAllKBest(k)
2: for v V in topological order do
3: FindKBest(v, k)
4: procedure FindKBest(v, k)
5: GetCandidates(v, k) initialize the heap
6: while [D(v)[ < k and [cand[v][ > 0 do
7: AppendNext(cand[v], D(v))
8:
9: procedure GetCandidates(v, k)
10: temp e, 1) [ e BS(v)
11: cand[v] the top k elements in temp (optional) prune useless candidates
12: Heapify(cand[v])
32
3.7 Algorithm 3 (Lazy)
Algorithm 2 exploited the idea of lazy computation: performing mult
k
only as many
times as necessary. But this algorithm still calculates a full k-best list for every vertex
in the hypergraph, whereas we are only interested in the k-best derivations of the target
vertex (goal item). We can therefore take laziness to an extreme by delaying the whole
k-best calculation until after parsing. Algorithm 3 assumes an initial parsing phase that
generates the hypergraph and nds the 1-best derivation of each item; then in the second
phase, it proceeds as in Algorithm 2, but starts at the goal item and calls itself recursively
only as necessary. The pseudocode for this algorithm is shown in Code 3.4.
5
As a side
note, this second phase should be applicable also to a cyclic hypergraph as long as its
derivation weights are bounded.
Algorithm 2 has an overall complexity of O([E[ + [V [k log k) and Algorithm 3 is
O([E[ + [D
m
[k log k) where [D
m
[ = max
D
[D[ is the size of the longest among all top
k derivations (for CFG in CNF, [D[ = 2n 1 for all D, so [D
m
[ is O(n)). These are
signicant improvements against Algorithms 0 and 1 since it turns the multiplicative over-
head into an additive overhead. In practice, [E[ usually dominates, as in CKY parsing of
CFG. So theoretically the running times grow very slowly as k increases, which is exactly
demonstrated by our experiments below.
Table 3.1 summarizes the time and space complexities of our four k-best algorithms,
along with the 1-best Viterbi algorithm and the generalized Jimenez and Marzal algorithm.
Algorithm 3 (to be covered in Section 3.7.1) is a space-ecient variant of Algorithm 3
(Jason Eisner, p.c.) for the case when the forest (of size [E[) is too big to store in memory.
The key dierence between our Algorithm 3 and Jimenez and Marzals algorithm is the
restriction of top k candidates before making heaps (line 10 in Code 3.3, see also Sec. 3.6).
Without this line Algorithm 3 could be considered as a generalization of the Jimenez and
Marzal algorithm to the case of acyclic monotonic hypergraphs. This line is also responsible
for improving the time complexity from O([E[+[D
m
[k log(d+k)) (generalized Jimenez and
5
This version corrects the behavior of the previously published version in case a vertex has only one
incoming hyperarc but has more than one derivation. Furthermore, it improves the eciency by eliminating
possible extra work in LazyNext.
33
Code 3.4 k-best Algorithm 3.
1: procedure LazyKthBest(v, k, k

) k

is the global k
2: if cand[v] is not dened then rst visit of vertex v?
3: GetCandidates(v, k

) initialize the heap


4: while [D(v)[ < k do
5: if [D(v)[ > 0 then already have last derivation extracted?
6: e, j) D
|D(v)|
(v) get last derivation
7: LazyNext(cand[v], e, j, k

) insert successors
8: if [cand[v][ > 0 then
9: append Extract-Min(cand[v]) to D(v) extract the next best
10: else
11: break no more derivations
12:
13: procedure LazyNext(cand, e, j, k

)
14: for i 1 . . . [e[ do add the [e[ neighbors
15: j

j +b
i
16: LazyKthBest(tails
i
(e), j

i
, k

) recursively solve a sub-problem


17: if j

i
[D(tails
i
(e))[ and e, j

) / cand then exists and not in heap?


18: Insert(cand, e, j

)) add to heap
Algorithm Time Complexity CKY case Space
1-best Viterbi O([E[) n
3
[P[ O(V )
Algorithm 0 O([E[k
a
log k) n
3
[P[k
2
log k O(k[V [)
Algorithm 1 O([E[k log k) n
3
[P[k log k O(k[V [)
Algorithm 2 O([E[ +[V [k log k) n
3
[P[ +n
2
[N[k log k O(k[V [)
Algorithm 3 O([E[ +[D
m
[k log k) n
3
[P[ +nk log k O([E[ +k[D
m
[)
gen. J&M O([E[ +[D
m
[k log(d +k)) n
3
[P[ +nk log(n[P
N
[ +k) O([E[ +k[D
m
[ +d)
Algorithm 3 O([E[ +[D
m
[(d +k log k)) n
3
[P[ +n
2
[P
N
[ +nk log k O([V [ +k[D
m
[)
Table 3.1: Summary of k-best Algorithms. See Section 3.7.1 for Algorithm 3.
34
Marzal algorithm) to O([E[ + [D
m
[k log k), where d = max
v
[BS(v)[ is the maximum in-
degree among all vertices. For the CKY case, d = nP
N
where P
N
is the maximal number
of productions a nonterminal can have in the grammar. So our algorithm outperforms
Jimenez and Marzals for small values of k < d.
3.7.1 Extension 1: Space-Ecient Algorithm 3
In terms of space complexity, Algorithms 0-2 all require O(k[V [) space since they store the
k-best fore each vertex. Algorithm 3, on the other hand, stores the whole forest of size
[E[ in the forward phase, but needs much less space in the backward phase, since it only
queries a small number of vertices for k-best. So it is more space ecient than Algorithms
0-2 for big values of k = (d) = (nP
N
). However, in practice there are cases where [E[
is too big to t in memory; if k is small we can still use Algorithm 2, but what if k is big
also?
Here we develop a space-ecient variant of Algorithm 3, trading time for space so
that with a little overhead on time complexity, we can manage to compute k-best without
storing the whole forest nor the k-best of each vertex. In other words, this new Algorithm
will run in time slightly slower than Algorithm 3, but uses much less space than both
Algorithm 3 and Algorithm 2. This variant is suggested by Jason Eisner (p.c.).
The basic idea is, instead of storing the whole forest in the forward phase, we only store
1-best hyperedges for each vertex as in normal parses, but reconstruct on-the-y relevant
alternative hyperedges during the backward phase. When a vertex v is queried for the
2nd-best derivation for the rst time, before constructing the heap of candidates (line 9 in
Code 3.3), we will now reconstruct its backpointers, i.e., BS(v), just as during the forward
phase. This is obviously a waste of time since we have already covered these hyperedges
in the forward phase, just could not aord to store all of them except for the best one in
memory. Since Algorithm 3 only queries a small number of vertices in the backward phase,
the reconstructed space is only O(k[D
m
[) (again, if d > k we only need to store the top
k out of d backpointers); so overall the space complexity is O([V [ + k[D
m
[), which is the
most ecient among all k-best algorithms in Table 3.1. The time complexity is of course
worse than Algorithm 3, by paying O([D
m
[d) price for on-the-y reconstruction.
35
3.7.2 Extension 2: the Unique k-best Algorithm
In the above we have assumed that the derivations in the forest are exactly the output
of the search algorithms. However, in many real-world applications, we may need to map
these derivations to some other form, where multiple derivations D
1
,= D
2
,= . . . map to the
same output, i.e., (D
1
) = (D
2
) = . . .. This situation is often called spurious ambiguity,
since it introduces some ambiguities not directly useful in the output domain. For example,
in Tree-Adjoining Grammars (Joshi and Schabes, 1997), dierent derivation trees can map
to the same derived tree (whereas in CFG there is a one-to-one correspondence between the
two notions). This is also the case with latently annotated Treebank grammars (Petrov
et al., 2006). A more interesting example is in machine translation: we often need very
large k-best translation list to rescore with a language model or other features (Huang et
al., 2006). However, many dierent derivations have the same yield, which results in a
very small ratio of unique strings among the top-k derivations. To alleviate this problem,
determinization techniques have been proposed by Mohri and Riley (2002) for nite-state
automata and extended to tree automata by May and Knight (2006). These methods
eliminate spurious ambiguity by eectively transforming the grammar into an equivalent
deterministic form. However, this transformation often leads to a blow-up in forest size,
which is exponential in the original size in the worst-case.
So instead of determinization, here we present a simple yet eective variant of Algorithm
3 that guarantees unique mapped output for any compositional mapping on derivations.
By compositional we intuitively mean if two dierent derivations of a node map to the
same result then they would also look equivalent in a larger derivation:
Denition 3.3 (Compositional Mapping on Derivations)
A mapping : D(H) S from derivations to some output domain S is said to be
compositional if for all hyperedge e = (u
1
. . . u
i
. . . u
|e|
), v), for all 1 i [e[, if (D
i
) =
(D

i
) for D
i
, D

i
D(u
i
), then (e, D
1
. . . D
i
. . . D
|e|
)) = (e, D
1
. . . D

i
. . . D
|e|
)).
Apparently, all of the above mentioned mappings are compositional, and we are not
aware of any non-compositional ones in practice. Our modication of Algorithm 3 for
compositional mappings is as simple as:
36
keep a hash table of unique outputs at each node in the hypergraph, and
when asked for the next-best derivation of a node (LazyNext), keep asking until we
get a derivation with unseen output, and then add it to the set of unique outputs.
The idea behind this method is that there are exponentionally duplicates partly due
to combinatorial explosion: duplicates of children nodes would certainly translate to more
duplicates of the parent node (if the mapping is compositional); combining duplicates of
two children nodes will introduce a cross-product of duplicates of the parent node. So we
should eliminate duplicates as soon as possible (and as much as possible) in the recursion,
so that every node will only report unique outputs to higher nodes, although combining
unique outputs of children nodes might still lead to duplicates of the parent node. In
practice this approach works every well in the machine translation rescoring experiments
in Huang et al. (2006), where the duplicate ratio is about 1:40 (one unique string out of
every 40 derivations), but getting unique strings using our method is only twice as slow as
the original Algorithm 3, which translates to a 20 fold speedup.
However, there are two limitations of this method. First, there is no worst-case guaran-
tee of polynomial time complexity under arbitrary compositional mapping. The empirical
eciency depends on the degree of spurious ambiguity. Second, since this method dis-
cards duplicate subderivations in all intermediate stages, it is impossible to compute the
sum or expected value under the mapping (e.g., the total probability of derivations of the
same string, which is useful in modeling). Rather, it only works for the max (i.e., Viterbi)
approximation although in many cases that is sucient. Determinization techniques of
Mohri and Riley (2002) and May and Knight (2006), on the other hand, can in principle
handle both max and sum scenarios.
3.8 k-best Parsing and Decoding Experiments
We report results from two sets of experiments. For probabilistic parsing, we implemented
Algorithms 0, 1, and 3 on top of a widely-used parser (Bikel, 2004) and conducted ex-
periments on parsing eciency and the quality of the k-best lists. We also implemented
37
Algorithms 2 and 3 in the hierarchical phrase-based decoder (Chiang, 2005) and report
results on decoding speed.
3.8.1 Experiment 1: Bikel Parser
Bikels parser (Bikel, 2004) is a state-of-the-art multilingual parser based on lexicalized
context-free models (Collins, 1999; Eisner, 2000). It does support k-best parsing, but, fol-
lowing Collins parse-reranking work (Collins, 2000) (see also Section 3.8.1), it accomplishes
this by simply abandoning dynamic programming, i.e., no items are considered equivalent
(Charniak and Johnson, 2005). Theoretically, the time complexity is exponential in n (the
input sentence length) and constant in k, since, without merging of equivalent items, there
is no limit on the number of items in the chart. In practice, beam search is used to reduce
the observed time.
6
But with the standard beam width of 10
4
, this method becomes
prohibitively expensive for n 25 on Bikels parser. (Collins, 2000) used a narrower 10
3
beam and further applied a cell limit of 100,
7
but, as we will show below, this has a detri-
mental eect on the quality of the output. We therefore omit this method from our speed
comparisons, and use our implementation of Algorithm 0 (nave) as the baseline.
We implemented our k-best Algorithms 0, 1, and 3 on top of Bikels parser and con-
ducted experiments on a 2.4 GHz 64-bit AMD Opteron with 32 GB memory. The program
is written in Java 1.5 running on the Sun JVM in server mode with a maximum heap size
of 5 GB. For this experiment, we used sections 0221 of the Penn Treebank (PTB) (Marcus
et al., 1993) as the training data and section 23 (2416 sentences) for evaluation, as is now
standard. We ran Bikels parser using its settings to emulate Model 2 of (Collins, 1999).
Eciency
We tested our algorithms under various conditions. We rst did a comparison of the
average parsing time per sentence of Algorithms 0, 1, and 3 on section 23, with k 10000
for the standard beam of width 10
4
. Figure 3.4(a) shows that the parsing speed of
6
In beam search, or threshold pruning, each cell in the chart (typically containing all the items corre-
sponding to a span [i, j]) is reduced by discarding all items that are worse than times the score of the
best item in the cell. This is known as the beam width.
7
In this type of pruning, also known as histogram pruning, only the best items are kept in each cell.
This is called the cell limit.
38
1.5
2.5
3.5
4.5
5.5
6.5
7.5
1 10 100 1000 10000
A
v
e
r
a
g
e

P
a
r
s
i
n
g

T
i
m
e

(
s
e
c
o
n
d
s
)
k
Algorithm 0
Algorithm 1
Algorithm 3
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2 4 8 16 32 64
A
v
e
r
a
g
e

H
e
a
p

S
i
z
e
k
JM Algorithm with 10
-5
beam
Algorithm 3 with 10
-5
beam
JM Algorithm with 10
-4
beam
Algorithm 3 with 10
-4
beam
(a) Average parsing speed (log-log) (b) Average heap size (Alg. 3 vs. J&M)
Figure 3.4: Eciency results of the k-best Algorithms, compared to Jimenez and Marzals
algorithm
Algorithm 3 improved dramatically against the other algorithms and is nearly constant in
k, which exactly matches the complexity analysis. Algorithm 1 (k log k) also signicantly
outperforms the baseline nave algorithm (k
2
log k).
We also did a comparison between our Algorithm 3 and the Jimenez and Marzal al-
gorithm in terms of average heap size. Figure 3.4(b) shows that for larger k, the two
algorithms have the same average heap size, but for smaller k, our Algorithm 3 has a
considerably smaller average heap size. This dierence is useful in applications where only
short k-best lists are needed. For example, (McDonald et al., 2005) nd that k = 5 gives
optimal parsing accuracy.
Accuracy
Our ecient k-best algorithms enable us to search over a larger portion of the whole search
space (e.g. by less aggressive pruning), thus producing k-best lists with better quality
than previous methods. We demonstrate this by comparing our k-best lists to those in
(Ratnaparkhi, 1997), (Collins, 2000) and the parallel work by Charniak and Johnson (2005)
in several ways, including oracle reranking and average number of found parses.
Ratnaparkhi (1997) introduce the idea of oracle reranking: suppose there exists a per-
fect reranking scheme that magically picks the best parse that has the highest F-score
among the top k parses for each sentence. Then the performance of this oracle reranking
39
scheme is the upper bound of any actual reranking system like (Collins, 2000).As k in-
creases, the F-score is nondecreasing, and there is some k (which might be very large) at
which the F-score converges.
Ratnaparkhi reports experiments using oracle reranking with his statistical parser MX-
PARSE, which can compute its k-best parses (in his experiments, k = 20). (Collins, 2000),
in his parse-reranking experiments, used his Model 2 parser (Collins, 1999) with a beam
width of 10
3
together with a cell limit of 100 to obtain k-best lists; the average number
of parses obtained per sentence was 29.2, the maximum, 101.
8
(Charniak and Johnson,
2005) use coarse-to-ne parsing on top of the (Charniak, 2000) parser and get 50-best lists
for section 23.
Figure 3.5(a) compares the results of oracle reranking. Collins curve converges at
around k = 50 while ours continues to increase. With a beam width of 10
4
and k = 100,
our parser plus oracle reaches an F-score of 96.4%, compared to Collins 94.9%. Charniak
and Johnsons work, however, is based on a completely dierent parser whose 1-best F-
score is 1.5 points higher than the 1-bests of ours and Collins, making it dicult to
compare in absolute numbers. So we instead compared the relative improvement over
1-best. Figure 3.5(b) shows that our work has the largest percentage of improvement in
terms of F-score when k > 20.
To further explore the impact of Collins cell limit on the quality of k-best lists, we
plotted average number of parses for a given sentence length (Figure 3.6). Generally
speaking, as input sentences get longer, the number of parses grows (exponentially). But
we see that the curve for Collins k-best list goes down for large k (> 40). We suspect this
is due to the cell limit of 100 pruning away potentially good parses too early in the chart.
As sentences get longer, it is more likely that a lower-probability parse might contribute
eventually to the k-best parses. So we infer that Collins k-best lists have limited quality
for large k, and this is demonstrated by the early convergence of its oracle-reranking score.
By comparison, our curves of both beam widths continue to grow with k = 100.
All these experiments suggest that our k-best parses are of better quality than those
from previous k-best parsers, and similar quality to those from (Charniak and Johnson,
8
The reason the maximum is 101 and not 100 is that Collins merged the 100-best list using a beam of
10
3
with the 1-best list using a beam of 10
4
(Collins, p.c.).
40
86
88
90
92
94
96
98
1 2 5 10 20 30 50 70 100
O
r
a
c
l
e

F
-
s
c
o
r
e
k
(Charniak and Johnson, 2005)
This work with beam width 10
-4
(Collins, 2000)
(Ratnaparkhi, 1997)
0
2
4
6
8
10
1 2 5 10 20 30 50 70 100
P
e
r
c
e
n
t
a
g
e

o
f

I
m
p
r
o
v
e
m
e
n
t

o
v
e
r

1
-
b
e
s
t
k
(Charniak and Johnson, 2005)
This work with beam width 10
-4
(Collins, 2000)
(Ratnaparkhi, 1997)
(a) Oracle Reranking (b) Relative Improvement
Figure 3.5: Absolutive and Relative F-scores of oracle reranking for the top k ( 100)
parses for section 23, compared to (Charniak and Johnson, 2005), (Collins, 2000) and
(Ratnaparkhi, 1997).
0
20
40
60
80
100
0 10 20 30 40 50 60 70
A
v
e
r
a
g
e

N
u
m
b
e
r

o
f

P
a
r
s
e
s
Sentence Length
This work with beam width 10
-4
This work with beam width 10
-3
(Collins, 2000) with beam width 10
-3
Figure 3.6: Average number of parses for each sentence length in section 23, using k=100,
with beam width 10
4
and 10
3
, compared to (Collins, 2000).
41
0.001
0.01
0.1
1
10
10 100 1000 10000 100000 1e+06
s
e
c
o
n
d
s
k
Algorithm 2
Algorithm 3
Figure 3.7: Algorithm 2 compared with Algorithm 3 (oine) on MT decoding task. Aver-
age time (both excluding initial 1-best phase) vs. k (log-log).
2005) which has so far the highest F-score after reranking, and this might lead to better
results in real parse reranking.
3.8.2 Experiment 2: Hiero decoder
Our second experiment was on a CKY-based decoder for a machine translation system
(Chiang, 2005), implemented in Python 2.4 accelerated with Psyco 1.3 (Rigo, 2004). We
implemented Algorithms 2 and 3 to compute k-best English translations of Mandarin
sentences. Because the CFG used in this system is large to begin with (millions of rules),
and then eectively intersected with a nite-state machine on the English side (the language
model), the grammar constant for this system is quite large. The decoder uses a relatively
narrow beam search for eciency.
We ran the decoder on a 2.8 GHz Xeon with 4 GB of memory, on 331 sentences from
the 2002 NIST MTEval test set. We tested Algorithm 2 for k = 2
i
, 3 i 10, and
Algorithm 3 (oine algorithm) for k = 2
i
, 3 i 20. For each sentence, we measured the
time to calculate the k-best list, not including the initial 1-best parsing phase. We then
averaged the times over our test set to produce the graph of Figure 3.7, which shows that
Algorithm 3 runs an average of about 300 times faster than Algorithm 2. Furthermore, we
42
were able to test Algorithm 3 up to k = 10
6
in a reasonable amount of time.
9
3.9 Summary
We have presented here a series of general-purpose algorithms for k-best inference in packed
forests and applied it to two state-of-the-art, large-scale NLP systems: Bikels imple-
mentation of Collins lexicalized PCFG model (Bikel, 2004; Collins, 1999) and Chiangs
synchronous-CFG based Hiero decoder (Chiang, 2005) for machine translation. These
algorithms lie in the core of approximate search algorithms in Chapter 4 and Chapter 5.
9
The curvature in the plot for Algorithm 3 for k < 1000 may be due to lack of resolution in the timing
function for short times.
43
Chapter 4
Approximate Dynamic
Programming I: Forest Rescoring
There are three diculties with translation: delity, uency and elegance.
(x`n, da, y a)
Yan, Fu (1854-1921), Translation of Huxleys Evolution and Ethics, 1898.
Ecient decoding has been a fundamental problem in machine translation, especially
with an integrated language model which is essential for achieving good translation quality
(+LM decoding). However, as analyzed in Section 2.2, exact decoding under both
phrase-based and syntax-based models is often intractable due to the huge sizes of +LM
forests. In practice, one must apply aggressive pruning techniques to reduce thems to a
reasonable size. This chapter thus develops faster and more principled approaches for this
problem by extending the exact k-best parsing algorithms in Chapter 3 to approximate
search over these forests.
Let us rst consider a much simpler alternative method for LM integration, called
rescoring, which rst decodes without the LM (LM decoding) to produce a k-best list
of candidate translations using variants of Algorithm 3 (Section 3.7), and then rerank the
k-best list with the LM. This method runs much faster in practice since LM forests are
considerably smaller in size (see Section 2.2.1), but often produces a considerable number
44
method k-best . . . +LM rescoring. . .
rescoring Algorithm 3 only at the root node
cube pruning Algorithm 2 on-the-y at each node
cube growing Algorithm 3 on-the-y at each node
Table 4.1: Comparison of the three methods for decoding with n-gram LMs.
of search errors since the true best translation (taking LM into account) is often outside
of the k-best list.
Cube pruning (Chiang, 2007), based on Algorithm 2 (Section 3.6), is a compromise
between rescoring and full-integration: it rescores k subtranslations at each node of the
LM forest, rather than only at the root node as in pure rescoring, and achieves signi-
cant speed-up over full-integration on the hierarchical phrase-based system Hiero (Chiang,
2005).
We push the idea behind this method further and make the following contributions in
this chapter:
We generalize cube pruning into a generic method for LM-integrated decoding, and
adapt it to two systems very dierent from Hiero: a phrase-based system similar
to Pharaoh (Koehn, 2004) and a tree-to-string system (Huang et al., 2006), both
described in Section 2.2.
We also devise a faster variant of cube pruning, called cube growing, which uses the
k-best Algorithm 3 to reduce k to the minimum needed at each node to obtain the
desired number of hypotheses at the root. Table 4.1 compares the three methods for
incorporating the LM.
Cube pruning and cube growing are collectively called forest rescoring since they both
approximately rescore the LM forest, which is a coarser-grained version of the +LM for-
est, and can thus be viewed as an instance of coarse-to-ne search (Charniak and Johnson,
2005). In practice they run an order of magnitude faster than full-integration with beam
search, at the same level of search errors and translation accuracy as measured by BLEU.
This chapter is based on the material in (Huang and Chiang, 2007).
45
1.0
1.1
3.5
1.0 4.0 7.0
2.5 8.3 8.5
2.4 9.5 8.4
9.2 17.0 15.2
(VP
held meeting
3,6
)
(VP
held talk
3,6
)
(VP
hold conference
3,6
)
(
P
P
w
i
t
h

S
h
a
r
o
n
1
,
3
)
(
P
P
a
l
o
n
g

S
h
a
r
o
n
1
,
3
)
(
P
P
w
i
t
h

S
h
a
l
o
n
g
1
,
3
)
1.0 4.0 7.0
(
P
P
w
i
t
h

S
h
a
r
o
n
1
,
3
)
(
P
P
a
l
o
n
g

S
h
a
r
o
n
1
,
3
)
(
P
P
w
i
t
h

S
h
a
l
o
n
g
1
,
3
)
2.5
2.4
8.3
(
P
P
w
i
t
h

S
h
a
r
o
n
1
,
3
)
(
P
P
a
l
o
n
g

S
h
a
r
o
n
1
,
3
)
(
P
P
w
i
t
h

S
h
a
l
o
n
g
1
,
3
)
1.0 4.0 7.0
2.5
2.4
8.3
9.5
9.2
(
P
P
w
i
t
h

S
h
a
r
o
n
1
,
3
)
(
P
P
a
l
o
n
g

S
h
a
r
o
n
1
,
3
)
(
P
P
w
i
t
h

S
h
a
l
o
n
g
1
,
3
)
1.0 4.0 7.0
2.5
2.4
8.3
9.2
9.5
8.5
(a) (b) (c) (d)
Figure 4.1: Cube pruning along one hyperedge. (a): the numbers in the grid denote the
score of the resulting +LM item, including the combination cost; (b)-(d): the best-rst
enumeration of the top three items. Notice that the items popped in (b) and (c) are out
of order due to the non-monotonicity of the combination cost.
4.1 Cube Pruning based on Algorithm 2
Cube pruning (Chiang, 2007) reduces the search space signicantly based on the observa-
tion that when the above method is combined with beam search, only a small fraction of
the possible +LM items at a node will escape being pruned, and moreover we can select
with reasonable accuracy those top-k items without computing all possible items rst. In
a nutshell, cube pruning works on the LM forest, keeping at most k +LM items at each
node, and uses the k-best parsing Algorithm 2 of Huang and Chiang (2005) to speed up the
computation. For simplicity of presentation, we will use concrete SCFG-based examples,
but the method applies to the general hypergraph framework.
Consider Figure 4.1(a). Here k = 3 and we use D(v) to denote the top-k +LM items
(in sorted order) of node v. Suppose we have computed D(u
1
) and D(u
2
) for the two
antecedent nodes u
1
= VP
3, 6
and u
2
= PP
1, 3
respectively. Then for the consequent node
v = VP
1, 6
we just need to derive the top-3 from the 9 combinations of (D
i
(u
1
), D
j
(u
2
))
with i, j [1, 3]. Since the antecedent items are sorted, it is very likely that the best
consequent items in this grid lie towards the upper-left corner. This situation is very
similar to k-best parsing and we can adapt the Algorithm 2 of Huang and Chiang (2005)
here to explore this grid in a best-rst order.
46
Suppose that the combination costs are negligible, and therefore the weight of a conse-
quent item is just the product of the weights of the antecedent items. Then we know that
D
1
(v) = (D
1
(u
1
), D
1
(u
2
)), the upper-left corner of the grid. Moreover, we know that D
2
(v)
is the better of (D
1
(u
1
), D
2
(u
2
)) and (D
2
(u
1
), D
1
(u
2
)), the two neighbors of the upper-left
corner. We continue in this way (see Figure 4.1(b)(d)), enumerating the consequent items
best-rst while keeping track of a relatively small number of candidates (shaded cells in
Figure 4.1(b), cand in Code 4.1) for the next-best item.
Code 4.1 Cube pruning based on k-best Algorithm 2.
1: function Cube(F) the input is a forest F
2: for v F in (bottom-up) topological order do
3: KBest(v)
4: return D
1
(TOP)
5: procedure KBest(v)
6: cand e, 1) [ e IN(v) for each incoming e
7: Heapify(cand) a priority queue of candidates
8: buf
9: while [cand[ > 0 and [buf [ < k do
10: item Pop-Min(cand)
11: append item to buf
12: PushSucc(item, cand)
13: sort buf to D(v)
14: procedure PushSucc(e, j), cand)
15: e is v u
1
. . . u
|e|
16: for i in 1 . . . [e[ do
17: j

j +b
i
18: if [D(u
i
)[ j

i
then
19: Push(e, j

), cand)
However, when we take into account the combination costs, this grid is no longer
monotonic in general, and the above algorithm will not always enumerate items in best-
rst order. We can see this in the rst iteration in Figure 4.1(b), where an item with
score 2.5 has been enumerated even though there is an item with score 2.4 still to come.
47
Thus we risk making more search errors than the full-integration method, but in practice
the loss is much less signicant than the speedup. Because of this disordering, we do not
put the enumerated items directly into D(v); instead, we collect items in a buer (buf in
Code 4.1) and re-sort the buer into D(v) after it has accumulated k items.
1
In general the grammar may have multiple rules that share the same source side but
have dierent target sides, which we have treated here as separate hyperedges in the LM
forest. In Hiero, these hyperedges are processed as a single unit which we call a hyperedge
bundle. The dierent target sides then constitute a third dimension of the grid, forming a
cube of possible combinations (Chiang, 2007).
Now consider that there are many hyperedges that derive v, and we are only interested
the top +LM items of v over all incoming hyperedges. Following Algorithm 2, we initialize
the priority queue cand with the upper-left corner item from each hyperedge, and proceed
as above. See Code 4.1 for the pseudocode for cube pruning. We use the notation e, j)
to identify the derivation of v via the hyperedge e and the j
i
th best subderivation of
antecedent u
i
(1 i [j[). Also, we let 1 stand for a vector whose elements are all 1,
and b
i
for the vector whose members are all 0 except for the ith whose value is 1 (the
dimensionality of either should be evident from the context). The heart of the algorithm
is lines 1315. Lines 1311 move the best derivation e, j) from cand to buf , and then line
15 pushes its successors e, j +b
i
) [ i 1 . . . [e[ into cand.
4.2 Cube Growing based on Algorithm 3
Although much faster than full-integration, cube pruning still computes a xed amount
of +LM items at each node, many of which will not be useful for arriving at the 1-best
hypothesis at the root. It would be more ecient to compute as few +LM items at each
node as are needed to obtain the 1-best hypothesis at the root. This new method, called
cube growing, is a lazy version of cube pruning just as Algorithm 3 of Huang and Chiang
(2005), is a lazy version of Algorithm 2 (see Table 5.4).
1
Notice that dierent combinations might have the same resulting item, in which case we only keep the
one with the better score (sometimes called hypothesis recombination in MT literature), so the number of
items in D(v) might be less than k.
48
1.0
1.1
3.5
1.0 4.0 7.0
2.1 5.1 8.1
2.2 5.2 8.2
4.6 7.6 10.6
1.0 4.0 7.0
2.5
2.4
8.3
(a) h-values (b) true costs
Figure 4.2: Example of cube growing along one hyperedge. (a): the h(x) scores for the grid
in Figure 4.1(a), assuming h
combo
(e) = 0.1 for this hyperedge; (b) cube growing prevents
early ranking of the top-left cell (2.5) as the best item in this grid.
Instead of traversing the forest bottom-up, cube growing visits nodes recursively in
depth-rst order from the root node (Figure 4.2). First we call LazyJthBest(TOP, 1),
which uses the same algorithm as cube pruning to nd the 1-best +LM item of the root
node using the best +LM items of the antecedent nodes. However, in this case the best
+LM items of the antecedent nodes are not known, because we have not visited them yet.
So we recursively invoke LazyJthBest on the antecedent nodes to obtain them as needed.
Each invocation of LazyJthBest(v, j) will recursively call itself on the antecedents of v
until it is condent that the jth best +LM item for node v has been found.
Consider again the case of one hyperedge e. Because of the nonmonotonicity caused
by combination costs, the rst +LM item (e, 1)) popped from cand is not guaranteed to
be the best of all combinations along this hyperedge (for example, the top-left cell of 2.5
in Figure 4.1 is not the best in the grid). So we cannot simply enumerate items just as
they come o of cand.
2
Instead, we need to store up popped items in a buer buf , just as
in cube pruning, and enumerate an item only when we are condent that it will never be
surpassed in the future. In other words, we would like to have an estimate of the best item
not explored yet (analogous to the heuristic function in A* search). If we can establish
a lower bound h
combo
(e) on the combination cost of any +LM deduction via hyperedge
e, then we can form a monotonic grid (see Figure 4.2(a)) of lower bounds on the grid of
combinations, by using h
combo
(e) in place of the true combination cost for each +LM item
2
If we did, then the out-of-order enumeration of +LM items at an antecedent node would cause an entire
row or column in the grid to be disordered at the consequent node, potentially leading to a multiplication
of search errors.
49
x in the grid; call this lower bound h(x).
Code 4.2 Cube growing based on k-best Algorithm 3.
1: procedure LazyJthBest(v, j)
2: if cand[v] is undened then
3: cand[v]
4: Fire(e, 1, cand) foreach e IN(v)
5: buf [v]
6: while [D(v)[ < j and [buf [v][ +[D(v)[ < k and [cand[v][ > 0 do
7: item Pop-Min(cand[v])
8: Push(item, buf [v])
9: PushSucc(item, cand[v])
10: bound minh(x) [ x cand[v]
11: Enum(buf [v], D(v), bound)
12: Enum(buf [v], D(v), +)
13: procedure Fire(e, j, cand)
14: e is v u
1
. . . u
|e|
15: for i in 1 . . . [e[ do
16: LazyJthBest(u
i
, j
i
)
17: if [D(u
i
)[ < j
i
then return
18: Push(e, j), cand)
19: procedure PushSucc(e, j), cand)
20: Fire(e, j +b
i
, cand) foreach i in 1 . . . [e[
21: procedure Enum(buf , D, bound)
22: while [buf [ > 0 and Min(buf ) < bound do
23: append Pop-Min(buf ) to D
Now suppose that the gray-shaded cells in Figure 4.2(a) are the members of cand.
Then the minimum of h(x) over the items in cand, in this example, min2.2, 5.1 = 2.2 is
a lower bound on the cost of any item in the future for the hyperedge e. Indeed, if cand
contains items from multiple hyperedges for a single consequent node, this is still a valid
lower bound. More formally:
50
Lemma 1. For each node v in the forest, the term
bound = min
xcand[v]
h(x) (4.1)
is a lower bound on the true cost of any future item that is yet to be explored for v.
Proof. For any item x that is not explored yet, the true cost c(x) h(x), by the denition
of h. And there exists an item y cand[v] along the same hyperedge such that h(x)
h(y), due to the monotonicity of h within the grid along one hyperedge. We also have
h(y) bound by the denition of bound. Therefore c(x) bound.
Now we can safely pop the best item from buf if its true cost Min(buf ) is better than
bound and pass it up to the consequent node (lines 2123); but otherwise, we have to wait
for more items to accumulate in buf to prevent a potential search error, for example, in the
case of Figure 4.2(b), where the top-left cell (2.5) is worse than the current bound of 2.2.
The update of bound in each iteration (line 10) can be eciently implemented by using
another heap with the same contents as cand but prioritized by h instead. In practice this
is a negligible overhead on top of cube pruning.
We now turn to the problem of estimating the heuristic function h
combo
. In practice,
computing true lower bounds of the combination costs is too slow and would compromise
the speed up gained from cube growing. So we instead use a much simpler method that
just calculates the minimum combination cost of each hyperedge in the top-i derivations
of the root node in LM decoding. This is just an approximation of the true lower bound,
and bad estimates can lead to search errors. However, the hope is that by choosing the
right value of i, these estimates will be accurate enough to aect the search quality only
slightly, which is analogous to almost admissible heuristics in A* search (Soricut, 2006).
Using Algorithm 3, this preprocessing step runs very fast in practice.
4.3 Forest Pruning Algorithm
We use the Relatively-Useless Pruning Algorithm by Graehl (2005) which is very similar
to the method based on marginal probability (Charniak and Johnson, 2005), except that
ours prunes hyperedges as well as nodes. Basically, we use an Inside-Outside algorithm to
51
compute the Viterbi inside cost (v) and the Viterbi outside cost (v) for each node v,
and then compute the merit (e) for each hyperedge:
(e) = (head(e)) +

u
i
tails(e)
(u
i
) (4.2)
Intuitively, this merit is the cost of the best derivation that traverses e, and the dierence
(e) = (e)(TOP) can be seen as the distance away from the globally best derivation.
We prune away all hyperedges that have (e) > p for a threshold p. Nodes with all
incoming hyperedges pruned are also pruned. The key dierence from (Charniak and
Johnson, 2005) is that in our algorithm, a node can partially survive the beam, with
a subset of its hyperedges pruned. In practice, this method prunes on average 15% more
hyperedges than their method.
4.4 Experiments
We test our methods on two large-scale English-to-Chinese translation systems: a phrase-
based system and our tree-to-string system (Huang et al., 2006).
4.4.1 Phrase-based Decoding
We implemented Cubit, a Python clone of the Pharaoh decoder (Koehn, 2004),
3
and
adapted cube pruning to it as follows. As in Pharaoh, each bin i contains hypotheses
(i.e., +LM items) covering i words on the source-side. But at each bin (see Figure 4.3), all
+LM items from previous bins are rst partitioned into LM items; then the hyperedges
leading from those LM items are further grouped into hyperedge bundles (Figure 4.4),
which are placed into the priority queue of the current bin.
Our data preparation follows Huang et al. (2006): the training data is a parallel corpus
of 28.3M words on the English side, and a trigram language model is trained on the Chinese
side. We use the same test set as (Huang et al., 2006), which is a 140-sentence subset of
the NIST 2003 test set with 936 words on the English side. The weights for the log-linear
3
In our tests, Cubit always obtains a BLEU score within 0.004 of Pharaohs (Figure 4.5(b)). Source
code available at http://www.cis.upenn.edu/

lhuang3/cubit/
52
(a)
1 2 3 4 5
(b)
1 2 3 4 5
Figure 4.3: (a) Pharaoh expands the hypotheses in the current bin (#2) into longer ones.
(b) In Cubit, hypotheses in previous bins are fed via hyperedge bundles (solid arrows) into
a priority queue (shaded triangle), which empties into the current bin (#5).
1.0
1.1
3.5
1.0 4.0 7.0
2.5 8.3 8.5
2.4 9.5 8.4
9.2 17.0 15.2
(
meeting
)
(
talk
)
(
conference
)
w
i
t
h
S
h
a
r
o
n
a
n
d
S
h
a
r
o
n
w
i
t
h
A
r
i
e
l
S
h
a
r
o
n
.
.
.
Figure 4.4: A hyperedge bundle represents all +LM deductions that derives an item in the
current bin from the same coverage vector (see Figure 4.3). The phrases on the top denote
the target-sides of applicable phrase-pairs sharing the same source-side.
53
76
80
84
88
92
10
2
10
3
10
4
10
5
10
6
a
v
e
r
a
g
e

m
o
d
e
l

c
o
s
t
average number of hypotheses per sentence
full-integration (Cubit)
cube pruning (Cubit)
0.200
0.205
0.210
0.215
0.220
0.225
0.230
0.235
0.240
0.245
10
2
10
3
10
4
10
5
10
6
B
L
E
U

s
c
o
r
e
average number of hypotheses per sentence
Pharaoh
full-integration (Cubit)
cube pruning (Cubit)
(a) (b)
Figure 4.5: Cube pruning vs. full-integration (with beam search) on phrase-based decoding.
model are tuned on a separate development set. We set the decoder phrase-table limit to
100 as suggested in (Koehn, 2004) and the distortion limit to 4.
Figure 4.5(a) compares cube pruning against full-integration in terms of search quality
vs. search eciency, under various pruning settings (threshold beam set to 0.0001, stack
size varying from 1 to 200). Search quality is measured by average model cost per sentence
(lower is better), and search eciency is measured by the average number of hypotheses
generated (smaller is faster). At each level of search quality, the speed-up is always bet-
ter than a factor of 10. The speed-up at the lowest search-error level is a factor of 32.
Figure 4.5(b) makes a similar comparison but measures search quality by BLEU, which
shows an even larger relative speed-up for a given BLEU score, because translations with
very dierent model costs might have similar BLEU scores. It also shows that our full-
integration implementation in Cubit faithfully reproduces Pharaohs performance. Fixing
the stack size to 100 and varying the threshold yielded a similar result.
4.4.2 Tree-to-string Decoding
In tree-to-string or syntax-directed decoding (Huang et al., 2006; Liu et al., 2006) (see
Chapter 6 for details), the source string is rst parsed into a tree, which is then recursively
converted into a target string according to transfer rules in a synchronous grammar (Galley
et al., 2006). For instance, the following rule translates an English passive construction
into Chinese:
54
218.2
218.4
218.6
218.8
219.0
10
3
10
4
10
5
a
v
e
r
a
g
e

m
o
d
e
l

c
o
s
t
average number of +LM items explored per sentence
full-integration
cube pruning
cube growing
0.254
0.256
0.258
0.260
0.262
10
3
10
4
10
5
B
L
E
U

s
c
o
r
e
average number of +LM items explored per sentence
full-integration
cube pruning
cube growing
(a) (b)
Figure 4.6: Cube growing vs. cube pruning vs. full-integration (with beam search) on
tree-to-string decoding.
VP
VBD
was
VP-C
x
1
:VBN PP
IN
by
x
2
:NP-C
b`ei x
2
x
1
Our tree-to-string system performs slightly better than the state-of-the-art phrase-based
system Pharaoh on the above data set. Although dierent from the SCFG-based sys-
tems in Section 2.2, its derivation trees remain context-free and the search space is still a
hypergraph, where we can adapt the methods presented in Sections 4.1 and 4.2.
The data set is same as in Section 4.4.1, except that we also parsed the English-side
using a variant of the Collins (1999) parser, and then extracted 24.7M tree-to-string rules
using the algorithm of (Galley et al., 2006). Since our tree-to-string rules may have many
variables, we rst binarize each hyperedge in the forest on the target projection (Huang,
2007). All the three +LM decoding methods to be compared below take these binarized
forests as input. For cube growing, we use a non-duplicate k-best method (Huang et al.,
2006) to get 100-best unique translations according to LM to estimate the lower-bound
heuristics.
4
This preprocessing step takes on average 0.12 seconds per sentence, which is
4
If a hyperedge is not represented at all in the 100-best LM derivations at the root node, we use
the 1-best LM derivation of this hyperedge instead. Here, rules that share the same source side but
55
negligible in comparison to the +LM decoding time.
Figure 4.6(a) compares cube growing and cube pruning against full-integration under
various beam settings in the same fashion of Figure 4.5(a). At the lowest level of search
error, the relative speed-up from cube growing and cube pruning compared with full-
integration is by a factor of 9.8 and 4.1, respectively. Figure 4.6(b) is a similar comparison
in terms of BLEU scores and shows an even bigger advantage of cube growing and cube
pruning over the baseline.
4.5 Summary
We have presented a novel extension of cube pruning called cube growing, and shown how
both can be seen as general forest rescoring techniques applicable to both phrase-based
and syntax-based decoding. We evaluated these methods on large-scale translation tasks
and observed considerable speed improvements, often by more than a factor of ten.
These forest rescoring algorithms have potential applications to other computationally
intensive tasks involving combinations of dierent models, or renements of the state space
of a model. Examples of such tasks abound in natural language processing: for example,
head-lexicalized parsing (Collins, 1999); joint parsing and semantic role labeling (Sutton
and McCallum, 2005); joint word-segmentation and part-of-speech tagging for Chinese
(Ng and Low, 2004). Thus we envision forest rescoring as being of general applicability
for reducing complicated search spaces, as an alternative to simulated annealing methods
(Kirkpatrick et al., 1983). For instance, the next chapter (Chapter 5) adapts this technique
to discriminative parsing with non-local features.
have dierent target sides are treated as separate hyperedges, not collected into hyperedge bundles, since
grouping becomes dicult after binarization.
56
Chapter 5
Approximate Dynamic
Programming II: Forest Reranking
We now turn back to parsing. Chapter 3 provides very ecient k-best parsing algorithms
and shows they have potential to improve results from parse reranking systems like that of
Collins (2000) and Charniak and Johnson (2005). Typically such a system reranks the k-
best list with arbitrary features that are not computable or intractable to compute within
the baseline system. However, although being popular in recent years (Collins, 2000; Shen
et al., 2004), this pipelined approach has a fundamental drawback: it suers from the
limited scope of the k-best list, which rules out many potentially good alternatives. For
example 41% of the correct parses were not in the candidates of 30-best parses in (Collins,
2000). This situation becomes worse with longer sentences because the number of possible
interpretations usually grows exponentially with the sentence length. As a result, we often
see very few variations among the k-best trees, for example, 50-best trees typically just
represent a combination of 5 to 6 binary ambiguities (since 2
5
< 50 < 2
6
).
Alternatively, discriminative parsing is tractable with exact and ecient search based
on dynamic programming (DP) if all features are restricted to be local, that is, only looking
at a local window within the factored search space (Taskar et al., 2004; McDonald et al.,
2005; Petrov and Klein, 2008; Finkel et al., 2008; Carreras et al., 2008). However, we miss
the benet of non-local features that are not representable in this model. (Johnson, 1998;
57
local non-local
conventional reranking only at the root
DP-based discriminative parsing exact N/A
forest annotation by splitting states exact but intractable
this work: forest-reranking exact on-the-y
Table 5.1: Comparison of various approaches for incorporating local and non-local features.
Miyao and Tsujii, 2002)
Ideally, we would wish to combine the merits of both approaches, where an ecient
inference algorithm could integrate both local and non-local features. Although exact
search is intractable (at least in theory) for features with unbounded scope, fortunately,
this situation resembles the huge-sized +LM forests in MT decoding where forest rescoring
techniques can eectively reduce the search space. So in this chapter, we adapt forest
rescoring to forest reranking, an approximation technique that reranks the packed forest
of exponentially many parses. The key idea is to compute non-local features incrementally
from bottom up, so that we can rerank the k-best subtrees at all internal nodes, instead
of only at the root node as in conventional reranking (see Table 5.1).
Although previous work on discriminative parsing has mainly focused on short sentences
( 15 words) (Taskar et al., 2004; Turian and Melamed, 2007), our work scales to the whole
Treebank, where we achieved an F-score of 91.7, which is a 19% error reduction from the
1-best baseline, and outperforms both 50-best and 100-best reranking. This result is also
better than any previously reported systems trained on the Treebank.
This chapter is based on material in (Huang, 2008b).
5.1 Generic Reranking with the Perceptron
We rst establish a unied framework for parse reranking with both k-best lists and packed
forests. For a given sentence s, a generic reranker selects the best parse y among the set
58
of candidates cand(s) according to some scoring function:
y = argmax
ycand(s)
score(y) (5.1)
In k-best reranking, cand(s) is simply a set of k-best parses from the baseline parser, that
is, cand(s) = y
1
, y
2
, . . . , y
k
. Whereas in forest reranking, cand(s) is a forest implicitly
representing the set of exponentially many parses.
As usual, we dene the score of a parse y to be the dot product between a high
dimensional feature representation and a weight vector w:
score(y) = w f (y) (5.2)
where the feature extractor f is a vector of d functions f = (f
1
, . . . , f
d
), and each feature
f
j
maps a parse y to a real number f
j
(y). Following (Charniak and Johnson, 2005), the
rst feature f
1
(y) = log Pr(y) is the log probability of a parse from the baseline generative
parser, while the remaining features are all integer valued, and each of them counts the
number of times that a particular conguration occurs in parse y. For example, one such
feature f
1000
might be a question
how many times does the rule VP VBD NP PP appear in parse y?
which is an instance of the Rule feature (Figure 5.1(a)), while another feature f
2000
might
be
how many times is a VP constituent of length 5 surrounded by the word has on the
left and the period on the right?
which is an instance of the WordEdges feature (see Figure 5.1(c) and Section 5.2 for
details).
Using a machine learning algorithm, the weight vector w can be estimated from the
training data where each sentence s
i
is labelled with its correct (gold-standard) parse
y

i
. As for the learner, Collins (2000) uses the boosting algorithm and Charniak and
Johnson (2005) use the maximum entropy estimator. In this work we use the averaged
perceptron algorithm (Collins, 2002) since it is an online algorithm much simpler and
orders of magnitude faster than the Boosting and MaxEnt methods. Previous applications
59
Code 5.1 Generic Reranking with the Perceptron.
1: Input: Training examples cand(s
i
), y
+
i

N
i=1
y
+
i
is the oracle tree for sentence s
i
among cand(s
i
)
2: w 0 initial weights
3: for t 1 . . . T do T iterations
4: for i 1 . . . N do
5: y = argmax
ycand(s
i
)
w f (y)
6: if y ,= y
+
i
then
7: w w+f (y
+
i
) f ( y)
8: return w
of the perceptron algorithm to reranking include (Collins and Duy, 2002) and (Shen et
al., 2004), both on k-best reranking, while this work applies it to forest reranking.
Shown in Code 5.1, the perceptron algorithm makes several passes over the whole
training data, and in each iteration, for each sentence s
i
, it tries to predict a best parse
y
i
among the candidates cand(s
i
) using the current weight setting. Intuitively, we want
the gold parse y

i
to be picked, but in general it is not guaranteed to be within cand(s
i
),
because the grammar may fail to cover the gold parse, and because the gold parse may be
pruned away due to the limited scope of cand(s
i
). Following Charniak and Johnson (2005)
and Collins (2000), we dene an oracle parse y
+
i
to be the candidate that has the highest
Parseval F-score with respect to the gold tree y

i
:
1
y
+
i
argmax
ycand(s
i
)
F(y, y

i
) (5.3)
where function F returns the F-score. Now we train the reranker to pick the oracle parses
as often as possible, and in case an error is made (line 6), we do an update on the weight
vector (line 7), by adding the dierence between two feature representations.
In k-best reranking, since all parses are explicitly enumerated, it is trivial to compute
the oracle tree.
2
However, in forest reranking, it becomes less obvious and indeed highly
1
If one uses the gold y

i
for oracle y
+
i
, the perceptron will continue to make updates towards something
unreachable even when the decoder has picked the best possible candidate.
2
In case multiple candidates get the same highest F-score, we choose the parse with the highest log
probability from the baseline parser to be the oracle parse (Collins, 2000).
60
VP
VBD NP PP
S
VP
VBD NP PP
(a) Rule (local) (b) ParentRule (non-local)
VP VBD NP PP ) VP VBD NP PP [ S )
VP
VBZ
has
NP
[ 5 words [
.
.
VP
VBD
saw
NP
DT
the
...
PP
...
(c) WordEdges (local) (d) NGramTree (non-local)
NP 5 has . ) VP (VBD saw) (NP (DT the)) )
Figure 5.1: Illustration of some example features. Shaded nodes denote information in-
cluded in the feature.
non-trivial how to identify the forest oracle. We will present a dynamic programming
algorithm for this problem in Section 5.4.
We also use a renement called averaged parameters where the nal weight vector is
the average of weight vectors after each sentence in each iteration over the training data.
This averaging eect has been shown to reduce overtting and produce much more stable
results (Collins, 2002).
5.2 Factorization of Local and Non-Local Features
A key dierence between k-best and forest reranking is the computation of features. In k-
best reranking, all features are treated equivalently by the decoder, which simply computes
the value of each feature one by one on each candidate parse. However, for forest reranking,
61
since the trees are not explicitly enumerated, many features are not directly computable.
So we rst classify the feature set into local and non-local subsets, which the decoder will
process in very dierent fashions.
We dene a feature f to be local if and only if it can be factored among the local
productions in a tree, and non-local if otherwise. For example, the Rule feature in
Fig. 5.1(a) is local, while the ParentRule feature in Fig. 5.1(b) is non-local. It is worth
noting that some features which seem complicated at the rst sight are indeed local. For
example, the WordEdges feature in Fig. 5.1(c), which classies a node by its label, span
length, and surrounding words, is still local since all these information are encoded either
in the node itself or in the input sentence. In contrast, it would become non-local if we
replace the surrounding words by surrounding POS tags, which are generated dynamically.
More formally, we split the feature extractor f = (f
1
, . . . , f
d
) into f = (f
L
; f
N
) where
f
L
and f
N
are the local and non-local features, respectively. For the former, we extend
their domains from parses to hyperedges, where f(e) returns the value of a local feature
f f
L
on hyperedge e, and its value on a parse y factors across the hyperedges (local
productions):
f
L
(y) =

ey
f
L
(e) (5.4)
For a forest, we can pre-compute f
L
(e) for each e.
Non-local features, however, can not be pre-computed, but we still prefer to compute
them as early as possible, which we call on-the-y computation, so that our decoder can
be sensitive to them at internal nodes. For instance, the NGramTree feature in Fig. 5.1
(d) returns the minimum tree fragement containing two consecutive words, in this case
saw and the, and should thus be computed at the smallest common ancestor of the
two, which is the VP node in this example. Similarly, the ParentRule feature in Fig. 5.1
(b) can be computed when the S subtree is formed. In doing so, we essentially factor
non-local features across subtrees, where for each subtree y

in a parse y, we dene a unit


feature

f(y

) to be the part of f(y) that are computable within y

, but not computable


in any (proper) subtree of y

. Then we have:
f
N
(y) =

f
N
(y

) (5.5)
62
A
i, k
B
i, j
w
i
. . . w
j1
C
j, k
w
j
. . . w
k1
Figure 5.2: Example of the unit NGramTree feature at node A
i, k
: A (B . . . w
j1
) (C
. . . w
j
) ).
Intuitively, we compute the unit non-local features at each subtree from bottom-up. For
example, for the binary-branching node A
i, k
in Fig. 5.2, the unit NGramTree instance is
for the pair w
j1
, w
j
) on the boundary between the two subtrees, whose smallest common
ancestor is the current node. Other unit NGramTree instances within this span have
already been computed in the subtrees, except those for the boundary words of the whole
node, w
i
and w
k1
, which will be computed when this node is further combined with other
nodes in the future.
5.3 Approximate Decoding via Cube Pruning
Before moving on to approximate decoding with non-local features, we rst describe the
algorithm for exact decoding when only local features are present. We will follow the
notations from Chapter 3, where D(v) denotes the top derivations of node v, with D
1
(v)
being its 1-best derivation. We also use the derivation-with-backpointers notation (see
Denition 3.1) e, j) to denote the derivation along hyperedge e, using the j
i
th (sub-)
derivation for tail u
i
, so e, 1) is the best derivation along e. The exact decoding algorithm,
shown in Code 5.2, is an instance of the bottom-up Viterbi algorithm, which traverses the
hypergraph in a topological order, and at each node v, calculates its 1-best derivation using
each incoming hyperedge e BS(v). The cost of e is the score of its (pre-computed) local
features w f
L
(e). This algorithm has a time complexity of O(E), and is almost identical
to traditional chart parsing, except that the forest might be more than binary-branching.
For non-local features, we adapt cube pruning from forest rescoring (Chiang, 2007;
Huang and Chiang, 2007), since the situation here is analogous to machine translation
decoding with integrated language models: we can view the scores of unit non-local features
63
Code 5.2 Exact Decoding with Local Features: Generalized Viterbi (3.1).
1: function Viterbi(V, E))
2: for v V in topological order do
3: for e BS(v) do
4: c(e) w f
L
(e) +

u
i
tails(e)
c(D
1
(u
i
))
5: if c(e) > c(D
1
(v)) then better derivation?
6: D
1
(v) e, 1)
7: c(D
1
(v)) c(e)
8: return D
1
(TOP)
as the language model cost, computed on-the-y when combining sub-constituents.
Shown in Code 5.3, cube pruning works bottom-up on the forest, and keeps a beam of at
most k derivations at each node. When combining the sub-derivations along a hyperedge e
to form a new subtree y

= e, j), we also compute its unit non-local feature values

f
N
(e, j))
(line 26). A priority queue (heap in the pseudo-code) is used to hold the candidates for the
next-best derivation, which is initialized to the set of best derivations along each hyperedge
(lines 7 to 9). Then at each iteration, we pop the best derivation from heap, (lines 13),
and push its successors into heap (line 15). Analogous to the language model cost in forest
rescoring, the unit feature cost here is a non-monotonic score in the dynamic programming
backbone, and the derivations may thus be extracted t out-of-order. So a buer buf is used
to keep the extracted derivations and is sorted at the end (line 16) to form the list of top-k
derivations D(v) of node v. The complexity of this algorithm is O(E +V k log k^) where
O(^) is the time for on-the-y feature extraction for each subtree. As in forest rescoring,
the use of priority queue here brings a dramatic speed-up, while the feature extraction
time is the bottleneck in practice.
64
Code 5.3 Approximate Decoding with Non-Local Features: Cube Pruning (4.1).
1: function Cube(V, E))
2: for v V in topological order do
3: KBest(v)
4: return D
1
(TOP)
5: procedure KBest(v)
6: heap
7: for e BS(v) do
8: c(e, 1)) Eval(e, 1) extract unit features
9: append e, 1) to heap
10: Heapify(heap) prioritized frontier
11: buf
12: while [heap[ > 0 and [buf [ < k do
13: item Pop-Max(heap) extract next-best
14: append item to buf
15: PushSucc(item, heap)
16: sort buf to D(v)
17: procedure PushSucc(e, j), heap)
18: e is v u
1
. . . u
|e|
19: for i in 1 . . . [e[ do
20: j

j +b
i
b
i
is 1 only on the ith dim.
21: if [D(u
i
)[ j

i
then enough sub-derivations?
22: c(e, j

)) Eval(e, j

) unit features
23: Push(e, j

), heap)
24: function Eval(e, j)
25: e is v u
1
. . . u
|e|
26: return w f
L
(e) +w

f
N
(e, j)) +

i
c(D
ji
(u
i
))
65
5.4 Forest Oracle Algorithm
We now turn to the problem of nding the oracle tree in a forest. Recall that the Parseval
F-score is the harmonic mean of labelled precision P and labelled recall R:
F(y, y

)
2PR
P +R
=
2[y y

[
[y[ +[y

[
(5.6)
where [y[ and [y

[ are the numbers of brackets in the test parse and gold parse, respectively,
and [y y

[ is the number of matched brackets. Since the harmonic mean is a non-linear


combination, we can not optimize the F-scores on sub-forests independently with a greedy
algorithm. In other words, the optimal F-score tree in a forest is not guaranteed to be
composed of two optimal F-score subtrees.
We instead propose a dynamic programming algorithm which optimizes the number of
matched brackets for a given number of test brackets. For example, our algorithm will ask
questions like,
when a test parse has 5 brackets, what is the maximum number of matched brackets?
More formally, at each node v, we compute an oracle function ora[v] : N N, which maps
an integer t to ora[v](t), the max. number of matched brackets for all parses y
v
of node v
with exactly t brackets:
ora[v](t) max
yv:|yv|=t
[y
v
y

[ (5.7)
When node v is combined with another node u along a hyperedge e = (v, u), w), we
have to combine the two oracle functions ora[v] and ora[u], for which we need to dene a
convolution operator between two functions f and g:
(f g)(t) max
t
1
+t
2
=t
f(t
1
) +g(t
2
) (5.8)
Intuitively, this convolution distributes the number of test brackets between the two sub-
trees, and optimizes the number of matches. For instance:
t f(t)
2 1
3 2

t g(t)
4 4
5 4
=
t (f g)(t)
6 5
7 6
8 6
66
Code 5.4 Forest Oracle Algorithm.
1: function Oracle(V, E), y

)
2: for v V in topological order do
3: for e BS(v) do
4: e is v u
1
u
2
. . . u
|e|
5: ora[v] ora[v] (
i
ora[u
i
])
6: ora[v] ora[v] (1, 1
vy
)
7: return F(y
+
, y

) = max
t
2ora[TOP](t)
t+|y

|
oracle F
1
The oracle function for the head node w is then
ora[w](t) = (ora[v] ora[u])(t 1) +1
wy
(5.9)
where 1is the indicator function, returning 1 if node w is found in the gold tree y

, in
which case we increment the number of matched brackets. We can also express Eq. 5.9 in
a purely functional form
ora[w] = (ora[v] ora[u]) (1, 1
wy
) (5.10)
where is a translation operator which shifts a function along the axes:
(f (a, b))(t) f(t a) +b (5.11)
Above we discussed the case of one hyperedge. If there is another hyperedge e

deriving
node w, we also need to combine the resulting oracle functions from both hyperedges, for
which we dene a pointwise addition operator :
(f g)(t) maxf(t), g(t) (5.12)
Shown in Figure 5.4, we perform these computations in a bottom-up topological order,
and nally at the root node TOP, we can compute the best global F-score by maximizing
over dierent numbers of test brackets (line 7). The oracle tree y
+
can be recursively
restored by keeping backpointers for each ora[v](t), which we omit in the pseudo-code.
The computational complexity of this algorithm for a sentence of l words is O([E[
l
2(a1)
) where a is the arity of the forest. For a CKY forest, this complexity amounts to
67
O(l
3
l
2
) = O(l
5
), but for general forests like those in our experiments the complexities are
much higher. In practice it takes on average 0.05 seconds for forests pruned by p = 10 (see
Section 4.3), but we can pre-compute and store the oracle for each forest before training
starts.
5.5 Experiments
We compare the performance of our forest reranker against k-best reranking on the Penn
English Treebank (Marcus et al., 1993). The baseline parser is the Charniak parser, which
we modied to output a packed forest for each sentence.
3
5.5.1 Data Preparation
We use the standard split of the Treebank: sections 02-21 as the training data (39832
sentences), section 22 as the development set (1700 sentences), and section 23 as the test
set (2416 sentences). Following (Charniak and Johnson, 2005), the training set is split into
20 folds, each containing about 1992 sentences, and is parsed by the Charniak parser with
a model trained on sentences from the remaining 19 folds. The development set and the
test set are parsed with a model trained on all 39832 training sentences.
We implemented both k-best and forest reranking systems in Python and ran our
experiments on a 64-bit Dual-Core Intel Xeon with 3.0GHz CPUs. Our feature set is
summarized in Table 5.2, which closely follows Charniak and Johnson (2005), except that
we excluded the non-local features Edges, NGram, and CoPar, and simplied Rule and
NGramTree features, since they were too complicated to compute.
4
We also added four
unlexicalized local features from Collins (2000) to cope with data-sparsity.
Following Charniak and Johnson (2005), we extracted the features from the 50-best
parses on the training set (sec. 02-21), and used a cut-o of 5 to prune away low-count
features. There are 0.8M features in our nal set, considerably fewer than that of Charniak
3
This is a relatively minor change to the Charniak parser, since it implements k-best Algorithm 3 for
ecient enumeration of k-best parses, which requires storing the forest. The modied parser and related
scripts for handling forests (e.g. oracles) will be available on my homepage.
4
In fact, our Rule and ParentRule features are two special cases of the original Rule feature in
(Charniak and Johnson, 2005). We also restricted NGramTree to be on bigrams only.
68
Local instances Non-Local instances
Rule 10, 851 ParentRule 18, 019
Word 20, 328 WProj 27, 417
WordEdges 454, 101 Heads 70, 013
CoLenPar 22 HeadTree 67, 836
Bigram

10, 292 Heavy 1, 401


Trigram

24, 677 NGramTree 67, 559


HeadMod

12, 047 RightBranch 2


DistMod

16, 017
Total Feature Instances: 800, 582
Table 5.2: Features used in this work. Those with a

are from (Collins, 2000), and others
are from (Charniak and Johnson, 2005), with simplications.
and Johnson which has about 1.3M features in the updated version.
5
However, our initial
experiments show that, even with this much simpler feature set, our 50-best reranker
performed equally well as theirs (both with an F-score of 91.4, see Tables 5.3 and 5.4).
This result conrms that our feature set design is appropriate, and the averaged perceptron
learner is a reasonable candidate for reranking.
The forests dumped from the Charniak parser are huge in size, so we use the forest
pruning algorithm in Section 4.3 to prune them down to a reasonable size. In the following
experiments we use a threshold of p = 10, which results in forests with an average number
of 123.1 hyperedges per forest. Then for each forest, we annotate its forest oracle, and on
each hyperedge, pre-compute its local features.
6
Shown in Figure 5.3, these forests have an
forest oracle of 97.8, which is 1.1% higher than the 50-best oracle (96.7), and are 8 times
smaller in size.
69
89.0
91.0
93.0
95.0
97.0
99.0
0 500 1000 1500 2000
P
a
r
s
e
v
a
l

F
-
s
c
o
r
e

(
%
)
average # of hyperedges or brackets per sentence
p=10 p=20
n=10
n=50
n=100
1-best
forest oracle
n-best oracle
Figure 5.3: Forests (shown with various pruning thresholds) enjoy higher oracle scores than
k-best lists. For the latter, the number of hyperedges is the (average) number of brackets
in the k-best parses per sentence.
baseline: 1-best Charniak parser 89.72
n-best reranking
features n pre-comp. training F
1
%
local 50 1.7G / 16h 3 0.1h 91.28
all 50 2.4G / 19h 4 0.3h 91.43
all 100 5.3G / 44h 4 0.7h 91.49
forest reranking (p = 10)
features k pre-comp. training F
1
%
local -
1.2G / 2.9h
3 0.8h 91.25
all 15 4 6.1h 91.69
Table 5.3: Forest reranking compared to n-best reranking on sec. 23. The pre-comp.
column is for feature extraction, and training column shows the number of perceptron
iterations that achieved best results on the dev set, and average time per iteration.
70
5.5.2 Results and Analysis
Table 5.3 compares the performance of forest reranking against standard n-best reranking.
For both systems, we rst use only the local features, and then all the features. We use the
development set to determine the optimal number of iterations for averaged perceptron,
and report the F
1
score on the test set. With only local features, our forest reranker
achieves an F-score of 91.25, and with the addition of non-local features, the accuracy rises
to 91.69 (with beam size k = 15), which is a 0.26% absolute improvement over 50-best
reranking.
7
This improvement might look relatively small, but it is much harder to make a similar
progress with n-best reranking. For example, even if we double the size of the n-best list
to 100, the performance only goes up by 0.06% (Table 5.3). In fact, the 100-best oracle is
only 0.5% higher than the 50-best one (see Fig. 5.3). In addition, the feature extraction
step in 100-best reranking produces huge data les and takes 44 hours in total, though
this part can be parallelized.
8
On two CPUs, 100-best reranking takes 25 hours, while
our forest-reranker can also nish in 26 hours, with a much smaller disk space. Indeed,
this demonstrates the severe redundancies as another disadvantage of n-best lists, where
many subtrees are repeated across dierent parses, while the packed forest reduces space
dramatically by sharing common sub-derivations (see Fig. 5.3).
To put our results in perspective, we also compare them with other best-performing
systems in Table 5.4. Our nal result (91.7) is better than any previously reported system
trained on the Treebank, although McClosky et al. (2006) achieved an even higher accuarcy
(92.1) by leveraging on much larger unlabelled data. Moreover, their technique is orthog-
onal to ours, and we suspect that replacing their n-best reranker by our forest reranker
5
http://www.cog.brown.edu/mj/software.htm. We follow this version as it corrects some bugs from
their 2005 paper which leads to a 0.4% increase in performance (see Table 5.4).
6
A subset of local features, e.g. WordEdges, is independent of which hyperedge the node takes in a
derivation, and can thus be annotated on nodes rather than hyperedges. We call these features node-local,
which also include part of Word features.
7
It is surprising that 50-best reranking with local features achieves an even higher F-score of 91.28, and
we suspect this is due to the aggressive updates and instability of the perceptron, as we do observe the
learning curves to be non-monotonic. We leave the use of more stable learning algorithms to future work.
8
The n-best feature extraction already uses relative counts (Johnson, 2006), which reduced le sizes by
at least a factor 4.
71
type system F
1
%
D
Collins (2000) 89.7
Henderson (2004) 90.1
Charniak and Johnson (2005) 91.0
updated (Johnson, 2006) 91.4
Petrov and Klein (2008) 88.3
this work 91.7
G
Bod (2003) 90.7
Petrov and Klein (2007) 90.1
S McClosky et al. (2006) 92.1
Table 5.4: Comparison of our nal results with other best-performing systems on the whole
Section 23. Types D, G, and S denote discriminative, generative, and semi-supervised
approaches, respectively.
might get an even better performance. Plus, except for n-best reranking, most discrimi-
native methods require repeated parsing of the training set, which is generally impratical
(Petrov and Klein, 2008). Therefore, previous work often resorts to extremely short sen-
tences ( 15 words) or only looked at local features (Taskar et al., 2004; Henderson, 2004;
Turian and Melamed, 2007). In comparison, thanks to the ecient decoding, our work not
only scaled to the whole Treebank, but also successfully incorporated non-local features,
which showed an absolute improvement of 0.44% over that of local features alone.
5.6 Discussion and Summary
There also exist other appraoches to incorporating non-local features, mostly from out-
side parsing. McDonald and Pereira (2006) and Ding (2006) use an annealing method
(Kirkpatrick et al., 1983), which makes a few rearrangements to the best solution from the
base model with only local features, assuming that the globally best solution is only a few
changes away from the base solution. However, this assumption is not guaranteed to hold
in general. Finkel et al. (2005) use Gibbs Sampling for approximate inference in sequence
72
labeling, which is completely dierent from our work. Roth and Yih (2005) encode hard
constraints by integer linear programming, but this is equivalent to only using non-local
features whose weights are xed to negative innity. Kazama and Torisawa (2007) regen-
erate k-best lists after each iteration, which alleviates the problem of xed scopes, but is
still based on k-best reranking.
To summarize, we have presented a framework for dynamic programming-based rerank-
ing on packed forests which compactly encodes many more candidates than k-best lists.
With the approximate decoding algorithm, perceptron training on the whole Treebank
becomes practical, which can be done in 2 days even with a Python implementation. Our
nal result outperforms both 50-best and 100-best reranking baselines, and is better than
any previously reported systems trained on the Treebank. We also devised a dynamic
programming algorithm for computing forest oracles. which is an non-trivial problem by
itself. We believe this general framework could also be applied to other problems involving
packed forests, such as dependency parsing and sequence labelling.
73
Chapter 6
Application: Forest-based
Translation
Continuing the trend of alternating topics between parsing and machine translation, this
chapter is an application of all the materials developed in previous chapters to the problem
of syntax-based translation.
Syntax-based machine translation has witnessed promising improvements in recent
years. Depending on the type of input, these eorts can be divided into two broad cate-
gories: the string-based systems whose input is a string to be simultaneously parsed and
translated by a synchronous grammar (Wu, 1997; Chiang, 2005; Galley et al., 2006), and
the tree-based systems whose input is already a parse tree to be directly converted into a
target tree or string (Lin, 2004; Ding and Palmer, 2005; Quirk et al., 2005; Liu et al., 2006;
Huang et al., 2006). Compared with their string-based counterparts, tree-based systems
oer some attractive features: they are much faster in decoding (linear time vs. cubic time,
see (Huang, 2006)), do not require a binary-branching grammar as in string-based models
(Zhang et al., 2006), and can have separate grammars for parsing and translation, say, a
context-free grammar for the former and a tree substitution grammar for the latter (Huang
et al., 2006). However, despite these advantages, current tree-based systems suer from a
major drawback: they only use the 1-best parse tree to direct the translation, which po-
tentially introduces translation mistakes due to parsing errors (Quirk and Corston-Oliver,
74
2006). This situation becomes worse with resource-poor source languages without enough
Treebank data to train a high-accuracy parser.
One obvious solution to this problem is to take as input k-best parses, instead of a single
tree. This k-best list postpones some disambiguation to the decoder, which may recover
from parsing errors by getting a better translation from a non 1-best parse. However, a
k-best list, with its limited scope, often has too few variations and too many redundancies;
for example, a 50-best list typically encodes a combination of 5 or 6 binary ambiguities
(since 2
5
< 50 < 2
6
), and many subtrees are repeated across dierent parses (Huang,
2008b). It is thus inecient either to decode separately with each of these very similar
trees. Longer sentences will also aggravate this situation as the number of parses grows
exponentially with the sentence length.
We instead propose a new approach, forest-based translation where the decoder trans-
lates a packed forest of exponentially many parses, which compactly encodes many more
alternatives than k-best parses. This scheme can be seen as a compromise between the
string-based and tree-based methods, while combining the advantages of both: decoding
is still fast, yet does not commit to a single parse. Large-scale experiments show an im-
provement of 1 BLEU points over the 1-best baseline, which is also 0.7 points higher than
decoding with 30-best trees, and takes even less time thanks to the sharing of common
subtrees.
Pushing this direction further, we also propose to extract translation rules from packed
forests, which helps alleviate the propogation of parsing errors into the rule set. It is
well-known that parsing errors often cause the rule extractor to resort to bigger chunks of
rules, which decreases the generalizability of rules extracted and severs the data sparsity
problem (May and Knight, 2007). Experiments show further improvements with combined
forest-based rule extraction and decoding, with a total of 2.5 BLEU points over the 1-best
baseline, and our nal result outperforms the hierarchical system of Hiero (Chiang, 2007).
This chapter is based on materials in (Mi et al., 2008) and (Mi and Huang, 2008).
75
IP
NP
x
1
:NPB CC
y u
x
2
:NPB
x
3
:VPB
x
1
x
3
with x
2
Figure 6.1: Example translation rule r
1
. The Chinese conjunction y u (and) is translated
into the English preposition with.
6.1 Tree-based Translation
We review in this section the tree-based approach to machine translation (Liu et al., 2006;
Huang et al., 2006), and its rule extraction algorithm (Galley et al., 2004; Galley et al.,
2006).
6.1.1 Tree-to-String System
Current tree-based systems perform translation in two separate steps: parsing and decod-
ing. The input string is rst parsed by a parser into a 1-best tree, which will then be
converted to a target language string by applying a set of tree-to-string transformation
rules. For example, consider the following example translating from Chinese to English:
(6.1) B` ush
Bush
y u
and/with
Sh al ong
Sharon
1
j uxng
hold
le
pass.
hu`t an
meeting
2
Bush held a meeting
2
with Sharon
1

Figure 6.2 shows how this process works. The Chinese sentence (a) is rst parsed into
a parse tree (b), which will be converted into an English string in 5 steps. First, at the
root node, we apply rule r
1
shown in Figure 6.1, which translates the Chinese coordination
construction (... and ...) into an English prepositional phrase. Then, from step (c) we
continue applying rules to untranslated Chinese subtrees, until we get the complete English
translation in (e).
More formally, a (tree-to-string) translation rule (Galley et al., 2004; Huang et
al., 2006) is a tuple lhs(r), rhs(r), (r)), where lhs(r) is the source-side tree fragment,
76
(a) B` ush y u Sh al ong j uxng le hu`t an
1-best parser
(b)
IP
NP
NPB
B` ush
CC
y u
NPB
Sh al ong
VPB
VV
j uxng
AS
le
NPB
hu`t an
r
1

(c)
NPB
B` ush
VPB
VV
j uxng
AS
le
NPB
hu`t an
with
NPB
Sh al ong
r
2
r
3

(d) Bush held
NPB
hu`t an
with
NPB
Sh al ong
r
4
r
5

(e) Bush held a meeting with Sharon
Figure 6.2: An example derivation of tree-to-string translation. Each shaded region denotes
a tree fragment that is pattern-matched with the rule being applied.
whose internal nodes are labeled by nonterminal symbols (like NP and VP), and whose
frontier nodes are labeled by source-language words (like y u) or variables from a set
A = x
1
, x
2
, . . .; rhs(r) is the target-side string expressed in target-language words (like
with) and variables; and (r) is a mapping from A to nonterminals. Each variable
x
i
A occurs exactly once in lhs(r) and exactly once in rhs(r). For example, for rule r
1
in Figure 6.1,
lhs(r
1
) = IP ( NP(x
1
CC(y u) x
2
) x
3
),
rhs(r
1
) = x
1
x
3
with x
2
,
(r
1
) = x
1
NPB, x
2
NPB, x
3
VPB.
77
IP
Bush .. Sharon
NP
Bush with Sharon
NPB
Bush
B` ush
CC
with
y u
NPB
Sharon
Sh al ong
VPB
held .. meeting
VV
held
j uxng
AS
held
le
NPB
a meeting
hu`t an
Bush held a meeting with Sharon
minimal rules extracted
IP (NP(x
1
:NPB x
2
:CC x
3
:NPB) x
4
:VPB) x
1
x
4
x
2
x
3
CC (y u) with
. . .
Figure 6.3: Tree-based rule extraction, `a la Galley et al. (2004). Each non-leaf node in
the tree is annotated with its target span (below the node), where denotes a gap, and
non-faithful spans are crossed out. Nodes with contiguous and faithful spans form the
frontier set shown in shadow. The rst two rules extracted can be composed to form
rule r
1
in Figure 6.1, other rules (r
2
. . . r
5
) are omitted.
These rules are being used in the reverse direction of the string-to-tree transducers in
Galley et al. (2004).
6.1.2 Tree-to-String Rule Extraction
We now briey explain how to extract these translation rules from a word-aligned bitext
with a source-side parse tree using the algorithm of Galley et al. (2004).
Consider the example in Figure 6.3. The basic idea is to decompose the source (Chinese)
parse into a series of tree fragments, each of which will form a rule with its corresponding
78
English translation. However, not every fragmentation can be used for rule extraction,
since it may or may not respect the alignment and reordering between the two languages.
So we say a fragmentation is well-formed with respect to an alignment if the root node of
every tree fragment corresponds to a contiguous span on the target side. The intuition
is that there is a translational equivalence between the subtree rooted at the node and
the corresponding target span. For example, in Figure 6.3, every node in the parse tree
is annotated with its corresponding English span, where the NP node maps to a non-
contiguous span Bush with Sharon.
More formally, we need a precise formulation to handle the cases of one-to-many, many-
to-one, and many-to-many alignment links. Given a source-target sentence pair (, ) with
alignment a, the span of node v is the set of target words aligned to leaf nodes yield(v)
under node v:
span(v)
i
[
j
yield(v), (
j
,
i
) a.
For example, in Figure 6.3, every node in the parse tree is annotated with its corresponding
span below the node, where most nodes have contiguous spans except for the NP node
which maps to a gapped phrase Bush with Sharon. But contiguity alone is not enough
to ensure well-formedness, since there might be words within the span aligned to other
unrelated nodes as well. So we also dene a span s to be faithful to node v if every word
in it is only aligned to leaf nodes of v, i.e.:

i
s, (
j
,
i
) a
j
yield(v).
For example, sibling nodes VV and AS in the tree have non-faithful spans (crossed out
in the Figure), because they both map to held, thus neither of them can be translated
to held alone. In this case, a larger tree fragment rooted at VPB has to be extracted.
Nodes with (non-empty) contiguous and faithful spans form the admissible set (shaded
nodes in Figure) which serve as potential cut-points for rule extraction.
1
With the admissible set computed, rule extraction is as simple as a depth-rst traversal
from the root: we cut the tree at all admissible nodes to form tree fragments and extract
1
Admissible set (Wang et al., 2007a) is also known as frontier set in Galley et al. (2004). For simplicity
of presentation, we assume every target word is aligned to at least one source word. See Galley et al. (2006)
for details about handling unaligned target words.
79
a rule for each fragment, with variables matching the admissible descendant nodes. For
example, the tree in Figure 6.3 is cut into 6 pieces, each of which corresponds to a rule on
the right.
These extracted rules are called minimal rules, which can be glued together to form
composed rules with larger tree fragments (e.g. r
1
in Fig. 6.1) (Galley et al., 2006). Our
experiments use composed rules.
6.2 Forest-based Decoding
We now extend tree-based decoding from Section 6.1.1 to the case of forest-based transla-
tion. Again, there are two steps, parsing and decoding. In the former, a (modied) parser
described in Chapter 5 will parse the input sentence and output a packed forest rather
than just the 1-best tree. Such a forest is usually huge in size, so we use the forest pruning
algorithm from Section 4.3 to reduce it to a reasonable size. The pruned parse forest will
then be used to direct the translation.
In the decoding step, we rst convert the parse forest into a translation forest using the
translation rule set, by similar techniques of pattern-matching from tree-based decoding
(see Section 6.2.1). Then the decoder searches for the best derivation on the translation
forest and outputs the target string (see Section 6.2.2).
We will continue with the running example (6.1). The source Chinese sentence has two
readings depending on the part-of-speech of the word y u: it can either be a conjunction
(CC and) as shown in Figure 6.2(b), or a preposition (P with), in which case the
sentence would still translate into the same English sentence (but with dierent rules).
These two parses are packed in a forest in Figure 6.4(a).
6.2.1 From Parse Forest to Translation Forest
Given a parse forest and a translation rule set 1, we can generate a translation forest which
has a similar hypergraph structure. Basically, just as the depth-rst traversal procedure in
tree-based decoding (Figure 6.2), we visit in top-down order each node v in the parse forest,
and try to pattern-match each translation rule r against the local sub-forest under node v.
80
(a)
IP
0, 6
NP
0, 3
NPB
0, 1
B` ush
CC
1, 2
y u
VP
1, 6
PP
1, 3
P
1, 2
NPB
2, 3
Sh al ong
VPB
3, 6
VV
3, 4
j uxng
AS
4, 5
le
NPB
5, 6
hu`t an
e
p
2
e
p
3
e
p
1
translation rule set 1
(b)
IP
0, 6
NP
0, 3
NPB
0, 1
CC
1, 2
VP
1, 6
PP
1, 3
P
1, 2
NPB
2, 3
VPB
3, 6
VV
3, 4
AS
4, 5
NPB
5, 6
e
t
1
e
t
4
e
t
2
e
t
3
(c)
translation rule
e
t
1
r
1
: IP(NP(x
1
:NPB CC(y u) x
2
:NPB) x
3
:VPB) x
1
x
3
with x
2
e
t
2
r
6
: IP(x
1
:NPB x
2
:VP) x
1
x
2
e
t
3
r
7
: VP(PP(P(y u) x
1
:NPB) x
2
:VPB) x
2
with x
1
e
t
4
r
3
: VPB(VV(j uxng) AS(le) x
1
:NPB) held x
1
Figure 6.4: (a) the parse forest of the example sentence; solid hyperedges denote the 1-
best parse in Figure 6.2(b) while dashed hyperedges denote the alternative parse. (b) the
resulting translation forest after applying the translation rules (lexical rules not shown); the
derivation shown in bold solid lines (e
t
1
and e
t
4
) corresponds to the derivation in Figure 6.2
while the one in dashed lines (e
t
2
and e
t
3
) uses the alternative parse. (c) the correspondence
between translation hyperedges and translation rules.
81
For example, in Figure 6.4(a), at root node IP
0, 6
, two rules r
1
and r
6
from Figure 6.4(c)
both match the local subforest, and will thus generate two translation hyperedges e
t
1
and
e
t
2
in Figure 6.4(b). We use e
t
to distinguish translation hyperedges from parse hyperedges
denoted by e
p
.
More formally, we dene a function match(r, v) which attempts to pattern-match the
rule r at node v in the parse forest, and in case of success, returns a list of pairs, or an
empty list if otherwise. Each such pair frag, vars) represents a matching scenario, where
frag is the list of parse hyperedges involved, and vars the list of descendent nodes of v that
are matched to the variables in rule r. For example, in Figure 6.4(a),
match(r
1
, IP
0, 6
) = (e
p
1
, e
p
3
), (NPB
0, 1
, NPB
2, 3
, VPB
3, 6
)), (6.2)
returns a single matching scenario, which involves two parse hyperedges and three descen-
dant nodes. The nodes in gray fail to pattern-match any rule (although they are involved
in the matching of their ancestor nodes, where they correspond to interior nodes of the
source-side tree fragments). This example looks trivial, but in general this procedure is
non-deterministic and can return multiple matching scenarios. For example if we merge
NPB with NP and VPB with VP, then a rule
r

6
: IP (x
1
:NP x
2
:VP) x
1
x
2
would match both hyperedges e
p
1
and e
p
2
and return two matching scenarios:
match(r

6
, IP
0, 6
) = (e
p
1
), (NP
0, 3
, VP
3, 6
)), (e
p
2
), (NP
0, 1
, VP
1, 6
)).
For each matching scenario frag, vars), we can construct a translation hyperedge from
the matched descendant nodes vars to v. In addition, we also need to keep track of the tar-
get string rhs(r) specied by rule r, which includes target-language terminals and variables,
e.g. rhs(r
1
) = x
1
x
3
with x
2
. The variables will be substituted by the subtranslations
of the matched variable nodes to get a complete translation for node v. So informally, a
translation hyperedge e
t
is a triple tails(e
t
), head(e
t
), rhs) for example, from Match (6.2)
we have
e
t
1
= (NPB
0, 1
, NPB
2, 3
, VPB
3, 6
), VP
1, 6
, x
1
x
3
with x
2
).
82
Code 6.1 Forest-based Decoding: from Parse Forest to Translation Forest.
Input: parse forest H
p
and rule set 1
Output: translation forest H
t
1: for each node v V
p
in top-down order do
2: for each translation rule r 1 do
3: mlist match(r, v) matching scenarios
4: for each frag, vars) in mlist do
5: e
t
vars, v, rhs(r))
6: add translation hyperedge e
t
to H
t
More formally, in the strict denition of hypergraph in Chapter 2, a hyperedge would
have a weight function (or output function) which computes an output value from the
inputs of tail nodes. In this case, the output function f(e
t
) for a translation hyperedge
based on rule r is notated:
f(e
t
)(t
1
, w
1
), . . . , t
n
, w
n
)) = rhs(r)[x
i
t
i
],

w
i
P(e
t
)),
where n = [e
t
[ is the arity of the hyperedge (number of variables in the rule), and t
i
and w
i
are the subtranslations and corresponding weights of the tail nodes, respectively.
The probablity of the translation hyperedge, P(e
t
), is the (weighted) product of the rule
probability P(r), and the probabilities of the parse hyperedges involved in the fragment:
P(e
t
) = P(r)

e
p
i
frag
P(e
p
i
). (6.3)
For example, P(e
t
1
) = P(r
1
)

P(e
p
1
) P(e
p
3
), and
f(e
t
1
)(t
1
, w
1
), t
2
, w
2
), t
3
, w
3
)) = t
1
t
3
with t
2
, w
1
w
2
w
3
P(e
t
1
)).
This procedure is summarized in Code 6.1.
6.2.2 Decoding on the Translation Forest with Language Models
The decoder performs two tasks on the translation forest: 1-best search with integrated
language model (LM), and k-best search with LM to be used in minimum error rate
83
training. Both tasks can be done eciently by forest-based algorithms based on k-best
parsing (Chapters 3 and 4).
For 1-best search, we use the cube pruning technique (Chapter 3) which approximately
intersects the translation forest with the LM. Basically, cube pruning works bottom up in
a forest, keeping at most k +LM items at each node, and uses the best-rst expansion idea
from the k-best Algorithm 2 of to speed up the computation. An +LM item of node v
has the form (v
ab
), where a and b are the target-language boundary words. For example,
(VP
held Sharon
(1,6)
) is an +LM item with its translation starting with held and ending with
Sharon. This scheme can be easily extended to work with a general n-gram by storing
n 1 words at both ends (Chiang, 2007).
For k-best search after getting 1-best derivation, we use the lazy Algorithm 3 (Chap-
ter 3) that works backwards from the root node, incrementally computing the second,
third, through the kth best alternatives. However, this time we work on a ner-grained
forest, called translation+LM forest, resulting from the intersection of the translation for-
est and the LM, with its nodes being the +LM items during cube pruning. Although this
new forest is prohibitively large, Algorithm 3 is very ecient with minimal overhead on
top of 1-best.
6.3 Forest-based Rule Extraction
We now extend tree-based extraction algorithm described in Section 6.1.2 to work with a
packed forest.
6.3.1 Generalized Rule Extraction Algorithm
Like in tree-based extraction, we extract rules from a packed forest F in two steps:
(1) admissible set computation (where to cut), and
(2) fragmentation (how to cut).
It turns out that the exact formulation developed for admissible set in the tree-based case
can be applied to a forest without any change. The fragmentation step, however, becomes
84
IP
0, 6
Bush .. Sharon
NP
0, 3
Bush with Sharon
NPB
0, 1
Bush
B` ush
CC
1, 2
with
y u
VP
1, 6
held .. Sharon
PP
1, 3
with Sharon
P
1, 2
with
NPB
2, 3
Sharon
Sh al ong
VPB
3, 6
held .. meeting
VV
3, 4
held
j uxng
AS
4, 5
held
le
NPB
5, 6
a meeting
hu`t an
Bush held a meeting with Sharon
extra (minimal) rules extracted
IP (x
1
:NPB x
2
:VP) x
1
x
2
VP (x
1
:PP x
2
:VPB) x
2
x
1
PP (x
1
:P x
2
:NPB) x
1
x
2
P (y u) with
Figure 6.5: Forest-based rule extraction on the parse forest in Figure 6.4(a).
much more involved since we now face a choice of multiple parse hyperedges at each node.
In other words, it becomes non-deterministic how to cut a forest into tree fragments,
which is analagous to the non-deterministic pattern-match in forest-based decoding. For
example there are two parse hyperedges e
p
1
and e
p
2
at the root node in Figure 6.5. When
we follow one of them to grow a fragment, there again will be multiple choices at each of
its tail nodes. Like in tree-based case, a fragment is said to be complete if all its leaf
nodes are in the admissible set. Otherwise, an incomplete fragment can grow at any leaf
node v not in the admissible set, where following each parse hyperedge at v will spin-o a
85
new fragment. For example, following e
p
1
at the root node will immediately lead us to two
nodes in the admissible set, NPB
0, 1
and VP
1, 6
(we will highlight admissible set nodes
by gray shades in this section like in Figures 6.3 and 6.5). So this fragment, frag
1
= e
p
2
,
is now complete and we can extract a rule,
IP (x
1
:NPB x
2
:VP) x
1
x
2
.
However, following the other hyperedge e
p
2
IP
0, 6
NP
0, 3
VPB
3, 6
will leave the new fragment frag
2
= e
p
1
incomplete with one node NP
0, 3
not in the
admissible set. We then grow frag
2
at this node by choosing hyperedge e
p
3
NP
0, 3
NPB
0, 1
CC
1, 2
NPB
2, 3
,
and spin-o a new fragment frag
3
= e
p
1
, e
p
3
, which is now complete since all its four leaf
nodes are in the admissible set. We then extract a rule with four variables:
IP (NP(x
1
:NPB x
2
:CC x
3
:NPB) x
4
:VPB) x
1
x
4
x
2
x
3
.
This procedure is formalized by a breadth-rst search (BFS) in Pseudocode 6.2. The
basic idea is to visit each frontier node v, and keep a queue open of actively growing
fragments rooted at v. We keep expanding incomplete fragments from open, and extract
a rule if a complete fragment is found (line 7). Each fragment is associated with a frontier
(variable front in the Pseudocode), being the subset of leaf nodes not in the admissible set
(recall that expansion stops at admissible set). So the initial frontier is just v (line 3).
A fragment is complete if its frontier is empty (line 6), otherwise we pop one frontier
node u to grow and spin-o new fragments by following hyperedges of u, and update the
frontier (lines 11-13), until all active fragments are complete and open queue is empty
(line 4).
A single parse tree can also be viewed as a trivial forest, where each node has only one
incoming hyperedge. So the Galley et al. (2004) algorithm for tree-based rule extraction
(Section 6.1.2) can be considered a special case of our algorithm, where the queue open
always contains one single active fragment.
86
Code 6.2 Forest-based Rule Extraction.
Input: forest F, target sentence , and alignment a
Output: minimal rule set 1
1: admset AdmSet(F, , a) compute the admissible set
2: for each v admset do
3: open , v) initial queue (with empty fragment)
4: while open ,= do
5: frag, front) open.pop() extract a fragment
6: if front = then nothing to expand?
7: generate a rule r using fragment frag
8: 1.append(r)
9: else incomplete: further expand
10: u front.pop() expand a frontier node
11: for each e
p
BS(u) do
12: front

front (tails(e
p
) admset) update the frontier
13: open.append(frag e
p
, front

))
6.3.2 Fractional Counts and Rule Probabilities
In tree-based extraction, for each sentence pair, each rule extracted naturally has a count of
one, which will be used in maximum-likelihood estimation of rule probabilities. However,
a forest is an implicit collection of many more trees, each of which, when enumerated, has
its own probability accumulated from of the parse hyperedges involved. In other words, a
forest can be viewed as a virtual weighted k-best list with a huge k. So a rule extracted
from a non 1-best parse, i.e., using non 1-best hyperedges, should be penalized accordingly
and should have a fractional count instead of a unit one, similar to the E-step in EM
algorithms.
Inspired by the parsing literature on pruning (Charniak and Johnson, 2005; Huang,
2008b) (described in Section 4.3), we penalize a rule r by the posterior probability of its
tree fragment frag = lhs(r). This posterior probability, notated (frag), can be computed
in an Inside-Outside fashion as the product of the outside probability of its root node, the
probabilities of parse hyperedges involved in the fragment, and the inside probabilities of
87
its leaf nodes:
(frag) = (root(frag))

e
p
frag
P(h)

v leaves(frag)
(v) (6.4)
where () and () denote the outside and inside probabilities of tree nodes, respectively.
For example in Figure 6.5,
(e
p
2
, e
p
3
) = (IP
0, 6
) P(e
p
2
) P(e
p
3
)
(NPB
0, 1
)(CC
1, 2
)(NPB
2, 3
)(VPB
3, 6
).
Now the fractional count of rule r is simply
c(r) =
(lhs(r))
(TOP)
(6.5)
where TOP denotes the root node of the forest.
Like in the M-step in EM algorithm, we now extend the maximum likelihood estimation
to fractional counts for three conditional probabilities regarding a rule, which will be used
in the experiments:
P(r [ lhs(r)) =
c(r)

:lhs(r

)=lhs(r)
c(r

)
, (6.6)
P(r [ rhs(r)) =
c(r)

:rhs(r

)=rhs(r)
c(r

)
, (6.7)
P(r [ root(lhs(r))) =
c(r)

:root(lhs(r

))=root(lhs(r))
c(r

)
. (6.8)
6.4 Experiments
Our experiments are on Chinese-to-English translation based on a tree-to-string system
similar to (Huang et al., 2006; Liu et al., 2006). Given a 1-best tree T, the decoder searches
for the best derivation d

among the set of all possible derivations D:


d

= arg max
dD

0
log P(d [ T) +
1
log P
lm
((d)) +
2
[d[ +
3
[(d)[ (6.9)
where the rst two terms are translation and language model probabilities, (d) is the
target string (English sentence) for derivation d, and the last two terms are derivation and
88
translation length penalties, respectively. The conditional probability P(d [ T) decomposes
into the product of rule probabilities:
P(d [ T) =

rd
P(r) (6.10)
where each P(r) is a product of ve probabilities:
P(r) =P(r [ lhs(r))

4
P(r [ rhs(r))

5
P(r [ root(lhs(r)))

6
P
lex
(lhs(r) [ rhs(r))

7
P
lex
(rhs(r) [ lhs(r))

8
(6.11)
where the rst three are conditional probabilities based on fractional counts of rules de-
ned in the previous Section, and the last two are lexical probabilities. These parameters

1
. . .
8
are tuned by minimum error rate training (Och, 2003) on the dev sets.
We use the Chinese parser of Xiong et al. (2005) to parse the source side of the bitext.
Following Chapter 5 we also modify this parser to output a packed forest for each sentence.
We will rst report results trained on a small-scaled dataset with detailed analysis, and
then scale to a larger one.
6.4.1 Small Data: Forest-based Decoding
Our training corpus consists of 31,011 sentence pairs with 835K Chinese words and 941K
English words. We rst word-align them by GIZA++ with renement option diagand
(Koehn et al., 2003), and then extract translation rules after parsing the Chinese side (into
1-best trees). We use SRI Language Modeling Toolkit (Stolcke, 2002) to train a trigram
language model with Kneser-Ney smoothing on the English side of the bitext.
Our development and test sets are the 2002 NIST MT Evaluation test set (878 sen-
tences) and the 2005 NIST MT Evaluation test set as our test set (1082 sentences), with
on average 28.28 and 26.31 words per sentence, respectively. We evaluate the transla-
tion quality using the BLEU-4 metric (Papineni et al., 2002) with default case-insensitive
setting.
To test the eect of forest-based decoding, we parse the dev and test sets into forests
followed by a pruning of threshold of p
d
= 12, and then convert the pruned parse forests
into translation forests using the algorithm in Section 6.2.1. To increase the coverage of
89
0.242
0.244
0.246
0.248
0.250
0.252
0.254
0.256
0.258
0.260
0 5 10 15 20 25 30 35
B
L
E
U

s
c
o
r
e
average decoding time (secs/sentence)
1-best
p
d
=5
p
d
=12
k=10
k=30
k=100
decoding on forest
on k-best trees
Figure 6.6: Comparison of decoding on forests with decoding on k-best trees.
the rule set, we also introduce a default translation hyperedge for each parse hyperedge by
monotonically translating each tail node, so that we can always at least get a complete
translation in the end.
Figure 6.6 compares forest decoding with decoding on k-best trees in terms of speed and
quality. Using more than one parse tree apparently improves the BLEU score, but at the
cost of much slower decoding, since each of the top-k trees has to be decoded individually
although they share many common subtrees. Forest decoding, by contrast, is much faster
and produces consistently better BLEU scores. With pruning threshold p
d
= 12, it achieves
a BLEU score of 0.2602, which is an absolute improvement of 1.7% points over the 1-best
baseline of 0.2430, and is statistically signicant using the sign-test of Collins et al. (2005)
(p < 0.01).
We also investigate the question of how often the ith-best parse tree is picked to direct
the translation (i = 1, 2, . . .), in both k-best and forest decoding schemes. A packed forest
can be roughly viewed as a (virtual) -best list, and we can thus ask how often is a parse
beyond top-k used by a forest, which relates to the fundamental limitation of k-best lists.
Figure 6.7 shows that, the 1-best parse is still preferred 25% of the time among 30-best
trees, and 23% of the time by the forest decoder. These ratios decrease dramatically as
i increases, but the forest curve has a much longer tail in large i. Indeed, 40% of the
90
0
5
10
15
20
25
0 10 20 30 40 50 60 70 80 90 100
P
e
r
c
e
n
t
a
g
e

o
f

s
e
n
t
e
n
c
e
s

(
%
)
Rank of the tree picked in n-best list
decoding on forests
on 30-best trees
Figure 6.7: Percentage of the i-th best parse tree being picked in decoding. 32% of the
distribution for forest decoding is beyond top-100 and is not shown on this plot.
trees preferred by a forest is beyond top-30, 32% is beyond top-100, and even 20% beyond
top-1000. This conrms the fact that we need exponentially large k-best lists with the
explosion of alternatives, whereas a forest can encode these information compactly.
6.4.2 Small Data: Forest-based Rule Extraction
To test the eect of forest-based rule extraction, we parse the training set into parse forests
and use three levels of pruning thresholds: p
e
= 2, 5, 8.
rules from... extraction time decoding time BLEU
1-best trees 0.24 1.74 0.2430
30-best trees 5.56 3.31 0.2488
forest (p
e
= 8) 2.36 3.40 0.2533
Pharaoh - - 0.2297
Table 6.1: Results with dierent rule extraction methods (trained on small data). Extrac-
tion and decoding times are secs per 1000 sentences and per sentence, resp.
Figure 6.8 plots the extraction speed and translation quality of forest-based extraction
91
0.240
0.242
0.244
0.246
0.248
0.250
0.252
0.254
0 1 2 3 4 5 6
B
L
E
U

s
c
o
r
e
average extracting time (secs/1000 sentences)
1-best
p
e
=2
p
e
=5
p
e
=8
k=30
forest extraction
k-best extraction
Figure 6.8: Comparison of extraction time: forest-based vs.1-best and 30-best.
with various pruning thresholds. To control the number of variables we only use 1-best
decoding in this comparison. Same as the trend in the decoding plot (Figure 6.6), here
extracting rules on a forest is also faster than on 30-best trees, and produces consistently
better BLEU scores. With pruning threshold p
e
= 8, forest-based extraction achieves
a BLEU score of 0.2533, which is an absolute improvement of 1.0% points over the 1-
best baseline, and is statistically signicant (p < 0.01). These BLEU score results are
summarized in Table 6.1, which also shows that decoding with forest-extracted rules is less
than twice as slow as with 1-best rules, and only fractionally slower than 30-best rules.
Table 6.2: Statistics of rules extracted from small data.
extraction from ... 1-best 30-best forest (p
e
= 8)
total rules extracted 440k 1.2M 3.3M
after ltering on dev 90k 130k 188k
non 1-best rules in decoding - 8.71% 16.3%
We also investigate the question of how often rules extracted from non 1-best parses
are used by the decoder. Table 6.2 shows the numbers of rules extracted from both 1-best
and forest-based extraction, and how many among them survived after ltering on the
92
dev set. Basically in the forest-based case we can use about twice as many rules as in the
1-best case, or about 1.5 times of the 30-best extraction. But the real question is, are these
extra rules really useful in generating the nal translation? The last row shows that 16.3%
of the rules used in 1-best derivation are indeed extracted from non 1-best parses in the
forests, which conrms that forest-extracted rules do play an important role in decoding.
Note that this is a stronger condition than changing the distribution of rules by considering
more parses; here we introduce new rules never seen on any 1-best parses.
6.4.3 Large Data: Combined Results
We also conduct experiments on a larger training dataset, FBIS, which contains 239K
sentence pairs with about 6.9M/8.9M words in Chinese/English, respectively. We also use
a bigger trigram model trained on the rst 1/3 of the Xinhua portion of Gigaword corpus.
During both rule extraction and decoding phases, we use both 1-best trees and forests
(with pruning thresholds p
e
= 5 for extraction and p
d
= 10 for decoding).
Table 6.3: BLEU score results trained on large data.
extraction decoding 1-best tree forest (p
d
= 10)
1-best tree 0.2560 0.2674
30-best trees 0.2634 0.2767
forest (p
e
= 5) 0.2679 0.2816
Hiero 0.2738
The nal BLEU score results are shown in Table 6.3. With both tree-based and forest-
based decoding, rules extracted from forests signicantly outperform those extracted from
1-best trees (p < 0.01). The nal result with both forest-based extraction and decoding
reaches a BLEU score of 0.2816, outperforming that of Hiero (Chiang, 2005), one of the best
performing systems to date. These results conrm that our novel forest-based translation
approach is a promising direction for syntax-based translation.
93
6.5 Discussion and Summary
There are some related work which deserve discussion here. The concept of packed forest
has been previously used in translation rule extraction, for example in rule composition
(Galley et al., 2006) and tree binarization (Wang et al., 2007b). However, both of these
eorts only use 1-best parses, with the second one packing dierent binarizations of the
same tree in a forest. Nevertheless we suspect that their extraction algorithm is in prin-
ciple similar to ours, although they do not provide details of forest-based fragmentation
(Code 6.2). Moreover, in practice they use EM to learn to the best binarization for each
sentence, so in essense they do not extract alternative rules for real decoding.
The forest concept is also used in machine translation decoding, for example to charac-
terize the search space of decoding with integrated language models (Huang and Chiang,
2007) (the topic of Chapter 4).
There is also a parallel work on extracting rules from k-best parses and k-best align-
ments (Venugopal et al., 2008), but both their experiments and our own above conrm
that extraction on k-best parses is neither ecient nor eective.
To conclude, we have presented a novel forest-based translation framework which uses
a packed forest rather than the 1-best or k-best parse trees to direct the translation and to
extract translation rules. Forest provides a compact data-structure for ecient handling
of exponentially many tree structures, and is shown to be a promising direction with state-
of-the-art translation results and reasonable decoding and extraction speeds. This work
can thus be viewed as a compromise between string-based and tree-based paradigms, with
a good trade-o between speed and accuarcy. For future work we would like to apply this
approach to other types of syntax-based translation systems, especially the string-to-tree
(Galley et al., 2006) and tree-to-tree systems.
94
Chapter 7
Conclusions and Future Work
The tradeo between expressiveness and eciency has been a fundamental question not
only in Computational Linguistics, but also in almost all areas of Computer Science. Here
in this thesis, we have explored this question in the context of exact and approximate
dynamic programming on packed forests. The underlying motivation is to eectively incor-
porate more non-local information for the better disambiguation and generation in Natural
Language Processing systems. We have made the following contributions in the above
chapters:
1. We developed fast and exact k-best dynamic programming algorithms (Chapter 3)
which is generally applicable to all problems that can be modeled by the mono-
tonic hypergraph framework. Applications include statistical parsing and machine
translation decoding, where in both cases our algorithms lead to orders of magni-
tude speedups. Since the initial publication of (Huang and Chiang, 2005), these
algorithms have been implemented by most of the state-of-the-art parsers (Charniak
and Johnson, 2005; McDonald et al., 2005) and syntax-based MT decoders (Chiang,
2005; Galley et al., 2006; Zollmann and Venugopal, 2007).
2. We then extended these algorithms to approximate dynamic programming when the
forests are too big for exact inference. We discuss two particular instances of this
new method, forest rescoring for MT decoding (Chapter 4), and forest reranking for
parsing (Chapter 5). In both cases, our methods perform orders of magnitude faster
95
than conventional approaches. In the latter, faster search also leads to better learn-
ing, where our approximate decoding makes whole-treebank discriminative training
practical and results in an accuracy better than any previously reported systems
trained on the Penn Treebank.
3. Finally, we applied all the above ideas to the problem of syntax-based translation.
Here we propose a new paradigm, forest-based translation (Chapter 6), which trans-
lates a packed forest of the source sentence into a target sentence, as opposed to
just using 1-best or k-best parses as in usual practice. By considering exponen-
tially many alternatives, this scheme alleviates the propogation of parsing errors into
translation, yet only comes with fractional overhead in running time. We also push
this direction further to extract translation rules from packed forests. The combined
results of forest-based decoding and rule extraction show signicant improvements
in translation quality with large-scale experiments, and consistently outperform the
hierarchical system Hiero, one of the best performing systems to date.
For future work, we would like to pursue the following directions:
1. Theoretical analysis of search quality in forest rescoring and reranking algorithms,
and more principled pruning algorithms. Although the approximate dynamic pro-
gramming algorithms presented in Chapters 4 and 5 have been shown to be quite
eective in terms of empirical search quality, there have been no theoretical bounds
on the search errors. On the other hand, the forest pruning algorithm (Section 4.3)
works very well in practice, but is not guaranteed to be optimality-preserving. We
need to develop provably correct alternatives, and their approximation versions with
theoretical guarantees.
2. Faster approximate decoding for forest reranking, and potentially incorporating even
more non-local features on larger forests. The current forest decoding algorithm (Sec-
tion 5.3), although the fastest to date in non-trivial discriminative parsing systems, is
still too slow compared to its (trivial) n-best reranking counterpart. We need much
faster algorithms for discriminative training to become practical.
96
3. syntactic language models. Our forest-based translation work (Chapter 6) makes
pretty good use of source-side syntax, but is rather agnostic on the grammaticality
of target-side. As a result, the translation outputs often have grammatical problems
such as missing the main verb. There should be some interesting way of incorporating
grammatical information on the target-side as well as the source-side.
Overall, we aim to continue this direction of developing faster algorithms for more
expressive formalisms, with the hope of improving large-scale NLP systems.
97
References
Aho, Alfred V. and Jerey D. Ullman. 1972. The Theory of Parsing, Translation, and Compiling,
volume I: Parsing of Series in Automatic Computation. Prentice Hall, Englewood Clis, New
Jersey.
Bikel, Daniel M. 2004. Intricacies of Collins parsing model. Computational Linguistics,
30(4):479511, December.
Billot, Sylvie and Bernard Lang. 1989. The structure of shared forests in ambiguous parsing. In
Proceedings of ACL 89, pages 143151.
Bod, Rens. 1992. A computational model of language performance: Data Oriented Parsing. In
Proceedings of COLING, pages 855859.
Bod, Rens. 2003. An ecient implementation of a new DOP model. In Proceedings of EACL.
Brander, A. and M. Sinclair. 1995. A comparative study of k-shortest path algorithms. In Proc.
11th UK Performance Engineering Workshop for Computer and Telecommunications Systems.
Brown, Peter F., John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Frederick Jelinek,
Jennifer C. Lai, and Robert L. Mercer. 1995. Method and system for natural language
translation. U. S. Patent 5,477,451.
Carreras, Xavier, Michael Collins, and Terry Koo. 2008. Tag, dynamic programming and the
perceptron for ecient, feature-rich parsing. In Proceedings of CoNLL, Manchester, UK,
August.
Charniak, Eugene. 2000. A maximum-entropy-inspired parser. In Proceedings of NAACL.
Charniak, Eugene and Mark Johnson. 2005. Coarse-to-ne-grained n-best parsing and
discriminative reranking. In Proceedings of the 43rd ACL.
Chiang, David. 2005. A hierarchical phrase-based model for statistical machine translation. In
Proceedings of the ACL.
Chiang, David. 2007. Hierarchical phrase-based translation. Computational Linguistics,
33(2):201208.
Collins, Michael. 1999. Head-Driven Statistical Models for Natural Language Parsing. Ph.D.
thesis, University of Pennsylvania.
Collins, Michael. 2000. Discriminative reranking for natural language parsing. In Proceedings of
ICML, pages 175182.
Collins, Michael. 2002. Discriminative training methods for hidden markov models: Theory and
experiments with perceptron algorithms. In Proceedings of EMNLP.
98
Collins, Michael. 2003. Head-driven statistical models for natural language parsing.
Computational Linguistics, 29:589637.
Collins, Michael, Philipp Koehn, and Ivona Kucerova. 2005. Clause restructuring for statistical
machine translation. In Proceedings of ACL, pages 531540, Ann Arbor, Michigan, June.
Collins, Michale and Nigel Duy. 2002. New ranking algorithms for parsing and tagging: Kernels
over discrete structures, and the voted perceptron. In Proceedings of the ACL.
Cormen, Thomas, Charles Leiserson, Ronald Rivest, and Cliord Stein. 2001. Introduction to
Algorithms. MIT Press, second edition.
Ding, Yuan. 2006. Machine Translation Using Probabilistic Synchronous Dependency Insertion
Grammars. Ph.D. thesis, University of Pennsylvania.
Ding, Yuan and Martha Palmer. 2005. Machine translation using probablisitic synchronous
dependency insertion grammars. In Proceedings of the 43rd ACL.
Earley, J. 1970. An ecient context-free parsing algorithm. Communications of the ACM,
13(2):94102.
Eisner, Jason. 2000. Bilexical grammars and their cubic-time parsing algorithms. In Harry Bunt
and Anton Nijholt, editors, Advances in Probabilistic and Other Parsing Technologies. Kluwer
Academic Publishers, October, pages 2962.
Eppstein, David. 2001. Bibliography on k shortest paths and other k best solutions problems.
http://www.ics.uci.edu/eppstein/bibs/kpath.bib.
Finkel, Jenny, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local
information into information extraction systems by gibbs sampling. In Proceedings of ACL.
Finkel, Jenny Rose, Alex Kleeman, and Christopher D. Manning. 2008. Ecient, feature-based,
conditional random eld parsing. In Proceedings of ACL, Columbus, OH, June.
Galley, Michel, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and
Ignacio Thayer. 2006. Scalable inference and training of context-rich syntactic translation
models. In Proceedings of COLING-ACL.
Galley, Michel, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. Whats in a translation
rule? In HLT-NAACL, pages 273280.
Gallo, Giorgio, Giustino Longo, and Stefano Pallottino. 1993. Directed hypergraphs and
applications. Discrete Applied Mathematics, 42(2):177201.
Gildea, Daniel and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational
Linguistics, 28(3):245288.
99
Goodman, Joshua. 1998. Parsing Inside-Out. Ph.D. thesis, Harvard University.
Graehl, Jonathan. 2005. Relatively useless pruning. Unpubulished Manuscript. Technical Report,
USC Information Sciences Institute.
Graehl, Jonathan and Kevin Knight. 2004. Training tree transducers. In HLT-NAACL, pages
105112.
Hart, P. E., N. J. Nilsson, and B. Raphael. 1968. A formal basis for the heuristic determination of
minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100107.
Henderson, James. 2004. Discriminative training of a neural network statistical parser. In
Proceedings of ACL.
Huang, Liang. 2006. Dynamic programming in semiring and hypergraph frameworks. Technical
report, University of Pennsylvania. WPE II report.
Huang, Liang. 2007. Binarization, synchronous binarization, and target-side binarization. In
Proc. NAACL Workshop on Syntax and Structure in Statistical Translation.
Huang, Liang. 2008a. Advanced dynamic programming in semiring and hypergraph frameworks.
In Proceedings of COLING. Survey paper to accompany the conference tutorial, Manchester,
UK, August.
Huang, Liang. 2008b. Forest reranking: Discriminative parsing with non-local features. In
Proceedings of ACL, Columbus, OH, June.
Huang, Liang and David Chiang. 2005. Better k-best Parsing. In Proceedings of the Ninth
International Workshop on Parsing Technologies (IWPT-2005).
Huang, Liang and David Chiang. 2007. Forest rescoring: Fast decoding with integrated language
models. In Proceedings of ACL, Prague, Czech Rep., June.
Huang, Liang, Kevin Knight, and Aravind Joshi. 2006. Statistical syntax-directed translation
with extended domain of locality. In Proceedings of AMTA, Boston, MA, August.
Huang, Liang, Hao Zhang, and Daniel Gildea. 2005. Machine translation as lexicalized parsing
with hooks. In Proceedings of the Ninth International Workshop on Parsing Technologies
(IWPT-2005).
Jimenez, Vctor M. and Andres Marzal. 2000. Computation of the n best parse trees for weighted
and stochastic context-free grammars. In Proc. of the Joint IAPR International Workshops on
Advances in Pattern Recognition.
Johnson, Mark. 1998. PCFG models of linguistic tree representations. Computational Linguistics,
24:613632.
100
Johnson, Mark. 2006. Features of statistical parsers. Talk given at the Joint Microsoft Research
and Univ. of Washington Computational Linguistics Colloquium.
http://www.cog.brown.edu/mj/papers/ms-uw06talk.pdf.
Joshi, Aravind K. and Yves Schabes, 1997. Handbook of Formal Languages, chapter
Tree-Adjoining Grammars, pages 69123. Springer, Berlin.
Joshi, Aravind K. and K. Vijay-Shanker. 1999. Compositional semantics with lexicalized
tree-adjoining grammar (LTAG): How much underspecication is necessary? In H. C. Bunt and
E. G. C. Thijsse, editors, Proc. IWCS-3, pages 131145.
Kasami, T. 1965. An ecient recognition and syntax analysis algorithm for context-free
languages. Technical Report AFCRL-65-758, Air Force Cambridge Research Laboratory,
Bedford, MA.
Kazama, Junichi and Kentaro Torisawa. 2007. A new perceptron algorithm for sequence labeling
with non-local features. In Proceedings of EMNLP.
Kirkpatrick, S., C. D. Gelatt, and M. P. Vecchi. 1983. Optimization by simulated annealing.
Science, 220(4598):671680.
Klein, Dan and Chris Manning. 2003. A* parsing: Fast exact Viterbi parse selection. In
Proceedings of HLT-NAACL.
Klein, Dan and Christopher D. Manning. 2001. Parsing and Hypergraphs. In Proceedings of the
Seventh International Workshop on Parsing Technologies (IWPT-2001), 17-19 October 2001,
Beijing, China.
Knight, Kevin and Jonathan Graehl. 2005. An overview of probabilistic tree transducers for
natural language processing. In Proc. of the Sixth International Conference on Intelligent Text
Processing and Computational Linguistics (CICLing), LNCS.
Knuth, Donald. 1977. A generalization of Dijkstras algorithm. Information Processing Letters,
6(1).
Koehn, Philipp. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine
translation models. In Proceedings of AMTA, pages 115124.
Koehn, Philipp, Franz Joseph Och, and Daniel Marcu. 2003. Statistical phrase-based translation.
In Proceedings of NAACL, pages 127133.
Kumar, Shankar and William Byrne. 2004. Minimum bayes-risk decoding for statistical machine
translation. In HLT-NAACL.
Lawler, Eugene L. 1977. Comment on computing the k shortest paths in a graph. Comm. of the
ACM, 20(8):603604.
101
Lewis, P. M. and R. E. Stearns. 1968. Syntax-directed transduction. Journal of the ACM,
15(3):465488.
Lin, Dekang. 2004. A path-based transfer model for machine translation. In Proceedings of the
20th COLING.
Liu, Yang, Qun Liu, and Shouxun Lin. 2006. Tree-to-string alignment template for statistical
machine translation. In Proceedings of COLING-ACL, pages 609616.
Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large
annotated corpus of English: the Penn Treebank. Computational Linguistics, 19:313330.
May, Jonathan and Kevin Knight. 2006. A better n-best list: Practical determinization of
weighted nite tree automata. Submitted to HLT-NAACL 2006.
May, Jonathan and Kevin Knight. 2007. Syntactic re-alignment models for machine translation.
In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning, pages 360368, Prague, Czech Rep.,
June.
McAllester, David, Michael Collins, and Fernando Pereira. 2004. Case-factor diagrams for
structured probabilistic modeling. In Proc. UAI 2004.
McClosky, David, Eugene Charniak, and Mark Johnson. 2006. Eective self-training for parsing.
In Proceedings of the HLT-NAACL, New York City, USA, June.
McDonald, Ryan, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of
dependency parsers. In Proceedings of the 43rd ACL.
McDonald, Ryan and Fernando Pereira. 2006. Online learning of approximate dependency
parsing algorithms. In Proceedings of EACL.
Mi, Haitao and Liang Huang. 2008. Forest-based translation rule extraction. In Proceedings of
EMNLP, Honolulu, Haiwaii, October.
Mi, Haitao, Liang Huang, and Qun Liu. 2008. Forest-based translation. In Proceedings of ACL,
Columbus, OH, June.
Minieka, Edward. 1974. On computing sets of shortest paths in a graph. Comm. of the ACM,
17(6):351353.
Miyao, Yusuke and Junichi Tsujii. 2002. Maximum entropy estimation for feature forests. In
Proceedings of HLT, March.
Mohri, Mehryar. 2002. Semiring frameworks and algorithms for shortest-distance problems.
Journal of Automata, Languages and Combinatorics, 7(3):321350.
102
Mohri, Mehryar and Michael Riley. 2002. An ecient algorithm for the n-best-strings problem.
In Proceedings of the International Conference on Spoken Language Processing 2002 (ICSLP
02), Denver, Colorado, September.
Nederhof, Mark-Jan. 2003. Weighted deductive parsing and Knuths algorithm. Computational
Linguistics, pages 135143.
Ng, Hwee Tou and Jin Kiat Low. 2004. Chinese part-of-speech tagging: One-at-a-time or
all-at-once? word-based or character-based? In Proceedings of EMNLP 2004.
Nielsen, Lars Relund, Kim Allan Andersen, and Daniele Pretolani. 2005. Finding the k shortest
hyperpaths. Computers and Operations Research.
Och, Franz Joseph. 2003. Minimum error rate training in statistical machine translation. In
Proceedings of ACL, pages 160167.
Och, Franz Joseph and Hermann Ney. 2004. The alignment template approach to statistical
machine translation. Computational Linguistics, 30:417449.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for
automatic evaluation of machine translation. In Proceedings of ACL, pages 311318,
Philadephia, USA, July.
Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact,
and interpretable tree annotation. In Proceedings of COLING-ACL, pages 433440, Sydney,
Australia, July. Association for Computational Linguistics.
Petrov, Slav and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Proceedings of
HLT-NAACL.
Petrov, Slav and Dan Klein. 2008. Discriminative log-linear grammars with latent variables. In
Proceedings of NIPS 20.
Quirk, Chris and Simon Corston-Oliver. 2006. The impact of parse quality on
syntactically-informed statistical machine translation. In Proceedings of EMNLP.
Quirk, Chris, Arul Menezes, and Colin Cherry. 2005. Dependency treelet translation:
Syntactically informed phrasal smt. In Proceedings of the 43rd ACL.
Ratnaparkhi, Adwait. 1997. A linear observed time statistical parser based on maximum entropy
models. In Proceedings of EMNLP, pages 110.
Rigo, Armin. 2004. Representation-based just-in-time specialization and the Psyco prototype for
Python. In Nevin Heintze and Peter Sestoft, editors, Proceedings of the 2004 ACM SIGPLAN
Workshop on Partial Evaluation and Semantics-based Program Manipulation, pages 1526.
103
Roth, Dan and Scott Yih. 2005. Integer linear programming inference for conditional random
elds. In Proceedings of ICML.
Shen, Libin, Anoop Sarkar, and Franz Josef Och. 2004. Discriminative reranking for machine
translation. In Proceedings of HLT-NAACL.
Shieber, Stuart, Yves Schabes, and Fernando Pereira. 1995. Principles and implementation of
deductive parsing. Journal of Logic Programming, 24:336.
Soricut, Radu. 2006. Natural Language Generation using an Information-Slim Representation.
Ph.D. thesis, University of Southern California.
Stolcke, Andreas. 2002. Srilm - an extensible language modeling toolkit. In Proceedings of
ICSLP, volume 30, pages 901904.
Sutton, Charles and Andrew McCallum. 2005. Joint parsing and semantic role labeling. In Proc.
CoNLL 2005.
Taskar, Ben, Dan Klein, Michael Collins, Daphne Koller, and Chris Manning. 2004. Max-margin
parsing. In Proceedings of EMNLP.
Turian, Joseph and I. Dan Melamed. 2007. Scalable discriminative learning for natural language
parsing and translation. In Proceedings of NIPS 19.
Venugopal, Ashish, Andreas Zollmann, Noah A. Smith, and Stephan Vogel. 2008. Wider
pipelines: N-best alignments and parses in mt training. In Proceedings of AMTA, Honolulu,
Hawaii, October.
Viterbi, Andrew J. 1967. Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm. IEEE Transactions on Information Theory, IT-13(2):260269, April.
Wang, Wei, Kevin Knight, and Daniel Marcu. 2007a. Binarizing syntax trees to improve
syntax-based machine translation accuracy. In Proceedings of EMNLP, Prague, Czech Rep.,
July.
Wang, Wei, Kevin Knight, and Daniel Marcu. 2007b. Binarizing syntax trees to improve
syntax-based machine translation accuracy. In Proceedings of EMNLP, Prague, Czech Rep.,
July.
Wellner, Ben, Andrew McCallum, Fuchun Peng, and Michael Hay. 2004. An integrated,
conditional model of information extraction and coreference with application to citation
matching. In Proc. UAI 2004.
Wu, Dekai. 1996. A polynomial-time algorithm for statistical machine translation. In Proceedings
of the 34th ACL.
104
Wu, Dekai. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel
corpora. Computational Linguistics, 23(3):377404.
Xiong, Deyi, Shuanglong Li, Qun Liu, and Shouxun Lin. 2005. Parsing the penn chinese treebank
with semantic knowledge. In Proceedings of IJCNLP 2005, pages 7081.
Yamada, Kenji and Kevin Knight. 2001. A syntax-based statistical translation model. In
Proceedings of ACL, pages 523530.
Zhang, Hao, Liang Huang, Daniel Gildea, and Kevin Knight. 2006. Synchronous binarization for
machine translation. In Proc. of HLT-NAACL.
Zhang, Min, Hongfei Jiang, Aiti Aw, Haizhou Li, Chew Lim Tan, and Sheng Li. 2008. A tree
sequence alignment-based tree-to-tree translation model. In Proceedings of ACL, Columbus,
OH, June.
Zollmann, Andreas and Ashish Venugopal. 2007. An ecient two-pass approach to
synchronous-CFG driven statistical MT. In Proc. of HLT-NAACL (to appear).
105

You might also like