You are on page 1of 10

Confidence Driven Unsupervised Semantic Parsing

Dan Goldwasser Roi Reichart James Clarke Dan Roth



Department of Computer Science, University of Illinois at Urbana-Champaign
{goldwas1,clarkeje,danr}@illinois.edu

Computer Science and Artificial Intelligence Laboratory, MIT
roiri@csail.mit.edu

Abstract Current approaches for this task take a data driven


approach (Zettlemoyer and Collins, 2007; Wong and
Current approaches for semantic parsing take Mooney, 2007), in which the learning algorithm is
a supervised approach requiring a consider- given a set of NL sentences as input and their cor-
able amount of training data which is expen- responding MR, and learns a statistical semantic
sive and difficult to obtain. This supervision
parser a set of parameterized rules mapping lex-
bottleneck is one of the major difficulties in
scaling up semantic parsing.
ical items and syntactic patterns to their MR. Given
a sentence, these rules are applied recursively to de-
We argue that a semantic parser can be trained rive the most probable interpretation.
effectively without annotated data, and in-
Since semantic interpretation is limited to the syn-
troduce an unsupervised learning algorithm.
The algorithm takes a self training approach tactic patterns observed in the training data, in or-
driven by confidence estimation. Evaluated der to work well these approaches require consider-
over Geoquery, a standard dataset for this able amounts of annotated data. Unfortunately an-
task, our system achieved 66% accuracy, com- notating sentences with their MR is a time consum-
pared to 80% of its fully supervised counter- ing task which requires specialized domain knowl-
part, demonstrating the promise of unsuper- edge and therefore minimizing the supervision ef-
vised approaches for this task.
fort is one of the key challenges in scaling semantic
parsers.
1 Introduction In this work we present the first unsupervised
approach for this task. Our model compensates
Semantic parsing, the ability to transform Natural for the lack of training data by employing a self
Language (NL) input into a formal Meaning Repre- training protocol based on identifying high confi-
sentation (MR), is one of the longest standing goals dence self labeled examples and using them to re-
of natural language processing. The importance of train the model. We base our approach on a sim-
the problem stems from both theoretical and practi- ple observation: semantic parsing is a difficult struc-
cal reasons, as the ability to convert NL into a formal tured prediction task, which requires learning a com-
MR has countless applications. plex model, however identifying good predictions
The term semantic parsing has been used ambigu- can be done with a far simpler model capturing re-
ously to refer to several semantic tasks (e.g., se- peating patterns in the predicted data. We present
mantic role labeling). We follow the most common several simple, yet highly effective confidence mea-
definition of this task: finding a mapping between sures capturing such patterns, and show how to use
NL input and its interpretation expressed in a well- them to train a semantic parser without manually an-
defined formal MR language. Unlike shallow se- notated sentences.
mantic analysis tasks, the output of a semantic parser Our basic premise, that predictions with high con-
is complete and unambiguous to the extent it can be fidence score are of high quality, is further used to
understood or even executed by a computer system. improve the performance of the unsupervised train-
1486

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 14861495,
Portland, Oregon, June 19-24, 2011. 2011
c Association for Computational Linguistics
ing procedure. Our learning algorithm takes an EM- x How many states does the Colorado river run through?
like iterative approach, in which the predictions of y
the previous stage are used to bias the model. While z count( state( traverse( river( const(colorado))))
this basic scheme was successfully applied to many
unsupervised tasks, it is known to converge to a Figure 1: Example of an input sentence (x), meaning rep-
sub optimal point. We show that by using confi- resentation (z) and the alignment between the two (y) for
dence estimation as a proxy for the models pre- the Geoquery domain
diction quality, the learning algorithm can identify
a better model compared to the default convergence
criterion. ference problem based on and w is what com-
We evaluate our learning approach and model promises our semantic parser. In practice the pars-
on the well studied Geoquery domain (Zelle and ing decision is decomposed into smaller decisions
Mooney, 1996; Tang and Mooney, 2001), consist- (Sec. 2.2). Sec. 4 provides more details about the
ing of natural language questions and their prolog feature representation and inference procedure used.
interpretations used to query a database consisting Current approaches obtain w using annotated
of U.S. geographical information. Our experimental data, typically consisting of (x, z) pairs. In Sec. 3 we
results show that using our approach we are able to describe our unsupervised learning procedure, that is
train a good semantic parser without annotated data, how to obtain w without annotated data.
and that using a confidence score to identify good
2.1 Target Meaning Representation
models results in a significant performance improve-
ment. The output of the semantic parser is a logical for-
mula, grounding the semantics of the input sen-
2 Semantic Parsing tence in the domain language (i.e., the Geoquery
We formulate semantic parsing as a structured pre- domain). We use a subset of first order logic con-
diction problem, mapping a NL input sentence (de- sisting of typed constants (corresponding to specific
noted x), to its highest ranking MR (denoted z). In states, etc.) and functions, which capture relations
order to correctly parametrize and weight the pos- between domains entities and properties of entities
sible outputs, the decision relies on an intermediate (e.g., population : E N ). The seman-
representation: an alignment between textual frag- tics of the input sentence is constructed via func-
ments and their meaning representation (denoted y). tional composition, done by the substitution oper-
Fig. 1 describes a concrete example of this termi- ator. For example, given the function next to(x)
nology. In our experiments the input sentences x and the expression const(texas), substitution
are natural language queries about U.S. geography replaces the occurrence of the free variable x
taken from the Geoquery dataset. The meaning rep- with the expression, resulting in a new formula:
resentation z is a formal language database query, next to(const(texas)). For further details
this output representation language is described in we refer the reader to (Zelle and Mooney, 1996).
Sec. 2.1. 2.2 Semantic Parsing Decisions
The prediction function, mapping a sentence to its
corresponding MR, is formalized as follows: The inference problem described in Eq. 1 selects the
top ranking output formula. In practice this decision
z = Fw (x) = arg max wT (x, y, z) (1) is decomposed into smaller decisions, capturing lo-
yY,zZ
cal mapping of input tokens to logical fragments and
Where is a feature function defined over an input their composition into larger fragments. These deci-
sentence x, alignment y and output z. The weight sions are further decomposed into a feature repre-
vector w contains the models parameters, whose sentation, described in Sec. 4.
values are determined by the learning process. The first type of decisions are encoded directly by
We refer to the arg max above as the inference the alignment (y) between the input tokens and their
problem. Given an input sentence, solving this in- corresponding predicates. We refer to these as first
1487
order decisions. The pairs connected by the align- vides a good indication for the models prediction
ment (y) in Fig. 1 are examples of such decisions. performance, it can be used to approximate the over-
The final output structure z is constructed by all model performance, by observing the models to-
composing individual predicates into a complete tal confidence score over all its predictions. This
formula. For example, consider the formula pre- allows us to set a performance driven goal for our
sented in Fig. 1: river( const(colorado)) learning process - return the model maximizing the
is a composition of two predicates river and confidence score over all predictions. We describe
const(colorado). We refer to the composition the details of integrating the confidence score into
of two predicates, associated with their respective the learning framework in Sec. 3.1.
input tokens, as second order decisions. Although using the models prediction score (i.e.,
In order to formulate these decisions, we intro- wT (x, y, z)) as an indication of correctness is a
duce the following notation. c is a constituent in the natural choice, we argue and show empirically, that
input sentence x and D is the set of all function and unsupervised learning driven by confidence estima-
constant symbols in the domain. The alignment y is tion results in a better performing model. This
a set of mappings between constituents and symbols empirical behavior also has theoretical justification:
in the domain y = {(c, s)} where s D. training the model using examples selected accord-
We denote by si the i-th output predicate compo- ing to the models parameters (i.e., the top rank-
sition in z, by si1 (si ) the composition of the (i1)- ing structures) may not generalize much further be-
th predicate on the i-th predicate and by y(si ) the in- yond the existing model, as the training examples
put word corresponding to that predicate according will simply reinforce the existing model. The statis-
to the alignment y. tics used for confidence estimation are different than
those used by the model to create the output struc-
3 Unsupervised Semantic Parsing tures, and can therefore capture additional informa-
tion unobserved by the prediction model. This as-
Our learning framework takes a self training ap- sumption is based on the well established idea of
proach in which the learner is iteratively trained over multi-view learning, applied successfully to many
its own predictions. Successful application of this NL applications (Blum and Mitchell, 1998; Collins
approach depends heavily on two important factors and Singer, 1999). According to this idea if two
- how to select high quality examples to train the models use different views of the data, each of them
model on, and how to define the learning objective can enhance the learning process of the other.
so that learning can halt once a good model is found. The success of our learning procedure hinges
Both of these questions are trivially answered on finding good confidence measures, whose confi-
when working in a supervised setting: by using the dence prediction correlates well with the true quality
labeled data for training the model, and defining the of the prediction. The ability of unsupervised confi-
learning objective with respect to the annotated data dence estimation to provide high quality confidence
(for example, loss-minimization in the supervised predictions can be explained by the observation that
version of our system). prominent prediction patterns are more likely to be
In this work we suggest to address both of the correct. If a non-random model produces a predic-
above concerns by approximating the quality of tion pattern multiple times it is likely to be an in-
the models predictions using a confidence measure dication of an underlying phenomenon in the data,
computed over the statistics of the self generated and therefore more likely to be correct. Our specific
predictions. Output structures which fall close to the choice of confidence measures is guided by the intu-
center of mass of these statistics will receive a high ition that unlike structure prediction (i.e., solving the
confidence score. inference problem) which requires taking statistics
The first issue is addressed by using examples as- over complex and intricate patterns, identifying high
signed a high confidence score to train the model, quality predictions can be done using much simpler
acting as labeled examples. patterns that are significantly easier to capture.
We also note that since the confidence score pro- In the reminder of this section we describe our
1488
Algorithm 1 Unsupervised Confidence driven natural language sentences and a set of parameters
Learning used for making the initial predictions1 . The algo-
Input: Sentences {xl }N l=1 ,
rithm then iterates between the two stages - predict-
initial weight vector w ing the output structure for each sentence (line 4),
1: define Confidence : X Y Z R, and updating the set of parameters (line 9). The
i = 0, Si = specific learning algorithms used are discussed in
2: repeat Sec. 3.3. The training examples required for learn-
3: for l = 1, . . . , N do ing are obtained by selecting high confidence exam-
4: y, z = arg maxy,z wT (xl , y, z) ples - the algorithm first takes statistics over the cur-
5: Si = Si {xl , y, z} rent predicted set of output structures (line 7), and
6: end for then based on these statistics computes a confidence
7: Confidence = compute confidence statistics score for each structure, selecting the top ranked
8: Siconf = select from Si using Confidence ones as positive training examples, and if needed,
9: wi Learn(i Siconf ) the bottom ones as negative examples (line 8). The
10: i=i+1 set of top confidence examples (for either correct or
11: until Si
conf
has no incorrect prediction), at iteration i of the algorithm,
Pnew unique examples is denoted Siconf . The exact nature of the confidence
12: best = arg maxi ( sSi Confidence(s))/|S|
13: return wbest
computation is discussed in Sec. 3.2.
The algorithm iterates between these two stages,
at each iteration it adds more self-annotated exam-
learning approach. We begin by introducing the ples to its training set, learning therefore converges
overall learning framework (Sec. 3.1), we then ex- when no new examples are added (line 11). The al-
plain the rational behind confidence estimation over gorithm keeps track of the models it trained at each
self-generated data and introduce the confidence stage throughout this process, and returns the one
measures used in our experiments (Sec. 3.2). We with the highest averaged overall confidence score
conclude with a description of the specific learning (lines 12-13). At each stage, the overall confidence
algorithms used for updating the model (Sec. 3.3). score is computed by averaging over all the confi-
dence scores of the predictions made at that stage.
3.1 Unsupervised Confidence-Driven Learning
Our learning framework works in an EM-like 3.2 Unsupervised Confidence Estimation
manner, iterating between two stages: making pre- Confidence estimation is calculated over a batch of
dictions based on its current set of parameters and input (x) - output (z) pairs. Each pair decomposes
then retraining the model using a subset of the pre- into smaller first order and second order decisions
dictions, assigned high confidence. The learning (defined Sec. 2.2). Confidence estimation is done by
process discovers new high confidence training computing the statistics of these decisions, over the
examples to add to its training set over multiple it- entire set of predicted structures. In the rest of this
erations, and converges when the model no longer section we introduce the confidence measures used
adds new training examples. by our system.
While this is a natural convergence criterion, it
provides no performance guarantees, and in practice Translation Model The first approach essentially
it is very likely that the quality of the model (i.e., its constructs a simplified translation model, capturing
performance) fluctuates during the learning process. word-to-predicate mapping patterns. This can be
We follow the observation that confidence estima- considered as an abstraction of the prediction model:
tion can be used to approximate the performance of we collapse the intricate feature representation into
the entire model and return the model with the high- 1
Since we commit to the max-score output prediction, rather
est overall prediction confidence. than summing over all possibilities, we require a reasonable ini-
We describe this algorithmic framework in detail tialization point. We initialized the weight vector using simple,
in Alg. 1. Our algorithm takes as input a set of straight-forward heuristics described in Sec. 5.
1489
high level decisions and take statistics over these de- 3.3 Learning Algorithms
cisions. Since it takes statistics over considerably
Given a set of self generated structures, the param-
less variables than the actual prediction model, we
eter vector can be updated (line 9 in Alg. 1). We
expect this model to make reliable confidence pre-
consider two learning algorithm for this purpose.
dictions. We consider two variations of this ap-
proach, the first constructs a unigram model over the The first is a binary learning algorithm, which
first order decisions and the second a bigram model considers learning as a classification problem, that
over the second order decisions. Formally, given a is finding a set of weights w that can best sepa-
set of predicted structures we define the following rate correct from incorrect structures. The algo-
confidence scores: rithm decomposes each predicted formula and its
corresponding input sentence into a feature vector
Unigram Score:
(x, y, z) normalized by the size of the input sen-
|z|
Y tence |x|, and assigns a binary label to this vector2 .
p(z|x) = p(si |y(si )) The learning process is defined over both positive
i=1 and negative training examples. To accommodate
Bigram Score: that we modify line 8 in Alg. 1, and use the con-
|z| fidence score to select the top ranking examples as
positive examples, and the bottom ranking examples
Y
p(z|x) = p(si1 (si )|y(si1 ), y(si ))
i=1 as negative examples. We use a linear kernel SVM
with squared-hinge loss as the underlying learning
Structural Proportion Unlike the first approach
algorithm.
which decomposes the predicted structure into in-
dividual decisions, this approach approximates the The second is a structured learning algorithm
models performance by observing global properties which considers learning as a ranking problem, i.e.,
of the structure. We take statistics over the propor- finding a set of weights w such that the gold struc-
tion between the number of predicates in z and the ture will be ranked on top, preferably by a large
number of words in x. margin to allow generalization.The structured learn-
Given a set of structure predictions S, we com- ing algorithm can directly use the top ranking pre-
pute this proportion for each structure (denoted as dictions of the model (line 8 in Alg. 1) as training
P rop(x, z)) and calculate the average proportion data. In this case the underlying algorithm is a struc-
over the entire set (denoted as AvP rop(S)). The tural SVM with squared-hinge loss, using hamming
confidence score assigned to a given structure (x, y) distance as the distance function. We use the cutting-
is simply the difference between its proportion and plane method to efficiently optimize the learning
the averaged proportion, or formally process objective function.
P ropScore(S, (x, z)) = AvP rop(S) P rop(x, z)
4 Model
This measure captures the global complexity of the
predicted structure and penalizes structures which Semantic parsing as formulated in Eq. 1 is an in-
are too complex (high negative values) or too sim- ference procedure selecting the top ranked output
plistic (high positive values). logical formula. We follow the inference approach
Combined The two approaches defined above in (Roth and Yih, 2007; Clarke et al., 2010) and
capture different views of the data, a natural question formalize this process as an Integer Linear Program
is then - can these two measures be combined to pro- (ILP). Due to space consideration we provide a brief
vide a more powerful estimation? We suggest a third description, and refer the reader to that paper for
approach which combines the first two approaches. more details.
It first uses the score produced by the latter approach 2
Without normalization longer sentences would have more
to filter out unlikely candidates, and then ranks the influence on binary learning problem. Normalization is there-
remaining ones with the former approach and selects fore required to ensure that each sentence contributes equally to
those with the highest rank. the binary learning problem regardless of its length.
1490
4.1 Inference Model Description
I NITIAL M ODEL Manually set weights (Sec. 5.1)
The inference decision (Eq. 1) is decomposed into P RED . S CORE normalized prediction (Sec. 5.1)
smaller decisions, capturing mapping of input to- A LL E XAMPLES All top structures (Sec. 5.1)
kens to logical fragments (first order) and their com- U NIGRAM Unigram score (Sec. 3.2)
position into larger fragments (second order). We B IGRAM Bigram score (Sec. 3.2)
encode a first-order decision as cs , a binary vari- P ROPORTION Words-predicate prop (Sec. 3.2)
able indicating that constituent c is aligned with the C OMBINED Combined estimators (Sec. 3.2)
logical symbol s. A second-order decision cs,dt , is R ESPONSE BASED Supervised (binary) (Sec. 5.1)
S UPERVISED Fully Supervised (Sec. 5.1)
encoded as a binary variable indicating that the sym-
bol t (associated with constituent d) is an argument
Table 1: Compared systems and naming conventions.
of a function s (associated with constituent c). We
frame the inference problem over these decisions:
Second-order decision features 2 Second order
decisions rely on syntactic information. We use
XX
Fw (x) = arg max cs wT 1 (x, c, s) the dependency tree of the input sentence. Given
, cx sD a second-order decision cs,dt , the dependency fea-
X X ture takes the normalized distance between the head
+ cs,dt wT 2 (x, c, s, d, t) (2)
words in the constituents c and d. In addition, a set
c,dx s,tD
of features indicate which logical symbols are usu-
ally composed together, without considering their
We restrict the possible assignments to the deci-
alignment to the text.
sion variables, forcing the resulting output formula
to be syntactically legal, for example by restricting 5 Experiments
active -variables to be type consistent, and force
the resulting functional composition to be acyclic. In this section we describe our experimental evalua-
We take advantage of the flexible ILP framework, tion. We compare several confidence measures and
and encode these restrictions as global constraints analyze their properties. Tab. 1 defines the naming
over Eq. 2. We refer the reader to (Clarke et al., conventions used throughout this section to refer to
2010) for a full description of the constraints used. the different models we evaluated. We begin by de-
scribing our experimental setup and then proceed to
4.2 Features describe the experiments and their results. For the
sake of clarity we focus on the best performing mod-
The inference problem defined in Eq. (2) uses two
els (C OMBINED using B IGRAM and P ROPORTION)
feature functions: 1 and 2 .
first and discuss other models later in the section.
First-order decision features 1 Determining if 5.1 Experimental Settings
a logical symbol is aligned with a specific con-
stituent depends mostly on lexical information. In all our experiments we used the Geoquery
Following previous work (e.g., (Zettlemoyer and dataset (Zelle and Mooney, 1996), consisting of U.S.
Collins, 2005)) we create a small lexicon, mapping geography NL questions and their corresponding
logical symbols to surface forms.3 Existing ap- Prolog logical MR. We used the data split described
proaches rely on annotated data to extend the lexi- in (Clarke et al., 2010), consisting of 250 queries for
con. Instead we rely on external knowledge (Miller evaluation purposes. We compared our system to
et al., 1990) and add features which measure the lex- several supervised models, which were trained us-
ical similarity between a constituent and a logical ing a disjoint set of queries. Our learning system
symbols surface forms (as defined by the lexicon). had access only to the NL questions, and the log-
ical forms were only used to evaluate the systems
3
The lexicon contains on average 1.42 words per function performance. We report the proportion of correct
and 1.07 words per constant. structures (accuracy). Note that this evaluation cor-
1491
responds to the 0/1 loss over the predicted structures. in their sample selection strategy, but all use con-
Initialization Our learning framework requires an fidence estimation for selecting the final seman-
initial weight vector as input. We use a straight for- tic parsing model. The A LL E XAMPLES approach
ward heuristic and provide uniform positive weights achieves an accuracy score of 0.656. P REDICTION -
to three features. This approach is similar in spirit S CORE only achieves a performance of 0.164 us-
to previous works (Clarke et al., 2010; Zettlemoyer ing the binary learning algorithm and 0.348 us-
and Collins, 2007). We refer to this system as I NI - ing the structured learning algorithm. Finally, our
TIAL M ODEL throughout this section. confidence-driven technique C OMBINED achieved a
score of 0.536 for the binary case and 0.664 for the
Competing Systems We compared our system to
structured case, the best performing models in both
several other systems:
cases. As expected, the supervised systems R E -
(1) P RED . S CORE: An unsupervised frame- SPONSE BASED and S UPERVISED achieve the best
work using the models internal prediction score performance.
(wT (x, y, z)) for confidence estimation.
These results show that training the model with
(2) A LL E XAMPLES: Treating all predicted struc- training examples selected carefully will improve
tures as correct, i.e., at each iteration the model is learning - as the best performance is achieved with
trained over all the predictions it made. The re- perfect knowledge of the predictions correctness
ported score was obtained by selecting the model at (R ESPONSE BASED). Interestingly the difference
the training iteration with the highest overall confi- between the structured version of our system and
dence score (see line 12 in Alg. 1). that of R ESPONSE BASED is only 0.07, suggesting
(3) R ESPONSE BASED: A natural upper bound to that we can recover the binary feedback signal with
our framework is the approach used in (Clarke et al., high precision. The low performance of the P RE -
2010). While our approach is based on assessing DICTION S CORE model is also not surprising, and it
the correctness os the models predictions according demonstrates one of the key principles in confidence
to unsupervised confidence estimation, their frame- estimation - the score should be comparable across
work is provided with external supervision for these predictions done over different inputs, and not the
decisions, indicating if the predicted structures are same input, as done in P REDICTION S CORE model.
correct.
(4) S UPERVISED: A fully supervised framework (2) How does confidence driven sample selection
trained over 250 (x, z) pairs using structured SVM. contribute to the learning process? Comparing
the systems driven by confidence sample-selection
5.2 Results to the A LL E XAMPLES approach uncovers an inter-
esting tradeoff between training with more (noisy)
Our experiments aim to clarify three key points:
data and selectively training the system with higher
(1) Can a semantic parser indeed be trained with- quality examples. We argue that carefully select-
out any form of external supervision? this is our ing high quality training examples will result in bet-
key question, as this is the first attempt to approach ter performance. The empirical results indeed sup-
this task with an unsupervised learning protocol.4 In port our argument, as the best performing model
order to answer it, we report the overall performance (R ESPONSE BASED) is achieved by sample selec-
of our system in Tab. 2. tion with perfect knowledge of prediction correct-
The manually constructed model I NITIAL M ODEL ness. The confidence-based sample selection system
achieves a performance of 0.22. We can expect (C OMBINED) is the best performing system out of
learning to improve on this baseline. We com- all the self-trained systems. Nonetheless, the A LL
pare three self-trained systems, A LL E XAMPLES, E XAMPLES strategy performs well when compared
P REDICTION S CORE and C OMBINED, which differ to C OMBINED, justifying a closer look at that aspect
4 of our system.
While unsupervised learning for various semantic tasks has
been widely discussed, this is the first attempt to tackle this task. We argue that different confidence measures cap-
We refer the reader to Sec. 6 for further discussion of this point. ture different properties of the data, and hypothe-
1492
size that combining their scores will improve the re- Algorithm Supervision Acc.
sulting model. In Tab. 3 we compare the results of I NITIAL M ODEL 0.222
the C OMBINED measure to the results of its individ- S ELF -T RAIN : (Structured)
ual components - P ROPORTION and B IGRAM. We P RED . S CORE 0.348
compare these results both when using the binary A LL E XAMPLES 0.656
C OMBINED 0.664
and structured learning algorithms. Results show
S ELF -T RAIN : (Binary)
that using the C OMBINED measure leads to an im- P RED . S CORE 0.164
proved performance, better than any of the individ- C OMBINED 0.536
ual measures, suggesting that it can effectively ex- R ESPONSE BASED
ploit the properties of each confidence measure. Fur- B INARY 250 (binary) 0.692
thermore, C OMBINED is the only sample selection S TRUCTURED 250 (binary) 0.732
strategy that outperforms A LL E XAMPLES. S UPERVISED
S TRUCTURED 250 (struct.) 0.804
(3) Can confidence measures serve as a good
Table 2: Comparing our Self-trained systems with
proxy for the models performance? In the unsu- Response-based and supervised models. Results show
pervised settings we study the learning process may that our C OMBINED approach outperforms all other un-
not converge to an optimal model. We argue that supervised models.
by selecting the model that maximizes the averaged
confidence score, a better model can be found. We Algorithm Accuracy
validate this claim empirically in Tab. 4. We com- S ELF -T RAIN : (Structured)
pare the performance of the model selected using P ROPORTION 0.6
the confidence score to the performance of the fi- B IGRAM 0.644
C OMBINED 0.664
nal model considered by the learning algorithm (see
S ELF -T RAIN : (Binary)
Sec. 3.1 for details). We also compare it to the best B IGRAM 0.532
model achieved in any of the learning iterations. P ROPORTION 0.504
Since these experiments required running the C OMBINED 0.536
learning algorithm many times, we focused on the
binary learning algorithm as it converges consider- Table 3: Comparing C OMBINED to its components B I -
GRAM and P ROPORTION . C OMBINED results in a better
ably faster. In order to focus the evaluation on the
score than any of its components, suggesting that it can
effects of learning, we ignore the initial model gen-
exploit the properties of each measure effectively.
erated manually (I NITIAL M ODEL) in these exper-
iments. In order to compare models performance Algorithm Best Conf. estim. Default
across the different iterations fairly, a uniform scale, P RED . S CORE 0.164 0.128 (0.164) 0.134
such as U NIGRAM and B IGRAM, is required. In the U NIGRAM 0.52 0.52 0.4
case of the C OMBINED measure we used the B I - B IGRAM 0.532 0.532 0.472
GRAM measure for performance estimation, since it P ROPORTION 0.504 0.27 (0.504) 0.44
is one of its underlying components. In the P RED . C OMBINED 0.536 0.536 0.328
S CORE and P ROPORTION models we used both their
confidence prediction, and the simple U NIGRAM Table 4: Using confidence to approximate model perfor-
mance. We compare the best result obtained in any of the
confidence score to evaluate model performance (the
learning algorithm iterations (Best), the result obtained
latter appear in parentheses in Tab. 4). by approximating the best result using the averaged pre-
Results show that the over overall confidence diction confidence (Conf. estim.) and the result of us-
score serves as a reliable proxy for the model perfor- ing the default convergence criterion (Default). Results
mance - using U NIGRAM and B IGRAM the frame- in parentheses are the result of using the U NIGRAM con-
fidence to approximate the models performance.
work can select the best performing model, far better
than the performance of the default model to which
the system converged.
1493
6 Related Work explored by many previous works (see (Caruana and
Niculescu-Mizil, 2006) for a survey), and applied
Semantic parsing has attracted considerable interest to several NL processing tasks such as syntactic
in recent years. Current approaches employ various parsing (Reichart and Rappoport, 2007a; Yates et
machine learning techniques for this task, such as In- al., 2006), machine translation (Ueffing and Ney,
ductive Logic Programming in earlier systems (Zelle 2007), speech (Koo et al., 2001), relation extrac-
and Mooney, 1996; Tang and Mooney, 2000) and tion (Rosenfeld and Feldman, 2007), IE (Culotta and
statistical learning methods in modern ones (Ge and McCallum, 2004), QA (Chu-Carroll et al., 2003)
Mooney, 2005; Nguyen et al., 2006; Wong and and dialog systems (Lin and Weng, 2008).
Mooney, 2006; Kate and Mooney, 2006; Zettle- In addition to sample selection we use confidence
moyer and Collins, 2005; Zettlemoyer and Collins, estimation as a way to approximate the overall qual-
2007; Zettlemoyer and Collins, 2009). ity of the model and use it for model selection. This
The difficulty of providing the required supervi- use of confidence estimation was explored in (Re-
sion motivated learning approaches using weaker ichart et al., 2010), to select between models trained
forms of supervision. (Chen and Mooney, 2008; with different random starting points. In this work
Liang et al., 2009; Branavan et al., 2009; Titov and we integrate this estimation deeper into the learning
Kozhevnikov, 2010) ground NL in an external world process, thus allowing our training procedure to re-
state directly referenced by the text. The NL input in turn the best performing model.
our setting is not restricted to such grounded settings
and therefore we cannot exploit this form of supervi- 7 Conclusions
sion. Recent work (Clarke et al., 2010; Liang et al.,
We introduced an unsupervised learning algorithm
2011) suggest using response-based learning proto-
for semantic parsing, the first for this task to the best
cols, which alleviate some of the supervision effort.
of our knowledge. To compensate for the lack of
This work takes an additional step in this direction
training data we use a self-training protocol, driven
and suggest an unsupervised protocol.
by unsupervised confidence estimation. We demon-
Other approaches to unsupervised semantic anal-
strate empirically that our approach results in a high
ysis (Poon and Domingos, 2009; Titov and Kle-
preforming semantic parser and show that confi-
mentiev, 2011) take a different approach to seman-
dence estimation plays a vital role in this success,
tic representation, by clustering semantically equiv-
both by identifying good training examples as well
alent dependency tree fragments, and identifying
as identifying good over all performance, used to
their predicate-argument structure. While these ap-
improve the final model selection.
proaches have been applied successfully to semantic
In future work we hope to further improve un-
tasks such as question answering, they do not ground
supervised semantic parsing performance. Particu-
the input in a well defined output language, an essen-
larly, we intend to explore new approaches for confi-
tial component in our task.
dence estimation and their usage in the unsupervised
Our unsupervised approach follows a self training
and semi-supervised versions of the task.
protocol (Yarowsky, 1995; McClosky et al., 2006;
Reichart and Rappoport, 2007b) enhanced with con- Acknowledgments We thank the anonymous re-
straints restricting the output space (Chang et al., viewers for their helpful feedback. This material
2007; Chang et al., 2009). A Self training proto- is based upon work supported by DARPA under
col uses its own predictions for training. We esti- the Bootstrap Learning Program and Machine Read-
mate the quality of the predictions and use only high ing Program under Air Force Research Laboratory
confidence examples for training. This selection cri- (AFRL) prime contract no. FA8750-09-C-0181.
terion provides an additional view, different than the Any opinions, findings, and conclusion or recom-
one used by the prediction model. Multi-view learn- mendations expressed in this material are those of
ing is a well established idea, implemented in meth- the author(s) and do not necessarily reflect the view
ods such as co-training (Blum and Mitchell, 1998). of the DARPA, AFRL, or the US government.
Quality assessment of a learned model output was
1494
References R. Reichart and A. Rappoport. 2007a. An ensemble
method for selection of high quality parses. In ACL.
A. Blum and T. Mitchell. 1998. Combining labeled and
R. Reichart and A. Rappoport. 2007b. Self-training
unlabeled data with co-training. In COLT.
for enhancement and domain adaptation of statistical
S.R.K. Branavan, H. Chen, L. Zettlemoyer, and R. Barzi- parsers trained on small datasets. In ACL.
lay. 2009. Reinforcement learning for mapping in-
R. Reichart, R. Fattal, and A. Rappoport. 2010. Im-
structions to actions. In ACL.
proved unsupervised pos induction using intrinsic
R. Caruana and A. Niculescu-Mizil. 2006. An empiri- clustering quality and a zipfian constraint. In CoNLL.
cal comparison of supervised l earning algorithms. In B. Rosenfeld and R. Feldman. 2007. Using corpus statis-
ICML. tics on entities to improve semisupervised relation
M. Chang, L. Ratinov, and D. Roth. 2007. Guiding semi- extraction from the web. In ACL.
supervision with constraint-driven learning. In Proc. D. Roth and W. Yih. 2007. Global inference for entity
of the Annual Meeting of the ACL. and relation identification via a linear programming
M. Chang, D. Goldwasser, D. Roth, and Y. Tu. 2009. formulation. In Lise Getoor and Ben Taskar, editors,
Unsupervised constraint driven learning for transliter- Introduction to Statistical Relational Learning.
ation discovery. In NAACL. L. Tang and R. Mooney. 2000. Automated construction
D. Chen and R. Mooney. 2008. Learning to sportscast: a of database interfaces: integrating statistical and rela-
test of grounded language acquisition. In ICML. tional learning for semantic parsing. In EMNLP.
J. Chu-Carroll, J. Prager K. Czuba, and A. Ittycheriah. L. R. Tang and R. J. Mooney. 2001. Using multiple
2003. In question answering, two heads are better than clause constructors in inductive logic programming for
on. In HLT-NAACL. semantic parsing. In ECML.
J. Clarke, D. Goldwasser, M. Chang, and D. Roth. 2010. I. Titov and A. Klementiev. 2011. A bayesian model for
Driving semantic parsing from the worlds response. unsupervised semantic parsing. In ACL.
In CoNLL, 7. I. Titov and M. Kozhevnikov. 2010. Bootstrapping
M. Collins and Y. Singer. 1999. Unsupervised models semantic analyzers from non-contradictory texts. In
for named entity classification. In EMNLPVLC. ACL.
A. Culotta and A. McCallum. 2004. Confidence estima- N. Ueffing and H. Ney. 2007. Word-level confidence es-
tion for information extraction. In HLT-NAACL. timation for machine translation. Computational Lin-
R. Ge and R. Mooney. 2005. A statistical semantic parser guistics, 33(1):940.
that integrates syntax and semantics. In CoNLL. Y.W. Wong and R. Mooney. 2006. Learning for se-
R. Kate and R. Mooney. 2006. Using string-kernels for mantic parsing with statistical machine translation. In
learning semantic parsers. In ACL. NAACL.
Y. Koo, C. Lee, and B. Juang. 2001. Speech recogni- Y.W. Wong and R. Mooney. 2007. Learning syn-
tion and utterance verification based on a generalized chronous grammars for semantic parsing with lambda
confidence score. IEEE Transactions on Speech and calculus. In ACL.
Audio Processing, 9(8):821832. D. Yarowsky. 1995. Unsupervised word sense disam-
P. Liang, M. I. Jordan, and D. Klein. 2009. Learning biguation rivaling supervised method. In ACL.
semantic correspondences with less supervision. In A. Yates, S. Schoenmackers, and O. Etzioni. 2006. De-
ACL. tecting parser errors using web-based semantic filters.
P. Liang, M.I. Jordan, and D. Klein. 2011. Deep compo- In EMNLP.
sitional semantics from shallow supervision. In ACL. J. M. Zelle and R. J. Mooney. 1996. Learning to parse
F. Lin and F. Weng. 2008. Computing confidence scores database queries using inductive logic proramming. In
for all sub parse trees. In ACL. AAAI.
D. McClosky, E. Charniak, and Mark Johnson. 2006. L. Zettlemoyer and M. Collins. 2005. Learning to
Effective self-training for parsing. In HLT-NAACL. map sentences to logical form: Structured classifica-
G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K.J. tion with probabilistic categorial grammars. In UAI.
Miller. 1990. Wordnet: An on-line lexical database. L. Zettlemoyer and M. Collins. 2007. Online learning of
International Journal of Lexicography. relaxed CCG grammars for parsing to logical form. In
L. Nguyen, A. Shimazu, and X. Phan. 2006. Seman- CoNLL.
tic parsing with structured svm ensemble classification L. Zettlemoyer and M. Collins. 2009. Learning context-
models. In ACL. dependent mappings from sentences to logical form.
H. Poon and P. Domingos. 2009. Unsupervised semantic In ACL.
parsing. In EMNLP.
1495

You might also like