Professional Documents
Culture Documents
1787
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1787–1796
Brussels, Belgium, October 31 - November 4, 2018.
2018
c Association for Computational Linguistics
Document Encoder h11 h21 h31 h41 h51 h61 h71 h81
tions based on the global document representation.
h12 h22 h32 h42 h52 h62 h72 h82
A sentence selection RNN is used to select salient
and relevant sentences while generating each sum-
h13 h23 h33 h43 h53 h63 h73 h83
mary sentence sequentially based on the tailored
h1 h2 h3 h4 h5 h6 h7 h8 sentence representations. At last, the summary de-
coder produces the output summary to paraphrase
and generalize the selected sentences.
Information Selection
f1 f2 f3 f4 f5 f6 f7 f8
In the following, we denote hi , hi,j as the hid-
α0 α1 α2 den state of the i-th sentence and the j-th word of
h1' h2' h3' the i-th sentence in the document encoder part, re-
spectively. In the information selection and sum-
α1 α2 α3
mary decoder part, we denote h′t , h′t,k as the hid-
Summary
Decoder
'
h1,1 '
h1,2 '
h1,3 '
h2,1 '
h2,2 '
h2,3 '
h3,1 '
h3,2 '
h3,3 den state of the t-th summary sentence and the
k-th word in the t-th summary sentence, respec-
Figure 1: Our abstractive document summarization model, tively.
which mainly consists of three layers: document encoder
layer (the top part), information selection layer (the middle 2.1 Document Encoder
part) and summary decoder layer (the bottom part).
A document d is a sequence of sentences d =
{si }, and each sentence is a sequence of words
summary decoder layer, as shown in Figure 1. si = {wi,j }. A hierarchical encoder, which con-
In our model, both the document and summary sists of two levels: word level and sentence level
are processed sentence by sentence, to better cap- similar to (Nallapati et al., 2016), is used to encode
ture the inter-sentence relations. The informa- the document from both word and sentence level.
tion selection layer consists of two parts: gated The word-level encoder is a bidirectional Gated
global information filtering and local sentence se- Recurrent Unit (GRU) (Chung et al., 2014), which
lection. Unnecessary information in the original encodes the words of a sentence into sentence rep-
document are first globally filtered by a gated net- resentation. The word encoder sequentially up-
work, then important sentences are selected lo- dates its hidden state after receiving a word, which
cally while generating each summary sentence se- is formulated as:
quentially. Moreover, we propose to optimize
the information selection process with distantly- hi,j = BiGRU (hi,j−1 , ei,j ) (1)
supervised training. Our proposed method com-
bines the strengths of extractive methods and ab- where hi,j and ei,j denotes the hidden state and
stractive methods, which is able to tackle the fac- embedding of word wi,j , respectively.
tors of saliency, non-redundancy, coherence and The concatenation of the forward and backward
fluency under a unified framework. We conduct final hidden states in the word-level encoder is in-
extensive experiments on benchmark datasets and dicated as the vector representation xi of the sen-
the results demonstrate that the explicit modeling tence si , which is used as input to the sentence-
and distantly-supervised optimizing of the infor- level encoder. The sentence encoder is also a bidi-
mation selection process improves document sum- rectional GRU, which updates its hidden state after
marization performance significantly, which en- receiving each sentence representation by:
ables our model to significantly outperforms pre-
hi = BiGRU (hi−1 , xi ) (2)
vious state-of-the-art neural abstractive methods.
1788
information and select salient information from Local Sentence Selection
the input document to produce a condensed sum- We explicitly model the local sentence selection
mary. However, it is difficult for the basic encoder- process which selects several target sentences to
decoder framework to learn the process of salient generate a summary sentence. Concretely, we ap-
information selection, which has also been noticed ply a RNN layer to sequentially select target sen-
by several previous work (Tan et al., 2017a,b). To tences for each summary sentence, shown as in
tackle the challenge, we extend the basic encoder- Figure 1. The sentence-selection RNN uses the
′
decoder framework by adding an information se- document representation d̂ as initial state h0 , and
lection layer to model the information selection sequentially predicts the sentence selection vector
process explicitly. Our information selection layer αt as follows:
consists of two parts: gated global information fil- ′
tering that used to remove the unnecessary infor- eϕ(fi ,ht )
αti = ∑ ϕ(f ,h′ ) (5)
mation of a document, and local sentence selection le
l t
1789
embedding et,k−1 and context vector ct,k−1 as in- where DKL (αt , pt ) indicates the KL-divergence
′
puts to compute the new hidden state ht,k by: between distribution αt and pt . The sentence se-
lection loss is imported into the final loss func-
′ ′
ht,k = GRU (ht,k−1 , ct,k−1 , et,k−1 ) (9) tion to be optimized with the summary generation
component together.
We import attention mechanism to help locate The loss function L of the model is the mix
relevant words to be copied or paraphrased within of the negative log-likelihood of generating sum-
the selected source sentences in each word genera- maries over training set T , and the sentence selec-
i of the kth
tion step. The attention distribution βt,k tion loss of distantly-supervised training:
word of the tth summary sentence over the sen- ∑
tences in the ith document can be computed as: L= −logP (Y |X; θ) + λlosssel (13)
(X,Y )∈T
ϕ(hi,j ,h′t,k )
i,j e where λ is a hyper-parameter tuned on the vali-
βt,k = αti ∑ ′
ϕ(hi,l ,ht,k )
(10)
le dation set. (X, Y ) denotes a document-summary
pair in the training set.
where αti denotes the weight of the ith source
sentence, used to normalize the word attention 3 Experiments
distributions. Then the word-level context vec-
3.1 Dataset
tor when generating the kth word at the tth sen-
tence generation step can be computed as: ct,k = We conduct our experiments on a large-scale cor-
∑ ∑ i,j pus of CNN/DailyMail, which has been widely
i j βt,k hi,j , which is also incorporated into
the word decoder. used for exploration on summarizing documents
with multi-sentence summaries. The corpus are
At each word generation step, the vocabulary
originally constructed in (Hermann et al., 2015) by
distribution is calculated from the context vector
′ collecting human generated highlights from news
ct,k and the decoder state ht,k by:
stories in the CNN and DailyMail Website, which
′ ′ contains input document of about 800 tokens on
Pvocab (wt,k ) = sof tmax(Wv (Wc [ht,k , ct,k ] + bc ) + bv )
(11) average and multi-sentence summaries of up to
where Wv and Wc are learned parameters. The 200 tokens. We use the same version of data with
copy mechanism based on the word attention is (See et al., 2017), which totally has 280,125 train-
also imported into the decoder to alleviate the ing pairs, 13,367 validation pairs and 11,489 test
OOV problems as in (See et al., 2017). pairs after discarding the examples with empty ar-
ticle text. Some of previous work (Nallapati et al.,
2.4 Model Learning with Distant Supervision 2016, 2017; Paulus et al., 2017; Tan et al., 2017a)
Despite the end-to-end training for the perfor- use the anonymized version of data, which has
mance of generated summary, we also directly been pre-processed to replace each named entity
optimize the sentence selection decisions by im- with an unique identifier. By contrast, we use the
porting supervision for the sentence selection vec- non-anonymized data similar to (See et al., 2017),
tor αt in Equation 5. While there is no ex- which is a more favorable and challenging prob-
plicit supervision for sentence selection, we define lem because it requires no pre-processing.
a simple approach for labeling sentences based 3.2 Implementation Details
on the reference summaries. To simulate the
Model Parameters For all experiments, the
sentence selection process on human-written ab-
word-level encoder and summary decoder both
stracts, we compute the words-matching similari-
use 256-dimensional hidden states, and the
ties (based on TF-IDF cosine similarity) between
sentence-level encoder and sentence selection net-
a reference-summary sentence and corresponding
work both use 512-dimensional hidden states. We
source document sentences and normalize them
use pre-trained Glove (Pennington et al., 2014)
into distantly-labelled sentence selection vector
pt . Then the sentence selection loss is defined as: vector for initialization of word embeddings. The
dimension of word embeddings is 100, which will
∑ be further trained in the model. We use a vocab-
losssel = DKL (αt , pt ) (12)
t
ulary of 50k words for both encoder and decoder.
1790
Method Rouge-1 Rouge-2 Rouge-L Method Informat. Concise Coherent Fluent
Lead-3 40.34 17.70 36.57 Lead-3 3.49∗ 3.19∗ 3.86 4.07∗
SummaRuNNer-abs 37.5 14.5 33.4 Seq2seq-b. 3.11∗ 2.95∗ 3.08∗ 3.51∗
SummaRuNNer 39.6 16.2 35.3 Coverage 3.41∗ 3.25∗ 3.37 3.72
Seq2seq-baseline 36.64 15.66 33.42 Our Model 3.76 3.49 3.65 3.80
ABS-temp-attn 35.46 13.30 32.65
Graph-attention 38.1 13.9 34.0 Table 2: Human evaluation results. ∗ indicates the difference
Deep-reinforced 39.87 15.82 36.90 between Our Model and other models are statistic significant
Coverage 39.53 17.28 36.38 (p < 0.1) by two-tailed t-test.
Our Model 41.54 18.18 36.47
Table 1: Rouge F1 scores on the test set. All our ROUGE seq2seq architecture to overcome the repeti-
scores have a 95% confidence interval of at most ±0.25
as reported by the official ROUGE script. tion problem.
3) Graph-attention (Tan et al., 2017a), which
uses a graph-ranking based attention mecha-
We use dropout (Srivastava et al., 2014) with prob- nism based on a hierarchical architecture to
ability p = 0.5. After tuning on the validation set, identify important source sentences.
parameter λ is set as 0.2. 4) Deep-reinforced (Paulus et al., 2017), which
Model Training We use Adagrad (Duchi et al., trains the seq2seq encoder-decoder model
2011) algorithm with learning rate 0.1 and an ini- with reinforcement learning techniques.
tial accumulator value of 0.1 to optimize the model 5) Coverage (See et al., 2017), which is an ex-
parameters θ. During training, we use gradient tension of the Seq2seq-baseline model by im-
clipping with a maximum gradient norm of 2. Our porting coverage mechanism to control repe-
model is trained on a single Tesla K40m GPU with titions in summary.
a batch size of 16 and an epoch is set contain-
ing 10,000 randomly sampled documents. Con- 3.4 Evaluation
vergence is reached within 300 epochs.
ROUGE Evaluation
Hierarchical Beam Search To improve informa-
tion correctness and avoid redundancy during the We evaluate our models with the standard ROUGE
summary decoding process, we use the hierarchi- metric (Lin, 2004) and obtain ROUGE scores
cal beam search algorithm with reference mecha- using the pyrouge package. Results in Ta-
nism (Tan et al., 2017a) to generate multi-sentence ble 1 show that our method has significant im-
summaries. Similar to (Tan et al., 2017a), the provement over state-of-the-art neural abstractive
beam sizes for word decoder and sentence decoder baselines as well as extractive baselines. Note
are 15 and 2, respectively. that, the Deep-reinforced model achieves the best
ROUGE-L performance because it directly opti-
3.3 Baselines mizes the ROUGE-L metric. Comparing with
the current state-of-the-art model Coverage, our
We compare our system with the results of state-
model achieves significant better performance on
of-the-art neural summarization approaches re-
ROUGE-1 and ROUGE-2 metrics, and compa-
ported in recent papers, which contain both ab-
rable performance on ROUGE-L metric, which
stractive models and extractive models. The ex-
demonstrates that our model is more effective in
tractive models include SummaRuNNer (Nalla-
selecting salient information from a document to
pati et al., 2017), while SummaRuNNer-abs is
produce an informative summary while keeping
similar to SummaRuNNer but is trained directly
the ability to generate fluent and correct sentences.
on the abstractive summaries. Lead-3 is a strong
extractive baseline which uses the first 3 sentences Human Evaluation with Case Analysis
of the document as summary. The abstractive In addition to the ROUGE evaluation, we also con-
models include: ducted human evaluation on 50 random samples
1) Seq2seq-baseline, which uses the basic from CNN/DailyMail test set and compared the
seq2seq encoder-decoder structure with at- summaries generated by our method with the out-
tention mechanism and incorporates with the puts of Lead-3, Seq2seq-baseline and Coverage.
copy mechanism as in (See et al., 2017). Three data annotators were asked to compare the
2) ABS-temp-attn (Nallapati et al., 2016), generated summaries with the human summaries,
which uses Temporal Attention on the and assess each summary from four independent
1791
Gold Reference: faith and hope howie were born with one body and two faces on may I1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I 20
8. O1
they tragically died in hospital just 19 days after they were born . O2
parents simon howie and renee young visit their grave at pinegrove in western sydney
O3
fortnightly .
they arrived on thursday to find the grave bare of all the girls ’ mementos .
O4
staff had cleared entire baby section and thrown belongings in rubbish . O5
Seq2Seq-baseline: faith and hope howie were dubbed the miracle twins when they (a) Gold Reference
were born on may 8 last year with one body and two faces due to an extremely rare I1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I 20
condition known as disrosopus .
O1
faith and hope howie were dubbed the miracle twins when they were born on may 8
last year with one body and two faces due to an extremely rare condition known as
O2
disrosopus . O3
faith and hope howie were dubbed the miracle twins when they were born on may 8 (b) Seq2Seq-baseline
last year with one body and two faces due to an extremely rare condition known as I1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I 20
disrosopus .
Coverage: faith and hope howie were dubbed the miracle twins when they were born
O1
on may 8 last year with one body and two faces due to an extremely rare condition O2
known as disrosopus . (c) Coverage
they died in hospital less than a month after they were born and their parents , simon
howie and renee young , laid them to rest at pinegrove memorial park in sydney ’s west. I1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I 20
Our Model: faith and hope howie were dubbed the miracle twins when they were born O1
on may 8 last year with one body and two faces due to an extremely rare condition O2
known as disrosopus. O3
they died in hospital less than a month after they were born and their parents , simon
(d) Our Model
howie and renee young , laid them to rest at pinegrove memorial park in sydney ’s west.
family members have visited the grave every week to leave mementos and flowers for
faith and hope , but when mr howie and ms young arrived on thursday they found the
site completely bare .
Figure 2: Visualization of sentence selection vectors. Ii
and Oi indicate the i-th sentence of the input and output,
Table 3: Examples of generated summaries. The Seq2Seq- respectively. Obviously, our model can detect more salient
baseline model generates repeated sentences and loses salient sentences that are included in the reference summary.
information. The Coverage model reduces repetitions, but
also loses salient information. Our model can select more
salient information from the original document and generate salient information, which shows the effective-
more informative summary.
ness of the information selection component in
our model. According to the results in Table
perspectives: (1) Informative: How informative 2, the sentence-level modeling of document and
the summary is? (2) Concise: How concise the summary in our model also makes the gener-
summary is? (3) Coherent: How coherent (be- ated summaries achieve better inter-sentence co-
tween sentences) the summary is? (4) Fluent: herence. Compared with the strong extractive
How fluent, grammatical the sentences of a sum- baseline Lead-3, our model is able to generate
mary are? Each property is assessed with a score more informative and concise summaries, which
from 1(worst) to 5(best) by three annotators. The shows the advantage of abstractive methods. The
average results are presented in Table 2. fluency scores also show the good ability of our
The results show that our model consistently model to generate fluent and grammatical sen-
outperforms the Seq2seq-baseline model and the tences. The human evaluation results demonstrate
previous state-of-the-art method Coverage. An that our model is able to generate more infor-
example of comparison of the generated sum- mative, concise and coherent summaries than the
maries by our model with the two abstractive mod- baselines.
els (w.r.t the reference summary) is shown in Ta- The visualization of the sentence selection vec-
ble 31 . The summary generated by Seq2Seq- tors of the gold reference summary and the three
Baseline usually contains repetition of sentences, abstractive models when generating the presented
which seriously affects its informativeness, con- examples in Table 3 are shown in Figure 22 . The
ciseness as well as coherence. For example, the figure shows that Seq2Seq-baseline fails to detect
sentence “faith and hope howie were dubbed the all important source sentences and attend to the
miracle twins when they were born ...” is repeated same sentences repeatedly, which result in gen-
three times in Table 3. The Coverage model effec- erating repeated summary sentences. Coverage
tively alleviates the information repetition prob- learns to reduce repetitions, but fails to detect all
lem, however, it loses some salient information the salient information. Obviously, our method
that should be included in the summary. For ex- is more effective in selecting salient and rele-
ample, the information about “mementos” and vant source sentences from the document to gener-
“family members visit the grave” is lost in the ate more informative summary. Furthermore, our
example shown in Table 3. The summary gen- 2
The sentence selection vectors of the Seq2seq-baseline
erated by our method obviously contains more mode and the Coverage model are computed by summing the
attention weights of all words in each sentence and then nor-
1
More examples are shown in the supplementary material malized across sentences.
1792
Method Rouge-1 Rouge-2 Rouge-L Method Rouge-1 Rouge-2 Rouge-L
Our Model 41.54 18.18 36.47 SummaRuNNer-abs 37.5 14.5 33.4
– distS 40.02 17.54 34.87 SummaRuNNer 39.6 16.2 35.3
– distS&gateF 39.26 16.96 33.92 OurExtractive 40.41 18.30 36.30
– infoSelection 36.64 15.66 33.42 – distS 37.06 16.55 33.23
– distS&gateF 36.25 16.22 32.59
Table 4: Comparison results of removing different compo-
nents of our method. Table 5: Comparsion results of sentence selection.
1793
with different length of golden reference sum- Later, some work explored the seq2seq mod-
maries. The results are shown in Table 6, which els on document summarization, which produce
demonstrate that our method is better at gener- a multi-sentence summary for a document. The
ating long summary for long document. As the seq2seq models usually exhibit some undesir-
golden summary becoming longer, our system will able behaviors, such as inaccurately reproduc-
obtain larger advantages over the baseline (from ing factual details, unable to deal with out-of-
+1.0 Rouge-1, +0.1 Rouge-2 and -0.63 Rouge-L vocabulary (OOV) words and repetitions. To
for summary less than 75 words, rising to +10.68 alleviate these issues, copying mechanism (Gu
Rouge-1, +6.05 Rouge-2 and +4.86 Rouge-L for et al., 2016; Gulcehre et al., 2016; Nallapati et al.,
summaries more than 125 words). The results 2016) has been incorporated into the encoder-
also verify that our method is more effective in se- decoder architecture. Distraction-based attention
lecting salient information from documents, espe- model (Chen et al., 2016) and coverage mecha-
cially for long documents. nism (See et al., 2017) have also been investi-
gated to alleviate the repetition problem. To better
5 Related Work train the seq2seq model on tasks with long docu-
ments and multi-sentence summaries, a deep rein-
Existing exploration on document summarization
forced model was proposed to combine the stan-
mainly can be categorized to extractive methods
dard words predication with teacher forcing learn-
and abstractive methods.
ing and the global sequence prediction training
5.1 Extractive Summarization Methods with reinforcement learning (Paulus et al., 2017).
Recently, Tan et al. (2017a) propose to leverage
Neural networks have been widely investigated on the hierarchical encoder-decoder architecture on
extractive document summarization task. Earlier generating multi-sentence summaries, and incor-
work attempts to use deep learning techniques to porate sentence-ranking into the summary gener-
improve sentence ranking or scoring (Cao et al., ation process based on the graph-based attention
2015a,b; Yin and Pei, 2015). Some recent work mechanism. Different from these neural-based
solves the sentence extraction and document mod- work, our model explicitly models the informa-
eling in an end-to-end framework. Cheng and La- tion selection process in document summarization
pata (2016) propose an encoder-decoder approach by extending the encoder-decoder framework with
where the encoder hierarchically learns the rep- an information selection layer. Our model cap-
resentation of sentences and documents while an tures both the global document information and
attention-based sentence extractor extracts salient local inter-sentence relations, and optimize the in-
sentences sequentially from the original docu- formation selection process directly via distantly-
ment. Nallapati et al. (2017) propose a recur- supervised training, which effectively combines
rent neural network-based sequence-to-sequence the strengths of extractive methods and abstractive
model for sequential labelling of each sentence in methods.
the document. Neural models are able to lever-
age large-scale corpora and achieve better perfor- 6 Conclusion
mance than traditional methods.
In this paper, we have analyzed the necessity of ex-
5.2 Abstractive Summarization Methods plicitly modeling and optimizing of the informa-
As the seq2seq learning with neural networks tion selection process in document summarization,
achieve huge success in sequence generation tasks and verified its effectiveness by extending the ba-
like machine translation, it also shows great po- sic neural encoding-decoding framework with an
tential in text summarization area, especially for information selection layer and optimizing it with
abstractive methods. Some earlier researches stud- distantly-supervised training. Our information se-
ied the use of seq2seq learning for abstractive sen- lection layer consists of a gated global informa-
tence summarization (Takase et al., 2016; Rush tion filtering network and a local RNN sentence
et al., 2015; Chopra et al., 2016). These models selection network. Experimental results demon-
are trained on a large corpus of news documents strate that both of them are effective for help-
which are usually shortened to be the first one or ing select salient information during the summary
two sentences, and their headlines. generation process, which significantly improves
1794
the document summarization performance. Our Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
model combines the strengths of extractive meth- and Yoshua Bengio. 2014. Empirical evaluation of
gated recurrent neural networks on sequence model-
ods and abstractive methods, which can gener-
ing. arXiv preprint arXiv:1412.3555.
ate more informative and concise summaries, and
thus achieves state-of-the-art abstractive document John Duchi, Elad Hazan, and Yoram Singer. 2011.
summarization performance and is also competi- Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine
tive with state-of-the-art extractive models. Learning Research, 12(Jul):2121–2159.
1795
Sho Takase, Jun Suzuki, Naoaki Okazaki, Tsutomu Hi-
rao, and Masaaki Nagata. 2016. Neural headline
generation on abstract meaning representation. In
EMNLP, pages 1054–1059.
1796