Improving Neural Abstractive Document Summarization WithExplicit Information Selection Modeling

Improving Neural Abstractive Document Summarization with
Explicit Information Selection Modeling∗

Wei Li1,2,3 Xinyan Xiao2 Yajuan Lyu2 Yuanzhuo Wang1
1
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
2
Baidu Inc., Beijing, China
3
University of Chinese Academy of Sciences, Beijing, China
weili.ucas.ict@gmail.com, {xiaoxinyan,lvyajan}@baidu.com,
wangyuanzhuo@ict.ac.cn
Abstract generation, which inspires the research on abstrac-

tive document summarization.
Information selection is the most important
Most existing work directly apply the neu-
component in document summarization task.
In this paper, we propose to extend the basic ral encoding-decoding framework, which first
neural encoding-decoding framework with an encodes the input into an abstract representa-
information selection layer to explicitly model tion and then decodes the output based on the
and optimize the information selection pro- encoded information. Although the encoding-
cess in abstractive document summarization. decoding framework has achieved huge success
Specifically, our information selection layer on some text generation tasks like machine trans-
consists of two parts: gated global informa-
lation (Bahdanau et al., 2014) and image caption
tion filtering and local sentence selection. Un-
necessary information in the original docu-
(Vinyals et al., 2015), the performance on abstrac-
ment is first globally filtered, then salient sen- tive document summarization is much less con-
tences are selected locally while generating vincing. Since document summarization is a spe-
each summary sentence sequentially. To op- cial natural language generation task that requires
timize the information selection process di- information selection, the performance of current
rectly, distantly-supervised training guided by neural abstractive methods even has a considerable
the golden summary is also imported. Exper- gap from extractive methods.
imental results demonstrate that the explicit
The most essential prerequisite for a practical
modeling and optimizing of the information
selection process improves document summa- document summarization system is that the gen-
rization performance significantly, which en- erated summary must contain the salient informa-
ables our model to generate more informative tion of the original document. Since a document is
and concise summaries, and thus significantly a long sequence of multiple sentences, both global
outperform state-of-the-art neural abstractive document information and local inter-sentence re-
methods. lations need to be properly modeled in the infor-
mation selection process. Although the encoding-
1 Introduction
decoding framework has implicitly modeled the
Document summarization is the task of generat- information selection process via end-to-end train-
ing a fluent and condensed summary for a docu- ing, we argue that abstractive document summa-
ment while retaining the gist information. There rization shall benefit from explicitly modeling and
are two prominent approaches: extractive methods optimizing it by capturing both the global doc-
and abstractive methods. Extractive methods gen- ument information and local inter-sentence rela-
erate summary for a document by directly select- tions.
ing salient sentences from the original document. In this paper, we propose to extend the
On the contrary, abstractive methods synthesize encoding-decoding framework to model the in-
information from the input document to generate formation selection process explicitly. We treat
summary using arbitrary words and expressions - the document summarization as a three-phase
as human usually do. Recent neural models en- task: document encoding, information selection
able an end-to-end framework for natural language and summary decoding. Correspondingly, our
∗
This work was done while the first author was doing in- model consists of three layers: a document en-
ternship at Baidu Inc. coder layer, an information selection layer and a
1787
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1787–1796
Brussels, Belgium, October 31 - November 4, 2018. 2018
c Association for Computational Linguistics
Document Encoder h11 h21 h31 h41 h51 h61 h71 h81
tions based on the global document representation.
h12 h22 h32 h42 h52 h62 h72 h82
A sentence selection RNN is used to select salient
and relevant sentences while generating each sum-
h13 h23 h33 h43 h53 h63 h73 h83
mary sentence sequentially based on the tailored
h1 h2 h3 h4 h5 h6 h7 h8 sentence representations. At last, the summary de-
coder produces the output summary to paraphrase
and generalize the selected sentences.
Information Selection
f1 f2 f3 f4 f5 f6 f7 f8
In the following, we denote hi , hi,j as the hid-
α0 α1 α2 den state of the i-th sentence and the j-th word of
h1' h2' h3' the i-th sentence in the document encoder part, re-
spectively. In the information selection and sum-
α1 α2 α3
mary decoder part, we denote h′t , h′t,k as the hid-
Summary
Decoder
'
h1,1 '
h1,2 '
h1,3 '
h2,1 '
h2,2 '
h2,3 '
h3,1 '
h3,2 '
h3,3 den state of the t-th summary sentence and the
k-th word in the t-th summary sentence, respec-
Figure 1: Our abstractive document summarization model, tively.
which mainly consists of three layers: document encoder
layer (the top part), information selection layer (the middle 2.1 Document Encoder
part) and summary decoder layer (the bottom part).
A document d is a sequence of sentences d =
{si }, and each sentence is a sequence of words
summary decoder layer, as shown in Figure 1. si = {wi,j }. A hierarchical encoder, which con-
In our model, both the document and summary sists of two levels: word level and sentence level
are processed sentence by sentence, to better cap- similar to (Nallapati et al., 2016), is used to encode
ture the inter-sentence relations. The informa- the document from both word and sentence level.
tion selection layer consists of two parts: gated The word-level encoder is a bidirectional Gated
global information filtering and local sentence se- Recurrent Unit (GRU) (Chung et al., 2014), which
lection. Unnecessary information in the original encodes the words of a sentence into sentence rep-
document are first globally filtered by a gated net- resentation. The word encoder sequentially up-
work, then important sentences are selected lo- dates its hidden state after receiving a word, which
cally while generating each summary sentence se- is formulated as:
quentially. Moreover, we propose to optimize
the information selection process with distantly- hi,j = BiGRU (hi,j−1 , ei,j ) (1)
supervised training. Our proposed method com-
bines the strengths of extractive methods and ab- where hi,j and ei,j denotes the hidden state and
stractive methods, which is able to tackle the fac- embedding of word wi,j , respectively.
tors of saliency, non-redundancy, coherence and The concatenation of the forward and backward
fluency under a unified framework. We conduct final hidden states in the word-level encoder is in-
extensive experiments on benchmark datasets and dicated as the vector representation xi of the sen-
the results demonstrate that the explicit modeling tence si , which is used as input to the sentence-
and distantly-supervised optimizing of the infor- level encoder. The sentence encoder is also a bidi-
mation selection process improves document sum- rectional GRU, which updates its hidden state after
marization performance significantly, which en- receiving each sentence representation by:
ables our model to significantly outperforms pre-
hi = BiGRU (hi−1 , xi ) (2)
vious state-of-the-art neural abstractive methods.
2 Our Model where hi denotes the hidden state of sentence si .

The concatenation of the forward and backward
As shown in Figure 1, our model consists of a hi- final states in the sentence-level encoder is used as
erarchical document encoder, an information se- the vector representation of document d̂.
lection layer and an attention-equipped decoder.
Firstly, the hierarchical encoder encodes the doc- 2.2 Information Selection
ument sentence by sentence, and word by word Document summarization is a special natural lan-
in each sentence. Then the information selection guage generation task which requires information
layer selects and filters the sentence representa- compression. It needs to remove the unnecessary
1788
information and select salient information from Local Sentence Selection
the input document to produce a condensed sum- We explicitly model the local sentence selection
mary. However, it is difficult for the basic encoder- process which selects several target sentences to
decoder framework to learn the process of salient generate a summary sentence. Concretely, we ap-
information selection, which has also been noticed ply a RNN layer to sequentially select target sen-
by several previous work (Tan et al., 2017a,b). To tences for each summary sentence, shown as in
tackle the challenge, we extend the basic encoder- Figure 1. The sentence-selection RNN uses the
′
decoder framework by adding an information se- document representation d̂ as initial state h0 , and
lection layer to model the information selection sequentially predicts the sentence selection vector
process explicitly. Our information selection layer αt as follows:
consists of two parts: gated global information fil- ′
tering that used to remove the unnecessary infor- eϕ(fi ,ht )
αti = ∑ ϕ(f ,h′ ) (5)
mation of a document, and local sentence selection le
l t
that used to select salient sentences from a docu-

ment sequentially to produce summary sentences.
ϕ(fi , h′t ) = v T tanh(Wf fi + Wh h′t + b). (6)
Gated Global Information Filtering
Inspired by studies on how human write text sum- where αti indicates the weight of source sentence
maries by first skimming the document and delet- si when generating the t-th summary sentence,
′
ing unnecessary material (Brown and Day, 1983), and ht denotes the hidden state of sentence selec-
we design a gated global information filtering net- tion layer when generating the t-th summary sen-
work to filter unnecessary information of a doc- tence. v, Wf and Wh are weight matrices, and b
ument based on the global document representa- is the bias vector. Note that, the sentence selec-
tion before the summary decoder generates sum- tion vector αt is computed based on the tailored
mary. Concretely, the gated information filtering sentence representation fi .
network makes use of the document representa- The sentence-selection RNN uses a single uni-
tion d̂, which represents the global information of directional GRU, which updates its state by:
a document, to filter sentences based on the sen- ′ ′ ′
tence representation hi . ht = GRU (ht−1 , xt ) (7)
For each source sentence si , the gate network ′
takes the document representation d̂ and sentence where xt denotes the input of current sentence-
′
representation hi as inputs to compute the gate selection step. xt combines both the previous sen-
vector gi : tence selection vector αt−1 and the encoded repre-
′
sentation of previous generated sentence rt−1 by
′ ′
gi = σ(Wg hi + Ug d̂ + bg ) (3) xt = tanh(Wr rt−1 + Wα αt−1 + bx ), where Wr ,
Wα , and bx denote learnable parameters.
where Wg and Ug denote weight matrices, bg the The representation of the selected source sen-
bias vector, and σ the sigmoid activation function. tences is computed by:
Then each sentence si can be filtered by the gate ∑ j
vector gi as follows: qt = αt fj (8)
j
fi = hi ⊙ gi (4)
which is used as initial state of the summary de-
where fi indicates the representation of sen- coder to generate a summary sentence to para-
tence si after information filtering, and ⊙ denotes phrase and generalize the selected sentences.
element-wise multiplication.
Note that, we filter sentences in micro semantic 2.3 Summary Decoder
dimensions rather than filtering whole sentences. On top of the document encoder and the informa-
The tailored sentence representations are used as tion selection layer, we use GRU with attention
input to the sentence selection network and sum- as the summary decoder to realize each summary
mary decoder, which can help to detect salient sen- sentence word by word.
tences and improve informativeness of the gener- At each word decoding step k in the t-th sum-
ated summary. mary sentence, the GRU reads the previous word
1789
embedding et,k−1 and context vector ct,k−1 as in- where DKL (αt , pt ) indicates the KL-divergence
′
puts to compute the new hidden state ht,k by: between distribution αt and pt . The sentence se-
lection loss is imported into the final loss func-
′ ′
ht,k = GRU (ht,k−1 , ct,k−1 , et,k−1 ) (9) tion to be optimized with the summary generation
component together.
We import attention mechanism to help locate The loss function L of the model is the mix
relevant words to be copied or paraphrased within of the negative log-likelihood of generating sum-
the selected source sentences in each word genera- maries over training set T , and the sentence selec-
i of the kth
tion step. The attention distribution βt,k tion loss of distantly-supervised training:
word of the tth summary sentence over the sen- ∑
tences in the ith document can be computed as: L= −logP (Y |X; θ) + λlosssel (13)
(X,Y )∈T
ϕ(hi,j ,h′t,k )
i,j e where λ is a hyper-parameter tuned on the vali-
βt,k = αti ∑ ′
ϕ(hi,l ,ht,k )
(10)
le dation set. (X, Y ) denotes a document-summary
pair in the training set.
where αti denotes the weight of the ith source
sentence, used to normalize the word attention 3 Experiments
distributions. Then the word-level context vec-
3.1 Dataset
tor when generating the kth word at the tth sen-
tence generation step can be computed as: ct,k = We conduct our experiments on a large-scale cor-
∑ ∑ i,j pus of CNN/DailyMail, which has been widely
i j βt,k hi,j , which is also incorporated into
the word decoder. used for exploration on summarizing documents
with multi-sentence summaries. The corpus are
At each word generation step, the vocabulary
originally constructed in (Hermann et al., 2015) by
distribution is calculated from the context vector
′ collecting human generated highlights from news
ct,k and the decoder state ht,k by:
stories in the CNN and DailyMail Website, which
′ ′ contains input document of about 800 tokens on
Pvocab (wt,k ) = sof tmax(Wv (Wc [ht,k , ct,k ] + bc ) + bv )
(11) average and multi-sentence summaries of up to
where Wv and Wc are learned parameters. The 200 tokens. We use the same version of data with
copy mechanism based on the word attention is (See et al., 2017), which totally has 280,125 train-
also imported into the decoder to alleviate the ing pairs, 13,367 validation pairs and 11,489 test
OOV problems as in (See et al., 2017). pairs after discarding the examples with empty ar-
ticle text. Some of previous work (Nallapati et al.,
2.4 Model Learning with Distant Supervision 2016, 2017; Paulus et al., 2017; Tan et al., 2017a)
Despite the end-to-end training for the perfor- use the anonymized version of data, which has
mance of generated summary, we also directly been pre-processed to replace each named entity
optimize the sentence selection decisions by im- with an unique identifier. By contrast, we use the
porting supervision for the sentence selection vec- non-anonymized data similar to (See et al., 2017),
tor αt in Equation 5. While there is no ex- which is a more favorable and challenging prob-
plicit supervision for sentence selection, we define lem because it requires no pre-processing.
a simple approach for labeling sentences based 3.2 Implementation Details
on the reference summaries. To simulate the
Model Parameters For all experiments, the
sentence selection process on human-written ab-
word-level encoder and summary decoder both
stracts, we compute the words-matching similari-
use 256-dimensional hidden states, and the
ties (based on TF-IDF cosine similarity) between
sentence-level encoder and sentence selection net-
a reference-summary sentence and corresponding
work both use 512-dimensional hidden states. We
source document sentences and normalize them
use pre-trained Glove (Pennington et al., 2014)
into distantly-labelled sentence selection vector
pt . Then the sentence selection loss is defined as: vector for initialization of word embeddings. The
dimension of word embeddings is 100, which will
∑ be further trained in the model. We use a vocab-
losssel = DKL (αt , pt ) (12)
t
ulary of 50k words for both encoder and decoder.
1790
Method Rouge-1 Rouge-2 Rouge-L Method Informat. Concise Coherent Fluent
Lead-3 40.34 17.70 36.57 Lead-3 3.49∗ 3.19∗ 3.86 4.07∗
SummaRuNNer-abs 37.5 14.5 33.4 Seq2seq-b. 3.11∗ 2.95∗ 3.08∗ 3.51∗
SummaRuNNer 39.6 16.2 35.3 Coverage 3.41∗ 3.25∗ 3.37 3.72
Seq2seq-baseline 36.64 15.66 33.42 Our Model 3.76 3.49 3.65 3.80
ABS-temp-attn 35.46 13.30 32.65
Graph-attention 38.1 13.9 34.0 Table 2: Human evaluation results. ∗ indicates the difference
Deep-reinforced 39.87 15.82 36.90 between Our Model and other models are statistic significant
Coverage 39.53 17.28 36.38 (p < 0.1) by two-tailed t-test.
Our Model 41.54 18.18 36.47
Table 1: Rouge F1 scores on the test set. All our ROUGE seq2seq architecture to overcome the repeti-
scores have a 95% confidence interval of at most ±0.25
as reported by the official ROUGE script. tion problem.
3) Graph-attention (Tan et al., 2017a), which
uses a graph-ranking based attention mecha-
We use dropout (Srivastava et al., 2014) with prob- nism based on a hierarchical architecture to
ability p = 0.5. After tuning on the validation set, identify important source sentences.
parameter λ is set as 0.2. 4) Deep-reinforced (Paulus et al., 2017), which
Model Training We use Adagrad (Duchi et al., trains the seq2seq encoder-decoder model
2011) algorithm with learning rate 0.1 and an ini- with reinforcement learning techniques.
tial accumulator value of 0.1 to optimize the model 5) Coverage (See et al., 2017), which is an ex-
parameters θ. During training, we use gradient tension of the Seq2seq-baseline model by im-
clipping with a maximum gradient norm of 2. Our porting coverage mechanism to control repe-
model is trained on a single Tesla K40m GPU with titions in summary.
a batch size of 16 and an epoch is set contain-
ing 10,000 randomly sampled documents. Con- 3.4 Evaluation
vergence is reached within 300 epochs.
ROUGE Evaluation
Hierarchical Beam Search To improve informa-
tion correctness and avoid redundancy during the We evaluate our models with the standard ROUGE
summary decoding process, we use the hierarchi- metric (Lin, 2004) and obtain ROUGE scores
cal beam search algorithm with reference mecha- using the pyrouge package. Results in Ta-
nism (Tan et al., 2017a) to generate multi-sentence ble 1 show that our method has significant im-
summaries. Similar to (Tan et al., 2017a), the provement over state-of-the-art neural abstractive
beam sizes for word decoder and sentence decoder baselines as well as extractive baselines. Note
are 15 and 2, respectively. that, the Deep-reinforced model achieves the best
ROUGE-L performance because it directly opti-
3.3 Baselines mizes the ROUGE-L metric. Comparing with
the current state-of-the-art model Coverage, our
We compare our system with the results of state-
model achieves significant better performance on
of-the-art neural summarization approaches re-
ROUGE-1 and ROUGE-2 metrics, and compa-
ported in recent papers, which contain both ab-
rable performance on ROUGE-L metric, which
stractive models and extractive models. The ex-
demonstrates that our model is more effective in
tractive models include SummaRuNNer (Nalla-
selecting salient information from a document to
pati et al., 2017), while SummaRuNNer-abs is
produce an informative summary while keeping
similar to SummaRuNNer but is trained directly
the ability to generate fluent and correct sentences.
on the abstractive summaries. Lead-3 is a strong
extractive baseline which uses the first 3 sentences Human Evaluation with Case Analysis
of the document as summary. The abstractive In addition to the ROUGE evaluation, we also con-
models include: ducted human evaluation on 50 random samples
1) Seq2seq-baseline, which uses the basic from CNN/DailyMail test set and compared the
seq2seq encoder-decoder structure with at- summaries generated by our method with the out-
tention mechanism and incorporates with the puts of Lead-3, Seq2seq-baseline and Coverage.
copy mechanism as in (See et al., 2017). Three data annotators were asked to compare the
2) ABS-temp-attn (Nallapati et al., 2016), generated summaries with the human summaries,
which uses Temporal Attention on the and assess each summary from four independent
1791
Gold Reference: faith and hope howie were born with one body and two faces on may I1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I 20
8. O1
they tragically died in hospital just 19 days after they were born . O2
parents simon howie and renee young visit their grave at pinegrove in western sydney
O3
fortnightly .
they arrived on thursday to find the grave bare of all the girls ’ mementos .
O4
staff had cleared entire baby section and thrown belongings in rubbish . O5
Seq2Seq-baseline: faith and hope howie were dubbed the miracle twins when they (a) Gold Reference
were born on may 8 last year with one body and two faces due to an extremely rare I1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I 20
condition known as disrosopus .
O1
faith and hope howie were dubbed the miracle twins when they were born on may 8
last year with one body and two faces due to an extremely rare condition known as
O2
disrosopus . O3
faith and hope howie were dubbed the miracle twins when they were born on may 8 (b) Seq2Seq-baseline
last year with one body and two faces due to an extremely rare condition known as I1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I 20
disrosopus .
Coverage: faith and hope howie were dubbed the miracle twins when they were born
O1
on may 8 last year with one body and two faces due to an extremely rare condition O2
known as disrosopus . (c) Coverage
they died in hospital less than a month after they were born and their parents , simon
howie and renee young , laid them to rest at pinegrove memorial park in sydney ’s west. I1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I 20
Our Model: faith and hope howie were dubbed the miracle twins when they were born O1
on may 8 last year with one body and two faces due to an extremely rare condition O2
known as disrosopus. O3
they died in hospital less than a month after they were born and their parents , simon
(d) Our Model
howie and renee young , laid them to rest at pinegrove memorial park in sydney ’s west.
family members have visited the grave every week to leave mementos and flowers for
faith and hope , but when mr howie and ms young arrived on thursday they found the
site completely bare .
Figure 2: Visualization of sentence selection vectors. Ii
and Oi indicate the i-th sentence of the input and output,
Table 3: Examples of generated summaries. The Seq2Seq- respectively. Obviously, our model can detect more salient
baseline model generates repeated sentences and loses salient sentences that are included in the reference summary.
information. The Coverage model reduces repetitions, but
also loses salient information. Our model can select more
salient information from the original document and generate salient information, which shows the effective-
more informative summary.
ness of the information selection component in
our model. According to the results in Table
perspectives: (1) Informative: How informative 2, the sentence-level modeling of document and
the summary is? (2) Concise: How concise the summary in our model also makes the gener-
summary is? (3) Coherent: How coherent (be- ated summaries achieve better inter-sentence co-
tween sentences) the summary is? (4) Fluent: herence. Compared with the strong extractive
How fluent, grammatical the sentences of a sum- baseline Lead-3, our model is able to generate
mary are? Each property is assessed with a score more informative and concise summaries, which
from 1(worst) to 5(best) by three annotators. The shows the advantage of abstractive methods. The
average results are presented in Table 2. fluency scores also show the good ability of our
The results show that our model consistently model to generate fluent and grammatical sen-
outperforms the Seq2seq-baseline model and the tences. The human evaluation results demonstrate
previous state-of-the-art method Coverage. An that our model is able to generate more infor-
example of comparison of the generated sum- mative, concise and coherent summaries than the
maries by our model with the two abstractive mod- baselines.
els (w.r.t the reference summary) is shown in Ta- The visualization of the sentence selection vec-
ble 31 . The summary generated by Seq2Seq- tors of the gold reference summary and the three
Baseline usually contains repetition of sentences, abstractive models when generating the presented
which seriously affects its informativeness, con- examples in Table 3 are shown in Figure 22 . The
ciseness as well as coherence. For example, the figure shows that Seq2Seq-baseline fails to detect
sentence “faith and hope howie were dubbed the all important source sentences and attend to the
miracle twins when they were born ...” is repeated same sentences repeatedly, which result in gen-
three times in Table 3. The Coverage model effec- erating repeated summary sentences. Coverage
tively alleviates the information repetition prob- learns to reduce repetitions, but fails to detect all
lem, however, it loses some salient information the salient information. Obviously, our method
that should be included in the summary. For ex- is more effective in selecting salient and rele-
ample, the information about “mementos” and vant source sentences from the document to gener-
“family members visit the grave” is lost in the ate more informative summary. Furthermore, our
example shown in Table 3. The summary gen- 2
The sentence selection vectors of the Seq2seq-baseline
erated by our method obviously contains more mode and the Coverage model are computed by summing the
attention weights of all words in each sentence and then nor-
1
More examples are shown in the supplementary material malized across sentences.
1792
Method Rouge-1 Rouge-2 Rouge-L Method Rouge-1 Rouge-2 Rouge-L
Our Model 41.54 18.18 36.47 SummaRuNNer-abs 37.5 14.5 33.4
– distS 40.02 17.54 34.87 SummaRuNNer 39.6 16.2 35.3
– distS&gateF 39.26 16.96 33.92 OurExtractive 40.41 18.30 36.30
– infoSelection 36.64 15.66 33.42 – distS 37.06 16.55 33.23
– distS&gateF 36.25 16.22 32.59
Table 4: Comparison results of removing different compo-
nents of our method. Table 5: Comparsion results of sentence selection.
length Method Rouge-1 Rouge-2 Rouge-

method tends to focus on different sets of source L
sentences when generating different summary sen- < 75 Our Mod. 39.90 16.91 35.19
(81.82%) Coverage 38.90 16.81 35.82
tences. The results verify that the information se- [75, 100) Our Mod. 47.13 22.44 40.81
lection component in our model significantly im- (12.64%) Coverage 42.89 19.72 39.41
proves the information selection process in docu- [100, 125) Our Mod. 50.49 24.23 43.68
(4.00%) Coverage 41.78 19.00 38.41
ment summarization. > 125 Our Mod. 50.25 23.98 41.19
(1.54%) Coverage 39.57 17.93 36.33
4 Discussion
Table 6: Comparison results w.r.t different length of refer-
ence summary. < 75 indicates the reference summary has
In this section, we first validate the effective- less than 75 words (occupy 81.82% of test set), [75, 100) de-
ness of each component of our model, then com- notes the number of words in reference summary is between
pare the performance of information selection of 75 and 100 (occupy 12.64% of test set).
our method with several extractive methods, and

finally analyze the effects of golden summary 4.2 Effectiveness of Information Selection
length on the performance of our model.
To verify the performance of sentence selection in
our model, we add a comparison system OurEx-
4.1 Model Validation
tractive which is almost the same as our model,
To further verify the effectiveness of each compo- but replaces the summary decoder by a sentence
nent in our model, we conduct several ablation ex- extractor. The sentence extractor extracts the
periments. “– distS” denotes removing the distant source sentence with the largest weight in each
supervision for sentence selection (set λ as 0). “– sentence generation step. “– distS” denotes re-
distS&gateF” denotes removing both the distant moving the distant supervision for sentence selec-
supervision for sentence selection training and the tion training in our model. “– distS&gateF” de-
global gated information filtering component. “– notes removing both the distant supervision for
infoSelection” denotes removing the whole infor- sentence selection training and the gated global in-
mation selection layer and do not explicitly mod- formation filtering component.
eling the information selection process, which is Results in Table 5 show that our simple extrac-
actually the Seq2seq-baseline model. tive method OurExtractive significantly outper-
Results on the test set are shown in Table 4. Our forms state-of-the-art neural extractive baselines,
method much outperforms all the comparison sys- which demonstrates the effectiveness of the infor-
tems and removing each component of our model mation selection component in our model. More-
one by one will leads to sustained significant per- over, OurExtractive significantly outperforms the
formance declining, which verifies the effective- two comparison systems which remove different
ness of each component in our model. The global components of our model one by one. The results
gated information filtering network removes un- show that both the gated global information filter-
necessary information from the original document ing and distant supervision training are effective
and helps generate more informative summary. for improving information selection in document
The distantly-supervised training for sentence se- summarization. Our proposed method effectively
lection decisions helps the model learn to detect combines the strengths of extractive methods and
important and relevant source sentences for each abstractive methods into a unified framework.
summary sentence. The results verify that explic-
itly modeling the information selection process 4.3 Effects of Summary Length
significantly improves the document summariza- We further compare our method with the Cov-
tion performance. erage model by evaluating them on the test set
1793
with different length of golden reference sum- Later, some work explored the seq2seq mod-
maries. The results are shown in Table 6, which els on document summarization, which produce
demonstrate that our method is better at gener- a multi-sentence summary for a document. The
ating long summary for long document. As the seq2seq models usually exhibit some undesir-
golden summary becoming longer, our system will able behaviors, such as inaccurately reproduc-
obtain larger advantages over the baseline (from ing factual details, unable to deal with out-of-
+1.0 Rouge-1, +0.1 Rouge-2 and -0.63 Rouge-L vocabulary (OOV) words and repetitions. To
for summary less than 75 words, rising to +10.68 alleviate these issues, copying mechanism (Gu
Rouge-1, +6.05 Rouge-2 and +4.86 Rouge-L for et al., 2016; Gulcehre et al., 2016; Nallapati et al.,
summaries more than 125 words). The results 2016) has been incorporated into the encoder-
also verify that our method is more effective in se- decoder architecture. Distraction-based attention
lecting salient information from documents, espe- model (Chen et al., 2016) and coverage mecha-
cially for long documents. nism (See et al., 2017) have also been investi-
gated to alleviate the repetition problem. To better
5 Related Work train the seq2seq model on tasks with long docu-
ments and multi-sentence summaries, a deep rein-
Existing exploration on document summarization
forced model was proposed to combine the stan-
mainly can be categorized to extractive methods
dard words predication with teacher forcing learn-
and abstractive methods.
ing and the global sequence prediction training
5.1 Extractive Summarization Methods with reinforcement learning (Paulus et al., 2017).
Recently, Tan et al. (2017a) propose to leverage
Neural networks have been widely investigated on the hierarchical encoder-decoder architecture on
extractive document summarization task. Earlier generating multi-sentence summaries, and incor-
work attempts to use deep learning techniques to porate sentence-ranking into the summary gener-
improve sentence ranking or scoring (Cao et al., ation process based on the graph-based attention
2015a,b; Yin and Pei, 2015). Some recent work mechanism. Different from these neural-based
solves the sentence extraction and document mod- work, our model explicitly models the informa-
eling in an end-to-end framework. Cheng and La- tion selection process in document summarization
pata (2016) propose an encoder-decoder approach by extending the encoder-decoder framework with
where the encoder hierarchically learns the rep- an information selection layer. Our model cap-
resentation of sentences and documents while an tures both the global document information and
attention-based sentence extractor extracts salient local inter-sentence relations, and optimize the in-
sentences sequentially from the original docu- formation selection process directly via distantly-
ment. Nallapati et al. (2017) propose a recur- supervised training, which effectively combines
rent neural network-based sequence-to-sequence the strengths of extractive methods and abstractive
model for sequential labelling of each sentence in methods.
the document. Neural models are able to lever-
age large-scale corpora and achieve better perfor- 6 Conclusion
mance than traditional methods.
In this paper, we have analyzed the necessity of ex-
5.2 Abstractive Summarization Methods plicitly modeling and optimizing of the informa-
As the seq2seq learning with neural networks tion selection process in document summarization,
achieve huge success in sequence generation tasks and verified its effectiveness by extending the ba-
like machine translation, it also shows great po- sic neural encoding-decoding framework with an
tential in text summarization area, especially for information selection layer and optimizing it with
abstractive methods. Some earlier researches stud- distantly-supervised training. Our information se-
ied the use of seq2seq learning for abstractive sen- lection layer consists of a gated global informa-
tence summarization (Takase et al., 2016; Rush tion filtering network and a local RNN sentence
et al., 2015; Chopra et al., 2016). These models selection network. Experimental results demon-
are trained on a large corpus of news documents strate that both of them are effective for help-
which are usually shortened to be the first one or ing select salient information during the summary
two sentences, and their headlines. generation process, which significantly improves
1794
the document summarization performance. Our Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
model combines the strengths of extractive meth- and Yoshua Bengio. 2014. Empirical evaluation of
gated recurrent neural networks on sequence model-
ods and abstractive methods, which can gener-
ing. arXiv preprint arXiv:1412.3555.
ate more informative and concise summaries, and
thus achieves state-of-the-art abstractive document John Duchi, Elad Hazan, and Yoram Singer. 2011.
summarization performance and is also competi- Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine
tive with state-of-the-art extractive models. Learning Research, 12(Jul):2121–2159.
Acknowledgments Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK

Li. 2016. Incorporating copying mechanism in
This work was supported by National Key Re- sequence-to-sequence learning. arXiv preprint
arXiv:1603.06393.
search and Development Program of China under
grants 2016YFB1000902 and 2017YFC0820404, Caglar Gulcehre, Sungjin Ahn, Ramesh Nallap-
and National Natural Science Foundation of China ati, Bowen Zhou, and Yoshua Bengio. 2016.
Pointing the unknown words. arXiv preprint
under grants 61572469, 91646120, 61772501 and arXiv:1603.08148.
61572473. We thank the anonymous reviewers for
their helpful comments about this work. Karl Moritz Hermann, Tomas Kocisky, Edward
Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su-
leyman, and Phil Blunsom. 2015. Teaching ma-
chines to read and comprehend. In Advances in Neu-
References ral Information Processing Systems, pages 1693–
1701.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly Chin-Yew Lin. 2004. Rouge: A package for auto-
learning to align and translate. arXiv preprint matic evaluation of summaries. In Text summariza-
arXiv:1409.0473. tion branches out: Proceedings of the ACL-04 work-
shop, volume 8. Barcelona, Spain.
Ann L Brown and Jeanne D Day. 1983. Macrorules
for summarizing texts: The development of exper- Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
tise. Journal of verbal learning and verbal behav- Summarunner: A recurrent neural network based se-
ior, 22(1):1–14. quence model for extractive summarization of docu-
ments. AAAI, 1:1.
Ziqiang Cao, Furu Wei, Li Dong, Sujian Li, and Ming
Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,
Zhou. 2015a. Ranking with recursive neural net-
Bing Xiang, et al. 2016. Abstractive text summa-
works and its application to multi-document sum-
rization using sequence-to-sequence rnns and be-
marization. In AAAI, pages 2153–2159.
yond. arXiv preprint arXiv:1602.06023.
Ziqiang Cao, Furu Wei, Sujian Li, Wenjie Li, Ming Romain Paulus, Caiming Xiong, and Richard Socher.
Zhou, and WANG Houfeng. 2015b. Learning sum- 2017. A deep reinforced model for abstractive sum-
mary prior representation for extractive summariza- marization. arXiv preprint arXiv:1705.04304.
tion. In Proceedings of the 53rd Annual Meet-
ing of the Association for Computational Linguistics Jeffrey Pennington, Richard Socher, and Christo-
and the 7th International Joint Conference on Natu- pher D. Manning. 2014. Glove: Global vectors for
ral Language Processing (Volume 2: Short Papers), word representation. In Empirical Methods in Nat-
volume 2, pages 829–833. ural Language Processing (EMNLP), pages 1532–
1543.
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei,
and Hui Jiang. 2016. Distraction-based neural net- Alexander M Rush, Sumit Chopra, and Jason We-
works for document summarization. arXiv preprint ston. 2015. A neural attention model for ab-
arXiv:1610.08462. stractive sentence summarization. arXiv preprint
arXiv:1509.00685.
Jianpeng Cheng and Mirella Lapata. 2016. Neural Abigail See, Peter J Liu, and Christopher D Man-
summarization by extracting sentences and words. ning. 2017. Get to the point: Summarization
arXiv preprint arXiv:1603.07252. with pointer-generator networks. arXiv preprint
arXiv:1704.04368.
Sumit Chopra, Michael Auli, and Alexander M Rush.
2016. Abstractive sentence summarization with at- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
tentive recurrent neural networks. In Proceedings of Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
the 2016 Conference of the North American Chap- Dropout: A simple way to prevent neural networks
ter of the Association for Computational Linguistics: from overfitting. The Journal of Machine Learning
Human Language Technologies, pages 93–98. Research, 15(1):1929–1958.
1795
Sho Takase, Jun Suzuki, Naoaki Okazaki, Tsutomu Hi-
rao, and Masaaki Nagata. 2016. Neural headline
generation on abstract meaning representation. In
EMNLP, pages 1054–1059.
Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017a.

Abstractive document summarization with a graph-
based attentional neural model. In Proceedings of
the 55th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers),
volume 1, pages 1171–1181.
Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017b.

From neural sentence summarization to headline
generation: A coarse-to-fine approach. IJCAI.
Oriol Vinyals, Alexander Toshev, Samy Bengio, and
Dumitru Erhan. 2015. Show and tell: A neural im-
age caption generator. In Proceedings of the IEEE
conference on computer vision and pattern recogni-
tion, pages 3156–3164.
Wenpeng Yin and Yulong Pei. 2015. Optimizing sen-
tence modeling and selection for document summa-
rization. In IJCAI.
1796

Improving Neural Abstractive Document Summarization WithExplicit Information Selection Modeling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improving Neural Abstractive Document Summarization WithExplicit Information Selection Modeling

Uploaded by

Copyright:

Available Formats

Improving Neural Abstractive Document Summarization with

Explicit Information Selection Modeling∗

Abstract generation, which inspires the research on abstrac-

2 Our Model where hi denotes the hidden state of sentence si .

that used to select salient sentences from a docu-

length Method Rouge-1 Rouge-2 Rouge-

our method with several extractive methods, and

Acknowledgments Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017a.

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017b.

You might also like