Professional Documents
Culture Documents
Abstra t
In this paper a
omparative study of Automated Text Summarization (TS) Systems is presented. It des
ribes
the fa
tors to be taken into a
ount for evaluating those systems and outlines three alternative
lassi
ations.
The paper provides extensive examples of working TS systems a
ording to their
hara
terizing features,
performan
e, and obtained results, with a spe
ial emphasis on the multilingual aspe
t of summarization.
Inteligen
ia Arti
ial, Revista Iberoameri
ana de Inteligen
ia Arti
ial. No 20 (2003), pp. 34-52.
ISSN: 1137-3601.
AEPIA (http://www.aepia.org/revista)
2 Some
onsiderations on degrees of portability. The restri
tion to a
er-
tain domain is usually
ompensated by the fa
t
Summary Aspe
ts
that spe
ialized systems
an apply knowledge in-
tensive te
hniques whi
h are only feasible in
on-
Summarization has traditionally been de
om- trolled domains, as is the
ase of the multido
-
posed into three phases [147, 101, 58, 68, 98: ument summarizer SUMMONS [111, spe
ialized
in summaries in terrorism domain applying
om-
analyzing the input text to obtain text rep- plex Information Extra
tion te
hniques. In
on-
resentation, trast, general purpose systems are not dependant
transforming it into a summary representa- on information about domains, whi
h usually re-
tion, sults in a more shallow approa
h to the analysis
and synthesizing an appropriate output of the input do
uments.
form to generate the summary text. Nevertheless, some general purpose systems are
prepared to exploit domain spe
i
information.
Ee
tive summarizing requires an expli
it and de- For example, the meta summarizer developed at
tailed analysis of
ontext fa
tors, as is apparent Columbia University [19, 18, 61, 60, 108 applies
when we re
ognize that what summaries should dierent summarizers for dierent kinds of do
-
be like is dened by what they are wanted for. uments: MULTIGEN [19, 109 is spe
ialized in
The parameters to be taken into a
ount in sum- simple events, DEMS [143 (with the bio
ong-
marization systems have been widely dis
ussed uration) deals with biographies, and for the rest
[101, 68, 98. We will follow Spar
k Jones (1998) of do
uments, DEMS has a default
onguration
[147, who distinguishes three main aspe
ts that that
an be resorted to.
ae
t the pro
ess of TS: input, purpose and out-
put, with a spe
ial fo
us on multilinguality. Spe
ialization level. A text may be broadly
hara
terized as ordinary, spe
ialized, or re-
stri
ted, in relation to the presumed subje
t
2.1 Input Aspe
ts knowledge of the sour
e text readers. This as-
pe
t
an be
onsidered the same as the domain
The features of the text to be summarized
ru- aspe
t dis
ussed above.
ially determine the way a summary
an be ob-
tained. The following aspe
ts of input are rele- Restri
tion on the language. The language
vant to the task of TS: of the input
an be general language or restri
ted
to a sublanguage within a domain, purpose or
Do
ument Stru
ture. Besides textual
on-
audien
e. It may be ne
essary to preserve the
tent, heterogeneous do
umental information
an sublanguage in the summary.
be found in a sour
e do
ument, for example, la-
bels that mark headers,
hapters, se
tions, lists, S
ale. Dierent summarizing strategies have to
tables, et
. If it is well systematized and ex- be adopted to handle dierent text lengths. In-
ploited, this information
an be of use to ana- deed, the analysis of the input text
an be per-
lyze the do
ument. For example, Kan (2002) [74 formed at dierent granularities, for example, in
exploits the organization of medi
al arti
les in determining meaning units. In the
ase of news
se
tions to build a tree-like representation of the arti
les, senten
es or even
lauses are usually
on-
sour
e. Teufel and Moens (2002) [156 system- sidered the minimal meaning units, whereas for
atize the stru
tural properties of s
ienti
arti
les longer do
uments, like reports or books, para-
to assess the
ontribution of ea
h textual segment graphs seem a more adequate unit of meaning.
to the arti
le, in order to build a summary from Also the te
hniques for segmenting the input text
that enri
hed perspe
tive. in these meaning units dier: for shorter texts, or-
However, it
an also be the
ase that the informa- thography and syntax, even dis
ourse boundaries
tion it provides is not the target of the analysis. [103 indi
ate signi
ant boundaries, for longer
texts, topi
segmentation [79, 63 is more usual.
In this
ase, do
ument stru
ture has to be re-
moved in order to isolate the textual
omponent Media. Although the main fo
us of summariza-
of the do
ument. tion is textual summarization, summaries of non-
Domain. Domain-sensitive systems are only textual do
uments, like videos, meeting re
ords,
apable of obtaining summaries of texts that be- images or tables have also been undertaken in
long to a pre-determined domain, with varying re
ent years. The
omplexity of multimedia
summarization has prevented the development of Task-driven summarization presents the advan-
wide
overage systems, whi
h means that most tage that systems
an be evaluated with respe
t
summarization systems that
an handle multime- to the improvement they introdu
e in the nal
dia information are limited to spe
i
domains or task they are applied to.
textual genres [62, 104. However, resear
h ef-
forts also
onsider the integration of information Audien
e. In
ase a user prole is a
essible,
of dierent media [21, whi
h allow a wider
ov- summaries
an be adapted to the needs of spe-
erage of multimedia summarization systems by
i
users, for example, the user's prior knowl-
exploiting dierent kinds of do
umental informa- edge on a determined subje
t. Ba
kground sum-
tion
ollaboratively, like metadata asso
iated to maries assume that the reader's prior knowledge
video re
ords [161. is poor, and so extensive information is supplied,
while just-the-news are those kind of summaries
Genre. Some systems exploit typi
al genre-
onveying only the newest information on an al-
determined
hara
teristi
s of texts, su
h as the ready known subje
t. Briengs are a parti
ular
pyramidal organization of newspaper arti
les, or
ase of the latter, sin
e they
olle
t representative
the argumentative development of a s
ienti
information from a set of related do
uments.
arti
le. Some summarizers are independent of Usage. Summaries
an be sensitive to deter-
the type of do
ument to be summarized, while mined uses: retrieving sour
e text [75, preview-
others are spe
ialized on some type of do
u- ing a text [88, refreshing the memory of an al-
ments: health
are reports [48, medi
al arti
les ready read text, sorting...
[74, agen
y news [111, broad
ast fragments [62,
meeting re
ordings [169, e-mails [117, 3, web
pages [132, et
. 2.3 Output Aspe
ts
Unit. The input to the summarization pro
ess
an be a single do
ument or multiple do
uments, Content. A summary may try to represent all
either simple text or multimedia information su
h relevant features of a sour
e text or it may fo
us
as imagery audio, or video [150. on some spe
i
ones, whi
h
an be determined
by queries, subje
ts, et
. Generi
summaries are
Language. Systems
an be language- text-driven, while user-fo
used (or query-driven)
independant, exploiting
hara
teristi
s of do
- ones rely on a spe
i
ation of the user's informa-
uments that hold
ross-linguisti
ally [129, 125, tion need, like a question or key words.
or else their ar
hite
ture
an be determined by
the features of a
on
rete language. This means Related to the kind of
ontent that is to be ex-
that some adaptations must be
arried out in the tra
ted, dierent
omputational approa
hes are
system to deal with dierent languages. As an applied. The two basi
approa
hes are top-
additional improvement, some multi-do
ument down, using information extra
tion te
hniques,
systems are able to deal simultaneously with and bottom-up, more similar to information re-
do
uments in dierent languages [33, 34, whi
h trieval pro
edures. Top-down is used in query-
will be developed in Se
tion 2.4. driven summaries, when
riteria of interest are
en
oded as a sear
h spe
i
ation, and this spe
i-
ation is used by the system to lter or analyze
text portions. The strategies applied in this ap-
2.2 Purpose Aspe
ts proa
h are similar to those of Question Answer-
ing. On the other hand, bottom-up is used in
Situation. TS systems
an perform general text-driven summaries, when generi
importan
e
summarization or else they
an be embedded in metri
s are en
oded as strategies, whi
h are then
larger systems, as an intermediate step for an- applied over a representation of the whole text.
other NLP task, like Ma
hine Translation, Infor- Format. The output of a summarization sys-
mation Retrieval or Question Answering. As the tem
an be plain text, or else it
an be formatted.
eld evolves, more and more eorts are devoted Formatting
an be targeted to many purposes:
to task-driven summarization, in detriment of a
onforming to a pre-determined style (tags, orga-
more general approa
h to TS. This is due to the nization in elds), improving readability (division
fa
t that underspe
i
ation of the information in se
tions, highlighting), et
.
needs supposes a major problem for design and
evaluation of the systems. As will be dis
ussed in Style. A summary
an be informative, if it
ov-
Se
tion 5, evaluation is a major problem in TS. ers the topi
s in the sour
e text; indi
ative, if it
provides a brief survey of the topi
s addressed 2.4 Language
overage
in the original; aggregative, if it supplies informa-
tion non present in the sour
e text that
ompletes
some of its information or eli
its some hidden in- As regards language
overage, systems
an
formation [156; or
riti
al, if it provides an addi- be
lassied as monolingual, multilingual, and
tional valoration of the summarized text.
rosslingual (a similar
lassi
ation is
ommonly
used in Information Retrieval systems). Monolin-
Produ
tion Pro
ess. The resulting summary gual summarization systems deal with only one
text
an be an extra
t, if it is
omposed by literal language for both the input do
ument and the
fragments of text, or an abstra
t, if it is gener- summary. In the
ase of multilingual systems,
ated. The type of summary output desired
an input and output languages are also the same
be relatively polished, for example, if text is well- but in this
ase the system
an
over several lan-
formed and
onne
ted, or else more fragmentary guages. Crosslingual systems are able to pro
ess
in nature (e.g., a list of key words). input do
ument in several languages, produ
ing
summaries in dierent languages.
There are intermediate options, mostly
on
ern-
ing the nature of the fragments that
ompose ex- Multilinguality does not imply additional di
ul-
tra
ts, whi
h
an range from topi
-like passages, ties. Most of the systems and te
hniques we will
paragraph or multiparagraph long, to
lauses or present below
an be easily adapted to other lan-
even phrases. In addition, some approa
hes per- guages, assuming, of
ourse, the availability of the
form editing operations in the summary, over
om- knowledge sour
es needed for the dierent meth-
ing the in
oheren
e and redundan
y often found ods. Roughly speaking, the more amount of lin-
in extra
ts, but at the same time avoiding the guisti
knowledge is needed by a system, the more
high
ost of a NL generation system. Jing and di
ult is to transport it to another language.
M
Keown (2000) [73 apply six re-writing strate-
gies to improve the general quality of an extra
t- A more
omplex
hallenge is
rosslinguality.
based summary by edition operations like dele- There are examples of single do
ument
rosslin-
tion,
ompletion or substitution of
lausal
on- gual summarizers, implying a
ertain amount of
stituents. translation, either on the input text or on the
summary, but most
rosslingual summarizers are
Surrogation. Summaries
an stand in pla
e of multido
ument. In this
ase a lot of problems spe-
the sour
e as a surrogate, or they
an be linked
i
of translinguality arise. Measures of similar-
to the sour
e [75, 88, or even be presented in the ity between do
uments and passages in dierent
ontext of the sour
e (e.g., by highlighting sour
e languages, for identifying relations or for
luster-
text, [86). ing, have to be envisaged. Similarity between lex-
Length. The targeted length of the summary i
al units (words, NEs, multiword terms) belong-
ru
ially ae
ts the informativeness of the - ing to dierent languages, have to be
omputed
nal result. This length
an be determined by as well. Obviously, the more distant the involved
a
ompression rate, that is to say, a ratio of languages are, the harder these problems turn to
the summary length with respe
t to the length be, spe
ially if the languages present dierent lex-
of the original text. Traditionally,
ompression i
al units or
hara
ter sets. Sin
e this is a burning
rates range from 1% to 30%, with 10% as a pre- issue, it will be dis
ussed at length in Se
tion 5.
ferred rate for arti
le summarization. In the
ase
of multido
ument summarization though, length
annot be determined as a ratio to the original
text(s), so the summary always
onforms to a
pre-determined length. Summary length
an also 3 Approa
hes to Text Sum-
be determined by the physi
al
ontext where the marization
summary is to be displayed. For example, in the
ase of delivery of news of summaries to hand-
helds [23, 28, 39, the size of the s
reen imposes There are several ways in whi
h one
an
har-
severe restri
tions to the length of the summary. a
terize dierent approa
hes to text summariza-
Headline generation is another appli
ation where tion. In this se
tion, we present three possi-
the length of summaries is
learly determined ble
lassi
ations of text summarization systems,
[165, 41. In very short summaries,
oheren
e is but many others
an be found in the literature
usually sa
ri
ed to informativeness, so lists of [70, 130, 105, 98. The rst
lassi
ation, follow-
words are
onsidered a
eptable [80, 167. ing Mani and Maybury (1999) [101, is based in
the level of pro
essing that ea
h system performs, Cue words and phrases are signals of relevan
e
the se
ond, proposed in Alonso and Castellon or irrelevan
e. They are typi
ally meta-linguisti
(2001) [4, is based in the kind of information ex- markers (e.g.,
ues: "in summary", "in
on
lu-
ploited, the third follows Tu
ker (1999) [157. sion", "our investigation", "the paper des
ribes";
or emphasizers: "signi
antly", "important", "in
parti
ular", "hardly", "impossible"), as well as
3.1 Classi
ation 1: Level of Pro- domain-spe
i
bonus phrases and stigma terms.
Although lists of these phrases are usually built
essing manually [82, 154, they
an also be dete
ted au-
tomati
ally.
One useful way to
lassify summarization systems
is to examine the level of pro
essing of the text.
Based on this, summarization
an be
hara
ter- 3.1.2 Entity-level
ized as approa
hing the problem at the surfa
e,
entity, or dis
ourse level [101. Entity-level approa
hes build an internal repre-
sentation of the text by modeling text entities
(simple words,
ompound nouns, named entities,
3.1.1 Surfa
e level et
.) and their relationships. These approa
hes
tend to represent patterns of
onne
tivity in the
Surfa
e-level approa
hes tend to represent infor- text (e.g., graph topology) to help determine
mation in terms of shallow features that are then salien
y. Relations between entities in
lude:
sele
tively
ombined together to yield a salien
e Similarity. Similar words are those whose
fun
tion used to extra
t information, following form is similar, for example, those sharing a
om-
the approa
h of Edmunson (1969) [47. These mon stem (e.g., \similar" and \similarity"). Sim-
features in
lude: ilarity
an be
al
ulated with linguisti
knowledge
Term frequen
y statisti
s provide a themati
or by
hara
ter string overlap. Myaeng and Jang
representation of text, assuming that important (1999) [118 use two similarity measures for deter-
senten
es are the ones that
ontain words that mining if a senten
e belongs to the major
ontent:
o
ur frequently. The s
ore senten
es in
reases a similarity between the senten
e and the rest of
for ea
h frequent word. Early summarization the do
ument and a similarity between the sen-
systems dire
tly exploit word distribution in the ten
e and the title of the do
ument. Also, in NTT
sour
e [96. [65, 66, CENTRIFUSER [75, several similarity
measures are applied.
Lo
ation relies on the intuition that important
Proximity. The distan
e between the text
senten
es are lo
ated at positions that are usu-
units where entities o
ur is a determining fa
-
ally genre-dependent, however, some general rules
tor for establishing relations between entities.
are the lead method and the title-based method.
The lead method
onsists of just taking the rst Cohesion. Cohesion
an be dened in terms of
senten
es. The title-based method assumes that
onne
tivity. Conne
tivity a
ounts for the fa
t
words in titles and headings are positively rele- that important text units usually
ontain entities
vant to summarization. A generalization of these that are highly
onne
ted in some kind of seman-
methods is the OPP used by Hovy and Lin in ti
stru
ture. Cohesion
an be approa
hed by:
their SUMMARIST system [91, where they ex-
ploit Ma
hine Learning te
hniques to identify the
Word
o-o
urren
e: words
an be re-
positions where relevant information is pla
ed
lated if they o
ur in
ommon
ontexts.
within dierent textual genres. Many of the
ur-
Some appli
ations are presented in Bald-
rent systems, spe
ially those applying ma
hine
win and Morton (1998), M
Keown et al.
learning te
hniques, take into a
ount the lo
a-
(1999)[13, 109. Salton et al. (1997), Mitra
tion of meaning units in a do
ument to assess
et al. (1997) [141, 113 apply IR methods at
their relevan
e.
the do
ument level, treating paragraphs in
Bias. The relevan
e of meaning units is deter- texts as do
uments are treated in a
olle
-
mined by the presen
e of terms from the title or tion of do
uments. Using a traditional IR-
headings, initial part of text, or user's query. For based method, a word similarity measure is
example, [37, 36, 144 use as features the posi- used to determine the set Si of paragraphs
tion in the senten
e, the number of tokens and that ea
h paragraph Pi is related to. After
the number of pseudo-query terms.
determining relatedness s
ores Si for ea
h argument, between entities in the text.
paragraph, paragraphs with the largest Si
s
ores are extra
ted. The system of Baldwin and Morton (1998) [13
uses argument dete
tion in order to resolve
o
In SUMMAC [97, in the
ontext of query- referen
e between the query and the text for per-
based summarization, Cornell's Smart- forming summarization.
based approa
h expands the original query,
ompares expanded query against para-
graphs, and sele
ts top three paragraphs
(max 25% of original) that are most sim-
ilar to the original query. 3.1.3 Dis
ourse-level
Lo
al salien
e: important phrasal expres-
sions are given by a
ombination of gram- Dis
ourse-level approa
hes model the global
mati
al, synta
ti
, and
ontextual parame- stru
ture of the text, and its relation to
ommu-
ters [24. ni
ative goals. At this level, the following infor-
Lexi
al similarity: words
an be related mation
an be exploited:
by thesaural relationships (synonymy, hy-
pernymy, meronymy relations). Barzilay
(1997) [16 details a system where Lexi
al Format of the do
ument (e.g., hypertext
Chains are used, based on Morris and Hirst markup, do
ument outlines).
(1991) [116. This line has also been ap-
plied to Spanish, relying on EuroWordNet
relations between words, by Fuentes and Threads of topi
s
an be revealed in the text.
Rodrguez (2002) [53. The assumption is An example of this is SUMMARIST, whi
h ap-
that important senten
es are those that are plies Topi
identi
ation [69, 95. Topi
identi-
rossed by strong
hains1 . This approa
h
ation implies previous a
quisition of Topi
Sig-
provides a partial a
ount of texts, sin
e it natures (that
an be automati
ally learned) and
fo
uses mostly on
ohesive aspe
ts. An in- then the identi
ation of a text span as belonging
tegration of
ohesion and
oheren
e features to a topi
hara
terized by its signature. Topi
of texts might
ontribute to over
ome this, identi
ation, then, in
ludes text segmentation
as Alonso and Fuentes (2002) [5 point out. and
omparison of text spans with existing Topi
Co-referen
e: referring expressions
an be Signatures. The topi
identied are fused dur-
linked, and
o-referen
e
hains
an be built ing the interpretation of the pro
ess. The fused
with
o-referring expressions. Both Lexi- topi
s are then expressed in new terms. Other
al Chains and Co-referen
e Chains
an be systems are Boros et al. (2001) [25 and MEAD
priorised if they
ontain words in a query [133, 128, 121. These systems assign a topi
to
(for query-based summaries) or in the ti- the senten
es in order to
reate
lusters for sele
t-
tle. So, the preferen
e imposed on
hain is: ing the senten
es to appear in summary.
query > title > do
ument. Baga and Bald-
win (1998), Azzam et al. (1999) [11, 10
use
oreferen
e
hains for summarization. Rhetori
al stru
ture of the text, representing
Baldwin and Morton (1998) [13 exploit argumentation or narrative stru
ture. The main
o-referen
e
hains spe
i
ally for query- idea is that the
oheren
e stru
ture of a text
an
sensitive summarization. be
onstru
ted, so that the '
entrality' of the tex-
tual units in this stru
ture will re
e
t their impor-
Conne
tedness method [100 represents
map text with graphs. Words in the text tan
e. A tree-like representation of texts is pro-
posed by the Rhetori
al Stru
ture Theory [102.
are the nodes, and ar
s represent adja-
Ono et al. (1994) [120 and Mar
u (1997) [103 at-
en
y, grammati
al,
o-referen
e, and lex-
tempt to use this kind of dis
ourse representation
i
al similarity-based relations.
in order to determine the most important textual
units. They propose an approa
h to rhetori
al
Logi
al relations su
h as agreement,
ontra- parsing by dis
ourse markers and semanti
sim-
di
tion, entailment, and
onsisten
y. ilarities in order to hypothesize rhetori
al rela-
Meaning representation-based relations. tions. These hypotheses are used to derive a valid
Establishing relations, su
h as predi
ate- dis
ourse representation of the original text.
1 Lexi
al
hains have also been used in other NLP tasks, su
h as automati
extra
tion of interdo
ument links [56.
3.2 Classi
ation 2: Kind of Infor- important parts of
hains
an be
onsidered the
mation most representative of the text.
The approa
hes to summarization in this dire
- 3.3.4 Dis
ourse Stru
ture
tion try to grasp what a text is 'about' by iden-
tifying
on
epts that are in some sense
entral to Dis
ourse stru
ture is used by many systems in
the text, on the basis of the o
urren
e of the a limited way, for example, by trying to grasp a
same or related
on
epts in dierent parts of the text's 'aboutness'. In
ontrast, some other meth-
sour
e representation. Aboutness is represented ods apply dis
ourse theories to the analysis of the
as the links between these o
urren
es. sour
e text in order to obtain a representation of
their dis
ourse stru
ture. However, work in this
Frequen
y-based approa
hes exploit the frequen
y area has been largely theoreti
al.
with whi
h the
on
epts o
ur in the representa-
tion. In systems based in word frequen
y, atten-
tional networks are only represented impli
itly.
Some systems a
ount for frequen
y signi
an
e 3.4 Combined Systems
by applying IR te
hniques, su
h as the tf*idf mea-
sure. Others apply
orpus-based statisti
al nat- The predominant tenden
y in
urrent systems is
ural language pro
essing, su
h as
ollo
ation or to integrate some of the te
hniques mentioned so
proper noun identi
ation. Sill others try to ab- far. Integration is a
omplex matter, but it seems
stra
t from individual words to a
hieve
on
ept the appropriate way to deal with the
omplexity
frequen
y, by using lexi
ons or thesauri [69. of textual obje
ts. In this se
tion, we are going to
present some examples of
ombination of dierent
On the other hand, some systems identify and ex- te
hniques.
ploit the
ohesive links holding between parts of
the sour
e text. These links
an be represented There are several systems where dierent meth-
as graph-like stru
tures [145 as lexi
al
hains. ods are
ombined. Among the most interesting
are: [82, 156, 69, 100 where title-based method is
ombined with
ue-lo
ation, position, and word-
3.3.2 Senten
e by Senten
e frequen
y based methods.
As the eld progresses, summarization systems
Some summarizing systems de
ide for ea
h sen- tend to use more and deeper knowledge. For ex-
ten
e in the sour
e text whether it is important ample, IE te
hniques are be
oming widely used.
for summarizing, rather independently of the text Many systems do not rely any more in a single
as a whole. To do that, they rely on relevan
e or indi
ator of relevan
e or
oheren
e, but take into
irrelevan
e marks that
an be found in senten
es, a
ount as many of them as possible. So, the ten-
for example,
ue words. den
y is that heterogeneous kinds of knowledge
are merged in in
reasingly enri
hed representa- EuroWordNet [160 is a general resour
e available
tions of the sour
e text(s). for these four languages, so a rst approa
h to
summarization exploited this resour
e. A Lexi
al
These enri
hed representations allow for adapt- Chain summarizer was developed for Spanish [53.
ability of the nal summary to new summariza- As
an be seen in Figure 1, the ar
hite
ture of
tion
hallenges, su
h as multido
ument, multilin- the summarizer permits easy adaptation to other
gual and even multimedia summarization. In ad- languages, provided there is at least a morpho-
dition, su
h a ri
h representation of text is a step logi
al analyzer and a version of EuroWordNet
forward generation or, at least, pseudo-generation available for the language. If other NLP tools
by
ombining fragments of the original text. are available, like Named Entity Re
ognizers or
Good examples of this are [108, 93, 41, 84, 59,
o-referen
e solvers, they
an be easily integrated
among others. within the system. On
e the text has been ana-
lyzed and Lexi
al Chains have been obtained, a
summary is built by extra
ting
andidate textual
4 Summarization Systems units from the text. Candidate units are
hosen
applying a
ertain heuristi
, weighting some as-
pe
ts of Lexi
al Chains.
Table 1 shows how existing summarization sys-
tems would be
lassied a
ording to ea
h of the A se
ond approa
h to the task of summarization,
lassi
ations presented in the previous se
tion. seen in Figure 2, [52 tries to over
ome this depen-
However, it must be taken into a
ount that most den
y on lexi
applying Ma
hine Learning te
h-
urrent summarization systems are very
omplex, niques. The system is trained with a
orpus of
resorting to very heterogeneous information and senten
es des
ribed with a set of features, like po-
applying varied te
hniques, so a
lassi
ation will sition in the text, length, and also being
rossed
never be
lear
ut. Moreover, systems tend to by a Lexi
al Chain. For ea
h of these senten
es,
evolve with time, whi
h makes their
lassi
ation it is previously determined whether it belongs to
still more
ontroversial. a summary of the text or not, so that it
an be
learned whi
h
ombinations of features
hara
ter-
Files with a more extense des
ription of some of ize summary senten
es. In a text to be summa-
these systems (marked with an asterisk)
an be rized, ea
h senten
e is des
ribed with the same
found in the Annex (in ele
troni
version only). set of features, and it is determined whether these
Additionally, Table 2 lists on-line or download- des
ribing features
hara
terize the senten
e as a
able systems. summary senten
e or not. The summary is
om-
posed with senten
es qualifying as summary sen-
Multilinguality of the systems is one of the fea- ten
es.
tures in ea
h des
ribing le. It is stated whether
the system
an summarize only a single language, This se
ond system does not require any spe
i
a denite set of languages, or whether its ar
hite
- feature to produ
e a summary, not even Lexi
al
ture permits unrestri
ted multilinguality. In this Chains. However, the more information avail-
latter
ase, it is also stated whether experiments able, the more a
urate the learning pro
ess will
with dierent languages are reported. be, whi
h will result in better summaries. This
approa
h has been evaluated for English within
As a
on
rete example of an approa
h to mul- DUC 2003
ontest, but it
an be used straightfor-
tilingual summarization, we present the systems wardly for any other language, as long as there is
developed within proje
t HERMES2 . The target a training
orpus available.
of proje
t HERMES is to adapt and apply lan-
guage te
hnologies for Spanish, Catalan, Basque
and English to improve a
ess to textual infor-
mation in digital libraries, Internet, do
umental
Intranets, et
. Therefore, HERMES summariza-
tion system should integrate multiple languages 5 Burning Issues
in a
ommon ar
hite
ture. Sin
e the resour
es
available for every language are uneven, this ar-
hite
ture has to be
exible enough to adapt to The eld has experien
ed an exponential growth
knowledge-poor representations of text but also sin
e its beginnings, but some
ru
ial questions
to exploit ri
h representations when available. are still open.
2 http://terral.iee
.uned.es/hermes/
5.1 Coheren
e of Summary texts 5.2 Multido
ument summarization
Pai
e (1990) [123 pointed out that the main Multido
ument summarization is one of the ma-
short
omings of summarization systems up to the jor
hallenges in
urrent summarization systems.
1990s was their low representativity of the
ontent It
onsists of produ
ing a single summary of a
ol-
in the sour
e text and their la
k of
oheren
e. le
tion of do
uments dealing with the same topi
.
The work has been mostly determined by the
or-
Mu
h of the work in this area has treated the responding DUC task. Therefore, it has mainly
problem of text summarization from a predom- fo
used in
olle
tions of news arti
les with a given
inant information-theoreti
perspe
tive. There- topi
. Remarkable progresses have been a
hieved
fore, texts have been modeled as mathemati
al in avoiding redundan
y, mainly based on the work
obje
ts, where relevan
e and redundan
y
ould be in Carbonell and Goldstein (1998) [30.
dened in purely statisti
al terms. This approa
h
seems spe
ially valuable to produ
e a satisfa
tory When dealing with MDS new problems arise:
representation of the
ontent of a text. However, lower
ompression fa
tors implying a more ag-
it fails in produ
ing
oherent texts, a
eptable for gressive
ondensation, anti-redundan
y, tempo-
human users. ral dimension, more
hallenging
oreferen
e task
(inter-do
ument), et
. Clustering of similar do
-
The short
omings of purely statisti
al approa
hes uments plays now a
entral role [30, 133, 60, 110.
to text summarization on handling textual
o- Sele
ting the most relevant fragments from ea
h
heren
e are addressed from two dierent perspe
-
luster and assuring
oheren
e of the summaries
tives:
oming from dierent do
uments are other im-
portant problems,
urrently under development
in MDS systems.
Applying ma
hine learning te
hniques.
They have been used mainly for two pur-
poses:
lassifying a senten
e from a sour
e
text into relevant or non-relevant [82, 8, 5.3 Multilingual summarization
99, 90, 65 and transforming a sour
e sen-
ten
e
onsidered relevant into a summary
senten
e [73, 78, 59. Input for learning al- As for multilingual summarization, not mu
h
gorithms are usually texts with their
or- work has been done yet, but the roadmap for the
responding abstra
ts. Therefore, the main DUC
ontests [12
ontemplates this
hallenge in
short
oming of this approa
h is to obtain the near future of the area.
large quantities of <text, abstra
t> tuples The most well known Multilingual Summariza-
for a variety of textual genres. tion System is SUMMARIST [69. The system
extra
ts senten
es in a variety of languages (En-
Resorting to symboli
linguisti
or world glish, Spanish, Japanese, et
.) and translates the
knowledge. Understanding of texts, mainly resulting summaries. SUMMARIST pro
eeds in
through IE extra
tion te
hniques, seems a three steps: Topi
identi
ation, Interpretation
desirable way of produ
ing quality sum- and Summary generation. Topi
identi
ation
maries. Until re
ently, su
h te
hniques had implies previous a
quisition of Topi
Signatures
only been applied for very restri
ted do- and then the identi
ation of a text span as be-
mains [111. However, re
ent systems tend longing to a topi
hara
terized by its signature.
to in
orporate IE extra
tion modules that Topi
Signatures are tuples of the form <Topi
,
perform a partial understanding of text, ei- Signature> where Signature is a list of weighted
ther by modeling the typi
al
ontext of rel- terms: f< t1 ; w1 >, < t2 ; w2 >, ..., < tn ; wn >g.
evant pie
es of information [84, 76, or by Topi
signatures
an be automati
ally learned
applying general templates to nd, organize [89, 95. Topi
identi
ation, then, in
ludes text
and use the typi
al
ontent of a kind of text segmentation (using Marti Hearst's TextTiling)
or event [59, 41. This use of IE te
hniques and
omparison of text spans with existing Topi
has produ
ed very good results, as is re- Signatures. The identied topi
s are fused dur-
e
ted in the high ranking of Harabagiu and ing interpretation, the se
ond step of the pro
ess.
La
atusu (2002) [59 in DUC 2002. A
om- The fused topi
s are then reformulated, that is to
bination of deeper knowledge with surfa
e say, expressed in new terms. The last step is a
lues seems to yield good results, too [93.
onventional extra
tive task.
In order to fa
e multilingual problems the in- 1. Translation before do
ument
lustering (as
volved knowledge sour
es have to be as mu
h as in Columbia's system), named one-phase
possible language independent. In the
ase of strategy. This model
lusters the multi-
SUMMARIST, sets of Topi
Signatures have to lingual multido
uments dire
tly resulting in
be obtained for all the languages involved using multilingual
lusters.
the same pro
edures. Also the segmentation pro- 2. Translation after do
ument
lustering,
edure is language independent. So, the a
ura
y named two-phase strategy. This model
of the resulting summaries depends heavily on the
lusters do
uments in ea
h language sepa-
quality of the translators. rately and merges the
lustering results.
As has been said before, a more
hallenging is- 3. Translation deferred to senten
e
lustering.
sue is Crosslingual Multido
ument Summarizers. First, monolingual
lustering is performed
Basi
ally three main problems have to be ad- at do
ument level. All the do
uments in
dressed: 1)
lustering of multilingual do
uments, ea
h
luster refer to the same event in a
2) measuring the distan
e (or similarity) between spe
i
language. Then, for generating the
multilingual units (do
uments, paragraphs, sen- extra
ted summary of an event all the
lus-
ten
es, terms), and 3) automati
translation of ters referring to this event are taken into a
-
do
uments or summaries. Most systems dier on
ount. Similar senten
es of these multilin-
the way they fa
e these problems, the order of gual
lusters are
lustered together, now at
performan
e and the granularity of the units they senten
e level. Finally a representative sen-
deal with. ten
e is
hosen from ea
h
luster and trans-
lated if needed.
Evans and Klavans (2003) [49 present a platform
for multilingual news summarization that extends
the Columbia's Newsblaster system [106. The The a
ura
y of this pro
ess depends basi
ally on
system adds a new
omponent, translation, to the the form of
omputing the similarity between dif-
original six major modules:
rawling, extra
tion, ferent multilingual units. Several forms of su
h
lustering, summarization,
lassi
ation and web fun
tions are presented and empirially evaluated
page generation, that have been, in turn, mod- by the authors.
ied for allowing multilinguality (language iden-
ti
ation, dierent
hara
ter en
oding, language These measures are multilingual extensions of a
idiosyn
rasy, et
.). baseline monolingual similarity measure. Sen-
ten
es are represented as bag of words (only
In this system multilingual do
uments are trans- nouns and verbs are taken into a
ount). The
lated into English before
lustering, so that
lus- similarity measure is a fun
tion of the number
tering is performed only on English texts. of (approximate) mat
hes between words and of
the size of the bags. The mat
hing fun
tion in
Translation is
arried out at two levels. Be
ause a the baseline redu
es, ex
ept for NE, to the iden-
low quality translation is usually enough for
lus- tity. In the multilingual variants of the formula,
tering purposes and assessing the relevan
e of the a bilingual di
tionary is used as knowledge sour
e
senten
es, a simple and fast te
hnique is applied for
omputing this mat
hing.
for glossing the input do
uments prior to
luster-
ing. Higher (relatively) quality translation (us- Despite of its simpli
ity the position-free measure
ing Altavista's Babelsh interfa
e to Systran) is (the simplest one) seems to be the most a
urate
performed in a se
ond step only over fragments among the studied alternatives. In this approa
h
sele
ted to be part of the summary. the translations of all the words of the bag are
olle
ted and the similarity is
omputed as in the
The system takes as well into a
ount the possi- baseline. All the other alternatives
onstraint in
ble degradation of the input texts as result of the some ways the possible mappings between words,
translation pro
ess, sin
e most of the senten
es using dierent greedy strategies. The results are,
resulting from this pro
ess are simply not gram- however, worse.
mati
ally
orre
t.
The two-phase strategy outperforms in the exper-
Chen et al. (2003) [34
onsider three possibil- iments the on-phase strategy. The third strategy,
ities for s
heduling the basi
steps of do
ument deferring the translation to senten
e
lustering,
translation and
lustering: seems to be the most promising.
A system,
overing English and Chinese, follow-
ing this approa
h is presented in Chen and Lin (TIC2002-04447-C02-01) and it has also been
(2000) [35. The main
omponents of the system partially funded by a grant asso
iated to the X-
are a set of monolingual news
lusterers, a unique TRACT proje
t, PB98-1226 of the Spanish Re-
multilingual news
lusterer and a news summa- sear
h Department and by the proje
t INTER-
rizer. A
entral issue of the system is the deni- LINGUA (IN3-IR226).
tion and identi
ation of meaningful units as base
for
omparison. For English these units
an be
redu
ed to senten
es but for Chinese the identi-
ation of units and the asso
iated segmentation Referen
es
of the text
an be a di
ult task. Another im-
portant issue of the system (general for systems [1 Enrique Alfonse
a and Pilar Rodrguez. De-
overing distant languages or dierent en
oding s
ription of the UAM system for generat-
s
hemata) is the need of a robust transliteration ing very short summaries at DUC-2003. In
of names (or words not o
urring in the bilingual HLT/NAACL Workshop on Text Summa-
di
tionary) for assuring an a
urate mat
hing. rization / DUC 2003, 2003.
[14 M. Banko, V. Mittal, and M. Witbro
k. [24 Branimir Boguraev and Christopher
Headline generation based on statisti
al Kennedy. Salien
e-based
ontent
hara
-
translation. In Pro
eedings of the 38th An- terisation of text do
uments. In Pro
eedings
nual Meeting of the Asso
iation for Com- of ACL'97 Workshop on Intelligent, S
al-
putational Linguisti
s, ACL, 2000. able Text Summarisation, pages 2{9,
Madrid, Spain, 1997.
[15 Mi
hele Banko, Vibhu Mittal, Mark [25 E. Boros, P.B. Kantor, and D.J. Neu.
Kantrowitz, and Jade Goldstein. Generat- A
lustering based approa
h to
reating
ing extra
tion-based summaries from hand- multi-do
ument summaries. In Workshop
written summaries by aligning text spans. on Text Summarization in
onjun
tion with
In Pro
eedings of PACLING-9, Waterloo, the ACM SIGIR Conferen
e 2001, New Or-
Ontario, July 1999. leans, 2001.
[16 Regina Barzilay. Lexi
al
hains for summa- [26 Ronald Brandow, Karl Mitze, and Lisa F.
rization. Master's thesis, Ben-Gurion Uni- Rau. Automati
ondensation of ele
-
versity of the Negev, 1997. troni
publi
ations by senten
e sele
tio.
Information Pro
essing and Management, [36 John M. Conroy and Dianne P. O'Leary.
31(5):675{68, 1995. Text summarization via Hidden Markov
Models. In SIGIR 2001, 2001.
[27 M. Brunn, Y. Chali, and B. Dufou. The
University of Lethbridge text summarizer [37 John M. Conroy, Judith D. S
hlesinger,
at DUC 2002. In Workshop on Text Dianne P. O'Leary, and Mary Ellen
Summarization (In Conjun
tion with the Okurowski. Using HMM and Logisti
Re-
ACL 2002 and in
luding the DARPA/NIST gression to generate extra
t summaries for
sponsored DUC 2002 Meeting on Text Sum- DUC. In Workshop on Text Summariza-
marization), Philadelphia, July, 11-12 2002. tion in
onjun
tion with the ACM SIGIR
Conferen
e 2001, New Orleans, Louisiana,
[28 Orkut Buyukkokten, He
tor Gar
ia- 2001.
Molina, and Andreas Paep
ke. Text
summarization of web pages on handheld [38 T. Cope
k, S. Szpakowi
z, and N. Japkowi
.
devi
es. In NAACL'01, 2001. Learning how best to summarize. In Work-
shop on Text Summarization (In Conjun
-
[29 N. H. M. Caldwell. An investigation tion with the ACL 2002 and in
luding the
into shallow pro
essing for summarisation. DARPA/NIST sponsored DUC 2002 Meet-
Te
hni
al Report Computer s
ien
e tripos ing on Text Summarization), Philadelphia,
part II proje
t, University of Cambridge July, 11-12 2002.
Computer Laboratory, 1994.
[39 Simon H. Corston-Oliver. Text
ompa
tion
[30 Jaime G. Carbonell and Jade Goldstein. for display on very small s
reens. In
The use of MMR, diversity-based rerank- NAACL'01, 2001.
ing for reordering do
uments and produ
-
ing summaries. In Pro
eedings of SIGIR, [40 R. E. Cullingford. SAM. In S
hank and
pages 335{336, 1998. Riesbe
k, editors, Inside Computer Under-
standing. Lawren
e Erlbaum Asso
., Hills-
[31 J. Carroll, G. Minnen, Y. Canning, S. De- dale, NJ, 1981.
vlin, and J. Tait. Pra
ti
al simpli
ation
of english newspaper text to assist aphasi
[41 H. Daume III, A. E
hihabi, D. Mar
u, D.S.
readers. In AAAI-98 Workshop on Inte- Munteanu, and R. Sori
ut. GLEANS: A
grating Arti
ial Intelligen
e and Assistive generator of logi
al extra
ts and abstra
ts
Te
hnology, 1998. for ni
e summaries. In Workshop on Text
Summarization (In Conjun
tion with the
[32 Y. Chali, M. Kolla, N. Singh, and Z. Zhang. ACL 2002 and in
luding the DARPA/NIST
The university of lethbridge text summa- sponsored DUC 2002 Meeting on Text Sum-
rizer at DUC 2003. In HLT/NAACL Work- marization), Philadelphia, July, 11-12 2002.
shop on Text Summarization / DUC 2003,
2003. [42 Hal Daume III and Daniel Mar
u. A noisy-
hannel model for do
ument
ompression.
[33 Hsin-Hsi Chen. Multilingual summariza- In Pro
eedings of the 40th Annual Meeting
tion and question answering. In Workshop of the Asso
iation for Computational Lin-
on Multilingual Summarization and Ques- guisti
s, 2002.
tion Answering (COLING'2002), 2002.
[43 G. DeJong. An overview of the frump sys-
[34 Hsin-Hsi Chen, June-Jei Kuo, and Tsei- tem. In W. G. Lehnert and M. H. Ringle,
Chun Su. Clustering and visualization in editors, Strategies for natural language pro-
a multi-lingual multi-do
ument summariza-
essing, pages 149 { 176. Hillsdale, NJ:
tion system. In Pro
eedings of the 25th Eu- Lawren
e Erlbaum, 1982.
ropean Conferen
e on IR Resear
h, pages
266{280, 2003. [44 J. Dersy. Produ
ing summary
ontent indi-
ators for retrieved texts. Master's thesis,
[35 Hsin-Hsi Chen and Chuan-Jie Lin. A multi- University of Cambridge Department of En-
lingual news summarizer. In Pro
eedings of gineering, 1996.
18th International Conferen
e on Compu-
tational Linguisti
s, COLING 2000, pages [45 DUC. DUC{do
ument understanding
on-
159{165, 2000. feren
e. http://du
.nist.gov/.
[46 Daniel M. Dunlavy, John M. Conroy, Ju- [55 Jade Goldstein, Vibhu Mittal, Mark
dith D. S
hlesinger, Sarah A. Goodman, Kantrowitz, and Jaime Carbonell. Summa-
Mary Ellen Okurowski, Dianne P. O'Leary, rizing text do
uments: Senten
e sele
tion
and Hans van Halteren. Performan
e of and evaluation metri
s. In SIGIR-99, 1999.
a three-stage system for multi-do
ument
summarization. In DUC03, Edmonton, Al- [56 Stephen J. Green. Automati
ally generating
berta, Canada, May 31 - June 1 2003. As- hypertext by
omputing semanti
similarity.
so
iation for Computational Linguisti
s. PhD thesis, University of Toronto, 1997.
[47 H. P. Edmunson. New methods in auto-
mati
extra
ting. Journal of the Asso
ia- [57 U. Hahn. Topi
parsing: A
ounting for
tion for Computing Ma
hinery, 16(2):264 { text ma
ro stru
tures in full-text analysis.
285, April 1969. Information Pro
essing and Management,
26(1):135{170, 1990.
[48 Noemie Elhadad and Kathleen R. M
K-
eown. Towards generating patient spe- [58 Udo Hahn and Inderjeet Mani. The
i
summaries of medi
al arti
les. In
ahllenges of automati
summarization.
NAACL'01 Automati
Summarization IEEE Computer, 33(11):29{36, 2000.
Workshop, 2001.
[49 David Kirk Evans and Judith L. Klavans. [59 S.M. Harabagiu and F. La
atusu. Generat-
A platform for multilingual news summa- ing single and multi-do
ument summaries
rization. Te
hni
al Report CUCS-014-03, with GISTEXTER. In Workshop on Text
Computer S
ien
e, University of Columbia, Summarization (In Conjun
tion with the
2003. ACL 2002 and in
luding the DARPA/NIST
sponsored DUC 2002 Meeting on Text Sum-
[50 A. Farzindar, G. Lapalme, and H. Sag- marization), Philadelphia, July, 11-12 2002.
gion. Summaries with SumUM and its ex-
pansion for do
ument understanding
on-
feren
e (DUC 2002). In Workshop on Text [60 V. Hatzivassiloglou, J. Klavans, M. Hol-
Summarization (In Conjun
tion with the
ombe, R. Barzilay, M.Y. Kan, and K.R.
ACL 2002 and in
luding the DARPA/NIST M
Keown. Simnder: A
exible
lustering
sponsored DUC 2002 Meeting on Text Sum- tool for summarization. In NAACL'01 Au-
marization), Philadelphia, July, 11-12 2002. tomati
Summarization Workshop, 2001.
[51 Atefeh Farzindar and Guy Lapalme. Us- [61 Vassileios Hatzivassiloglou, Judith Klavans,
ing ba
kground information for multi- and Eleazar Eskin. Dete
ting text similar-
do
ument summarization and summaries in ity over short passages: Exploring linguisti
response to a question. In DUC03, Ed- feature
ombinations via ma
hine learning.
monton, Alberta, Canada, May 31 - June In EMNLP/VLC'99, Maryland, 1999.
1 2003. Asso
iation for Computational Lin-
guisti
s.
[62 A. G. Hauptmann and M. J. Witbro
k.
[52 Maria Fuentes, Mar
Massot, Hora
io Informedia: News-on-demand multimedia
Rodrguez, and Laura Alonso. Mixed ap- information a
quisition and retrieval. In
proa
h to headline extra
tion for DUC M. Maybury, editor, Intelligent Multime-
2003. In HLT/NAACL Workshop on Text dia Information Retrieval, pages 215{239.
Summarization / DUC 2003, Edmonton, AAAI/MIT Press, 1997.
Canada, 2003.
[63 Marti Hearst. Multi-paragraph segmenta-
[53 Maria Fuentes and Hora
io Rodrguez. Us- tion of expository text. In 32nd Annual
ing
ohesive properties of text for automati
Meeting of Asso
iation for Computational
summarization. In JOTRI'02, 2002. Linguisti
s, 1994.
[54 P. Gladwin, S. Pulman, and K. Spar
k-
Jones. Shallow pro
essing and automati
[64 Ulf Hermjakob. Learning Parse and Trans-
summarising: a rst study. Te
hni
al Re- lation De
isions From Examples With Ri
h
port 223, University of Cambridge Com- Context. PhD thesis, University of Texas at
puter Laboratory, 1991. Austin, 1997.
[65 T. Hirao, Y. Sasaki, H. Isozaki, and [76 Min-Yen Kan and Kathleen M
Keown. In-
E. Maeda. NTT's Text Summarization sys- formation extra
tion and summarization:
tem for DUC-2002. In Workshop on Text Domain independen
e through fo
us types.
Summarization (In Conjun
tion with the Te
hni
al report, Computer S
ien
e De-
ACL 2002 and in
luding the DARPA/NIST partment, Columbia University, New York,
sponsored DUC 2002 Meeting on Text Sum- 1999.
marization), Philadelphia, July, 11-12 2002.
[77 M. Karamuftuoglu. An approa
h to sum-
[66 T. Hirao, J. Suzuki, H. Isozake, and marization based on lexi
al bonds. In
E. Maeda. NTT's multiple do
ument Workshop on Text Summarization (In Con-
summarization system for DUC2003. In jun
tion with the ACL 2002 and in
luding
HLT/NAACL Workshop on Text Summa- the DARPA/NIST sponsored DUC 2002
rization / DUC 2003, 2003. Meeting on Text Summarization), Philadel-
phia, July, 11-12 2002.
[67 Mi
hael Hoey. Patterns of Lexis in Text.
Des
ribing English Language. Oxford Uni- [78 Kevin Knight and Daniel Mar
u. Statisti
s-
versity Press, 1991. based summarization - step one: Senten
e
ompression. In The 17th National Confer-
[68 Eduard Hovy. Handbook of Computational en
e of the Ameri
an Asso
iation for Arti-
Linguisti
s,
hapter 28: Text Summariza-
ial Intelligen
e AAAI'2000, Austin, Texas,
tion. Oxford University Press, 2001. 2000.
[69 Eduard Hovy and Chin-Yew Lin. Au- [79 Hideki Kozima. Text segmentation based
tomated Text Summarization in SUM- on similarity between words. In Pro
eedings
MARIST. In Mani and Maybury, editors, of the 31th Annual Meeting of the Asso
i-
Advan
es in Automati
Text Summariza- ation for Computational Linguisti
s, pages
tion. 1999. 286{288, 1993.
[70 Eduard Hovy and Daniel Mar
u. Auto- [80 W. Kraaij, M. Spitters, and A. Hulth.
mated Text Summarization. COLING- Headline extra
tion based on a
ombina-
ACL, 1998. tutorial. tion of uni- and multido
ument summa-
[71 Hongyan Jing. Senten
e simpli
ation in rization te
hniques. In Workshop on Text
automati
text summarization. In ANLP- Summarization (In Conjun
tion with the
2000, 2000. ACL 2002 and in
luding the DARPA/NIST
sponsored DUC 2002 Meeting on Text Sum-
[72 Hongyan Jing. Cut-and-Paste Text Sum- marization), Philadelphia, July, 11-12 2002.
marization. PhD thesis, Graduate S
hool
of Arts and S
ien
es, Columbia University, [81 W. Kraaij, M. Spitters, and M. van der Hei-
2001. jden. Combining a mixture language model
and naive bayes for multi-do
ument sum-
[73 Hongyan Jing and Kathleen M
Keown. Cut marisation. In Workshop on Text Sum-
and paste based text summarization. In 1st marization in
onjun
tion with the ACM
Conferen
e of the North Ameri
an Chapter SIGIR Conferen
e 2001, New Orleans,
of the Asso
iation for Computational Lin- Louisiana, 2001.
guisti
s, 2000.
[82 Julian Kupie
, Jan Pedersen, and Fran
ine
[74 Min-Yen Kan. Automati
text summariza- Chen. A trainable do
ument summarizer.
tion as applied to information retrieval: Us- In Pro
eedings of ACM SIGIR Conferen
e
ing indi
ative and informative summaries. on Resear
h and Development in Informa-
PhD thesis, Columbia University, 2003. tion Retrieval, pages 68{73. ACM Press,
1995.
[75 Min-Yen Kan, Judith L. Klavans, and
Kathleen R. M
Keown. Domain-spe
i
[83 Finley La
atusu, Paul Parker, and Sanda
informative and indi
ative summarization Harabagiu. Lite-GISTexter: Generating
for information retrieval. In Workshop on short summaries with minimal resour
es. In
Text Summarization in
onjun
tion with DUC03, Edmonton, Alberta, Canada, May
the ACM SIGIR Conferen
e 2001, New Or- 31 - June 1 2003. Asso
iation for Computa-
leans, 2001. tional Linguisti
s.
[84 P. Lal and S. Rueger. Extra
t-based sum- [94 Chin-Yew Lin and Eduard Hovy. Auto-
marization with simpli
ation. In Work- mati
evaluation of summaries using n-
shop on Text Summarization (In Conjun
- gram
o-o
urren
e statisti
s. In Marti
tion with the ACL 2002 and in
luding the Hearst and Mari Ostendorf, editors, HLT-
DARPA/NIST sponsored DUC 2002 Meet- NAACL 2003: Main Pro
eedings, pages
ing on Text Summarization), Philadelphia, 150{157, Edmonton, Alberta, Canada, May
July, 11-12 2002. 27 - June 1 2003. Asso
iation for Computa-
tional Linguisti
s.
[85 Abderrah Lehmam. Text stru
turation
leading to an automati
summary system: [95 Chin-Yew Lin and Eduard H. Hovy. The
Ra. Information Pro
essing and Manage- automated a
quisition of topi
signatures
men, 35(2):181{191, 1999. for Text Summarization. In COLING-00,
Saarbru
ken, 2000.
[86 Abderrah Lehmam and Philippe Bouvet.
Evaluation, re
ti
ation et pertinen
e du [96 H. P. Luhn. The automati
reation of lit-
resume automatique de texte pour une util- erature abstra
ts. IBM Journal of resear
h
isation en reseau. In S. Chaudiron and and development, 2(2):159 { 165, 1958.
C. Fluhr, editors, III Colloque d'ISKO- [97 I. Mani, D. House, G. Klein, L. Hirs
hman,
Fran
e: Filtrage et resume automatique de L. Obrst, T. Firmin, M. Chrzanowski, and
l'information sur les reseaux, 2001. B. Sundheim. The tipster SUMMAC text
summarization evaluation: Final report.
[87 W. G. Lehnert. Plot units: a narrative sum-
Te
hni
al report, DARPA, 1998.
marization strategy. In W. G. Lehnert and
M. H. Ringle, editors, Strategies for natural [98 Inderjeet Mani. Automati
Summarization.
language pro
essing, pages 375 { 412. Hills- Nautral Language Pro
essing. John Ben-
dale, NJ: Lawren
e Erlbaum, 1982. jamins Publishing Company, 2001.
[88 Anton Leuski, Chin-Yew Lin, and Eduard [99 Inderjeet Mani and Eri
Bloedorn. Ma
hine
Hovy. iNeATS: Intera
tive multi-do
ument learning of generi
and user-fo
used sum-
summarization. In ACL'03, 2003. marization. In AAAI, pages 821{826, 1998.
[89 C-Y. Lin. Robust Automated Topi
Identi- [100 Inderjeet Mani and Eri
Bloedorn. Sum-
ation. PhD thesis, University of Southern marizing similarities and dieren
es among
California, 1997. related do
uments. Information Retrieval,
1(1-2):35{67, 1999.
[90 Chin-Yew Lin. Training a sele
tion fun
-
tion for extra
tion. In ACM-CIKM, pages [101 Inderjeet Mani and Mark T. Maybury, edi-
55{62, 1999. tors. Advan
es in automati
text summari-
sation. MIT Press, 1999.
[91 Chin-Yew Lin and Eduard Hovy. Identify-
ing topi
s by position. In Pro
eedings of the [102 William C. Mann and Sandra A. Thomp-
Applied Natural Language Pro
essing Con- son. Rhetori
al stru
ture theory: Toward a
feren
e (ANLP-97), pages 283{290, Wash- fun
tional theory of text organisation. Text,
ington, DC, 1997. 3(8):234{281, 1988.
[103 Daniel Mar
u. From dis
ourse stru
tures
[92 Chin-Yew Lin and Eduard Hovy. NeATS: A to text summaries. In Mani and Maybury,
multido
ument summarizer. In Workshop editors, Advan
es in Automati
Text Sum-
on Text Summarization in
onjun
tion with marization, pages 82 { 88, 1997.
the ACM SIGIR Conferen
e 2001, New Or-
leans, 2001. [104 M. Maybury and A. Merlino. Multimedia
summaries of broad
ast news. In Interna-
[93 Chin-Yew Lin and Eduard Hovy. NeATS tional Conferen
e on Intelligent Informa-
in DUC 2002. In Workshop on Text tion Systems, 1997.
Summarization (In Conjun
tion with the
ACL 2002 and in
luding the DARPA/NIST [105 Mark T. Maybury and Inderjeet Mani. Au-
sponsored DUC 2002 Meeting on Text Sum- tomati
summarization. ACL/EACL'01,
marization), Philadelphia, July, 11-12 2002. 2001. tutorial.
[106 K. M
Keown, R. Barzilay, D. Evans, [113 M. Mitra, A. Singhal, and C. Bu
kley. Au-
V. Hatzivassiloglou, J. Klavans, C. Sable, tomati
Text Summarization by paragraph
B. S
himan, and S. Sigelman. Tra
king extra
tion. In Inderjeet Mani and Mark
and summarizing news on a daily basis with Maybury, editors, Intelligent S
alable Text
Columbia's Newsblaster. In Pro
eedings of Summarization Workshop (ISTS'97), pages
the Human Language Te
hnology Confer- 39 { 46, Madrid, 1997. ACL/EACL.
en
e, 2002.
[114 V. Mittal, M. Kantrowitz, J. Goldstein, and
[107 K. M
Keown, S.-F. Chang, J. Cimino, J. Carbonell. Sele
ting text spans for do
-
S. Feiner, C. Friedman, L. Gravano, ument summaries: Heuristi
s and metri
s.
V. Hatzivassiloglou, S. Johnson, D. Jor- In AAAI 1999, 1999.
dan, J. Klavans, A. Kushniruk, V. Pa-
tel, and S. Teufel. Persival, a system [115 Vibhu Mittal and Adam Berger. Query-
for personalized sear
h and summarization relevant summarization using faqs. In Pro-
over multimedia health
are information. In
eedings of the 38th Annual Meeting of the
ACM+IEEE Joint Conferen
e on Digital Asso
iation for Computational Linguisti
s
Libraries (JCDL 2001), 2001. (ACL 2000), Hong Kong, 2000.
[108 K. M
Keown, D. Evans, A. Nenkova, [116 Jane Morris and Graeme Hirst. Lexi
al
o-
R. Barzilay, V. Hatzivassiloglou, B. S
hi- hesion, the thesaurus, and the stru
ture of
man, S. Blair-Goldensohn, J. Klavans, and text. Computational linguisti
s, 17(1):21{
S. Sigelman. The
olumbia multi-do
ument 48, 1991.
summarizer for DUC 2002. In Work-
shop on Text Summarization (In Conjun
- [117 S. Muresan, E. Tzoukermann, and J. Kla-
tion with the ACL 2002 and in
luding the vans. Combining linguisti
and ma
hine
DARPA/NIST sponsored DUC 2002 Meet- learning te
hniques for email summariza-
ing on Text Summarization), Philadelphia, tion. In ACL-EACL'01 CoNLL Workshop,
July, 11-12 2002. 2001.
[109 Kathleen M
Keown, Judith Klavans, Vas- [118 Sung Hyon Myaeng and Myung-Gil Jang.
sileios Hatzivassiloglou, Regina Barzilay, Integrating digital libraries with
ross-
and Eleazar Eskin. Towards multido
ument language ir. In Pro
eedings of the 2nd Con-
summarization by reformulation: Progress feren
e on Digital Libraries, 1999.
and prospe
ts. In AAAI 99, 1999.
[119 Ani Nenkova, Barry S
himan, An-
[110 Kathleen R. M
Keown, Regina Barzilay, drew S
hlaiker, Sasha Blair-Goldensohn,
David Evans, Vasileios Hatzivassiloglou, Regina Barzilay, Sergey Sigelman, Vasileios
Min-Yen Kan, Barry S
himan, and Si- Hatzivassiloglou, and Kathleen M
Keown.
mone Teufel. Columbia multi-do
ument Columbia at the du
2003. In DUC03, Ed-
summarization: Approa
h and evaluation. monton, Alberta, Canada, May 31 - June
In Pro
eedings of the Workshop on Text 1 2003. Asso
iation for Computational Lin-
Summarization, ACM SIGIR Conferen
e, guisti
s.
2001.
[120 K. Ono, K. Sumita, and S. Miike. Abstra
t
[111 Kathleen R. M
Keown and Dragomir R. generation based on rhetori
al stru
ture ex-
Radev. Generating summaries of multiple tra
tion. In Pro
eedings of the 15th Inter-
news arti
les. In ACM Conferen
e on Re- national Conferen
e on Computational Lin-
sear
h and Development in Information Re- guisti
s (COLING-94), pages 344 { 348,
trieval SIGIR'95, Seattle, WA, 1995. Kyoto, Japan, 1994.
[112 Jean-Lu
Minel, Jean-Pierre Des
les, [121 J.C. Otterba
her, A.J. Winkel, and D.R.
Emmanuel Cartier, Gustavo Crispino, Radev. The mi
higan single and multi-
Slim Ben Hazez, and Agata Ja
k- do
ument summarizer for DUC 2002. In
iewi
z. Resume automatique par ltrage Workshop on Text Summarization (In Con-
semantique d'informations dans des textes. jun
tion with the ACL 2002 and in
luding
presentation de la plate-forme ltext. the DARPA/NIST sponsored DUC 2002
Revue Te
hnique et S
ien
e Informatique, Meeting on Text Summarization), Philadel-
2001. phia, July, 11-12 2002.
[122 Chris D. Pai
e. The automati
genera- summarization of topi
ally related news
tion of literature abstra
ts: an approa
h arti
les. In 5th European Conferen
e on
based on the identi
ation of self-indi
ating Resear
h and Advan
ed Te
hnology for
phrases. In R. N. Oddy, C. J. Rijsbergen, Digital Libraries, Darmstadt, 2001.
and P. W. Williams, editors, Information
Retrieval Resear
h, pages 172 { 191. Lon- [132 Dragomir R. Radev, Weiguo Fan, and Zhu
don: Butterworths, 1981. Zhang. Webinessen
e: A personalized web-
based multi-do
ument summarization and
[123 Chris D. Pai
e. Constru
ting literature ab- re
ommendation system. In NAACL Work-
stra
ts by
omputer. Information Pro
ess- shop on Automati
Summarization, Pitts-
ing & Management, 26(1):171 { 186, 1990. burgh, 2001.
[124 T.A.S. Pardo and L.H.M. Rino. DMSumm: [133 Dragomir R. Radev, Hongyan Jing, and
Review and assessment. In E. Ran
hhod Malgorzata Budzikowska. Centroid-based
and N. J. Mamede, editors, Advan
es in summarization of multiple do
uments: sen-
Natural Language Pro
essing, pages 263{ ten
e extra
tion, utility-based evaluation,
273. Springer-Verlag, 2002. and user studies. In ANLP/NAACL Work-
shop on Summarization, Seattle, Washing-
[125 T.A.S. Pardo, L.H.M. Rino, and M.G.V. ton, 2000.
Nunes. GistSumm: A summarization tool
based on a new extra
tive method. In [134 Dragomir R. Radev, Simone Teufel, Ho-
N.J. Mamede, J. Baptista, I. Tran
oso, ra
io Saggion, Wai Lam, John Blitzer,
and M.G.V. Nunes, editors, 6th Workshop Arda Celebi, Hong Qi, Elliott Drabek, and
on Computational Pro
essing of the Por- Danyu Liu. Evaluation of Text Summa-
tuguese Language - Written and Spoken, rization in a Cross-lingual Information Re-
number 2721 in Le
ture Notes in Arti- trieval Framework. Te
hni
al report, Cen-
ial Intelligen
e, pages 210{218. Springer- ter for Language and Spee
h Pro
essing,
Verlag, 2003. Johns Hopkins University, Baltimore, MD,
June 2002.
[126 J. J. Pollo
k and A. Zamora. Automati
abstra
ting resear
h at
hemi
al abstra
ts [135 Lisa F. Rau, Paul S. Ja
obs, and Uri Zernik.
servi
e. Journal of Information and Com- Information extra
tion and text summari-
puter S
ien
es, 15(4):226{23, 1975. sation using linguisti
knowledge a
quisi-
tion. Information Pro
essing & Manage-
[127 K. Preston and S. Williams. Managing the ment, 25(4):419 { 428, 1989.
information overload. physi
s in business.
Institute of Physi
s, 1994. [136 RIPTIDES. RIPTIDES: Rapidly Portable
Translingual Information Extra
tion and
[128 Dragomir Radev, Sasha Blair-Goldensohn, Intera
tive Multido
ument Summarization.
and Zhu Zhang. Experiments in single http://www.
s.
ornell.edu/Info/People/
and multi-do
ument summarization using
ardie/tides/, 2002.
MEAD. In First Do
ument Understanding
Conferen
e, New Orleans, LA, September [137 J. E. Rush and et al. Automati
abstra
t-
2001. ing and indexing. ii. produ
tion of abstra
ts
by appli
ation of
ontextual inferen
e and
[129 Dragomir Radev, Jahna Otterba
her, Hong synta
ti
oheren
e
riteria. Journal of the
Qi, and Daniel Tam. MEAD ReDUCs: Ameri
an So
iety for Information S
ien
e,
Mi
higan at DUC 2003. In DUC03, Ed- 22(4):260 { 274, 1971.
monton, Alberta, Canada, May 31 - June
1 2003. Asso
iation for Computational Lin- [138 Hora
io Saggion and Guy Lapalme. Gen-
guisti
s. erating Indi
ative-Informative Summaries
with SumUM. Computational Linguisti
s,
[130 Dragomir R. Radev. Text Summarization. 28(4), 2002.
ACM SIGIR, 2000. tutorial.
[139 Hora
io Saggion and Guy Lapalme. Gener-
[131 Dragomir R. Radev, Sasha Blair- ating informative and indi
ative summaries
Goldensohn, Zhu Zhang, and Re- with SumUM. Computational Linguisti
s,
vathi Sundara Raghavan. Intera
tive, 28(4), 2002. Spe
ial Issue on Automati
domain-independent identi
ation and Summarization.
[140 Gerard Salton, James Allan, and Chris [150 H. Sundaram. Segmentation, Stru
ture De-
Bu
kley. Automati
stru
turing and re- te
tion and Summarization of Multimedia
trival of large text les. CACM, 37(2):97{ Sequen
es. PhD thesis, Graduate S
hool
108, 1994. of Arts and S
ien
es, Columbia University,
2002.
[141 Gerard Salton, Amit Singhal, M. Mitra,
and C. Bu
kley. Automati
text stru
tur- [151 SweSum. http://www.nada.kth.se/ xmartin/
ing and summarization. Information Pro- swesum/index-eng.html, 2002.
essing and Management, 33(3):193 { 207,
1997. [152 J. L. Tait. Automati
summarizing of en-
[142 R. S
hank and R. Abelson. S
ripts, Plans, glish texts. Te
hni
al Report 47, University
Goals, and Understanding. Lawren
e Erl- of Cambridge Computer Laboratory, 1983.
baum, Hillsdale, NJ, 1977.
[153 S. L. Taylor. Automati
abstra
ting by ap-
[143 Barry S
himan, Inderjeet Mani, and Kris- plying graphi
al te
hniques to semanti
net-
tian J. Con
ep
ion. Produ
ing biographi- works. PhD thesis, Northwestern Univer-
al summaries: Combining linguisti
knowl- sity, 1975.
edge with
orpus statisti
s. In EACL'01,
2001. [154 Simone Teufel and Mar
Moens. Sen-
ten
e extra
tion as a
lassi
ation task. In
[144 J.D. S
hlesinger, J.M. Conroy, M.E. Inderjeet Mani and Mark Maybury, edi-
Okurowski, H.T. Wilson, D.P. O'Leary, tors, Intelligent S
alable Text Summariza-
A. Taylor, and J. Hobbs. Understand- tion Workshop (ISTS'97), pages 58 { 59,
ing ma
hine performan
e in the
ontext Madrid, 1997. ACL/EACL.
of human performan
e for multi-do
ument
summarization. In Workshop on Text [155 Simone Teufel and Mar
Moens. Senten
e
Summarization (In Conjun
tion with the extra
tion and rhetori
al
lassi
ation for
ACL 2002 and in
luding the DARPA/NIST
exible abstra
ts. In AAAI Spring Sym-
sponsored DUC 2002 Meeting on Text Sum- posium on Intelligent Text Summarisation,
marization), Philadelphia, July, 11-12 2002. pages 16 { 25, 1998.
[145 E. F. Skorokhod'ko. Adaptive method of
automati
abstra
ting and indexing. Infor- [156 Simone Teufel and Mar
Moens. Summa-
mation pro
essing, 71, 1971. rizing s
ienti
arti
les { experiments with
relevan
e and rhetori
al status. Computa-
[146 K. Spar
k Jones, S. Walker, and S. Robert- tional Linguisti
s, 28(4), 2002. Spe
ial Issue
son. A probabilisti
model of information on Automati
Summarization.
retrieval: Development and status. Te
hni-
al Report N 446, University of Cambridge [157 Ri
hard Tu
ker. Automati
Summarising
Computer Laboratory, 1998. and the CLASP system. PhD thesis, Uni-
versity of Cambridge, 1999.
[147 Karen Spar
k-Jones. Automati
summaris-
ing: fa
tors and dire
tions. In Inderjeet [158 E. Tzoukermann, S. Muresan, and J. Kla-
Mani and Mark Maybury, editors, Advan
es vans. Gist-it: Summarizing email using lin-
in Automati
Text Summarization. MIT guisti
knowledge and ma
hine learning. In
Press, 1999. ACL-EACL'01 HLT/KM Workshop, 2001.
[148 Tomek Strzalkowski, Jin Wang, and Bow-
den Wise. A robust pra
ti
al text summa- [159 H. van Halteren. Writing style re
ogni-
rization. In Eduard Hovy and Dragomir tion and senten
e extra
tion. In Work-
Radev, editors, AAAI Spring Symposium shop on Text Summarization (In Conjun
-
on Intelligent Text Summarisation, pages tion with the ACL 2002 and in
luding the
26 { 33, Stanford, California, Mar
h 23- DARPA/NIST sponsored DUC 2002 Meet-
25 1998. Ameri
an Asso
iation for Arti
ial ing on Text Summarization), Philadelphia,
Intelligen
e, AAAI Press. July, 11-12 2002.
[149 SUMMAC. SUMMAC, the nal report. [160 Piek Vossen, editor. Euro WordNet: a mul-
http://www.itl.nist.gov/iaui/894.02/ tilingual database with lexi
al semanti
net-
related proje
ts/tipster summa
/, 1998. works. Kluwer A
ademi
Publishers, 1998.
[161 H. Wa
tlar. Multi-do
ument summariza- 22nd International Conferen
e on Resear
h
tion and visualization in the informedia dig- and Development in Information Retrieval
ital video library, 2001. (SIGIR-99), 1999.
[162 M. White, D. M
Cullough, C. Cardie, [166 S. R. Young and P. J. Hayes. Automati
V. Ng, and K. Wagsta. Dete
ting dis
rep-
lassi
ation and summarisation of bank-
an
ies and improving intelligibility: Two ing telexes. In Se
ond Conferen
e on Arti-
preliminary evaluations of riptides. In
ial Intelligen
e Appli
ations, pages 402{
Workshop on Text Summarization in
on- 408, New York, 1985.
jun
tion with the ACM SIGIR Conferen
e
2001, New Orleans, 2001. [167 D. Zaji
, B. Door, and R. S
hwartz. Au-
tomati
headline generation for newspaper
[163 Mi
hael White and Claire Cardie. Sele
ting stories. In Workshop on Text Summariza-
senten
es for multido
ument summaries us- tion (In Conjun
tion with the ACL 2002
ing randomized lo
al sear
h. In ACL Work- and in
luding the DARPA/NIST sponsored
shop on Automati
Summarization, 2002. DUC 2002 Meeting on Text Summariza-
[164 Mi
hael White, Tanya Korelsky, Claire tion), Philadelphia, July, 11-12 2002.
Cardie, Vin
ent Ng, David Pier
e, and Kiri
Wagsta. Multi-do
ument summarization [168 Klaus Ze
hner. A literature survey on in-
via information extra
tion. In Pro
eed- formation extra
tion and Text Summariza-
ings of the First International Conferen
e tion. term paper, Carnegie Mellon Univer-
on Human Language Te
hnology Resear
h, sity, 1997.
2001. [169 Klaus Ze
hner. Automati
Summarisation
[165 M. Witbro
k and V. Mittal. Ultra- of Spoken Dialogues in Unrestri
ted Do-
summarization: A statisti
al approa
h to mains. PhD thesis, Carnegie Mellon Uni-
generating highly
ondensed nonextra
- versity, 2001.
tive summaries. In Pro
eedings of the
System Pro
essing Level Information Kind Tu
ker 1999
Adam [137, 126 surfa
e stru
tural senten
ewise
Alfonse
a and Rodrguez [1 surfa
e stru
tural senten
ewise
* Anes [26 surfa
e lexi
al att. networks
Barzilay and Elhadad 1997 [17 entity lexi
al att. networks
Boguraev and Kennedy 1997 [24 entity lexi
al att. networks
Caldwell 1994 [29 entity lexi
al att. networks
* CENTRIFUSER [48 dis
ourse understanding info.
ontent
* Chen and Lin (2000) [35 surfa
e lexi
al info.
ontent
* Columbia MDS [108, 38, 119 entity/dis
ourse understanding/stru
tural info.
ontent
Cope
k et al. 2002 [38 surfa
e lexi
al att. networks
* Cut-and-Paste [72 surfa
e stru
tural info.
ontent
Darsy 1996 [44 entity lexi
al att. networks
* DiaSumm [169 surfa
e lexi
al dis
ourse stru
ture
DimSum [9 surfa
e lexi
al att. networks
* DMSumm [124 dis
ourse stru
tural dis
. stru
ture
Edmunson 1969 [47 surfa
e stru
tural senten
ewise
FilText [112 surfa
e stru
tural info.
ontent
* Fo
iSum [76 entity understanding att. networks
Frump [43 entity understanding info.
ontent
GISTexter [59, 83 dis
ourse/entity understanding info.
ontent
GISTSumm [125 surfa
e lexi
al att. networks
Gladwin et al. 1991 [54 entity lexi
al att. networks
* GLEANS [41 entity/dis
ourse understanding info.
ontent
* NTT [65, 66 surfa
e stru
tural/lexi
al att. networks
* Karamuftuoglu 2002 [77 surfa
e stru
tural att. networks
* Kraaij et al. 2002 [80 surfa
e lexi
al att. networks
K. U. Leuven [6, 7 entity lexi
al att. networks
* Lal and Rueger 2002 [84 entity/dis
ourse understanding info.
ontent
Lehnert 1982 [87 entity understanding info.
ontent
* Univ. of Lethbridge [27, 32 entity stru
tural/lexi
al att. networks
Luhn 1958 [96 surfa
e lexi
al att. networks
Mar
u 1997 [103 dis
ourse stru
tural dis
. stru
ture
* MEAD [128, 129 surfa
e lexi
al att. networks
* MultiGen [109, 19 entity stru
tural info.
ontent
* NeATS [92, 93, 88 entity stru
tural info.
ontent
* Newsblaster [106 entity/dis
ourse stru
tural/understanding info.
ontent
NewsInEssen
e [131 surfa
e lexi
al att. networks
Ono et al. 1994 [120 dis
ourse stru
tural dis
. stru
ture
NetSumm [127 surfa
e lexi
al att. networks
Pai
e 1981 [122 surfa
e stru
tural senten
ewise
* PERSIVAL [107 understanding info.
ontent
Ra [85 surfa
e stru
tural att. networks
* RIPTIDES [136, 163 entity/dis
ourse understanding info.
ontent
SAM [142, 40 entity understanding info.
ontent
Dunlavy et al. 2003 [144, 46 surfa
e lexi
al att. networks
S
isor [135 entity understanding info.
ontent
S
rabble [152 entity understanding info.
ontent
Skoro
hod'ko 1971 [145 entity lexi
al att. networks
Smart [140, 113 entity lexi
al att. networks
* SUMMARIST [69 surfa
e lexi
al att. networks
SUMMONS [111 entity understanding info.
ontent
SumUM [50, 138, 51 dis
ourse stru
tural dis
ourse stru
ture
* SweSum [151 surfa
e lexi
al att. networks
Taylor 1975 [153 entity understanding info.
ontent
Tele-Pattan [20 entity lexi
al att. networks
Tess [166 entity understanding info.
ontent
Teufel and Moens [155, 156 dis
ourse stru
tural dis
. stru
ture
TICC [2 entity understanding info.
ontent
TOPIC [57 dis
ourse stru
tural dis
. stru
ture
van Halteren 2002 [159 surfa
e lexi
al att. networks
WebInEssen
e [132, 167 surfa
e lexi
al att. networks
GISTexter English
Single and Multi-Do
ument
no straightforward a
ess form at: http://www.language
omputer.
om/demos/summarization/index.html
GistSumm multilingual
single do
ument
downloadable demo http://www.nil
.i
m
.usp.br/ thiago/Install GistSum.zip
Newsblaster Multilingual
multi-do
ument
on-line demo http://www1.
s.
olumbia.edu/nlp/newsblaster/
Table 2: Some on-line demos of summarization systems, both
ommer
ial and a
ademi
TEXT
cleaning up
Textual Unit
segmentation
morphological
analysis PN rules
Lexical Unit
segmentation
co-reference
resolution heuristics
co-reference
rules
semantic
tagging EuroWN
trigger-words
PRE-PROCESSED
TEXT
LEXICAL CHAINER
textual CHAINS
units OUTPUT
RANKING &
Parameters
SELECTION
Summary
ENRICHMENT
Pre-processing
Lexical Chainer
Feature
Extraction
enriched
Textual Units
Decision
Rules CLASSIFICATION
ranked
Textual Units
DETERMINATION OF
SUMMARY CONTENT
chosen
Textual Unit
SIMPLIFICATION
summary
Tokenization
Senten e segmentation
POS tagging
Morphologi al analysis
Parsing
Co-referen e resolution: Identity and Part-Whole, in luding nominal and verbal phrases,
a
ronyms, events
{ Language
overage: English
{ Output fa
ilities and
onstraints:
Evaluation:
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): entity
{ within
lassi
ation 2 (kind of information): lexi
al
{ within
lassi
ation 3 (Tu
ker, 1999): senten
e by senten
e
Comments:
Banko et al. 1999, Mittal et al. 1999
Name:
Referen
e: [15, [114
Short des
ription: Extra
tion-based summarization from hand-written summaries, i.e. going from
abstra
ts to extra
ts, of single do
uments, by aligning text spans.
System Features
{ Input:
{ Ar
hite
ture: A tl*tf (term length * term frequen
y) measure is used for weighting the relevan
e
of terms and NE. [114 fo
uses on the sele
tion of spans for do
ument summaries. Senten
es from
the original do
ument are ranked a
ording to their salien
e using two parameters for tuning the
pro
ess: i) granularity, e.g. paragraph, senten
e, et
. and ii) metri
for ranking. Features at
dis
ourse level in
lude:
length of the span
density of NEs
omplexity of NPs
pun tuation
themati phrases
anaphora density
There are also features at subdo
ument level (senten
e, phrase and word). These in
lude:
word length
themati phrases
Centroid
Centroid + outliers
senten
e, the number of tokens (senten
e length), the distan
e to the query terms and the
position of the senten
e.
[36 use pivoted GR matrix de
omposition. A token-senten
e matrix is built and from it the
olumns giving good
overage of the tokens are sele
ted. Two dierent approa
hes are used
for this pro
ess: a greedy ele
tion and a pivoted QR fa
torisation. [144 merged the LRM and
HMM by in
luding all the features of the LRM in the HMM. An additional feature was the
onditional probability that a senten
e is a summary senten
e given that the previous senten
e is.
A post-pro
ess is run on extra
ted senten
es to remove senten
e starting dis
ourse markers and
boilerplate, to improve
ohesiveness. An extensive investigation was
arried out to a
ount for
human performan
e in multi-do
ument summarization. Con
lusions were that single-do
ument
summaries
ould be used as a base for multi-do
ument, but had to be enri
hed, possibly wiht
dis
ourse stru
ture. Senten
e pruning te
hniques were also found useful.
{ Language
overage: English, potentially multilingual
{ Output fa
ilities and
onstraints:
Evaluation: parti
ipated in DUC'01, DUC'02 and DUC'03. In DUC'02, it was ranked among the
rst systems, but did not beat the baselines. In DUC'03, it performed among the top systems.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): surfa
e/entity
{ within
lassi
ation 2 (kind of information): lexi
al
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments:
Cut-and-Paste
Name: Cut-and-Paste
Referen
e: [71, [73
Short des
ription: Senten
e Redu
tion for automati
text summarization. The system relates the
phrases o
urring in a summary written by a professional summarizer and the phrases o
urring in the
original do
ument.
System Features
{ Input:
{ Ar
hite
ture: 6 editing operations (learned from the performan
e of human summarizers) are
used for senten
e redu
tion:
removing extraneous phrases
synta ti transformations
reordering
mined by the topi
of the
olle
tion. Ea
h template keeps a re
ord of the text snippets where
the information has been extra
ted from. If one of these snippets
ontains an anaphori
element, its
o-referen
e
hain is also re
orded. If no template is provided for a given topi
,
a template is generated ad-ho
, based on the topi
al relations of the words in WordNet.
the dominant event of the
olle
tion is determined, and templates are
lassed depending on
how
entral the dominant event is in the template and in the do
ument where the template
is extra
ted from.
within ea
h
lass, templates are ordered by their representativeness. Highly representative
templates are those that have the same slot llers in the same slots as the majority of tem-
plates. Also those templates related to text snippets
rossed by
o-referen
e
hains are more
representative.
the summary is made from the text snippets re
orded by the most representative template
in the
lass of templates most
losely related to the dominant event in the
olle
tion, in their
order of appearan
e in the text. If they
ontain an anaphori
element, senten
es
ontaining
the ante
edent are also in
luded. If the summary is too long, the linguisti
form of dates
and lo
ations is shortened, unimportant
oordinated phrases are dropped or, nally, the last
senten
e is dropped until the targeted length is a
hieved. If the summary is too short, the
same pro
ess is applied to the most representative templates to the other
lasses of templates,
in order of
loseness to the dominant event.
{ Language
overage: English
{ Output fa
ilities and
onstraints:
Evaluation: parti
ipated in DUC 2002 and was ranked among the rst. The best
overage rates
for single and multi-do
ument summarization, only surpassed by one system as to pre
ision in multi-
do
ument summarization. In DUC 2003 they parti
ipated with Lite-Gistexter, whi
h uses minimal
lexi
o-semanti
resour
es, obtaining good results for one of the four tasks.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): entity/dis
ourse
{ within
lassi
ation 2 (kind of information): understanding
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments: the mentioned referen
e does not provide mu
h detail on some of the modules of the
system.
GISTSumm
Name:
Referen
e: [125
Short des
ription: an automait
text summarizer that tries to identify the text main idea, i.e., the
gist, for generating the
orresponding summary.
System Features
{ Input:
{ Ar
hite
ture: It is based in the assumptions that it is possible to:
nd a senten
e that represents the main idea of a text, the gist.
nd the gist by statisti
al methods.
produ
e
oherent abstra
ts relating the gist with other senten
es of the original text
It has two methods to summarize: via key words or via a metri
to nd the most representative
words of a text (tf*isf, term frequen
y - inverse senten
e frequen
y).
{ Output fa
ilities and
onstraints:
{ Language
overage: multilingual
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): surfa
e
{ within
lassi
ation 2 (kind of information): lexi
al
{ within
lassi
ation 3 (Tu
ker, 1999): attentional networks
Evaluation:
Evaluation:
Comments:
GLEANS
Name: GLEANS
Referen
e: [41
Short des
ription: IE-based multi-do
ument summarizer, makes expli
it the main entities and rela-
tions in a do
ument
olle
tion. It produ
es headlines, extra
ts and a redu
ed form of abstra
t.
System Features
{ Input:
{ Ar
hite
ture: summarization in four steps:
do
uments are parsed [64, the main
onstituents of ea
h senten
e are identied, some
anaphori
expressions are resolved, and nally mapped into a
anoni
al representation that
expli
its their main entities and relations
ea
h
olle
tion of do
uments is
lassied by its
ontent into person, single event, multiple
and relations are extra
ted, by
hoosing the most salient words in the
olle
tion.
a headline is
reated, based on the type of
olle
tion and teh
ore entities and relations. For
multiple event
olle
tions, a short abstra
t
an also be generated with the me
hanisms to
generate headlines.
an abstra
t is generated by applying a library of
anoni
al s
hemas obtained from manual
anaphori
expressions to use for ea
h entity and temporal expressions are represented in a
anoni
al form.
{ Language
overage: English
{ Output fa
ilities and
onstraints:
Evaluation: performan
e in DUC 2002 not high: low
overage, but improved when do
ument
olle
-
tions were
orre
tly
lassied. Spe
ially bad on headline generation.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): entity/dis
ourse
{ within
lassi
ation 2 (kind of information): understanding
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments:
Knight and Mar
u 2000
Name:
Referen
e: [78
Short des
ription: This system is not a full summarizer but a senten
e
ompressor. Senten
e
ompressing is presented as a fundamental
omponent of any high-quality non extra
tive summarizer
System Features
{ Input:
{ Ar
hite
ture: The system follows a statisti
al approa
h. Senten
e
ompression is
onsidered as
a pro
ess of translation from a sour
e language (full text) into a target language (summary). The
pro
ess is a
omplished following two dierent approa
hes: a
onventional noise
hannel model
and de
ision trees (using C4.5). The probabilisti
models are trained on a
orpus of <full text,
summary> pairs.
{ Language
overage: English, potentially multilingual
{ Output fa
ilities and
onstraints:
Evaluation:
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): surfa
e
{ within
lassi
ation 2 (kind of information): lexi
al
{ within
lassi
ation 3 (Tu
ker, 1999): senten
e by senten
e
Comments: an enhan
ement of this approa
h was
arried out later on, applying the same te
hnique
to rhetori
al parse trees, with a s
ope beyond the senten
e [42.
Kraaij et al. 2001
Name:
Referen
e: [81
Short des
ription: Probabilisti
single do
ument extra
tive summarizer.
System Features
{ Input:
{ Ar
hite
ture: The system follows a probabilisti
approa
h. Two dierent statisti
al models are
applied and their results are
ombined for sele
ting the senten
es that have to be in
luded in the
summary. The former is a
ontent-based language model (unigrams + smoothing) and the latter
is based on non-
ontent features (being or not the rst senten
e,
ontaining
ue phrases, senten
e
length, et
.)
{ Language
overage: English, potentially multilingual
{ Output fa
ilities and
onstraints:
Evaluation:
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): surfa
e
{ within
lassi
ation 2 (kind of information): lexi
al
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments:
Lal and Ruger 2002
Name:
Referen
e: [84
Short des
ription: single-do
ument, extra
t-based summarizer, applies anaphora resolution and text
simpli
ation.
System Features
{ Input:
{ Ar
hite
ture: following the approa
h of [82, it works as a Bayesian pattern
lassier over
senten
es trained from an annotated
orpus. The features that are taken into a
ount are: length
of the senten
e, position of the senten
e within the paragraph and the paragraph within the
do
ument, mean tf*idf of named entities,
o-referen
e with named entities in headline, in
lusion of
highly
o-refered named entities. Some dangling anaphors are repla
ed by their referent. Lexi
al
simpli
ation is performed with tools from the PSET proje
t [31. Ba
kground knowledge on
people and pla
es, taken from sour
es on the web,
an also be in
luded.
{ Output fa
ilities and
onstraints: English
Evaluation: DUC 2002, performed well ex
ept for grammati
ality and
oheren
e.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): entity/dis
ourse
{ within
lassi
ation 2 (kind of information): lexi
al/stru
tural
{ within
lassi
ation 3 (Tu
ker, 1999): senten
e by senten
e
Comments: A demonstration
an be found at http://km.do
.i
.a
.uk/pr-p.lal-2002/, and the system
an be downloaded as a CREOLE Repository for GATE users.
Lethbridge, University of
Name: University of Lethbridge
Referen
e: [27, 32
Short des
ription: single- and multido
ument lexi
al
hain summarizer by extra
tion. It lters out
hain
andidates in subordinate
lauses.
System Features
{ Input:
{ Ar
hite
ture: for multido
ument summaries, the pro
edure is the same as for single do
ument
(below), but all segments in the
olle
tion are pooled together, assigning a time stamp to ea
h.
topi
segmentation of the text
lexi al haining
surfa e repairs: add previous senten e to a senten e ontaining a dangling anaphora, remove
the topi of ea h senten e is determined, by general topi ality me hanisms of English (initial
by the targeted summary length, so that only senten
es in higher levels are in
luded.
for multiple do
ument summarization, headline-kind summaries are produ
ed by listing non-
redundant topi
terms. For longer summaries, open-
lass words of every senten
e in the
olle
tion are
lustered.
Key terms are asso
iated to ea
h topi
, and a tree-like table of
ontent is produ
ed.
{ Language
overage: English, potentially multilingual
{ Output fa
ilities and
onstraints: oriented to tables of
ontents, la
ks
ohesion for texts.
Evaluation: DUC 2002, average s
ores, bad for short abstra
ts. In DUC 2003, the strategy for very
short abstra
ts (headlines) was signi
antly improved,
ombining the informativeness of topi
terms
with hand-
rafted grammati
al rules for senten
e
ompression, whi
h resulted in very good results for
the task of headline generation. In the other tasks, results were average.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): entity
{ within
lassi
ation 2 (kind of information): lexi
{ within
lassi
ation 3 (Tu
ker, 1999): attentional networks
Comments:
Lexi
al Bonds
Name: Lexi
al Bonds
Referen
e: [77
Short des
ription: extra
tive single-do
ument system based on analysis of lexi
al bonds between
senten
es in a text and a
lassi
ation of senten
es into important and unimportant using SVM.
System Features
{ Input: single do
uments
{ Ar
hite
ture: the original design in
ludes a transformation phase that should
ompa
t the text
extra
ted in the rst phase and resolve anaphori
referen
es, but it is not yet developed. The
urrent ar
hite
ture is:
senten
es are splitted and stopwords are removed
re ord of features for every senten e: senten e position, number of words, number of ba kward,
forward and total lexi
al bonds and lexi
al links, and information
ontent
a lexi
al link between two senten
es is found when a word stem o
urs in both of them, a
lexi
al bond is found when there are two or more lexi
al links between a pair of senten
es
[67.
the information
ontent of a senten
e is the IR fun
tion BM25 [146, whi
h indi
ates the
are: only senten
es in the upper half of the do
ument and sele
ted by SVM are
onsidered.
The system produ
es
ohesive summaries, but they are very redundant.
{ Language
overage: English, potentially multilingual
{ Output fa
ilities and
onstraints:
ompa
tation pro
ess is under development.
Evaluation: parti
ipated in DUC 2002, with good results in quality.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): surfa
e/entity
{ within
lassi
ation 2 (kind of information): dis
ourse
{ within
lassi
ation 3 (Tu
ker, 1999): attentional networks
Comments:
MEAD
Name: MEAD
Referen
e: [133, 128, 121, 129
Short des
ription: Centroid-based multi-do
ument summarization
System Features
{ Input:
{ Ar
hite
ture: MEAD begins identifying all the arti
les related to an emerging event (using
the CIDR Topi
Dete
tion and Tra
king system). CIDR produ
es a set of
lusters. From ea
h
luster a
entroid is built. Then the senten
es
losest to the ea
h of the
entroids are sele
ted to be
in
luded in the summary. CBSU (Centroid-based senten
e utility) s
ores the degree of relevan
e of
a parti
ular senten
e to the general topi
of the entire
luster. CSIS (Cross-senten
e informational
subsumption) measures the overlap between the informational
ontent of the senten
es. CSIS is a
similar measure than MMR. The dieren
e is that CSIS is multi-do
ument and query-independent
while MMR is single-do
ument and query-based. More re
ent versions of MEAD use a linear
ombination of three features: the
entroid s
ore and it assigns higher s
ores to senten
es
loser
to the beginning of the do
ument and to longer senten
es.
{ Language
overage: multilingual: English, Chinese, potentially any language
{ Output fa
ilities and
onstraints:
Evaluation: DUC 2001, 2002 and 2003. In DUC 2002 they had format problems (SGML tags). In
DUC 2003 they had the best s
ore for question-fo
used multi-do
ument summaries, and performed
among the top 3 systems for all multi-do
ument summarization tasks.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): surfa
e
{ within
lassi
ation 2 (kind of information): lexi
al
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments:
MULTIGEN
Name: MULTIGEN
Referen
e: [19, [109
Short des
ription: Multi-do
ument Summarization using Information Fusion and Reformulation
System Features
{ Input: News arti
les presenting dierent des
riptions of the same event.
{ Ar
hite
ture:
identify similarities and dieren
es a
ross do
uments by statisti
al te
hniques [111
extra
t sets of similar senten
es: THEMES
shallow synta
ti
analysis
order sets of similar senten es (Reformulation). Two dierent forms of implementing ordering
orpus analysis. MULTIGEN takes prot of the experien
e of Columbia's group in NL Gen-
eration for building high quality summaries (not extra
ts but abstra
ts).
{ Language
overage: English
{ Output fa
ilities and
onstraints:
Evaluation:
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): entity
{ within
lassi
ation 2 (kind of information): stru
tural
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments: MULTIGEN has been extended in several dire
tions. See Columbia MDS [18, 110
PERSIVAL [107 and CENTRIFUSER [75 among others.
Muresan et al. 2001, Tzoukermann et al. 2001
Name:
Referen
e: [117, [158
Short des
ription: e-mail summarization
ombining Ma
hine Learning and linguisti
information.
System Features
{ Input:
{ Ar
hite
ture: The basi
pro
ess
onsists on learning the salient NPs o
urring in the text. The
following features are used for the learning task:
for the head of the NP:
head-tf*idf (relevan e)
np-tf*idf
np-fo
np-length-words
np-length- hars
senten e-position
paragraph-position
Dierent ML methods have been applied in
luding de
ision trees (C4.5) and rule indu
tion (Rip-
per). The linguisti
pro
ess in
lude:
in
e
tional morphology pro
essing
2. Filtering for
ontent: remove all senten
es that are not within the rst 10 senten
es of a
do
ument, de
rease ranking s
ore of senten
es
ontaining stigma words.
3. Enfor
ing
ohesion and
oheren
e by pairing ea
h senten
e with the lead senten
e of the
do
ument
4. Filtering for length: in
lude senten
es (paired with the
orresponding lead senten
e) that are
most dierent from the in
luded ones, until targeted length is satised.
5. Ensuring
hronologi
al
oheren
e
As an additional enhan
ement, Leuski et al. (2003) [88 provide a graphi
al interfa
e to improve
the navigation and modi
ation of the summaries produ
ed by NeATS.
{ Language
overage: English, potentially multilingual
{ Output fa
ilities and
onstraints:
Evaluation: in DUC 2002, it was the system with highest pre
ision and F1 measure, although it
performed low in re
all.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): entity
{ within
lassi
ation 2 (kind of information): stru
tural
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments:
Newsblaster
Name: Multilingual Columbia's Newsblaster
Referen
e: [49
Short des
ription:
System Features
{ Input: multido
ument
{ Ar
hite
ture: A platform for multilingual news summarization that extends the Columbia's
Newsblaster system [106. The system adds a new
omponent, translation, to the original six
major modules:
rawling, extra
tion,
lustering, summarization,
lassi
ation and web page gen-
eration, that have been, in turn, modied for allowing multilinguality (language identi
ation,
dierent
hara
ter en
oding, language idiosyn
rasy, et
.).
In this system multilingual do
uments are translated into English before
lustering and, so,
lus-
tering is performed only on English texts.
Translation is
arried out at two levels. As a low quality translation is usually enough for
lustering
purposes and assessing the relevan
e of the senten
es, a simple and fast te
hnique is applied for
glossing the input do
uments prior to
lustering. Higher (relatively) quality translation (using
Altavista's Babelsh interfa
e to Systran) is performed in a se
ond step only over fragments
sele
ted to be part of the summary.
The system takes as well into a
ount the possible degradation of the input texts as result of the
translation pro
ess (most of the senten
es resulting from this pro
ess are simply not grammati
ally
orre
t).
{ Output fa
ilities and
onstraints:
{ Language
overage:
rosslingual
Evaluation:
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): entity
{ within
lassi
ation 2 (kind of information): stru
tural
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments:
NTT
Name: NTT
Referen
e: [65, 66
Short des
ription: extra
tive summarizer based on
lassi
ation of senten
es by Support Ve
tor
Ma
hines (SVM) and Maximal Marginal Relevan
e (MMR).
System Features
{ Input:
{ Ar
hite
ture: ea
h senten
e in a do
ument is des
ribed with the following features: position,
length, weight (tf*idf s
ore of the words in the senten
e), similarity with the headline and presen
e
of
ertain prepositions or verbs.
{ Language
overage: English, potentially multilingual
{ Output fa
ilities and
onstraints:
Evaluation: parti
ipated in DUC'02, with good results in
overage but low quality. For DUC 2003,
NTT a
hieved the highest metri
s for readability in the two multido
ument summarization tasks it
took part in, and got average positions for
overage.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): surfa
e
{ within
lassi
ation 2 (kind of information): lexi
al
{ within
lassi
ation 3 (Tu
ker, 1999): senten
e by senten
e
Comments:
OCELOT
Name: OCELOT
Referen
e: [115
Short des
ription: Summarizing of Web pages. Gist of Web do
ument based on probabilisti
models.
System Features
{ Input:
{ Ar
hite
ture: OCELOT is one of the appli
ations of a general probabilisti
approa
h that
models summarisation as a translation pro
ess between two languages, the language of full text
and the language of summaries. Berger in his thesis applies
onventional sto
hasti
translation
methods for summarizing. Three dierent examples of appli
ation are provided and OCELOT is
one of them.
{ Language
overage: English, potentially multilingual
{ Output fa
ilities and
onstraints:
Evaluation:
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): surfa
e
{ within
lassi
ation 2 (kind of information): lexi
al
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments:
PERSIVAL
Name: PERSIVAL
Referen
e: [107
Short des
ription: PERSIVAL (Personalized Retrieval and Summarization of Image, Video and
Language). The system builds patient spe
i
(tailored a
ess for both patients and physi
ians) sum-
maries of medi
al arti
les
ontained in a distributed multimedia patient
are digital library. It is a
Digital Library proje
t.
System Features
{ Input: Multimedia
olle
tions in the medi
al domain
{ Ar
hite
ture: Multimedia sear
h triggered by a
on
ept from patient's data. The system in-
ludes the annotation and organization of large
olle
tions of video data. Video do
uments are
segmented and a storyboard summary is produ
ed. Video are indexed at synta
ti
and semanti
levels. A set of
ontent-based video sear
h tools has been developed. The system in
ludes the use
of DEFINDER tool (for looking for denitions).
{ Language
overage: English
{ Output fa
ilities and
onstraints:
Evaluation:
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): entity
{ within
lassi
ation 2 (kind of information): understanding
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments:
RIPTIDES
Name: RIPTIDES
Referen
e: [164, [162
Short des
ription: user dire
ted do
ument summarizer
ombining the appli
ation of te
hniques of In-
formation Extra
tion, Extra
tion-based Summarization and Natural Language Generation. The former
referen
e refers to single-do
ument summarization while the latter to multi-do
ument summarization.
System Features
{ Input:
{ Ar
hite
ture: The system pro
eeds in the following steps:
1. User information needs are a
quired from the system
2. S
enario templates are lled by an IE system
3. IE output templates are merged into an event-oriented stru
ture where
omparable fa
ts are
grouped. For doing so SimFinder is used.
4. Importan
e s
ores are assigned to slot/senten
es based on a
ombination of do
ument posi-
tion, do
ument re
en
y and group/
luster membership.
5. Content sele
tion
6. Summary generation
{ Language
overage: English
{ Output fa
ilities and
onstraints:
Evaluation:
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): entity
{ within
lassi
ation 2 (kind of information): understanding
{ within
lassi
ation 3 (Tu
ker, 1999): informational
ontent
Comments:
S
himan et al. 2001
Name:
Referen
e: [143
Short des
ription: Multi-do
ument summarizer produ
ing Biographi
al Summaries
ombining lin-
guisti
knowledge with
orpus statisti
s.
System Features
{ Input:
{ Ar
hite
ture: A number of modules
o-operate for produ
ing the summaries:
Senten
e tokenizer
Nametag NER
Cass parser
Appositives
tion of a text span as belonging to a topi
hara
terised by its signature. Topi
Signatures
are tuples of the form <Topi
, Signature> where Signature is a list of weighted terms: <t1,w1>,
<t2,w2>, ..., <tn,wn>. Topi
signatures
an be automati
ally learned ([Lin, 1997, [Lin, Hovy,
2000). Topi
identi
ation, then, in
ludes text segmentation (using TextTiling) and
om-
parison of text spans with existing Topi
Signatures.
The topi
identied are fused during the interpretation (2nd step) of the pro
ess. The fused
and linguisti
transdu
ers identify noun groups and verb groups.
on
epts are tagged semanti
ally, marking dis
ourse domain relations
an indi ative abstra t is omposed, by re-generation of text using pre-dened summary tem-
plates
based on the rst, indi
ative abstra
t, an informative abstra
t
an be
omposed, elaborating
ea
h senten
e
ea
h senten
e is assigned values for some surfa
e features: senten
e position, length, presen
e
to generate headlines, the most frequent word in the highest ranked senten e for every do -
ument and the titles is
onsidered a trigger word. Then, the senten
es in the whole
luster
are ranked a
ording to their importan
e. The highest ranked noun phrase that
ontains the
trigger word is
hosen as the headline.
{ Language
overage: English, potentially multilingual
{ Output fa
ilities and
onstraints:
Evaluation: parti
ipated in DUC 2002 in the multi-do
ument extra
t and abstra
t tra
ks, with
\disappointing performan
e". In addition, a self-evaluation applying relative utility [133, whi
h reports
better results. An investigation on the individual
ontribution of ea
h feature was also performed,
revealing that position in the senten
e is highly indi
ative, while negative
ue phrase was not well-
dened.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): surfa
e
{ within
lassi
ation 2 (kind of information): lexi
{ within
lassi
ation 3 (Tu
ker, 1999): attentional networks / senten
e by senten
e
Comments:
van Halteren 2002
Name:
Referen
e: [159
Short des
ription: multi-do
ument, extra
tive summarizer. Senten
es are
lassied by feature sets
used for writing style re
ognition.
System Features
{ Input:
{ Ar
hite
ture: ea
h senten
e is des
ribed by a set of features: distan
e between o
urren
es of
the same word, distribution of words, relative position of words, senten
e length, senten
e position
and
ontext of POS tags. A
lassier trained for a writing style re
ognition task exploits these
features for senten
e s
oring and extra
tion.
{ Language
overage: English, potentially multilingual
{ Output fa
ilities and
onstraints:
Evaluation: parti
ipated in DUC 2002, but obtained not so good results.
Classi
ation
{ within
lassi
ation 1 (level of pro
essing): surfa
e
{ within
lassi
ation 2 (kind of information): lexi
al
{ within
lassi
ation 3 (Tu
ker, 1999): senten
e by senten
e
Comments: the system was trained on materials not oriented to the summarization task