You are on page 1of 65

Approa

hes to Text Summarization:

Questions and Answers

Laura Alonso, Irene Castellon Salvador Climent


Estudis d'Humanitats i Filologia
Dept. de Ling
usti a General
Universitat Oberta de Catalunya
Universitat de Bar elona
Av. Tibidabo 39-43
Gran Via de les Corts Catalanes, 585
08035 Bar elona
08007 Bar elona

flalonso, astelglingua. l.ub.es


s limentuo .edu

Maria Fuentes Llus Padro, Hora io Rodrguez


Dept. d'Inform
ati a i Matem
ati a Apli ada TALP Resear h Center

Universitat de Girona Universitat Polit


e ni a de Catalunya

Campus Montilivi Jordi Girona 1-3

17071 Girona 08034 Bar elona

maria.fuentesudg.es fpadro,hora ioglsi.up .es

Abstra t

In this paper a omparative study of Automated Text Summarization (TS) Systems is presented. It des ribes
the fa tors to be taken into a ount for evaluating those systems and outlines three alternative lassi ations.
The paper provides extensive examples of working TS systems a ording to their hara terizing features,
performan e, and obtained results, with a spe ial emphasis on the multilingual aspe t of summarization.

Key Words: Automated Text Summarization, Multilingual Systems

1 Introdu tion (2001, 2002, 2003) [45 ontests provide a good


overview of urrent working systems.
The eld of Text Summarization (TS) has expe- In this study, we provide an analysis of urrent
rien ed an exponential growth in the last years. work in TS, with spe ial attention to the future
That is why many omparative studies an be developments of the eld, like multilingual sum-
found in the literature, among the most om- marization. First, we present the fa tors a e ting
prehensive, Pai e (1990) [123, Ze hner (1997) summarization in Se tion 2, and provide exam-
[168, Spar k-Jones (1998) [147, Hovy and Mar u ples of how working systems handle ea h of these
(1998) [70, Tu ker (1999) [157, Radev (2000) fa tors. In Se tion 3 three possible lassi ations
[130, Mani (2001) [98 and Maybury and Mani of summarization systems are outlined, whi h are
(2001) [105. Given that an upper bound of per- applied to on rete systems in Se tion 4, with a
forman e for TS systems is still far from being on rete example of multilingual summarization.
rea hed, task-based ompetitions are the main To nish, we brie y dis uss some burning issues
forum of dis ussion in the area. As follows, in TS.
the SUMMAC (1998) [149 and espe ially DUC

Inteligen ia Arti ial, Revista Iberoameri ana de Inteligen ia Arti ial. No 20 (2003), pp. 34-52.

ISSN: 1137-3601. AEPIA (http://www.aepia.org/revista)
2 Some onsiderations on degrees of portability. The restri tion to a er-
tain domain is usually ompensated by the fa t
Summary Aspe ts
that spe ialized systems an apply knowledge in-
tensive te hniques whi h are only feasible in on-
Summarization has traditionally been de om- trolled domains, as is the ase of the multido -
posed into three phases [147, 101, 58, 68, 98: ument summarizer SUMMONS [111, spe ialized
in summaries in terrorism domain applying om-
 analyzing the input text to obtain text rep- plex Information Extra tion te hniques. In on-
resentation, trast, general purpose systems are not dependant
 transforming it into a summary representa- on information about domains, whi h usually re-
tion, sults in a more shallow approa h to the analysis
 and synthesizing an appropriate output of the input do uments.
form to generate the summary text. Nevertheless, some general purpose systems are
prepared to exploit domain spe i information.
E e tive summarizing requires an expli it and de- For example, the meta summarizer developed at
tailed analysis of ontext fa tors, as is apparent Columbia University [19, 18, 61, 60, 108 applies
when we re ognize that what summaries should di erent summarizers for di erent kinds of do -
be like is de ned by what they are wanted for. uments: MULTIGEN [19, 109 is spe ialized in
The parameters to be taken into a ount in sum- simple events, DEMS [143 (with the bio on g-
marization systems have been widely dis ussed uration) deals with biographies, and for the rest
[101, 68, 98. We will follow Spar k Jones (1998) of do uments, DEMS has a default on guration
[147, who distinguishes three main aspe ts that that an be resorted to.
a e t the pro ess of TS: input, purpose and out-
put, with a spe ial fo us on multilinguality. Spe ialization level. A text may be broadly
hara terized as ordinary, spe ialized, or re-
stri ted, in relation to the presumed subje t
2.1 Input Aspe ts knowledge of the sour e text readers. This as-
pe t an be onsidered the same as the domain
The features of the text to be summarized ru- aspe t dis ussed above.
ially determine the way a summary an be ob-
tained. The following aspe ts of input are rele- Restri tion on the language. The language
vant to the task of TS: of the input an be general language or restri ted
to a sublanguage within a domain, purpose or
Do ument Stru ture. Besides textual on-
audien e. It may be ne essary to preserve the
tent, heterogeneous do umental information an sublanguage in the summary.
be found in a sour e do ument, for example, la-
bels that mark headers, hapters, se tions, lists, S ale. Di erent summarizing strategies have to
tables, et . If it is well systematized and ex- be adopted to handle di erent text lengths. In-
ploited, this information an be of use to ana- deed, the analysis of the input text an be per-
lyze the do ument. For example, Kan (2002) [74 formed at di erent granularities, for example, in
exploits the organization of medi al arti les in determining meaning units. In the ase of news
se tions to build a tree-like representation of the arti les, senten es or even lauses are usually on-
sour e. Teufel and Moens (2002) [156 system- sidered the minimal meaning units, whereas for
atize the stru tural properties of s ienti arti les longer do uments, like reports or books, para-
to assess the ontribution of ea h textual segment graphs seem a more adequate unit of meaning.
to the arti le, in order to build a summary from Also the te hniques for segmenting the input text
that enri hed perspe tive. in these meaning units di er: for shorter texts, or-
However, it an also be the ase that the informa- thography and syntax, even dis ourse boundaries
tion it provides is not the target of the analysis. [103 indi ate signi ant boundaries, for longer
texts, topi segmentation [79, 63 is more usual.
In this ase, do ument stru ture has to be re-
moved in order to isolate the textual omponent Media. Although the main fo us of summariza-
of the do ument. tion is textual summarization, summaries of non-
Domain. Domain-sensitive systems are only textual do uments, like videos, meeting re ords,
apable of obtaining summaries of texts that be- images or tables have also been undertaken in
long to a pre-determined domain, with varying re ent years. The omplexity of multimedia
summarization has prevented the development of Task-driven summarization presents the advan-
wide overage systems, whi h means that most tage that systems an be evaluated with respe t
summarization systems that an handle multime- to the improvement they introdu e in the nal
dia information are limited to spe i domains or task they are applied to.
textual genres [62, 104. However, resear h ef-
forts also onsider the integration of information Audien e. In ase a user pro le is a essible,
of di erent media [21, whi h allow a wider ov- summaries an be adapted to the needs of spe-
erage of multimedia summarization systems by i users, for example, the user's prior knowl-
exploiting di erent kinds of do umental informa- edge on a determined subje t. Ba kground sum-
tion ollaboratively, like metadata asso iated to maries assume that the reader's prior knowledge
video re ords [161. is poor, and so extensive information is supplied,
while just-the-news are those kind of summaries
Genre. Some systems exploit typi al genre- onveying only the newest information on an al-
determined hara teristi s of texts, su h as the ready known subje t. Brie ngs are a parti ular
pyramidal organization of newspaper arti les, or ase of the latter, sin e they olle t representative
the argumentative development of a s ienti information from a set of related do uments.
arti le. Some summarizers are independent of Usage. Summaries an be sensitive to deter-
the type of do ument to be summarized, while mined uses: retrieving sour e text [75, preview-
others are spe ialized on some type of do u- ing a text [88, refreshing the memory of an al-
ments: health are reports [48, medi al arti les ready read text, sorting...
[74, agen y news [111, broad ast fragments [62,
meeting re ordings [169, e-mails [117, 3, web
pages [132, et . 2.3 Output Aspe ts
Unit. The input to the summarization pro ess
an be a single do ument or multiple do uments, Content. A summary may try to represent all
either simple text or multimedia information su h relevant features of a sour e text or it may fo us
as imagery audio, or video [150. on some spe i ones, whi h an be determined
by queries, subje ts, et . Generi summaries are
Language. Systems an be language- text-driven, while user-fo used (or query-driven)
independant, exploiting hara teristi s of do - ones rely on a spe i ation of the user's informa-
uments that hold ross-linguisti ally [129, 125, tion need, like a question or key words.
or else their ar hite ture an be determined by
the features of a on rete language. This means Related to the kind of ontent that is to be ex-
that some adaptations must be arried out in the tra ted, di erent omputational approa hes are
system to deal with di erent languages. As an applied. The two basi approa hes are top-
additional improvement, some multi-do ument down, using information extra tion te hniques,
systems are able to deal simultaneously with and bottom-up, more similar to information re-
do uments in di erent languages [33, 34, whi h trieval pro edures. Top-down is used in query-
will be developed in Se tion 2.4. driven summaries, when riteria of interest are
en oded as a sear h spe i ation, and this spe i-
ation is used by the system to lter or analyze
text portions. The strategies applied in this ap-
2.2 Purpose Aspe ts proa h are similar to those of Question Answer-
ing. On the other hand, bottom-up is used in
Situation. TS systems an perform general text-driven summaries, when generi importan e
summarization or else they an be embedded in metri s are en oded as strategies, whi h are then
larger systems, as an intermediate step for an- applied over a representation of the whole text.
other NLP task, like Ma hine Translation, Infor- Format. The output of a summarization sys-
mation Retrieval or Question Answering. As the tem an be plain text, or else it an be formatted.
eld evolves, more and more e orts are devoted Formatting an be targeted to many purposes:
to task-driven summarization, in detriment of a onforming to a pre-determined style (tags, orga-
more general approa h to TS. This is due to the nization in elds), improving readability (division
fa t that underspe i ation of the information in se tions, highlighting), et .
needs supposes a major problem for design and
evaluation of the systems. As will be dis ussed in Style. A summary an be informative, if it ov-
Se tion 5, evaluation is a major problem in TS. ers the topi s in the sour e text; indi ative, if it
provides a brief survey of the topi s addressed 2.4 Language overage
in the original; aggregative, if it supplies informa-
tion non present in the sour e text that ompletes
some of its information or eli its some hidden in- As regards language overage, systems an
formation [156; or riti al, if it provides an addi- be lassi ed as monolingual, multilingual, and
tional valoration of the summarized text. rosslingual (a similar lassi ation is ommonly
used in Information Retrieval systems). Monolin-
Produ tion Pro ess. The resulting summary gual summarization systems deal with only one
text an be an extra t, if it is omposed by literal language for both the input do ument and the
fragments of text, or an abstra t, if it is gener- summary. In the ase of multilingual systems,
ated. The type of summary output desired an input and output languages are also the same
be relatively polished, for example, if text is well- but in this ase the system an over several lan-
formed and onne ted, or else more fragmentary guages. Crosslingual systems are able to pro ess
in nature (e.g., a list of key words). input do ument in several languages, produ ing
summaries in di erent languages.
There are intermediate options, mostly on ern-
ing the nature of the fragments that ompose ex- Multilinguality does not imply additional di ul-
tra ts, whi h an range from topi -like passages, ties. Most of the systems and te hniques we will
paragraph or multiparagraph long, to lauses or present below an be easily adapted to other lan-
even phrases. In addition, some approa hes per- guages, assuming, of ourse, the availability of the
form editing operations in the summary, over om- knowledge sour es needed for the di erent meth-
ing the in oheren e and redundan y often found ods. Roughly speaking, the more amount of lin-
in extra ts, but at the same time avoiding the guisti knowledge is needed by a system, the more
high ost of a NL generation system. Jing and di ult is to transport it to another language.
M Keown (2000) [73 apply six re-writing strate-
gies to improve the general quality of an extra t- A more omplex hallenge is rosslinguality.
based summary by edition operations like dele- There are examples of single do ument rosslin-
tion, ompletion or substitution of lausal on- gual summarizers, implying a ertain amount of
stituents. translation, either on the input text or on the
summary, but most rosslingual summarizers are
Surrogation. Summaries an stand in pla e of multido ument. In this ase a lot of problems spe-
the sour e as a surrogate, or they an be linked i of translinguality arise. Measures of similar-
to the sour e [75, 88, or even be presented in the ity between do uments and passages in di erent
ontext of the sour e (e.g., by highlighting sour e languages, for identifying relations or for luster-
text, [86). ing, have to be envisaged. Similarity between lex-
Length. The targeted length of the summary i al units (words, NEs, multiword terms) belong-
ru ially a e ts the informativeness of the - ing to di erent languages, have to be omputed
nal result. This length an be determined by as well. Obviously, the more distant the involved
a ompression rate, that is to say, a ratio of languages are, the harder these problems turn to
the summary length with respe t to the length be, spe ially if the languages present di erent lex-
of the original text. Traditionally, ompression i al units or hara ter sets. Sin e this is a burning
rates range from 1% to 30%, with 10% as a pre- issue, it will be dis ussed at length in Se tion 5.
ferred rate for arti le summarization. In the ase
of multido ument summarization though, length
annot be determined as a ratio to the original
text(s), so the summary always onforms to a
pre-determined length. Summary length an also 3 Approa hes to Text Sum-
be determined by the physi al ontext where the marization
summary is to be displayed. For example, in the
ase of delivery of news of summaries to hand-
helds [23, 28, 39, the size of the s reen imposes There are several ways in whi h one an har-
severe restri tions to the length of the summary. a terize di erent approa hes to text summariza-
Headline generation is another appli ation where tion. In this se tion, we present three possi-
the length of summaries is learly determined ble lassi ations of text summarization systems,
[165, 41. In very short summaries, oheren e is but many others an be found in the literature
usually sa ri ed to informativeness, so lists of [70, 130, 105, 98. The rst lassi ation, follow-
words are onsidered a eptable [80, 167. ing Mani and Maybury (1999) [101, is based in
the level of pro essing that ea h system performs, Cue words and phrases are signals of relevan e
the se ond, proposed in Alonso and Castellon or irrelevan e. They are typi ally meta-linguisti
(2001) [4, is based in the kind of information ex- markers (e.g., ues: "in summary", "in on lu-
ploited, the third follows Tu ker (1999) [157. sion", "our investigation", "the paper des ribes";
or emphasizers: "signi antly", "important", "in
parti ular", "hardly", "impossible"), as well as
3.1 Classi ation 1: Level of Pro- domain-spe i bonus phrases and stigma terms.
Although lists of these phrases are usually built
essing manually [82, 154, they an also be dete ted au-
tomati ally.
One useful way to lassify summarization systems
is to examine the level of pro essing of the text.
Based on this, summarization an be hara ter- 3.1.2 Entity-level
ized as approa hing the problem at the surfa e,
entity, or dis ourse level [101. Entity-level approa hes build an internal repre-
sentation of the text by modeling text entities
(simple words, ompound nouns, named entities,
3.1.1 Surfa e level et .) and their relationships. These approa hes
tend to represent patterns of onne tivity in the
Surfa e-level approa hes tend to represent infor- text (e.g., graph topology) to help determine
mation in terms of shallow features that are then salien y. Relations between entities in lude:
sele tively ombined together to yield a salien e Similarity. Similar words are those whose
fun tion used to extra t information, following form is similar, for example, those sharing a om-
the approa h of Edmunson (1969) [47. These mon stem (e.g., \similar" and \similarity"). Sim-
features in lude: ilarity an be al ulated with linguisti knowledge
Term frequen y statisti s provide a themati or by hara ter string overlap. Myaeng and Jang
representation of text, assuming that important (1999) [118 use two similarity measures for deter-
senten es are the ones that ontain words that mining if a senten e belongs to the major ontent:
o ur frequently. The s ore senten es in reases a similarity between the senten e and the rest of
for ea h frequent word. Early summarization the do ument and a similarity between the sen-
systems dire tly exploit word distribution in the ten e and the title of the do ument. Also, in NTT
sour e [96. [65, 66, CENTRIFUSER [75, several similarity
measures are applied.
Lo ation relies on the intuition that important
Proximity. The distan e between the text
senten es are lo ated at positions that are usu-
units where entities o ur is a determining fa -
ally genre-dependent, however, some general rules
tor for establishing relations between entities.
are the lead method and the title-based method.
The lead method onsists of just taking the rst Cohesion. Cohesion an be de ned in terms of
senten es. The title-based method assumes that onne tivity. Conne tivity a ounts for the fa t
words in titles and headings are positively rele- that important text units usually ontain entities
vant to summarization. A generalization of these that are highly onne ted in some kind of seman-
methods is the OPP used by Hovy and Lin in ti stru ture. Cohesion an be approa hed by:
their SUMMARIST system [91, where they ex-
ploit Ma hine Learning te hniques to identify the
 Word o-o urren e: words an be re-
positions where relevant information is pla ed
lated if they o ur in ommon ontexts.
within di erent textual genres. Many of the ur-
Some appli ations are presented in Bald-
rent systems, spe ially those applying ma hine
win and Morton (1998), M Keown et al.
learning te hniques, take into a ount the lo a-
(1999)[13, 109. Salton et al. (1997), Mitra
tion of meaning units in a do ument to assess
et al. (1997) [141, 113 apply IR methods at
their relevan e.
the do ument level, treating paragraphs in
Bias. The relevan e of meaning units is deter- texts as do uments are treated in a olle -
mined by the presen e of terms from the title or tion of do uments. Using a traditional IR-
headings, initial part of text, or user's query. For based method, a word similarity measure is
example, [37, 36, 144 use as features the posi- used to determine the set Si of paragraphs
tion in the senten e, the number of tokens and that ea h paragraph Pi is related to. After
the number of pseudo-query terms.
determining relatedness s ores Si for ea h argument, between entities in the text.
paragraph, paragraphs with the largest Si
s ores are extra ted. The system of Baldwin and Morton (1998) [13
uses argument dete tion in order to resolve o
In SUMMAC [97, in the ontext of query- referen e between the query and the text for per-
based summarization, Cornell's Smart- forming summarization.
based approa h expands the original query,
ompares expanded query against para-
graphs, and sele ts top three paragraphs
(max 25% of original) that are most sim-
ilar to the original query. 3.1.3 Dis ourse-level
 Lo al salien e: important phrasal expres-
sions are given by a ombination of gram- Dis ourse-level approa hes model the global
mati al, synta ti , and ontextual parame- stru ture of the text, and its relation to ommu-
ters [24. ni ative goals. At this level, the following infor-
 Lexi al similarity: words an be related mation an be exploited:
by thesaural relationships (synonymy, hy-
pernymy, meronymy relations). Barzilay
(1997) [16 details a system where Lexi al Format of the do ument (e.g., hypertext
Chains are used, based on Morris and Hirst markup, do ument outlines).
(1991) [116. This line has also been ap-
plied to Spanish, relying on EuroWordNet
relations between words, by Fuentes and Threads of topi s an be revealed in the text.
Rodrguez (2002) [53. The assumption is An example of this is SUMMARIST, whi h ap-
that important senten es are those that are plies Topi identi ation [69, 95. Topi identi -
rossed by strong hains1 . This approa h ation implies previous a quisition of Topi Sig-
provides a partial a ount of texts, sin e it natures (that an be automati ally learned) and
fo uses mostly on ohesive aspe ts. An in- then the identi ation of a text span as belonging
tegration of ohesion and oheren e features to a topi hara terized by its signature. Topi
of texts might ontribute to over ome this, identi ation, then, in ludes text segmentation
as Alonso and Fuentes (2002) [5 point out. and omparison of text spans with existing Topi
 Co-referen e: referring expressions an be Signatures. The topi identi ed are fused dur-
linked, and o-referen e hains an be built ing the interpretation of the pro ess. The fused
with o-referring expressions. Both Lexi- topi s are then expressed in new terms. Other
al Chains and Co-referen e Chains an be systems are Boros et al. (2001) [25 and MEAD
priorised if they ontain words in a query [133, 128, 121. These systems assign a topi to
(for query-based summaries) or in the ti- the senten es in order to reate lusters for sele t-
tle. So, the preferen e imposed on hain is: ing the senten es to appear in summary.
query > title > do ument. Baga and Bald-
win (1998), Azzam et al. (1999) [11, 10
use oreferen e hains for summarization. Rhetori al stru ture of the text, representing
Baldwin and Morton (1998) [13 exploit argumentation or narrative stru ture. The main
o-referen e hains spe i ally for query- idea is that the oheren e stru ture of a text an
sensitive summarization. be onstru ted, so that the ' entrality' of the tex-
tual units in this stru ture will re e t their impor-
Conne tedness method [100 represents
map text with graphs. Words in the text tan e. A tree-like representation of texts is pro-
posed by the Rhetori al Stru ture Theory [102.
are the nodes, and ar s represent adja-
Ono et al. (1994) [120 and Mar u (1997) [103 at-
en y, grammati al, o-referen e, and lex-
tempt to use this kind of dis ourse representation
i al similarity-based relations.
in order to determine the most important textual
units. They propose an approa h to rhetori al
Logi al relations su h as agreement, ontra- parsing by dis ourse markers and semanti sim-
di tion, entailment, and onsisten y. ilarities in order to hypothesize rhetori al rela-
Meaning representation-based relations. tions. These hypotheses are used to derive a valid
Establishing relations, su h as predi ate- dis ourse representation of the original text.
1 Lexi al hains have also been used in other NLP tasks, su h as automati extra tion of interdo ument links [56.
3.2 Classi ation 2: Kind of Infor- important parts of hains an be onsidered the
mation most representative of the text.

Summarization systems an be lassi ed by the


kind of information they deal with [4. A ording 3.2.2 Stru tural Information
to this, we an distinguish between those exploit-
ing lexi al aspe ts of texts, those working with
stru tural information and those trying to a hieve A se ond dire tion in TS tries to exploit informa-
deep understanding of texts. tion from the texts as stru tured entities. Sin e
texts are stru tured in di erent dimensions (do -
umental, dis ursive, on eptual), di erent kinds
3.2.1 Lexi al of stru tural information an be exploited. Be-
ginning by the most shallow:
These approa hes exploit the information asso i-
ated to words in the texts. Some of them are Do umental Stru ture exploits the informa-
very shallow, relying on the frequen y of words, tion that texts arry in their format, for example,
but some others apply lexi al resour es to obtain headings, se tions, et .
a deeper representation of texts. Beginning by
the most shallow, the following main trends an Textual Stru ture . Some positions in text
be distinguished. A ommon assumption of these systemati ally ontain the most relevant infor-
approa hes is that repeated information ould be mation, for example, the beginning paragraph of
a good indi ator of importan e: news stories. These positions are usually genre-
Word Frequen y approa hes assume that the
or domain-dependant.
most frequent words in text are the most repre-
sentative of its ontent, and onsequently frag- Con eptual stru ture . The hains men-
ments of text ontaining them are more relevant. tioned in lexi al approa hes an be onsidered as
Most systems apply some kind of lter to leave a kind of on eptual stru ture.
out of onsideration those words that are very
frequent but not indi ative, for example, by the Dis ursive Stru ture an be divided in two
tf*idf metri or by ex luding the so- alled stop main lines: linear or narrative and hierar hi-
words, words with grammati al but no meaning al or rhetori . The rst tries to a ount
ontent. for satisfa tion-pre eden e-like relations among
Domain Frequen y tries to determine the rel- pie es of text, the se ond explains texts as trees
evan e of words by rst assigning the do ument where fragments of text are related with ea h
to a parti ular domain. Domain spe i words other by virtue of a set of rhetori al relations,
have a previous relevan e s ore, whi h serves as a mostly asymmetri .
omparison ground to adequately evaluate their
frequen y in a given text.
Con ept Frequen y abstra ts from mere 3.2.3 Deep Understanding
word- ounting to on ept- ounting. By use of an
ele troni thesaurus or WordNet, ea h word in Some approa hes try to a hieve understanding of
the text is asso iated to a more general on ept, the text in order to build a summary. Two main
and frequen y is omputed on on epts instead of lines an be distinguished:
parti ular words.
Cue words and phrases an be onsidered as Top-down approa hes try to re ognize pre-
indi ators of relative relevan e or non-relevan e de ned knowledge stru tures to texts, for exam-
of fragments of text in respe t to the others. ple, templates or frames.
Chains an be built from lexi al items whi h
are related by on eptual similarity a ording to Bottom-up approa hes try to represent texts
a lexi al resour e (lexi al hains) or by identity, as highly on eptual onstru ts, su h as s ene.
if they o-refer to the same entity ( o-referen e Others apply fragmentary knowledge-stru tures
hains). The fragments of text rossed by most to lue parts of text, and then build a omplete
hains or by most important hains or by most representation out of these small parts.
3.3 Classi ation 3: Ri hard However, it must be noted that most of the
Tu ker 1999 systems applying senten e-by-senten e relevan e
ranking do not rely entirely in this method, but
use it in ombination with other methods that
This lassi ation is taken from Tu ker (1999) tend to onsider the text as a whole.
[157. It onsiders four main dire tions in TS:
summarizing from attentional networks, senten e
by senten e, from informational ontent and from
dis ourse stru ture. 3.3.3 Informational Content
The lasses proposed here are even less disjun t Some approa hes to summarization have tried to
than those in the two previous lassi ations, thus understand the text, that is to say, to a hieve
every system an be onsidered as an instan e of a representation of some or all of its meaning
more than one of the lasses. This shows the in- whereupon reasoning an be applied. This ap-
adequa y of a taxonomi perspe tive on summa- proa h requires deeper analysis of the sour e text
rization systems, due to the heterogeneous kinds but allows the produ tion of sophisti ated sum-
of knowledge and te hniques that systems tend to maries, for example, by applying NL generation
in orporate. te hniques. However, these methods tend to be
highly domain-dependant, be ause of the huge
amount of information they require.
3.3.1 Attentional Networks

The approa hes to summarization in this dire - 3.3.4 Dis ourse Stru ture
tion try to grasp what a text is 'about' by iden-
tifying on epts that are in some sense entral to Dis ourse stru ture is used by many systems in
the text, on the basis of the o urren e of the a limited way, for example, by trying to grasp a
same or related on epts in di erent parts of the text's 'aboutness'. In ontrast, some other meth-
sour e representation. Aboutness is represented ods apply dis ourse theories to the analysis of the
as the links between these o urren es. sour e text in order to obtain a representation of
their dis ourse stru ture. However, work in this
Frequen y-based approa hes exploit the frequen y area has been largely theoreti al.
with whi h the on epts o ur in the representa-
tion. In systems based in word frequen y, atten-
tional networks are only represented impli itly.
Some systems a ount for frequen y signi an e 3.4 Combined Systems
by applying IR te hniques, su h as the tf*idf mea-
sure. Others apply orpus-based statisti al nat- The predominant tenden y in urrent systems is
ural language pro essing, su h as ollo ation or to integrate some of the te hniques mentioned so
proper noun identi ation. Sill others try to ab- far. Integration is a omplex matter, but it seems
stra t from individual words to a hieve on ept the appropriate way to deal with the omplexity
frequen y, by using lexi ons or thesauri [69. of textual obje ts. In this se tion, we are going to
present some examples of ombination of di erent
On the other hand, some systems identify and ex- te hniques.
ploit the ohesive links holding between parts of
the sour e text. These links an be represented There are several systems where di erent meth-
as graph-like stru tures [145 as lexi al hains. ods are ombined. Among the most interesting
are: [82, 156, 69, 100 where title-based method is
ombined with ue-lo ation, position, and word-
3.3.2 Senten e by Senten e frequen y based methods.
As the eld progresses, summarization systems
Some summarizing systems de ide for ea h sen- tend to use more and deeper knowledge. For ex-
ten e in the sour e text whether it is important ample, IE te hniques are be oming widely used.
for summarizing, rather independently of the text Many systems do not rely any more in a single
as a whole. To do that, they rely on relevan e or indi ator of relevan e or oheren e, but take into
irrelevan e marks that an be found in senten es, a ount as many of them as possible. So, the ten-
for example, ue words. den y is that heterogeneous kinds of knowledge
are merged in in reasingly enri hed representa- EuroWordNet [160 is a general resour e available
tions of the sour e text(s). for these four languages, so a rst approa h to
summarization exploited this resour e. A Lexi al
These enri hed representations allow for adapt- Chain summarizer was developed for Spanish [53.
ability of the nal summary to new summariza- As an be seen in Figure 1, the ar hite ture of
tion hallenges, su h as multido ument, multilin- the summarizer permits easy adaptation to other
gual and even multimedia summarization. In ad- languages, provided there is at least a morpho-
dition, su h a ri h representation of text is a step logi al analyzer and a version of EuroWordNet
forward generation or, at least, pseudo-generation available for the language. If other NLP tools
by ombining fragments of the original text. are available, like Named Entity Re ognizers or
Good examples of this are [108, 93, 41, 84, 59, o-referen e solvers, they an be easily integrated
among others. within the system. On e the text has been ana-
lyzed and Lexi al Chains have been obtained, a
summary is built by extra ting andidate textual
4 Summarization Systems units from the text. Candidate units are hosen
applying a ertain heuristi , weighting some as-
pe ts of Lexi al Chains.
Table 1 shows how existing summarization sys-
tems would be lassi ed a ording to ea h of the A se ond approa h to the task of summarization,
lassi ations presented in the previous se tion. seen in Figure 2, [52 tries to over ome this depen-
However, it must be taken into a ount that most den y on lexi applying Ma hine Learning te h-
urrent summarization systems are very omplex, niques. The system is trained with a orpus of
resorting to very heterogeneous information and senten es des ribed with a set of features, like po-
applying varied te hniques, so a lassi ation will sition in the text, length, and also being rossed
never be lear ut. Moreover, systems tend to by a Lexi al Chain. For ea h of these senten es,
evolve with time, whi h makes their lassi ation it is previously determined whether it belongs to
still more ontroversial. a summary of the text or not, so that it an be
learned whi h ombinations of features hara ter-
Files with a more extense des ription of some of ize summary senten es. In a text to be summa-
these systems (marked with an asterisk) an be rized, ea h senten e is des ribed with the same
found in the Annex (in ele troni version only). set of features, and it is determined whether these
Additionally, Table 2 lists on-line or download- des ribing features hara terize the senten e as a
able systems. summary senten e or not. The summary is om-
posed with senten es qualifying as summary sen-
Multilinguality of the systems is one of the fea- ten es.
tures in ea h des ribing le. It is stated whether
the system an summarize only a single language, This se ond system does not require any spe i
a de nite set of languages, or whether its ar hite - feature to produ e a summary, not even Lexi al
ture permits unrestri ted multilinguality. In this Chains. However, the more information avail-
latter ase, it is also stated whether experiments able, the more a urate the learning pro ess will
with di erent languages are reported. be, whi h will result in better summaries. This
approa h has been evaluated for English within
As a on rete example of an approa h to mul- DUC 2003 ontest, but it an be used straightfor-
tilingual summarization, we present the systems wardly for any other language, as long as there is
developed within proje t HERMES2 . The target a training orpus available.
of proje t HERMES is to adapt and apply lan-
guage te hnologies for Spanish, Catalan, Basque
and English to improve a ess to textual infor-
mation in digital libraries, Internet, do umental
Intranets, et . Therefore, HERMES summariza-
tion system should integrate multiple languages 5 Burning Issues
in a ommon ar hite ture. Sin e the resour es
available for every language are uneven, this ar-
hite ture has to be exible enough to adapt to The eld has experien ed an exponential growth
knowledge-poor representations of text but also sin e its beginnings, but some ru ial questions
to exploit ri h representations when available. are still open.
2 http://terral.iee .uned.es/hermes/
5.1 Coheren e of Summary texts 5.2 Multido ument summarization
Pai e (1990) [123 pointed out that the main Multido ument summarization is one of the ma-
short omings of summarization systems up to the jor hallenges in urrent summarization systems.
1990s was their low representativity of the ontent It onsists of produ ing a single summary of a ol-
in the sour e text and their la k of oheren e. le tion of do uments dealing with the same topi .
The work has been mostly determined by the or-
Mu h of the work in this area has treated the responding DUC task. Therefore, it has mainly
problem of text summarization from a predom- fo used in olle tions of news arti les with a given
inant information-theoreti perspe tive. There- topi . Remarkable progresses have been a hieved
fore, texts have been modeled as mathemati al in avoiding redundan y, mainly based on the work
obje ts, where relevan e and redundan y ould be in Carbonell and Goldstein (1998) [30.
de ned in purely statisti al terms. This approa h
seems spe ially valuable to produ e a satisfa tory When dealing with MDS new problems arise:
representation of the ontent of a text. However, lower ompression fa tors implying a more ag-
it fails in produ ing oherent texts, a eptable for gressive ondensation, anti-redundan y, tempo-
human users. ral dimension, more hallenging oreferen e task
(inter-do ument), et . Clustering of similar do -
The short omings of purely statisti al approa hes uments plays now a entral role [30, 133, 60, 110.
to text summarization on handling textual o- Sele ting the most relevant fragments from ea h
heren e are addressed from two di erent perspe - luster and assuring oheren e of the summaries
tives: oming from di erent do uments are other im-
portant problems, urrently under development
in MDS systems.
 Applying ma hine learning te hniques.
They have been used mainly for two pur-
poses: lassifying a senten e from a sour e
text into relevant or non-relevant [82, 8, 5.3 Multilingual summarization
99, 90, 65 and transforming a sour e sen-
ten e onsidered relevant into a summary
senten e [73, 78, 59. Input for learning al- As for multilingual summarization, not mu h
gorithms are usually texts with their or- work has been done yet, but the roadmap for the
responding abstra ts. Therefore, the main DUC ontests [12 ontemplates this hallenge in
short oming of this approa h is to obtain the near future of the area.
large quantities of <text, abstra t> tuples The most well known Multilingual Summariza-
for a variety of textual genres. tion System is SUMMARIST [69. The system
extra ts senten es in a variety of languages (En-
 Resorting to symboli linguisti or world glish, Spanish, Japanese, et .) and translates the
knowledge. Understanding of texts, mainly resulting summaries. SUMMARIST pro eeds in
through IE extra tion te hniques, seems a three steps: Topi identi ation, Interpretation
desirable way of produ ing quality sum- and Summary generation. Topi identi ation
maries. Until re ently, su h te hniques had implies previous a quisition of Topi Signatures
only been applied for very restri ted do- and then the identi ation of a text span as be-
mains [111. However, re ent systems tend longing to a topi hara terized by its signature.
to in orporate IE extra tion modules that Topi Signatures are tuples of the form <Topi ,
perform a partial understanding of text, ei- Signature> where Signature is a list of weighted
ther by modeling the typi al ontext of rel- terms: f< t1 ; w1 >, < t2 ; w2 >, ..., < tn ; wn >g.
evant pie es of information [84, 76, or by Topi signatures an be automati ally learned
applying general templates to nd, organize [89, 95. Topi identi ation, then, in ludes text
and use the typi al ontent of a kind of text segmentation (using Marti Hearst's TextTiling)
or event [59, 41. This use of IE te hniques and omparison of text spans with existing Topi
has produ ed very good results, as is re- Signatures. The identi ed topi s are fused dur-
e ted in the high ranking of Harabagiu and ing interpretation, the se ond step of the pro ess.
La atusu (2002) [59 in DUC 2002. A om- The fused topi s are then reformulated, that is to
bination of deeper knowledge with surfa e say, expressed in new terms. The last step is a
lues seems to yield good results, too [93. onventional extra tive task.
In order to fa e multilingual problems the in- 1. Translation before do ument lustering (as
volved knowledge sour es have to be as mu h as in Columbia's system), named one-phase
possible language independent. In the ase of strategy. This model lusters the multi-
SUMMARIST, sets of Topi Signatures have to lingual multido uments dire tly resulting in
be obtained for all the languages involved using multilingual lusters.
the same pro edures. Also the segmentation pro- 2. Translation after do ument lustering,
edure is language independent. So, the a ura y named two-phase strategy. This model
of the resulting summaries depends heavily on the lusters do uments in ea h language sepa-
quality of the translators. rately and merges the lustering results.
As has been said before, a more hallenging is- 3. Translation deferred to senten e lustering.
sue is Crosslingual Multido ument Summarizers. First, monolingual lustering is performed
Basi ally three main problems have to be ad- at do ument level. All the do uments in
dressed: 1) lustering of multilingual do uments, ea h luster refer to the same event in a
2) measuring the distan e (or similarity) between spe i language. Then, for generating the
multilingual units (do uments, paragraphs, sen- extra ted summary of an event all the lus-
ten es, terms), and 3) automati translation of ters referring to this event are taken into a -
do uments or summaries. Most systems di er on ount. Similar senten es of these multilin-
the way they fa e these problems, the order of gual lusters are lustered together, now at
performan e and the granularity of the units they senten e level. Finally a representative sen-
deal with. ten e is hosen from ea h luster and trans-
lated if needed.
Evans and Klavans (2003) [49 present a platform
for multilingual news summarization that extends
the Columbia's Newsblaster system [106. The The a ura y of this pro ess depends basi ally on
system adds a new omponent, translation, to the the form of omputing the similarity between dif-
original six major modules: rawling, extra tion, ferent multilingual units. Several forms of su h
lustering, summarization, lassi ation and web fun tions are presented and empirially evaluated
page generation, that have been, in turn, mod- by the authors.
i ed for allowing multilinguality (language iden-
ti ation, di erent hara ter en oding, language These measures are multilingual extensions of a
idiosyn rasy, et .). baseline monolingual similarity measure. Sen-
ten es are represented as bag of words (only
In this system multilingual do uments are trans- nouns and verbs are taken into a ount). The
lated into English before lustering, so that lus- similarity measure is a fun tion of the number
tering is performed only on English texts. of (approximate) mat hes between words and of
the size of the bags. The mat hing fun tion in
Translation is arried out at two levels. Be ause a the baseline redu es, ex ept for NE, to the iden-
low quality translation is usually enough for lus- tity. In the multilingual variants of the formula,
tering purposes and assessing the relevan e of the a bilingual di tionary is used as knowledge sour e
senten es, a simple and fast te hnique is applied for omputing this mat hing.
for glossing the input do uments prior to luster-
ing. Higher (relatively) quality translation (us- Despite of its simpli ity the position-free measure
ing Altavista's Babel sh interfa e to Systran) is (the simplest one) seems to be the most a urate
performed in a se ond step only over fragments among the studied alternatives. In this approa h
sele ted to be part of the summary. the translations of all the words of the bag are
olle ted and the similarity is omputed as in the
The system takes as well into a ount the possi- baseline. All the other alternatives onstraint in
ble degradation of the input texts as result of the some ways the possible mappings between words,
translation pro ess, sin e most of the senten es using di erent greedy strategies. The results are,
resulting from this pro ess are simply not gram- however, worse.
mati ally orre t.
The two-phase strategy outperforms in the exper-
Chen et al. (2003) [34 onsider three possibil- iments the on-phase strategy. The third strategy,
ities for s heduling the basi steps of do ument deferring the translation to senten e lustering,
translation and lustering: seems to be the most promising.
A system, overing English and Chinese, follow-
ing this approa h is presented in Chen and Lin (TIC2002-04447-C02-01) and it has also been
(2000) [35. The main omponents of the system partially funded by a grant asso iated to the X-
are a set of monolingual news lusterers, a unique TRACT proje t, PB98-1226 of the Spanish Re-
multilingual news lusterer and a news summa- sear h Department and by the proje t INTER-
rizer. A entral issue of the system is the de ni- LINGUA (IN3-IR226).
tion and identi ation of meaningful units as base
for omparison. For English these units an be
redu ed to senten es but for Chinese the identi-
ation of units and the asso iated segmentation Referen es
of the text an be a di ult task. Another im-
portant issue of the system (general for systems [1 Enrique Alfonse a and Pilar Rodrguez. De-
overing distant languages or di erent en oding s ription of the UAM system for generat-
s hemata) is the need of a robust transliteration ing very short summaries at DUC-2003. In
of names (or words not o urring in the bilingual HLT/NAACL Workshop on Text Summa-
di tionary) for assuring an a urate mat hing. rization / DUC 2003, 2003.

[2 D. Allport. The TICC: parsing interesting


5.4 Evaluation text. In Pro eedings of the Se ond Confer-
en e on Applied Natural Language Pro ess-
Last but not least, evaluation of summaries is ing, pages 211{218, 1988.
a major issue, be ause obje tive judgements are [3 Laura Alonso, Bernardino Casas, Irene
needed to assess the progress a hieved by di erent Castellon, Salvador Climent, and Llus
approa hes. Some ontests have been arried out Padro. Carpanta eats words you don't
to evaluate summarization systems with ommon, need from e-mail. In SEPLN, XIX Con-
publi pro edures: the SUMMAC ontest and the greso Anual de la So iedad Espa~nola para el
series of DUC ontests. Spe ially the last has pro- Pro esamiento del Lenguaje Natural, 2003.
vided sets of riteria to evaluate summary quality
in many di erent dimensions: informational ov- [4 Laura Alonso and Irene Castellon. Aprox-
erage (pre ision and re all), suitability to length ima io al resum automati per mar adors
requirements, grammati al and dis ursive oher- dis ursius. Te hni al report, CLiC, Univer-
en e, et . sitat de Bar elona, Bar elona, 2001.
An extensive investigation on the automati eval- [5 Laura Alonso and Maria Fuentes. Collabo-
uation of automati summaries was arried out in rating dis ourse for text summarisation. In
a six-week workshop at Johns Hopkins University Pro eedings of the Seventh ESSLLI Student
[134, where di erent evaluation metri s were pro- Session, 2002.
posed, in luding the relative utility method. Mani
(2001) [98 provides a lear pi ture of the urrent [6 R. Angheluta, R. De Busser, and M-F.
state-of-the-art in evaluation, both with human Moens. The use of topi segmentation
judges and by automated metri s, with a spe ial for automati summarization. In Work-
emphasis on ontent-based metri s. Hovy and Lin shop on Text Summarization (In Conjun -
(2003) [94 show that the summaries produ ed tion with the ACL 2002 and in luding the
by human judges are not reliable as a gold stan- DARPA/NIST sponsored DUC 2002 Meet-
dard, be ause they strongly disagree with ea h ing on Text Summarization), Philadelphia,
other. A onsensus summary obtained by apply- July, 11-12 2002.
ing ontent-based metri s, like unigram overlap,
seems mu h more reliable as a golden standard [7 Roxana Angheluta, Marie-Fran ine Moens,
against whi h summaries an be ontrasted. and Rik De Busser. K.u. leuven summa-
rization system. In DUC03, Edmonton, Al-
berta, Canada, May 31 - June 1 2003. As-
A knowledgements so iation for Computational Linguisti s.
[8 C. Aone, M. Okurowski, and J. Gorlin-
This resear h has been ondu ted thanks to sky. Trainable s alable summarization us-
proje ts HERMES (TIC2000-0335-C03-02), PE- ing robust NLP and ma hine learning. In
TRA (TIC2000-1735-C02-02), and ALIADO COLING-ACL, pages 62{66, 1998.
[9 Chinatsu Aone, Mary Ellen Okurowski, [17 Regina Barzilay and Mi hel Elhadad. Us-
James Gorlinsky, and Bjornar Larsen. ing lexi al hains for text summarization.
A s alable summarization system using In Inderjeet Mani and Mark Maybury, edi-
robust NLP. In Pro eeding of the tors, Intelligent S alable Text Summariza-
ACL'97/EACL'97 Workshop on Intelligent tion Workshop (ISTS'97), pages 10{17,
S alable Text Summarization, pages 66{73, Madrid, 1997. ACL/EACL.
1997.
[18 Regina Barzilay, Noemie Elhadad, and
[10 Saliha Azzam, Kevin Humphrey, and Kathy M Keown. Senten e ordering in
Robert Gaizauskas. Using oreferen e multido ument summarization. In HLT'01,
hains for text summarisation. In Amit 2001.
Bagga, Brek Baldwin, and Sara Shelton, ed- [19 Regina Barzilay, Kathy M Keown, and
itors, Pro eedings of the ACL'99 Workshop Mi hel Elhadad. Information fusion in the
on Coreferen e and Its Appli ations, pages ontext of multi-do ument summarization.
77 { 84, University of Maryland, College In Pro eedings of ACL 1999, 1999.
Park, Maryland, USA, June 1999. ACL.
[20 M. Benbrahim and K. Ahmad. Computer-
[11 Amit Bagga and Bre k Baldwin. Algo- aided lexi al ohesion analysis and text
rithms for s oring oreferen e hains. In abridgement. Te hni al Report Comput-
Pro eedings of the Linguisti Coreferen e ing S ien es Report CS-94-11, University of
Workshop at The First International Con- Surrey, 1994.
feren e on Language Resour es and Evalu-
ation (LREC'98), pages 536{566, Granada, [21 A. B. Benitez and S.-F. Chang. Multimedia
1998. knowledge integration, summarization and
evaluation. In Pro eedings of the 2002 In-
[12 Bre k Baldwin, Robert Donaway, Ed- ternational Workshop On Multimedia Data
uard Hovy, Elizabeth Liddy, Inderjeet Mining in onju tion with the International
Mani, Daniel Mar u, Kathleen M Ke- Conferen e on Knowledge Dis overy and
own, Vibhu Mittal, Mar Moens, Dragomir Data Mining (MDM/KDD-2002), Edmon-
Radev, Karen Spar k Jones, Beth Sund- ton, Alberta, 2002.
heim, Simone Teufel, Ralph Weis hedel, [22 Adam Berger and Vibhu Mittal. O elot: A
and Mi hael White. An evaluation road system for summarizing web pages. In Pro-
map for summarization resear h. TIDES, eedings of the 23rd Annual Conferen e on
TIDES 2000. Resear h and Development in Information
Retrieval (ACM SIGIR), Athens, 2001.
[13 Bre k Baldwin and Thomas S. Morton.
Dynami oreferen e-based summarization. [23 Branimir Boguraev, Ra hel Bellamy, and
In Pro eedings of the Third Conferen e on Calvin Swart. Summarisation miniaturisa-
Empiri al Methods in Natural Language tion: Delivery of news to hand-helds. In
Pro essing, Granada, Spain, June 1998. NAACL'01, 2001.

[14 M. Banko, V. Mittal, and M. Witbro k. [24 Branimir Boguraev and Christopher
Headline generation based on statisti al Kennedy. Salien e-based ontent hara -
translation. In Pro eedings of the 38th An- terisation of text do uments. In Pro eedings
nual Meeting of the Asso iation for Com- of ACL'97 Workshop on Intelligent, S al-
putational Linguisti s, ACL, 2000. able Text Summarisation, pages 2{9,
Madrid, Spain, 1997.
[15 Mi hele Banko, Vibhu Mittal, Mark [25 E. Boros, P.B. Kantor, and D.J. Neu.
Kantrowitz, and Jade Goldstein. Generat- A lustering based approa h to reating
ing extra tion-based summaries from hand- multi-do ument summaries. In Workshop
written summaries by aligning text spans. on Text Summarization in onjun tion with
In Pro eedings of PACLING-9, Waterloo, the ACM SIGIR Conferen e 2001, New Or-
Ontario, July 1999. leans, 2001.
[16 Regina Barzilay. Lexi al hains for summa- [26 Ronald Brandow, Karl Mitze, and Lisa F.
rization. Master's thesis, Ben-Gurion Uni- Rau. Automati ondensation of ele -
versity of the Negev, 1997. troni publi ations by senten e sele tio.
Information Pro essing and Management, [36 John M. Conroy and Dianne P. O'Leary.
31(5):675{68, 1995. Text summarization via Hidden Markov
Models. In SIGIR 2001, 2001.
[27 M. Brunn, Y. Chali, and B. Dufou. The
University of Lethbridge text summarizer [37 John M. Conroy, Judith D. S hlesinger,
at DUC 2002. In Workshop on Text Dianne P. O'Leary, and Mary Ellen
Summarization (In Conjun tion with the Okurowski. Using HMM and Logisti Re-
ACL 2002 and in luding the DARPA/NIST gression to generate extra t summaries for
sponsored DUC 2002 Meeting on Text Sum- DUC. In Workshop on Text Summariza-
marization), Philadelphia, July, 11-12 2002. tion in onjun tion with the ACM SIGIR
Conferen e 2001, New Orleans, Louisiana,
[28 Orkut Buyukkokten, He tor Gar ia- 2001.
Molina, and Andreas Paep ke. Text
summarization of web pages on handheld [38 T. Cope k, S. Szpakowi z, and N. Japkowi .
devi es. In NAACL'01, 2001. Learning how best to summarize. In Work-
shop on Text Summarization (In Conjun -
[29 N. H. M. Caldwell. An investigation tion with the ACL 2002 and in luding the
into shallow pro essing for summarisation. DARPA/NIST sponsored DUC 2002 Meet-
Te hni al Report Computer s ien e tripos ing on Text Summarization), Philadelphia,
part II proje t, University of Cambridge July, 11-12 2002.
Computer Laboratory, 1994.
[39 Simon H. Corston-Oliver. Text ompa tion
[30 Jaime G. Carbonell and Jade Goldstein. for display on very small s reens. In
The use of MMR, diversity-based rerank- NAACL'01, 2001.
ing for reordering do uments and produ -
ing summaries. In Pro eedings of SIGIR, [40 R. E. Cullingford. SAM. In S hank and
pages 335{336, 1998. Riesbe k, editors, Inside Computer Under-
standing. Lawren e Erlbaum Asso ., Hills-
[31 J. Carroll, G. Minnen, Y. Canning, S. De- dale, NJ, 1981.
vlin, and J. Tait. Pra ti al simpli ation
of english newspaper text to assist aphasi [41 H. Daume III, A. E hihabi, D. Mar u, D.S.
readers. In AAAI-98 Workshop on Inte- Munteanu, and R. Sori ut. GLEANS: A
grating Arti ial Intelligen e and Assistive generator of logi al extra ts and abstra ts
Te hnology, 1998. for ni e summaries. In Workshop on Text
Summarization (In Conjun tion with the
[32 Y. Chali, M. Kolla, N. Singh, and Z. Zhang. ACL 2002 and in luding the DARPA/NIST
The university of lethbridge text summa- sponsored DUC 2002 Meeting on Text Sum-
rizer at DUC 2003. In HLT/NAACL Work- marization), Philadelphia, July, 11-12 2002.
shop on Text Summarization / DUC 2003,
2003. [42 Hal Daume III and Daniel Mar u. A noisy-
hannel model for do ument ompression.
[33 Hsin-Hsi Chen. Multilingual summariza- In Pro eedings of the 40th Annual Meeting
tion and question answering. In Workshop of the Asso iation for Computational Lin-
on Multilingual Summarization and Ques- guisti s, 2002.
tion Answering (COLING'2002), 2002.
[43 G. DeJong. An overview of the frump sys-
[34 Hsin-Hsi Chen, June-Jei Kuo, and Tsei- tem. In W. G. Lehnert and M. H. Ringle,
Chun Su. Clustering and visualization in editors, Strategies for natural language pro-
a multi-lingual multi-do ument summariza- essing, pages 149 { 176. Hillsdale, NJ:
tion system. In Pro eedings of the 25th Eu- Lawren e Erlbaum, 1982.
ropean Conferen e on IR Resear h, pages
266{280, 2003. [44 J. Dersy. Produ ing summary ontent indi-
ators for retrieved texts. Master's thesis,
[35 Hsin-Hsi Chen and Chuan-Jie Lin. A multi- University of Cambridge Department of En-
lingual news summarizer. In Pro eedings of gineering, 1996.
18th International Conferen e on Compu-
tational Linguisti s, COLING 2000, pages [45 DUC. DUC{do ument understanding on-
159{165, 2000. feren e. http://du .nist.gov/.
[46 Daniel M. Dunlavy, John M. Conroy, Ju- [55 Jade Goldstein, Vibhu Mittal, Mark
dith D. S hlesinger, Sarah A. Goodman, Kantrowitz, and Jaime Carbonell. Summa-
Mary Ellen Okurowski, Dianne P. O'Leary, rizing text do uments: Senten e sele tion
and Hans van Halteren. Performan e of and evaluation metri s. In SIGIR-99, 1999.
a three-stage system for multi-do ument
summarization. In DUC03, Edmonton, Al- [56 Stephen J. Green. Automati ally generating
berta, Canada, May 31 - June 1 2003. As- hypertext by omputing semanti similarity.
so iation for Computational Linguisti s. PhD thesis, University of Toronto, 1997.
[47 H. P. Edmunson. New methods in auto-
mati extra ting. Journal of the Asso ia- [57 U. Hahn. Topi parsing: A ounting for
tion for Computing Ma hinery, 16(2):264 { text ma ro stru tures in full-text analysis.
285, April 1969. Information Pro essing and Management,
26(1):135{170, 1990.
[48 Noemie Elhadad and Kathleen R. M K-
eown. Towards generating patient spe- [58 Udo Hahn and Inderjeet Mani. The
i summaries of medi al arti les. In ahllenges of automati summarization.
NAACL'01 Automati Summarization IEEE Computer, 33(11):29{36, 2000.
Workshop, 2001.
[49 David Kirk Evans and Judith L. Klavans. [59 S.M. Harabagiu and F. La atusu. Generat-
A platform for multilingual news summa- ing single and multi-do ument summaries
rization. Te hni al Report CUCS-014-03, with GISTEXTER. In Workshop on Text
Computer S ien e, University of Columbia, Summarization (In Conjun tion with the
2003. ACL 2002 and in luding the DARPA/NIST
sponsored DUC 2002 Meeting on Text Sum-
[50 A. Farzindar, G. Lapalme, and H. Sag- marization), Philadelphia, July, 11-12 2002.
gion. Summaries with SumUM and its ex-
pansion for do ument understanding on-
feren e (DUC 2002). In Workshop on Text [60 V. Hatzivassiloglou, J. Klavans, M. Hol-
Summarization (In Conjun tion with the ombe, R. Barzilay, M.Y. Kan, and K.R.
ACL 2002 and in luding the DARPA/NIST M Keown. Sim nder: A exible lustering
sponsored DUC 2002 Meeting on Text Sum- tool for summarization. In NAACL'01 Au-
marization), Philadelphia, July, 11-12 2002. tomati Summarization Workshop, 2001.

[51 Atefeh Farzindar and Guy Lapalme. Us- [61 Vassileios Hatzivassiloglou, Judith Klavans,
ing ba kground information for multi- and Eleazar Eskin. Dete ting text similar-
do ument summarization and summaries in ity over short passages: Exploring linguisti
response to a question. In DUC03, Ed- feature ombinations via ma hine learning.
monton, Alberta, Canada, May 31 - June In EMNLP/VLC'99, Maryland, 1999.
1 2003. Asso iation for Computational Lin-
guisti s.
[62 A. G. Hauptmann and M. J. Witbro k.
[52 Maria Fuentes, Mar Massot, Hora io Informedia: News-on-demand multimedia
Rodrguez, and Laura Alonso. Mixed ap- information a quisition and retrieval. In
proa h to headline extra tion for DUC M. Maybury, editor, Intelligent Multime-
2003. In HLT/NAACL Workshop on Text dia Information Retrieval, pages 215{239.
Summarization / DUC 2003, Edmonton, AAAI/MIT Press, 1997.
Canada, 2003.
[63 Marti Hearst. Multi-paragraph segmenta-
[53 Maria Fuentes and Hora io Rodrguez. Us- tion of expository text. In 32nd Annual
ing ohesive properties of text for automati Meeting of Asso iation for Computational
summarization. In JOTRI'02, 2002. Linguisti s, 1994.
[54 P. Gladwin, S. Pulman, and K. Spar k-
Jones. Shallow pro essing and automati [64 Ulf Hermjakob. Learning Parse and Trans-
summarising: a rst study. Te hni al Re- lation De isions From Examples With Ri h
port 223, University of Cambridge Com- Context. PhD thesis, University of Texas at
puter Laboratory, 1991. Austin, 1997.
[65 T. Hirao, Y. Sasaki, H. Isozaki, and [76 Min-Yen Kan and Kathleen M Keown. In-
E. Maeda. NTT's Text Summarization sys- formation extra tion and summarization:
tem for DUC-2002. In Workshop on Text Domain independen e through fo us types.
Summarization (In Conjun tion with the Te hni al report, Computer S ien e De-
ACL 2002 and in luding the DARPA/NIST partment, Columbia University, New York,
sponsored DUC 2002 Meeting on Text Sum- 1999.
marization), Philadelphia, July, 11-12 2002.
[77 M. Karamuftuoglu. An approa h to sum-
[66 T. Hirao, J. Suzuki, H. Isozake, and marization based on lexi al bonds. In
E. Maeda. NTT's multiple do ument Workshop on Text Summarization (In Con-
summarization system for DUC2003. In jun tion with the ACL 2002 and in luding
HLT/NAACL Workshop on Text Summa- the DARPA/NIST sponsored DUC 2002
rization / DUC 2003, 2003. Meeting on Text Summarization), Philadel-
phia, July, 11-12 2002.
[67 Mi hael Hoey. Patterns of Lexis in Text.
Des ribing English Language. Oxford Uni- [78 Kevin Knight and Daniel Mar u. Statisti s-
versity Press, 1991. based summarization - step one: Senten e
ompression. In The 17th National Confer-
[68 Eduard Hovy. Handbook of Computational en e of the Ameri an Asso iation for Arti -
Linguisti s, hapter 28: Text Summariza- ial Intelligen e AAAI'2000, Austin, Texas,
tion. Oxford University Press, 2001. 2000.
[69 Eduard Hovy and Chin-Yew Lin. Au- [79 Hideki Kozima. Text segmentation based
tomated Text Summarization in SUM- on similarity between words. In Pro eedings
MARIST. In Mani and Maybury, editors, of the 31th Annual Meeting of the Asso i-
Advan es in Automati Text Summariza- ation for Computational Linguisti s, pages
tion. 1999. 286{288, 1993.
[70 Eduard Hovy and Daniel Mar u. Auto- [80 W. Kraaij, M. Spitters, and A. Hulth.
mated Text Summarization. COLING- Headline extra tion based on a ombina-
ACL, 1998. tutorial. tion of uni- and multido ument summa-
[71 Hongyan Jing. Senten e simpli ation in rization te hniques. In Workshop on Text
automati text summarization. In ANLP- Summarization (In Conjun tion with the
2000, 2000. ACL 2002 and in luding the DARPA/NIST
sponsored DUC 2002 Meeting on Text Sum-
[72 Hongyan Jing. Cut-and-Paste Text Sum- marization), Philadelphia, July, 11-12 2002.
marization. PhD thesis, Graduate S hool
of Arts and S ien es, Columbia University, [81 W. Kraaij, M. Spitters, and M. van der Hei-
2001. jden. Combining a mixture language model
and naive bayes for multi-do ument sum-
[73 Hongyan Jing and Kathleen M Keown. Cut marisation. In Workshop on Text Sum-
and paste based text summarization. In 1st marization in onjun tion with the ACM
Conferen e of the North Ameri an Chapter SIGIR Conferen e 2001, New Orleans,
of the Asso iation for Computational Lin- Louisiana, 2001.
guisti s, 2000.
[82 Julian Kupie , Jan Pedersen, and Fran ine
[74 Min-Yen Kan. Automati text summariza- Chen. A trainable do ument summarizer.
tion as applied to information retrieval: Us- In Pro eedings of ACM SIGIR Conferen e
ing indi ative and informative summaries. on Resear h and Development in Informa-
PhD thesis, Columbia University, 2003. tion Retrieval, pages 68{73. ACM Press,
1995.
[75 Min-Yen Kan, Judith L. Klavans, and
Kathleen R. M Keown. Domain-spe i [83 Finley La atusu, Paul Parker, and Sanda
informative and indi ative summarization Harabagiu. Lite-GISTexter: Generating
for information retrieval. In Workshop on short summaries with minimal resour es. In
Text Summarization in onjun tion with DUC03, Edmonton, Alberta, Canada, May
the ACM SIGIR Conferen e 2001, New Or- 31 - June 1 2003. Asso iation for Computa-
leans, 2001. tional Linguisti s.
[84 P. Lal and S. Rueger. Extra t-based sum- [94 Chin-Yew Lin and Eduard Hovy. Auto-
marization with simpli ation. In Work- mati evaluation of summaries using n-
shop on Text Summarization (In Conjun - gram o-o urren e statisti s. In Marti
tion with the ACL 2002 and in luding the Hearst and Mari Ostendorf, editors, HLT-
DARPA/NIST sponsored DUC 2002 Meet- NAACL 2003: Main Pro eedings, pages
ing on Text Summarization), Philadelphia, 150{157, Edmonton, Alberta, Canada, May
July, 11-12 2002. 27 - June 1 2003. Asso iation for Computa-
tional Linguisti s.
[85 Abderra h Lehmam. Text stru turation
leading to an automati summary system: [95 Chin-Yew Lin and Eduard H. Hovy. The
Ra . Information Pro essing and Manage- automated a quisition of topi signatures
men, 35(2):181{191, 1999. for Text Summarization. In COLING-00,
Saarbru ken, 2000.
[86 Abderra h Lehmam and Philippe Bouvet.

Evaluation, re ti ation et pertinen e du [96 H. P. Luhn. The automati reation of lit-
resume automatique de texte pour une util- erature abstra ts. IBM Journal of resear h
isation en reseau. In S. Chaudiron and and development, 2(2):159 { 165, 1958.
C. Fluhr, editors, III Colloque d'ISKO- [97 I. Mani, D. House, G. Klein, L. Hirs hman,
Fran e: Filtrage et resume automatique de L. Obrst, T. Firmin, M. Chrzanowski, and
l'information sur les reseaux, 2001. B. Sundheim. The tipster SUMMAC text
summarization evaluation: Final report.
[87 W. G. Lehnert. Plot units: a narrative sum-
Te hni al report, DARPA, 1998.
marization strategy. In W. G. Lehnert and
M. H. Ringle, editors, Strategies for natural [98 Inderjeet Mani. Automati Summarization.
language pro essing, pages 375 { 412. Hills- Nautral Language Pro essing. John Ben-
dale, NJ: Lawren e Erlbaum, 1982. jamins Publishing Company, 2001.
[88 Anton Leuski, Chin-Yew Lin, and Eduard [99 Inderjeet Mani and Eri Bloedorn. Ma hine
Hovy. iNeATS: Intera tive multi-do ument learning of generi and user-fo used sum-
summarization. In ACL'03, 2003. marization. In AAAI, pages 821{826, 1998.
[89 C-Y. Lin. Robust Automated Topi Identi - [100 Inderjeet Mani and Eri Bloedorn. Sum-
ation. PhD thesis, University of Southern marizing similarities and di eren es among
California, 1997. related do uments. Information Retrieval,
1(1-2):35{67, 1999.
[90 Chin-Yew Lin. Training a sele tion fun -
tion for extra tion. In ACM-CIKM, pages [101 Inderjeet Mani and Mark T. Maybury, edi-
55{62, 1999. tors. Advan es in automati text summari-
sation. MIT Press, 1999.
[91 Chin-Yew Lin and Eduard Hovy. Identify-
ing topi s by position. In Pro eedings of the [102 William C. Mann and Sandra A. Thomp-
Applied Natural Language Pro essing Con- son. Rhetori al stru ture theory: Toward a
feren e (ANLP-97), pages 283{290, Wash- fun tional theory of text organisation. Text,
ington, DC, 1997. 3(8):234{281, 1988.
[103 Daniel Mar u. From dis ourse stru tures
[92 Chin-Yew Lin and Eduard Hovy. NeATS: A to text summaries. In Mani and Maybury,
multido ument summarizer. In Workshop editors, Advan es in Automati Text Sum-
on Text Summarization in onjun tion with marization, pages 82 { 88, 1997.
the ACM SIGIR Conferen e 2001, New Or-
leans, 2001. [104 M. Maybury and A. Merlino. Multimedia
summaries of broad ast news. In Interna-
[93 Chin-Yew Lin and Eduard Hovy. NeATS tional Conferen e on Intelligent Informa-
in DUC 2002. In Workshop on Text tion Systems, 1997.
Summarization (In Conjun tion with the
ACL 2002 and in luding the DARPA/NIST [105 Mark T. Maybury and Inderjeet Mani. Au-
sponsored DUC 2002 Meeting on Text Sum- tomati summarization. ACL/EACL'01,
marization), Philadelphia, July, 11-12 2002. 2001. tutorial.
[106 K. M Keown, R. Barzilay, D. Evans, [113 M. Mitra, A. Singhal, and C. Bu kley. Au-
V. Hatzivassiloglou, J. Klavans, C. Sable, tomati Text Summarization by paragraph
B. S hi man, and S. Sigelman. Tra king extra tion. In Inderjeet Mani and Mark
and summarizing news on a daily basis with Maybury, editors, Intelligent S alable Text
Columbia's Newsblaster. In Pro eedings of Summarization Workshop (ISTS'97), pages
the Human Language Te hnology Confer- 39 { 46, Madrid, 1997. ACL/EACL.
en e, 2002.
[114 V. Mittal, M. Kantrowitz, J. Goldstein, and
[107 K. M Keown, S.-F. Chang, J. Cimino, J. Carbonell. Sele ting text spans for do -
S. Feiner, C. Friedman, L. Gravano, ument summaries: Heuristi s and metri s.
V. Hatzivassiloglou, S. Johnson, D. Jor- In AAAI 1999, 1999.
dan, J. Klavans, A. Kushniruk, V. Pa-
tel, and S. Teufel. Persival, a system [115 Vibhu Mittal and Adam Berger. Query-
for personalized sear h and summarization relevant summarization using faqs. In Pro-
over multimedia health are information. In eedings of the 38th Annual Meeting of the
ACM+IEEE Joint Conferen e on Digital Asso iation for Computational Linguisti s
Libraries (JCDL 2001), 2001. (ACL 2000), Hong Kong, 2000.
[108 K. M Keown, D. Evans, A. Nenkova, [116 Jane Morris and Graeme Hirst. Lexi al o-
R. Barzilay, V. Hatzivassiloglou, B. S hi - hesion, the thesaurus, and the stru ture of
man, S. Blair-Goldensohn, J. Klavans, and text. Computational linguisti s, 17(1):21{
S. Sigelman. The olumbia multi-do ument 48, 1991.
summarizer for DUC 2002. In Work-
shop on Text Summarization (In Conjun - [117 S. Muresan, E. Tzoukermann, and J. Kla-
tion with the ACL 2002 and in luding the vans. Combining linguisti and ma hine
DARPA/NIST sponsored DUC 2002 Meet- learning te hniques for email summariza-
ing on Text Summarization), Philadelphia, tion. In ACL-EACL'01 CoNLL Workshop,
July, 11-12 2002. 2001.

[109 Kathleen M Keown, Judith Klavans, Vas- [118 Sung Hyon Myaeng and Myung-Gil Jang.
sileios Hatzivassiloglou, Regina Barzilay, Integrating digital libraries with ross-
and Eleazar Eskin. Towards multido ument language ir. In Pro eedings of the 2nd Con-
summarization by reformulation: Progress feren e on Digital Libraries, 1999.
and prospe ts. In AAAI 99, 1999.
[119 Ani Nenkova, Barry S hi man, An-
[110 Kathleen R. M Keown, Regina Barzilay, drew S hlaiker, Sasha Blair-Goldensohn,
David Evans, Vasileios Hatzivassiloglou, Regina Barzilay, Sergey Sigelman, Vasileios
Min-Yen Kan, Barry S hi man, and Si- Hatzivassiloglou, and Kathleen M Keown.
mone Teufel. Columbia multi-do ument Columbia at the du 2003. In DUC03, Ed-
summarization: Approa h and evaluation. monton, Alberta, Canada, May 31 - June
In Pro eedings of the Workshop on Text 1 2003. Asso iation for Computational Lin-
Summarization, ACM SIGIR Conferen e, guisti s.
2001.
[120 K. Ono, K. Sumita, and S. Miike. Abstra t
[111 Kathleen R. M Keown and Dragomir R. generation based on rhetori al stru ture ex-
Radev. Generating summaries of multiple tra tion. In Pro eedings of the 15th Inter-
news arti les. In ACM Conferen e on Re- national Conferen e on Computational Lin-
sear h and Development in Information Re- guisti s (COLING-94), pages 344 { 348,
trieval SIGIR'95, Seattle, WA, 1995. Kyoto, Japan, 1994.
[112 Jean-Lu Minel, Jean-Pierre Des les, [121 J.C. Otterba her, A.J. Winkel, and D.R.
Emmanuel Cartier, Gustavo Crispino, Radev. The mi higan single and multi-
Slim Ben Hazez, and Agata Ja k- do ument summarizer for DUC 2002. In
iewi z. Resume automatique par ltrage Workshop on Text Summarization (In Con-
semantique d'informations dans des textes. jun tion with the ACL 2002 and in luding
presentation de la plate-forme ltext. the DARPA/NIST sponsored DUC 2002
Revue Te hnique et S ien e Informatique, Meeting on Text Summarization), Philadel-
2001. phia, July, 11-12 2002.
[122 Chris D. Pai e. The automati genera- summarization of topi ally related news
tion of literature abstra ts: an approa h arti les. In 5th European Conferen e on
based on the identi ation of self-indi ating Resear h and Advan ed Te hnology for
phrases. In R. N. Oddy, C. J. Rijsbergen, Digital Libraries, Darmstadt, 2001.
and P. W. Williams, editors, Information
Retrieval Resear h, pages 172 { 191. Lon- [132 Dragomir R. Radev, Weiguo Fan, and Zhu
don: Butterworths, 1981. Zhang. Webinessen e: A personalized web-
based multi-do ument summarization and
[123 Chris D. Pai e. Constru ting literature ab- re ommendation system. In NAACL Work-
stra ts by omputer. Information Pro ess- shop on Automati Summarization, Pitts-
ing & Management, 26(1):171 { 186, 1990. burgh, 2001.
[124 T.A.S. Pardo and L.H.M. Rino. DMSumm: [133 Dragomir R. Radev, Hongyan Jing, and
Review and assessment. In E. Ran hhod Malgorzata Budzikowska. Centroid-based
and N. J. Mamede, editors, Advan es in summarization of multiple do uments: sen-
Natural Language Pro essing, pages 263{ ten e extra tion, utility-based evaluation,
273. Springer-Verlag, 2002. and user studies. In ANLP/NAACL Work-
shop on Summarization, Seattle, Washing-
[125 T.A.S. Pardo, L.H.M. Rino, and M.G.V. ton, 2000.
Nunes. GistSumm: A summarization tool
based on a new extra tive method. In [134 Dragomir R. Radev, Simone Teufel, Ho-
N.J. Mamede, J. Baptista, I. Tran oso, ra io Saggion, Wai Lam, John Blitzer,
and M.G.V. Nunes, editors, 6th Workshop Arda Celebi, Hong Qi, Elliott Drabek, and
on Computational Pro essing of the Por- Danyu Liu. Evaluation of Text Summa-
tuguese Language - Written and Spoken, rization in a Cross-lingual Information Re-
number 2721 in Le ture Notes in Arti - trieval Framework. Te hni al report, Cen-
ial Intelligen e, pages 210{218. Springer- ter for Language and Spee h Pro essing,
Verlag, 2003. Johns Hopkins University, Baltimore, MD,
June 2002.
[126 J. J. Pollo k and A. Zamora. Automati
abstra ting resear h at hemi al abstra ts [135 Lisa F. Rau, Paul S. Ja obs, and Uri Zernik.
servi e. Journal of Information and Com- Information extra tion and text summari-
puter S ien es, 15(4):226{23, 1975. sation using linguisti knowledge a quisi-
tion. Information Pro essing & Manage-
[127 K. Preston and S. Williams. Managing the ment, 25(4):419 { 428, 1989.
information overload. physi s in business.
Institute of Physi s, 1994. [136 RIPTIDES. RIPTIDES: Rapidly Portable
Translingual Information Extra tion and
[128 Dragomir Radev, Sasha Blair-Goldensohn, Intera tive Multido ument Summarization.
and Zhu Zhang. Experiments in single http://www. s. ornell.edu/Info/People/
and multi-do ument summarization using ardie/tides/, 2002.
MEAD. In First Do ument Understanding
Conferen e, New Orleans, LA, September [137 J. E. Rush and et al. Automati abstra t-
2001. ing and indexing. ii. produ tion of abstra ts
by appli ation of ontextual inferen e and
[129 Dragomir Radev, Jahna Otterba her, Hong synta ti oheren e riteria. Journal of the
Qi, and Daniel Tam. MEAD ReDUCs: Ameri an So iety for Information S ien e,
Mi higan at DUC 2003. In DUC03, Ed- 22(4):260 { 274, 1971.
monton, Alberta, Canada, May 31 - June
1 2003. Asso iation for Computational Lin- [138 Hora io Saggion and Guy Lapalme. Gen-
guisti s. erating Indi ative-Informative Summaries
with SumUM. Computational Linguisti s,
[130 Dragomir R. Radev. Text Summarization. 28(4), 2002.
ACM SIGIR, 2000. tutorial.
[139 Hora io Saggion and Guy Lapalme. Gener-
[131 Dragomir R. Radev, Sasha Blair- ating informative and indi ative summaries
Goldensohn, Zhu Zhang, and Re- with SumUM. Computational Linguisti s,
vathi Sundara Raghavan. Intera tive, 28(4), 2002. Spe ial Issue on Automati
domain-independent identi ation and Summarization.
[140 Gerard Salton, James Allan, and Chris [150 H. Sundaram. Segmentation, Stru ture De-
Bu kley. Automati stru turing and re- te tion and Summarization of Multimedia
trival of large text les. CACM, 37(2):97{ Sequen es. PhD thesis, Graduate S hool
108, 1994. of Arts and S ien es, Columbia University,
2002.
[141 Gerard Salton, Amit Singhal, M. Mitra,
and C. Bu kley. Automati text stru tur- [151 SweSum. http://www.nada.kth.se/ xmartin/
ing and summarization. Information Pro- swesum/index-eng.html, 2002.
essing and Management, 33(3):193 { 207,
1997. [152 J. L. Tait. Automati summarizing of en-
[142 R. S hank and R. Abelson. S ripts, Plans, glish texts. Te hni al Report 47, University
Goals, and Understanding. Lawren e Erl- of Cambridge Computer Laboratory, 1983.
baum, Hillsdale, NJ, 1977.
[153 S. L. Taylor. Automati abstra ting by ap-
[143 Barry S hi man, Inderjeet Mani, and Kris- plying graphi al te hniques to semanti net-
tian J. Con ep ion. Produ ing biographi- works. PhD thesis, Northwestern Univer-
al summaries: Combining linguisti knowl- sity, 1975.
edge with orpus statisti s. In EACL'01,
2001. [154 Simone Teufel and Mar Moens. Sen-
ten e extra tion as a lassi ation task. In
[144 J.D. S hlesinger, J.M. Conroy, M.E. Inderjeet Mani and Mark Maybury, edi-
Okurowski, H.T. Wilson, D.P. O'Leary, tors, Intelligent S alable Text Summariza-
A. Taylor, and J. Hobbs. Understand- tion Workshop (ISTS'97), pages 58 { 59,
ing ma hine performan e in the ontext Madrid, 1997. ACL/EACL.
of human performan e for multi-do ument
summarization. In Workshop on Text [155 Simone Teufel and Mar Moens. Senten e
Summarization (In Conjun tion with the extra tion and rhetori al lassi ation for
ACL 2002 and in luding the DARPA/NIST exible abstra ts. In AAAI Spring Sym-
sponsored DUC 2002 Meeting on Text Sum- posium on Intelligent Text Summarisation,
marization), Philadelphia, July, 11-12 2002. pages 16 { 25, 1998.
[145 E. F. Skorokhod'ko. Adaptive method of
automati abstra ting and indexing. Infor- [156 Simone Teufel and Mar Moens. Summa-
mation pro essing, 71, 1971. rizing s ienti arti les { experiments with
relevan e and rhetori al status. Computa-
[146 K. Spar k Jones, S. Walker, and S. Robert- tional Linguisti s, 28(4), 2002. Spe ial Issue
son. A probabilisti model of information on Automati Summarization.
retrieval: Development and status. Te hni-
al Report N 446, University of Cambridge [157 Ri hard Tu ker. Automati Summarising
Computer Laboratory, 1998. and the CLASP system. PhD thesis, Uni-
versity of Cambridge, 1999.
[147 Karen Spar k-Jones. Automati summaris-
ing: fa tors and dire tions. In Inderjeet [158 E. Tzoukermann, S. Muresan, and J. Kla-
Mani and Mark Maybury, editors, Advan es vans. Gist-it: Summarizing email using lin-
in Automati Text Summarization. MIT guisti knowledge and ma hine learning. In
Press, 1999. ACL-EACL'01 HLT/KM Workshop, 2001.
[148 Tomek Strzalkowski, Jin Wang, and Bow-
den Wise. A robust pra ti al text summa- [159 H. van Halteren. Writing style re ogni-
rization. In Eduard Hovy and Dragomir tion and senten e extra tion. In Work-
Radev, editors, AAAI Spring Symposium shop on Text Summarization (In Conjun -
on Intelligent Text Summarisation, pages tion with the ACL 2002 and in luding the
26 { 33, Stanford, California, Mar h 23- DARPA/NIST sponsored DUC 2002 Meet-
25 1998. Ameri an Asso iation for Arti ial ing on Text Summarization), Philadelphia,
Intelligen e, AAAI Press. July, 11-12 2002.

[149 SUMMAC. SUMMAC, the nal report. [160 Piek Vossen, editor. Euro WordNet: a mul-
http://www.itl.nist.gov/iaui/894.02/ tilingual database with lexi al semanti net-
related proje ts/tipster summa /, 1998. works. Kluwer A ademi Publishers, 1998.
[161 H. Wa tlar. Multi-do ument summariza- 22nd International Conferen e on Resear h
tion and visualization in the informedia dig- and Development in Information Retrieval
ital video library, 2001. (SIGIR-99), 1999.
[162 M. White, D. M Cullough, C. Cardie, [166 S. R. Young and P. J. Hayes. Automati
V. Ng, and K. Wagsta . Dete ting dis rep- lassi ation and summarisation of bank-
an ies and improving intelligibility: Two ing telexes. In Se ond Conferen e on Arti-
preliminary evaluations of riptides. In ial Intelligen e Appli ations, pages 402{
Workshop on Text Summarization in on- 408, New York, 1985.
jun tion with the ACM SIGIR Conferen e
2001, New Orleans, 2001. [167 D. Zaji , B. Door, and R. S hwartz. Au-
tomati headline generation for newspaper
[163 Mi hael White and Claire Cardie. Sele ting stories. In Workshop on Text Summariza-
senten es for multido ument summaries us- tion (In Conjun tion with the ACL 2002
ing randomized lo al sear h. In ACL Work- and in luding the DARPA/NIST sponsored
shop on Automati Summarization, 2002. DUC 2002 Meeting on Text Summariza-
[164 Mi hael White, Tanya Korelsky, Claire tion), Philadelphia, July, 11-12 2002.
Cardie, Vin ent Ng, David Pier e, and Kiri
Wagsta . Multi-do ument summarization [168 Klaus Ze hner. A literature survey on in-
via information extra tion. In Pro eed- formation extra tion and Text Summariza-
ings of the First International Conferen e tion. term paper, Carnegie Mellon Univer-
on Human Language Te hnology Resear h, sity, 1997.
2001. [169 Klaus Ze hner. Automati Summarisation
[165 M. Witbro k and V. Mittal. Ultra- of Spoken Dialogues in Unrestri ted Do-
summarization: A statisti al approa h to mains. PhD thesis, Carnegie Mellon Uni-
generating highly ondensed nonextra - versity, 2001.
tive summaries. In Pro eedings of the
System Pro essing Level Information Kind Tu ker 1999
Adam [137, 126 surfa e stru tural senten ewise
Alfonse a and Rodrguez [1 surfa e stru tural senten ewise
* Anes [26 surfa e lexi al att. networks
Barzilay and Elhadad 1997 [17 entity lexi al att. networks
Boguraev and Kennedy 1997 [24 entity lexi al att. networks
Caldwell 1994 [29 entity lexi al att. networks
* CENTRIFUSER [48 dis ourse understanding info. ontent
* Chen and Lin (2000) [35 surfa e lexi al info. ontent
* Columbia MDS [108, 38, 119 entity/dis ourse understanding/stru tural info. ontent
Cope k et al. 2002 [38 surfa e lexi al att. networks
* Cut-and-Paste [72 surfa e stru tural info. ontent
Darsy 1996 [44 entity lexi al att. networks
* DiaSumm [169 surfa e lexi al dis ourse stru ture
DimSum [9 surfa e lexi al att. networks
* DMSumm [124 dis ourse stru tural dis . stru ture
Edmunson 1969 [47 surfa e stru tural senten ewise
FilText [112 surfa e stru tural info. ontent
* Fo iSum [76 entity understanding att. networks
Frump [43 entity understanding info. ontent
GISTexter [59, 83 dis ourse/entity understanding info. ontent
GISTSumm [125 surfa e lexi al att. networks
Gladwin et al. 1991 [54 entity lexi al att. networks
* GLEANS [41 entity/dis ourse understanding info. ontent
* NTT [65, 66 surfa e stru tural/lexi al att. networks
* Karamuftuoglu 2002 [77 surfa e stru tural att. networks
* Kraaij et al. 2002 [80 surfa e lexi al att. networks
K. U. Leuven [6, 7 entity lexi al att. networks
* Lal and Rueger 2002 [84 entity/dis ourse understanding info. ontent
Lehnert 1982 [87 entity understanding info. ontent
* Univ. of Lethbridge [27, 32 entity stru tural/lexi al att. networks
Luhn 1958 [96 surfa e lexi al att. networks
Mar u 1997 [103 dis ourse stru tural dis . stru ture
* MEAD [128, 129 surfa e lexi al att. networks
* MultiGen [109, 19 entity stru tural info. ontent
* NeATS [92, 93, 88 entity stru tural info. ontent
* Newsblaster [106 entity/dis ourse stru tural/understanding info. ontent
NewsInEssen e [131 surfa e lexi al att. networks
Ono et al. 1994 [120 dis ourse stru tural dis . stru ture
NetSumm [127 surfa e lexi al att. networks
Pai e 1981 [122 surfa e stru tural senten ewise
* PERSIVAL [107 understanding info. ontent
Ra [85 surfa e stru tural att. networks
* RIPTIDES [136, 163 entity/dis ourse understanding info. ontent
SAM [142, 40 entity understanding info. ontent
Dunlavy et al. 2003 [144, 46 surfa e lexi al att. networks
S isor [135 entity understanding info. ontent
S rabble [152 entity understanding info. ontent
Skoro hod'ko 1971 [145 entity lexi al att. networks
Smart [140, 113 entity lexi al att. networks
* SUMMARIST [69 surfa e lexi al att. networks
SUMMONS [111 entity understanding info. ontent
SumUM [50, 138, 51 dis ourse stru tural dis ourse stru ture
* SweSum [151 surfa e lexi al att. networks
Taylor 1975 [153 entity understanding info. ontent
Tele-Pattan [20 entity lexi al att. networks
Tess [166 entity understanding info. ontent
Teufel and Moens [155, 156 dis ourse stru tural dis . stru ture
TICC [2 entity understanding info. ontent
TOPIC [57 dis ourse stru tural dis . stru ture
van Halteren 2002 [159 surfa e lexi al att. networks
WebInEssen e [132, 167 surfa e lexi al att. networks

Table 1: Classi ation of summarization systems


On-line or Downloadable Demos
Centrifuser English
multi-do ument (spe i -topi : medi al do uments)
on-line demo http:// entrifuser. s. olumbia.edu/ entrifuser. gi

Coperni English, Fren h, German


single do ument (many formats)
downloadable demo http://www. operni . om/desktop/produ ts/summarizer/download.html

DMSumm English, Brazilian Portuguese


single do ument
downoadable demo http://www.nil .i m .usp.br/ thiago/DMSumm.zip

Extra tor English, Fren h, Spanish, German, Japanese, Korean


single do ument (many formats)
downloadable demo http://www.dbi-te h. om/dbi extra tor.asp

GISTexter English
Single and Multi-Do ument
no straightforward a ess form at: http://www.language omputer. om/demos/summarization/index.html

GistSumm multilingual
single do ument
downloadable demo http://www.nil .i m .usp.br/ thiago/Install GistSum.zip

Newsblaster Multilingual
multi-do ument
on-line demo http://www1. s. olumbia.edu/nlp/newsblaster/

Island InText English


single do ument
no straightforward downloading form at: http://www.islandsoft. om/orderform.html
Inxight Summarizer / Chinese, Danish, Dut h, English, Finnish, Fren h, German, Italian,
LinguistX / Xerox PARC Japanese, Korean, Norwegian, Portuguese, Spanish and Swedish
single do ument
no straightforward downloading form at: http://www.inxight. om/produ ts/oem/summarizer/ onta t sales.php
Kmaritime Korean
on-line demo http://nlplab.kmaritime.a .kr/demo//f ats.html

Lal and Ruger (2002) English


single do ument
on-line demo http://rowan.do .i .a .uk:8180/summarizer/demo.html

MEAD / NewsInEssen e / CLAIR English and Chinese


multi-do ument, multi-lingual
on-line and dowloadable demo http://www. lsp.jhu.edu/ws2001/groups/asmd/

multiple news summ. demo at: http://www.newsinessen e. om/nie. gi


MS-Word Autosummarize supposedly any language
single do ument
in luded in MS-Word
Pertinen e Summarizer English, Fren h, Spanish, German, Italian, Portuguese, Japanese,
Chinese, Korean, Arabi , Greek, Dut h, Norwegian and Russian
single do ument
on-line demo http://www.pertinen e.net

Sinope Summarizer Personal Edition English, Dut h and German


single do ument
30-day trial downloadable http://www.sinope.nl/en/sinope/index.html

Summ-It probably English only


pasted text
on-line demo http://www.m s.surrey.a .uk/SystemQ/summary/

Surfboard probably English only


30-day trial downloadable demo single web pages (Ma OS X.1 only)
http://www.glu. om/binaries/surfboard/surfboard.dmg.gz

SweSum Danish, English, Fren h, German, Spanish, Swedish


single do ument (Web pages or pasted text)
on-line demo http://www.nada.kth.se/ xmartin/swesum/index-eng.html

TextWise probably English only


Content Repurposing Suite single do ument or e-mail
no straightforward a ess form at: http://www.textwise. om/te hnology/ rs/demo.html

Table 2: Some on-line demos of summarization systems, both ommer ial and a ademi
TEXT

cleaning up

Textual Unit
segmentation

morphological
analysis PN rules

Lexical Unit
segmentation

co-reference
resolution heuristics
co-reference
rules
semantic
tagging EuroWN
trigger-words

PRE-PROCESSED
TEXT

LEXICAL CHAINER

textual CHAINS
units OUTPUT

RANKING &
Parameters
SELECTION

Summary

Figure 1: Ar hite ture of HERMES Lexi al Chain Summarizer.


text

ENRICHMENT
Pre-processing

Lexical Chainer

Feature
Extraction
enriched
Textual Units
Decision
Rules CLASSIFICATION
ranked
Textual Units

DETERMINATION OF
SUMMARY CONTENT
chosen
Textual Unit

SIMPLIFICATION

summary

Figure 2: Ar hite ture of HERMES Ma hine Learning Summarizer.


Annex: Some Summarization Systems

Alfonse a and Rodrguez 2003


 Name:
 Referen e: [1
 Short des ription: produ es very short summaries (headline-like) of single do uments applying ge-
neti algorithms
 System Features
{ Input:
{ Ar hite ture: The pro essing has two steps: rst, the most relevant senten es of a do ument
are extra ted, applying a geneti algorithm that sele ts senten es a ording to their values for a
series of features indi ative of their relevan e: senten e length, position in the do ument, order
of the senten es, representativity, synta ti al stru ture, redundan y. The algorithm is trained on
the data from past DUC ontests.
On e the senten es are extra ted, a headline is reated by on atenation of portions of these
senten es. To determine whi h portions should be extra ted and whi h an be left aside, senten es
are parsed, and hand- rafted rules are applied to guarantee well-formedness (extra ting the main
verb and its arguments) and informativity (extra ting highly onne ted lexi al items).
{ Output fa ilities and onstraints:
{ Language overage: English, potentially multilingual
 Evaluation: obtained average results (ranked in the middle of all systems) in DUC 2003
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): stru tural
{ within lassi ation 3 (Tu ker, 1999): senten e by senten e
 Comments:
Baldwin and Morton 1998
 Name
 Referen e: [13
 Short des ription: Uses o referen e between the query and the text for performing indi ative,
user-fo used (query-sensitive) summarization
 System Features
{ Input:
{ Ar hite ture: The system is based on a ri h linguisti s pro essing that in ludes the following
tasks:
 NER

 Tokenization

 Senten e segmentation

 POS tagging

 Morphologi al analysis

 Parsing

 Argument dete tion

 Co-referen e resolution: Identity and Part-Whole, in luding nominal and verbal phrases,

a ronyms, events
{ Language overage: English
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): senten e by senten e
 Comments:
Banko et al. 1999, Mittal et al. 1999
 Name:
 Referen e: [15, [114
 Short des ription: Extra tion-based summarization from hand-written summaries, i.e. going from
abstra ts to extra ts, of single do uments, by aligning text spans.
 System Features
{ Input:
{ Ar hite ture: A tl*tf (term length * term frequen y) measure is used for weighting the relevan e
of terms and NE. [114 fo uses on the sele tion of spans for do ument summaries. Senten es from
the original do ument are ranked a ording to their salien e using two parameters for tuning the
pro ess: i) granularity, e.g. paragraph, senten e, et . and ii) metri for ranking. Features at
dis ourse level in lude:
 length of the span

 density of NEs

 omplexity of NPs

 pun tuation

 themati phrases

 anaphora density

There are also features at subdo ument level (senten e, phrase and word). These in lude:
 word length

 ommuni ative a tions

 themati phrases

 use of honori s, auxiliary verbs, negation, prepositions, et .

 type of senten e (interrogative, evaluative, et .)

{ Language overage: English


{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): dis ourse
{ within lassi ation 2 (kind of information): understanding
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments: Related work in ludes Headline produ tion [14, 22 and Ultrasummarization [165.
Boros et al. 2001
 Name:
 Referen e: [25
 Short des ription: Multi-do ument summarization system
 System Features
{ Input:
{ Ar hite ture: The system pro eeds through the following steps i) From a do ument set a nite
number of topi s are extra ted, ii) topi s are ordered by importan e, iii) a unique senten e is
extra ted from the olle tion for overing ea h topi ; salien e of senten es is omputed using tf*idf,
iv) senten es are lustered (several lustering te hniques both hierar hi al and non-hierar hi al
are experimented) and, nally, v) the summary is produ ed.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
Carbonell and Goldstein 1998, Goldstein et al. 1999
 Name:
 Referen e: [30, [55
 Short des ription: CMU approa h to both SDS and MDS ombines riteria of query relevan e and
novelty.
 System Features
{ Input:
{ Ar hite ture: The base of the system is the MMR (Maximal Marginal Relevan e) metri . Im-
portant issues are the diversity-based re-ranking for reordering do uments (in MDS), the relevant
passage extra tion, the anti-redundan y measures, the way of ombining riteria of relevan e and
novelty (relevant novelty vs. de lining relevan e to users"s query). In the ase of SDS the system
ranks senten es from the original do ument a ording to their salien e or their likelihood of being
part of the summary. For doing so, a weighted s ore of both linguisti and statisti al features is
used. The weights are optimised a ording to appli ation genres. Among linguisti features we
an nd: name, pla e, honori s, quotations, themati phrases, et . Statisti al features in lude
osine, tf*idf, pseudo-relevan e feedba k, query expansion, user interest pro les, et . In the ase
of MDS di erent types of summaries an be produ ed using:
 Common se tions of do uments

 Common se tions + unique se tions of do uments

 Centroid

 Centroid + outliers

 Common se tions + unique se tions + time weighting fa tor

{ Language overage: English, potentially multilingual


{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
CENTRIFUSER
 Name: CENTRIFUSER
 Referen e: [75
 Short des ription: Multi-do ument Summarizer. CENTRIFUSER meets the needs of browsers and
sear hers in highly stru tured domains.
 System Features
{ Input:
{ Ar hite ture: The system uses SIMFINDER [61, 60, a exible lustering tool for summarization
(used also in MULTIGEN). This tool dete ts text similarity over short passages exploring linguisti
features ombinations via Ma hine Learning te hniques. Among the primitive linguisti features
we an nd word o-o urren e, shared proper nouns, linked noun phrases, WN synonyms and
semanti ally similar verbs. Composite features onsist of pairs of simple features. An automati
feature dete tion system is applied and then the well-known ILP system, RIPPER, is performed.
After lustering, the system uses key-terms for sele ting one senten e or paragraph from ea h
luster (using the entroid method of [133). The sele ted senten es are nally reordered by
reformulation (in a similar way as in MULTIGEN).
{ Language overage: English, parts of it potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): dis ourse
{ within lassi ation 2 (kind of information): understanding
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
Chen and Lin (2000), Chen et al. 2003
 Name:
 Referen e: [35, 34
 Short des ription:
 System Features
{ Input: multido ument
{ Ar hite ture: The main omponents of the system are a set of monolingual news lusterers,
a unique multilingual news lusterer and a news summarizer. A entral issue of the system is
the de nition and identi ation of meaningful units as base for omparison. For English these
units an be redu ed to senten es but for Chinese the identi ation of units and the asso iated
segmentation of the text an be a di ult task. Another important issue of the system (general
for systems overing distant languages or di erent en oding s hemata) is the need of a robust
transliteration of names (or words not o urring in the bilingual di tionary) for assuring an
a urate mat hing.
{ Output fa ilities and onstraints:
{ Language overage: rosslingual: English and Chinese, potentially any language
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
Columbia MDS
 Name: Columbia MDS
 Referen e: [19, 18, 110, 61, 60, 108, 119
 Short des ription: Enhan ed version of MULTIGEN. Complex system that an be applied to dif-
ferent sour es. It an be onsidered a sort of meta-summarizer.
 System Features
{ Input: Four di erent types of input that are identi ed in a way that the most appropriate
summarizer is applied in ea h ase. The system an deal with simple event, biography, multi-
event and others.
{ Ar hite ture: There is a pre-pro essing phase followed by a router that depending on the
kind of input triggers the appropriate summarizer. For simple events the summarizer used is the
onventional MULTIGEN, for biographies, DEMS [143 with the bio on guration, for multi-event
and others, DEMS with the default on guration.
{ Language overage: English
{ Output fa ilities and onstraints:
 Evaluation: DUC 2002, onsistently among the top three systems (se ond or third). For extra ts,
it ranked se ond pre isionwise and third re allwise. For abstra ts, it ranked se ond overagewise and
third pre isionwise. Also parti ipated in DUC 2003, and obtained good results for overage and quality
questions in some of the tasks.
 Classi ation
{ within lassi ation 1 (level of pro essing): entity/dis ourse
{ within lassi ation 2 (kind of information): stru tural/understanding
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
Conroy et al. 2001
 Name:
 Referen e: [37, 36, 144, 46
 Short des ription: Statisti al approa hes to summarisation
 System Features
{ Input:
{ Ar hite ture: Two di erent te hniques are used in [37:
 HMM, using as features the position in the senten e, the number of tokens and the number
of pseudo-query terms.
 Logisti Regression (LRM), using as features the number of query terms o urring in the

senten e, the number of tokens (senten e length), the distan e to the query terms and the
position of the senten e.
[36 use pivoted GR matrix de omposition. A token-senten e matrix is built and from it the
olumns giving good overage of the tokens are sele ted. Two di erent approa hes are used
for this pro ess: a greedy ele tion and a pivoted QR fa torisation. [144 merged the LRM and
HMM by in luding all the features of the LRM in the HMM. An additional feature was the
onditional probability that a senten e is a summary senten e given that the previous senten e is.
A post-pro ess is run on extra ted senten es to remove senten e starting dis ourse markers and
boilerplate, to improve ohesiveness. An extensive investigation was arried out to a ount for
human performan e in multi-do ument summarization. Con lusions were that single-do ument
summaries ould be used as a base for multi-do ument, but had to be enri hed, possibly wiht
dis ourse stru ture. Senten e pruning te hniques were also found useful.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation: parti ipated in DUC'01, DUC'02 and DUC'03. In DUC'02, it was ranked among the
rst systems, but did not beat the baselines. In DUC'03, it performed among the top systems.
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e/entity
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
Cut-and-Paste
 Name: Cut-and-Paste
 Referen e: [71, [73
 Short des ription: Senten e Redu tion for automati text summarization. The system relates the
phrases o urring in a summary written by a professional summarizer and the phrases o urring in the
original do ument.
 System Features
{ Input:
{ Ar hite ture: 6 editing operations (learned from the performan e of human summarizers) are
used for senten e redu tion:
 removing extraneous phrases

 ombining a redu ed senten e with other redu ed senten es

 synta ti transformations

 substitution with paraphrases

 substitution with more general or more spe i des riptors

 reordering

{ Language overage: English


{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): stru tural
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
DiaSumm
 Name: DiaSumm
 Referen e: [169
 Short des ription: Automati Summarization of Spoken Dialogues in Unrestri ted Domains
 System Features: Dealing with non textual do uments implies that additional problems have to
be fa ed. If the input omes from ASR (with or without on den e s ores), spee h dis uen ies have
to be dete ted and removed. Besides, senten e boundaries have to be dete ted and inserted. Topi
segmentation plays a more important role in this situation. In addition, in the ase of multi-party
dialogs, relations between moves have to be identi ed (e.g. linking of question/answering pairs).
{ Input: Spoken dialogues
{ Ar hite ture: DiaSumm is organised in the following modules:
1.
spee h dis uen y dete tion and removal
2.
identi ation and insertion of senten e boundaries
3.
identi ation and linking of Question-Answer regions
4.
topi al segmentation
5.
information ondensation (using MMR)
{ Language overage: English, German
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): dis ourse stru ture
 Comments:
DMSumm
 Name: DMSumm (Dis ourse Modeling SUMMarizer)
 Referen e: [124
 Short des ription: a three-layered dis ourse-based summarizer
 System Features
{ Input: single do ument
{ Ar hite ture: DMSumm is a deep approa h to the summarization problem. It has three steps:
ontent sele tion, text planning and linguisti realization. The ontent sele tion pro ess sele t
the information to be ommuni ated in the summaries; the text planning makes a mapping of
semanti and intentional relations onto rhetori al relations, building rhetori al text plans; the
linguisti realization expresses the plans in the written summaries. It is based on a dis ourse
model omposed of three di erent knowledge sour es, i.e., the semanti , intentional and rhetori al
levels. Some basi generation restri tions are supposed to be veri ed: the ommuni ative goal
satisfa tion and the entral proposition preservation.
{ Output fa ilities and onstraints:
{ Language overage: English and Brazilian Portuguese
 Classi ation
{ within lassi ation 1 (level of pro essing): dis ourse
{ within lassi ation 2 (kind of information): stru tural
{ within lassi ation 3 (Tu ker, 1999): dis ourse stru ture
 Evaluation:
 Comments:
eSseNSe, NewsInESSen e, WebInESSen e
 Name: eSseNSe, NewsInESSen e, WebInESSen e
 Referen e: [131, [132
 Short des ription: eSseNSe is basi ally a system for lustering do uments after/before retrieval,
summarization single/multi-do ument, personalization and re ommendation of do uments. From it
two systems applied respe tively to news (NewsInESSen e) and Web pages (WebInESSen e) have
been derived.
 System Features
{ Input:
{ Ar hite ture: These systems are based on the CST (Cross-Do ument Stru ture Theory). CST
(that is related to RST for single do uments) proposes a taxonomy of the informational relation-
ships between do uments in lusters of related do uments. In NewsInESSen e the aim is nding,
visualizing and summarizing a topi -based luster of news stories. A user sele ts a single news
story from a news Web site. The system sear hes for other live sour es of news for other stories
related to this one and presents summaries
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): attentional networks
 Comments:
Fo iSum
 Name: Fo iSum
 Referen e: [76, [75
 Short des ription: Summarizing long do uments. Domain spe i informative and indi ative sum-
marization for Information Retrieval. Closely related to CENTRIFUSER.
 System Features
{ Input:
{ Ar hite ture: Summarization of long do uments presents interesting hara teristi s that do not
o urs in onventional summarization systems (usually applied to summarize news, arti les, Web
pages and so). In long do uments summarization senten es to be extra ted o urs in distant
lo ations. So, oheren e properties are of less importan e here. Fo isum is an hybrid system that
merges: i) Information Extra tion te hniques (template-based), ii) Senten e extra tion (in luding
both senten e-based and lead-based strategies) and iii) based on the dynami ally determined fo i
of the text (in this ontext fo us is the topi ). Fo i are built from NE and multiword terms.
{ Language overage: English
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): understanding
{ within lassi ation 3 (Tu ker, 1999): attentional networks
 Comments:
GISTexter
 Name: GISTexter
 Referen e: [59, 83
 Short des ription: produ es multido ument extra ts and abstra ts by template-driven IE. Templates
are hosen by their adequa y to the topi of the do ument or olle tion of do uments. Single do ument
summaries by senten e extra tion and ompression.
 System Features
{ Input: olle tions of do uments dealing with the same topi .
{ Ar hite ture: for single do uments, the most relevant senten es are extra ted and ompressed
by rules that are learned from a orpus of human-written abstra ts and their sour e texts (no
further detail of these pro esses is given). For multi-do ument summarization, the system:
 the IE system CICERO extra ts relevant information by applying templates that are deter-

mined by the topi of the olle tion. Ea h template keeps a re ord of the text snippets where
the information has been extra ted from. If one of these snippets ontains an anaphori
element, its o-referen e hain is also re orded. If no template is provided for a given topi ,
a template is generated ad-ho , based on the topi al relations of the words in WordNet.
 the dominant event of the olle tion is determined, and templates are lassed depending on

how entral the dominant event is in the template and in the do ument where the template
is extra ted from.
 within ea h lass, templates are ordered by their representativeness. Highly representative

templates are those that have the same slot llers in the same slots as the majority of tem-
plates. Also those templates related to text snippets rossed by o-referen e hains are more
representative.
 the summary is made from the text snippets re orded by the most representative template

in the lass of templates most losely related to the dominant event in the olle tion, in their
order of appearan e in the text. If they ontain an anaphori element, senten es ontaining
the ante edent are also in luded. If the summary is too long, the linguisti form of dates
and lo ations is shortened, unimportant oordinated phrases are dropped or, nally, the last
senten e is dropped until the targeted length is a hieved. If the summary is too short, the
same pro ess is applied to the most representative templates to the other lasses of templates,
in order of loseness to the dominant event.
{ Language overage: English
{ Output fa ilities and onstraints:
 Evaluation: parti ipated in DUC 2002 and was ranked among the rst. The best overage rates
for single and multi-do ument summarization, only surpassed by one system as to pre ision in multi-
do ument summarization. In DUC 2003 they parti ipated with Lite-Gistexter, whi h uses minimal
lexi o-semanti resour es, obtaining good results for one of the four tasks.
 Classi ation
{ within lassi ation 1 (level of pro essing): entity/dis ourse
{ within lassi ation 2 (kind of information): understanding
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments: the mentioned referen e does not provide mu h detail on some of the modules of the
system.
GISTSumm
 Name:
 Referen e: [125
 Short des ription: an automait text summarizer that tries to identify the text main idea, i.e., the
gist, for generating the orresponding summary.
 System Features
{ Input:
{ Ar hite ture: It is based in the assumptions that it is possible to:
 nd a senten e that represents the main idea of a text, the gist.
 nd the gist by statisti al methods.
 produ e oherent abstra ts relating the gist with other senten es of the original text

It has two methods to summarize: via key words or via a metri to nd the most representative
words of a text (tf*isf, term frequen y - inverse senten e frequen y).
{ Output fa ilities and onstraints:
{ Language overage: multilingual
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): attentional networks
 Evaluation:
 Evaluation:
 Comments:
GLEANS
 Name: GLEANS
 Referen e: [41
 Short des ription: IE-based multi-do ument summarizer, makes expli it the main entities and rela-
tions in a do ument olle tion. It produ es headlines, extra ts and a redu ed form of abstra t.
 System Features
{ Input:
{ Ar hite ture: summarization in four steps:
do uments are parsed [64, the main onstituents of ea h senten e are identi ed, some


anaphori expressions are resolved, and nally mapped into a anoni al representation that
expli its their main entities and relations
 ea h olle tion of do uments is lassi ed by its ontent into person, single event, multiple

event or natural disaster


 given the olle tion type and the anoni al representation of the do uments, the ore entities

and relations are extra ted, by hoosing the most salient words in the olle tion.
 a headline is reated, based on the type of olle tion and teh ore entities and relations. For

multiple event olle tions, a short abstra t an also be generated with the me hanisms to
generate headlines.
 an abstra t is generated by applying a library of anoni al s hemas obtained from manual

analysis of abstra ts in a training orpus. These s hemas determine whi h senten es of a


sour e text ful ll the requirements of a anoni al summary, and extra t them. Chronologi al
oheren e, redundan y and dangling dis ourse referen es are treated.
 in a post-pro ess, dangling dis ourse markers are removed, de isions are made on whi h

anaphori expressions to use for ea h entity and temporal expressions are represented in a
anoni al form.
{ Language overage: English
{ Output fa ilities and onstraints:
 Evaluation: performan e in DUC 2002 not high: low overage, but improved when do ument olle -
tions were orre tly lassi ed. Spe ially bad on headline generation.
 Classi ation
{ within lassi ation 1 (level of pro essing): entity/dis ourse
{ within lassi ation 2 (kind of information): understanding
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
Knight and Mar u 2000
 Name:
 Referen e: [78
 Short des ription: This system is not a full summarizer but a senten e ompressor. Senten e
ompressing is presented as a fundamental omponent of any high-quality non extra tive summarizer
 System Features
{ Input:
{ Ar hite ture: The system follows a statisti al approa h. Senten e ompression is onsidered as
a pro ess of translation from a sour e language (full text) into a target language (summary). The
pro ess is a omplished following two di erent approa hes: a onventional noise hannel model
and de ision trees (using C4.5). The probabilisti models are trained on a orpus of <full text,
summary> pairs.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): senten e by senten e
 Comments: an enhan ement of this approa h was arried out later on, applying the same te hnique
to rhetori al parse trees, with a s ope beyond the senten e [42.
Kraaij et al. 2001
 Name:
 Referen e: [81
 Short des ription: Probabilisti single do ument extra tive summarizer.
 System Features
{ Input:
{ Ar hite ture: The system follows a probabilisti approa h. Two di erent statisti al models are
applied and their results are ombined for sele ting the senten es that have to be in luded in the
summary. The former is a ontent-based language model (unigrams + smoothing) and the latter
is based on non- ontent features (being or not the rst senten e, ontaining ue phrases, senten e
length, et .)
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
Lal and Ruger 2002
 Name:
 Referen e: [84
 Short des ription: single-do ument, extra t-based summarizer, applies anaphora resolution and text
simpli ation.
 System Features
{ Input:
{ Ar hite ture: following the approa h of [82, it works as a Bayesian pattern lassi er over
senten es trained from an annotated orpus. The features that are taken into a ount are: length
of the senten e, position of the senten e within the paragraph and the paragraph within the
do ument, mean tf*idf of named entities, o-referen e with named entities in headline, in lusion of
highly o-refered named entities. Some dangling anaphors are repla ed by their referent. Lexi al
simpli ation is performed with tools from the PSET proje t [31. Ba kground knowledge on
people and pla es, taken from sour es on the web, an also be in luded.
{ Output fa ilities and onstraints: English
 Evaluation: DUC 2002, performed well ex ept for grammati ality and oheren e.
 Classi ation
{ within lassi ation 1 (level of pro essing): entity/dis ourse
{ within lassi ation 2 (kind of information): lexi al/stru tural
{ within lassi ation 3 (Tu ker, 1999): senten e by senten e
 Comments: A demonstration an be found at http://km.do .i .a .uk/pr-p.lal-2002/, and the system
an be downloaded as a CREOLE Repository for GATE users.
Lethbridge, University of
 Name: University of Lethbridge
 Referen e: [27, 32
 Short des ription: single- and multido ument lexi al hain summarizer by extra tion. It lters out
hain andidates in subordinate lauses.
 System Features
{ Input:
{ Ar hite ture: for multido ument summaries, the pro edure is the same as for single do ument
(below), but all segments in the olle tion are pooled together, assigning a time stamp to ea h.
 topi segmentation of the text

 removing unimportant nouns from text (nouns in subordinate lauses).

 lexi al haining

 senten e extra tion

 surfa e repairs: add previous senten e to a senten e ontaining a dangling anaphora, remove

short senten es or senten es with question or quotation marks.




{ Language overage: English, potentially multilingual


{ Output fa ilities and onstraints:
 Evaluation: DUC 2002, but no results reported in referen e. They also parti ipated in DUC 2003, ob-
taining \reasonable results" but admitting that \some improvements are still required when onsidering
multi-do ument summarization".
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): attentional networks
 Comments:
K. U. Leuven
 Name:
 Referen e: [6, 7
 Short des ription: adapts a hierar hi al topi segmentation algorithm to text summarization
 System Features
{ Input:
{ Ar hite ture: For multi-do ument summarization, a ombination of topi segmentation and
lustering te hniques is applied, while for single-do ument headline generation, topi segmentation
is ombined with senten e s oring and ompression.
Themati stru tures in texts are dete ted using generi text stru ture ues:
 lexi al hains are built following [17 but using only WordNet synonymy relations.

 the topi of ea h senten e is determined, by general topi ality me hanisms of English (initial

position, persisten y).


 topi s are distinguished from subtopi s, be ause the rst spread throughout the whole text,

while the se ond have lo al s ope.


 for single do ument summarization, the number of levels of the topi hierar hy is restri ted

by the targeted summary length, so that only senten es in higher levels are in luded.
 for multiple do ument summarization, headline-kind summaries are produ ed by listing non-

redundant topi terms. For longer summaries, open- lass words of every senten e in the
olle tion are lustered.
Key terms are asso iated to ea h topi , and a tree-like table of ontent is produ ed.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints: oriented to tables of ontents, la ks ohesion for texts.
 Evaluation: DUC 2002, average s ores, bad for short abstra ts. In DUC 2003, the strategy for very
short abstra ts (headlines) was signi antly improved, ombining the informativeness of topi terms
with hand- rafted grammati al rules for senten e ompression, whi h resulted in very good results for
the task of headline generation. In the other tasks, results were average.
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): lexi
{ within lassi ation 3 (Tu ker, 1999): attentional networks
 Comments:
Lexi al Bonds
 Name: Lexi al Bonds
 Referen e: [77
 Short des ription: extra tive single-do ument system based on analysis of lexi al bonds between
senten es in a text and a lassi ation of senten es into important and unimportant using SVM.
 System Features
{ Input: single do uments
{ Ar hite ture: the original design in ludes a transformation phase that should ompa t the text
extra ted in the rst phase and resolve anaphori referen es, but it is not yet developed. The
urrent ar hite ture is:
 senten es are splitted and stopwords are removed

 re ord of features for every senten e: senten e position, number of words, number of ba kward,

forward and total lexi al bonds and lexi al links, and information ontent
 a lexi al link between two senten es is found when a word stem o urs in both of them, a

lexi al bond is found when there are two or more lexi al links between a pair of senten es
[67.
 the information ontent of a senten e is the IR fun tion BM25 [146, whi h indi ates the

importan e of the senten e with respe t to the do ument.


 SVM are used to sele t senten es a ording to these features (trained on DUC'02 manually

sele ted extra ts)


 summaries are generated by following lexi al bonds from a given senten e. Some onstraints

are: only senten es in the upper half of the do ument and sele ted by SVM are onsidered.
The system produ es ohesive summaries, but they are very redundant.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints: ompa tation pro ess is under development.
 Evaluation: parti ipated in DUC 2002, with good results in quality.
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e/entity
{ within lassi ation 2 (kind of information): dis ourse
{ within lassi ation 3 (Tu ker, 1999): attentional networks
 Comments:
MEAD
 Name: MEAD
 Referen e: [133, 128, 121, 129
 Short des ription: Centroid-based multi-do ument summarization
 System Features
{ Input:
{ Ar hite ture: MEAD begins identifying all the arti les related to an emerging event (using
the CIDR Topi Dete tion and Tra king system). CIDR produ es a set of lusters. From ea h
luster a entroid is built. Then the senten es losest to the ea h of the entroids are sele ted to be
in luded in the summary. CBSU (Centroid-based senten e utility) s ores the degree of relevan e of
a parti ular senten e to the general topi of the entire luster. CSIS (Cross-senten e informational
subsumption) measures the overlap between the informational ontent of the senten es. CSIS is a
similar measure than MMR. The di eren e is that CSIS is multi-do ument and query-independent
while MMR is single-do ument and query-based. More re ent versions of MEAD use a linear
ombination of three features: the entroid s ore and it assigns higher s ores to senten es loser
to the beginning of the do ument and to longer senten es.
{ Language overage: multilingual: English, Chinese, potentially any language
{ Output fa ilities and onstraints:
 Evaluation: DUC 2001, 2002 and 2003. In DUC 2002 they had format problems (SGML tags). In
DUC 2003 they had the best s ore for question-fo used multi-do ument summaries, and performed
among the top 3 systems for all multi-do ument summarization tasks.
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
MULTIGEN
 Name: MULTIGEN
 Referen e: [19, [109
 Short des ription: Multi-do ument Summarization using Information Fusion and Reformulation
 System Features
{ Input: News arti les presenting di erent des riptions of the same event.
{ Ar hite ture:
identify similarities and di eren es a ross do uments by statisti al te hniques [111
extra t sets of similar senten es: THEMES
 shallow synta ti analysis

 order sets of similar senten es (Reformulation). Two di erent forms of implementing ordering

are in luded: majority ordering and hronologi al ordering.


 generation: Senten e generation begins with phrases, with paraphrases rules derived from

orpus analysis. MULTIGEN takes pro t of the experien e of Columbia's group in NL Gen-
eration for building high quality summaries (not extra ts but abstra ts).
{ Language overage: English
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): stru tural
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments: MULTIGEN has been extended in several dire tions. See Columbia MDS [18, 110
PERSIVAL [107 and CENTRIFUSER [75 among others.
Muresan et al. 2001, Tzoukermann et al. 2001
 Name:
 Referen e: [117, [158
 Short des ription: e-mail summarization ombining Ma hine Learning and linguisti information.
 System Features
{ Input:
{ Ar hite ture: The basi pro ess onsists on learning the salient NPs o urring in the text. The
following features are used for the learning task:
 for the head of the NP:

 head-tf*idf (relevan e)

 head-fo (position of the rst o urren e of head)

 for the whole NP

 np-tf*idf

 np-fo

 np-length-words

 np-length- hars

 senten e-position

 paragraph-position

 all onstituents in the NP equally weighted

Di erent ML methods have been applied in luding de ision trees (C4.5) and rule indu tion (Rip-
per). The linguisti pro ess in lude:
 in e tional morphology pro essing

 removing unimportant modi ers

 removing ommon words

 removing empty words

{ Language overage: English, potentially multilingual


{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): understanding
{ within lassi ation 3 (Tu ker, 1999): attentional networks
 Comments:
Myaeng and Jang 1999
 Name:
 Referen e: [118
 Short des ription: Single do ument summarizer based on statisti al te hniques
 System Features
{ Input:
{ Ar hite ture: The system uses two similarity measures for determining if a senten e belongs
to the major ontent: a similarity between the senten e and the rest of the do ument and a
similarity between the senten e and the title of the do ument. Two statisti al te hniques are
applied, a Bayesian model based on 14 features (signature terms and positional information) and
the Dempter-Shafer ombination rule.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
NeATS, iNeATS
 Name: NeATS
 Referen e: [92, 93, 88
 Short des ription: Multi-do ument summarizer presented in DUC'01, DUC'02
 System Features
{ Input:
{ Ar hite ture: NeATS pro eeds in the following steps:
1. extra ting and ranking passages
 Identi ation of key on epts for ea h topi group

 Computing of unigram, bigram, trigram Topi Signatures

 Removing words or phrases o urring in less than the half of texts

 Saving signatures in a tree

 Web lopedia query formation

 Senten e-level IR giving to a ranked list of senten es

2. Filtering for ontent: remove all senten es that are not within the rst 10 senten es of a
do ument, de rease ranking s ore of senten es ontaining stigma words.
3. Enfor ing ohesion and oheren e by pairing ea h senten e with the lead senten e of the
do ument
4. Filtering for length: in lude senten es (paired with the orresponding lead senten e) that are
most di erent from the in luded ones, until targeted length is satis ed.
5. Ensuring hronologi al oheren e
As an additional enhan ement, Leuski et al. (2003) [88 provide a graphi al interfa e to improve
the navigation and modi ation of the summaries produ ed by NeATS.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation: in DUC 2002, it was the system with highest pre ision and F1 measure, although it
performed low in re all.
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): stru tural
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
Newsblaster
 Name: Multilingual Columbia's Newsblaster
 Referen e: [49
 Short des ription:
 System Features
{ Input: multido ument
{ Ar hite ture: A platform for multilingual news summarization that extends the Columbia's
Newsblaster system [106. The system adds a new omponent, translation, to the original six
major modules: rawling, extra tion, lustering, summarization, lassi ation and web page gen-
eration, that have been, in turn, modi ed for allowing multilinguality (language identi ation,
di erent hara ter en oding, language idiosyn rasy, et .).
In this system multilingual do uments are translated into English before lustering and, so, lus-
tering is performed only on English texts.
Translation is arried out at two levels. As a low quality translation is usually enough for lustering
purposes and assessing the relevan e of the senten es, a simple and fast te hnique is applied for
glossing the input do uments prior to lustering. Higher (relatively) quality translation (using
Altavista's Babel sh interfa e to Systran) is performed in a se ond step only over fragments
sele ted to be part of the summary.
The system takes as well into a ount the possible degradation of the input texts as result of the
translation pro ess (most of the senten es resulting from this pro ess are simply not grammati ally
orre t).
{ Output fa ilities and onstraints:
{ Language overage: rosslingual
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): stru tural
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
NTT
 Name: NTT
 Referen e: [65, 66
 Short des ription: extra tive summarizer based on lassi ation of senten es by Support Ve tor
Ma hines (SVM) and Maximal Marginal Relevan e (MMR).
 System Features
{ Input:
{ Ar hite ture: ea h senten e in a do ument is des ribed with the following features: position,
length, weight (tf*idf s ore of the words in the senten e), similarity with the headline and presen e
of ertain prepositions or verbs.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation: parti ipated in DUC'02, with good results in overage but low quality. For DUC 2003,
NTT a hieved the highest metri s for readability in the two multido ument summarization tasks it
took part in, and got average positions for overage.
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): senten e by senten e
 Comments:
OCELOT
 Name: OCELOT
 Referen e: [115
 Short des ription: Summarizing of Web pages. Gist of Web do ument based on probabilisti models.
 System Features
{ Input:
{ Ar hite ture: OCELOT is one of the appli ations of a general probabilisti approa h that
models summarisation as a translation pro ess between two languages, the language of full text
and the language of summaries. Berger in his thesis applies onventional sto hasti translation
methods for summarizing. Three di erent examples of appli ation are provided and OCELOT is
one of them.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
PERSIVAL
 Name: PERSIVAL
 Referen e: [107
 Short des ription: PERSIVAL (Personalized Retrieval and Summarization of Image, Video and
Language). The system builds patient spe i (tailored a ess for both patients and physi ians) sum-
maries of medi al arti les ontained in a distributed multimedia patient are digital library. It is a
Digital Library proje t.
 System Features
{ Input: Multimedia olle tions in the medi al domain
{ Ar hite ture: Multimedia sear h triggered by a on ept from patient's data. The system in-
ludes the annotation and organization of large olle tions of video data. Video do uments are
segmented and a storyboard summary is produ ed. Video are indexed at synta ti and semanti
levels. A set of ontent-based video sear h tools has been developed. The system in ludes the use
of DEFINDER tool (for looking for de nitions).
{ Language overage: English
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): understanding
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
RIPTIDES
 Name: RIPTIDES
 Referen e: [164, [162
 Short des ription: user dire ted do ument summarizer ombining the appli ation of te hniques of In-
formation Extra tion, Extra tion-based Summarization and Natural Language Generation. The former
referen e refers to single-do ument summarization while the latter to multi-do ument summarization.
 System Features
{ Input:
{ Ar hite ture: The system pro eeds in the following steps:
1. User information needs are a quired from the system
2. S enario templates are lled by an IE system
3. IE output templates are merged into an event-oriented stru ture where omparable fa ts are
grouped. For doing so SimFinder is used.
4. Importan e s ores are assigned to slot/senten es based on a ombination of do ument posi-
tion, do ument re en y and group/ luster membership.
5. Content sele tion
6. Summary generation
{ Language overage: English
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): understanding
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
S hi man et al. 2001
 Name:
 Referen e: [143
 Short des ription: Multi-do ument summarizer produ ing Biographi al Summaries ombining lin-
guisti knowledge with orpus statisti s.
 System Features
{ Input:
{ Ar hite ture: A number of modules o-operate for produ ing the summaries:
Senten e tokenizer


Alembi POS tagger




 Nametag NER

 Cass parser

 Cross-do ument o-referen e

 Appositives

 Relative lause weighting

 Sentential des ription, following [Sagion, Lapalme, 2000

{ Language overage: English


{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): understanding
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
SUMMARIST
 Name: SUMMARIST
 Referen e: [69, 95
 Short des ription: Extra tive single do ument summarisation system
 System Features
{ Input:
{ Ar hite ture: The system pro eeds in three steps: Topi identi ation, Interpretation and
Summary generation.
 Topi identi ation implies previous a quisition of Topi Signatures and then the identi a-

tion of a text span as belonging to a topi hara terised by its signature. Topi Signatures
are tuples of the form <Topi , Signature> where Signature is a list of weighted terms: <t1,w1>,
<t2,w2>, ..., <tn,wn>. Topi signatures an be automati ally learned ([Lin, 1997, [Lin, Hovy,
2000). Topi identi ation, then, in ludes text segmentation (using TextTiling) and om-
parison of text spans with existing Topi Signatures.
 The topi identi ed are fused during the interpretation (2nd step) of the pro ess. The fused

topi s are then reformulated (expressed in new terms).


 The last step is a onventional extra tive task.

{ Language overage: multilingual: English, Japanese, Spanish, Arabi , Indonesian, Korean,


potentially any language
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): attentional networks
 Comments:
SumUM
 Name: SumUM
 Referen e: [50, 139, 51
 Short des ription: generates single-do ument abstra ts of s ienti papers, based on shallow syn-
ta ti and semanti analysis oriented to on eptual identi ation and hand-made templates for text-
regeneration. It intera ts with the user. For DUC, an adaptation has been made to obtain biased
multi-do ument summaries.
 System Features
{ Input: single-do ument, s ienti or te hni al arti les with the following stru ture: title, author
and aliation, introdu tion, main se tion, referen es. There is also an adaptation for multi-
do ument summarization.
{ Ar hite ture:
 transdu ers identify on epts in text: domain transdu ers identify author, referen es, et .,

and linguisti transdu ers identify noun groups and verb groups.
 on epts are tagged semanti ally, marking dis ourse domain relations

 senten es of indi ative and informative type are identi ed

 an indi ative abstra t is omposed, by re-generation of text using pre-de ned summary tem-

plates
 based on the rst, indi ative abstra t, an informative abstra t an be omposed, elaborating

a spe i query of the user


{ Language overage: English
{ Output fa ilities and onstraints: an intera tive system: the user is presented with a short
indi ative abstra t and a list of topi s available for expansion, and an informative abstra t an
be produ ed, fo using on the topi s hosen by the user.
 Evaluation: it was formally adapted to parti ipate in DUC 2002, but with no adaptation to the news
domain. It was ranked among the three rst in quality, and the se ond in length-adjusted overage,
most probably due to the e ien y of templates. In DUC 2003, SumUM was adapted for biased
multi-do ument summarization, a hieving good s ores for overage but with a de rease on the quality
of the resulting summaries.
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): understanding
{ within lassi ation 3 (Tu ker, 1999): informative ontent
 Comments:
Strzalkowski 1998
 Name:
 Referen e: [148
 Short des ription: Query-based single do ument non-extra tive summarizer
 System Features
{ Input:
{ Ar hite ture: The system pro eeds in two steps, Analysis and Generation. Analysis phase
onsists of three tasks: Feature extra tion, feature synthesis and rule indu tion. As result a set of
themes is identi ed. The system uses both simple and omposite features. Simple features in lude
word o-o urren e, noun phrases (dete ted with linkIT), WN synonyms and ommon semanti
lasses for verbs (following Levin's, see [Klavans, Kan, 1998). Generation phase in ludes the
performan e of a ontent planner (based on the interse tion of themes obtained in the previous
phase and on a senten e planner) and a senten e generator.
{ Language overage: English
{ Output fa ilities and onstraints:
 Evaluation:
 Classi ation
{ within lassi ation 1 (level of pro essing): entity
{ within lassi ation 2 (kind of information): stru tural
{ within lassi ation 3 (Tu ker, 1999): informational ontent
 Comments:
Teufel and Moens
 Name:
 Referen e: [155, 156
 Short des ription: analyzes the rhetori al stru ture of s ienti arti les and produ es extra tive
summaries with the main ontributions.
 System Features
{ Input: s ienti arti les (spe ialized in omputational linguisti s domain)
{ Ar hite ture: Ea h senten e in an arti le is des ribed with a number of features, like its length
(in words) or its position in the do ument. But the main emphasis is put in des ribing the
ontribution of ea h senten e to the rhetori al stru ture of the do ument. To do that, a number
of linguisti knowledge sour es are exploited, among others: do ument layout, se tion titles,
lexi o-synta ti al stru tures, itation pro edures and ue phrases typi al of the genre of s ienti
arti les.
Then, a ma hine learning algorithm is applied to lassify ea h senten e as one of a number of
rhetori al ategories that a ount for the rhetori al status of the senten e with respe t to the
whole text. A parallel lassi ation is arried out to determine the relevan e of ea h senten e.
{ Output fa ilities and onstraints:
{ Language overage: English
 Evaluation: an evaluation by omparison with a human-made golden standard is presented in [156,
with good results.
 Classi ation
{ within lassi ation 1 (level of pro essing): dis ourse
{ within lassi ation 2 (kind of information): stru tural
{ within lassi ation 3 (Tu ker, 1999): dis ourse stru ture / senten e by senten e
 Comments:
TNO-TPD summarizer
 Name: TNO-TPD summarizer
 Referen e: [81, [80
 Short des ription: extra tive multi-do ument summarizer. Senten es are sele ted a ording to a
statisti al language model and applying a bayesian lassi er.
 System Features
{ Input:
{ Ar hite ture:
an unigram language model of a luster of do uments determines ontent-based salien e of


ea h senten e
 ea h senten e is assigned values for some surfa e features: senten e position, length, presen e

of positive or negative ue phrases, and the mentioned ontent s ore.


 senten es are lassi ed by a Naive Bayes lassi er into summary and non-summary senten es.

 redundan y is redu ed by applying MMR [30

 to generate headlines, the most frequent word in the highest ranked senten e for every do -

ument and the titles is onsidered a trigger word. Then, the senten es in the whole luster
are ranked a ording to their importan e. The highest ranked noun phrase that ontains the
trigger word is hosen as the headline.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation: parti ipated in DUC 2002 in the multi-do ument extra t and abstra t tra ks, with
\disappointing performan e". In addition, a self-evaluation applying relative utility [133, whi h reports
better results. An investigation on the individual ontribution of ea h feature was also performed,
revealing that position in the senten e is highly indi ative, while negative ue phrase was not well-
de ned.
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi
{ within lassi ation 3 (Tu ker, 1999): attentional networks / senten e by senten e
 Comments:
van Halteren 2002
 Name:
 Referen e: [159
 Short des ription: multi-do ument, extra tive summarizer. Senten es are lassi ed by feature sets
used for writing style re ognition.
 System Features
{ Input:
{ Ar hite ture: ea h senten e is des ribed by a set of features: distan e between o urren es of
the same word, distribution of words, relative position of words, senten e length, senten e position
and ontext of POS tags. A lassi er trained for a writing style re ognition task exploits these
features for senten e s oring and extra tion.
{ Language overage: English, potentially multilingual
{ Output fa ilities and onstraints:
 Evaluation: parti ipated in DUC 2002, but obtained not so good results.
 Classi ation
{ within lassi ation 1 (level of pro essing): surfa e
{ within lassi ation 2 (kind of information): lexi al
{ within lassi ation 3 (Tu ker, 1999): senten e by senten e
 Comments: the system was trained on materials not oriented to the summarization task

You might also like