Automatic Summarisation II: Methods

Automatic summarisation II - methods
Constantin Orăsan
March 15, 2010

Ideal summary processing model
Source text(s)
Interpretation
Source representation
Transformation
Summary representation
Generation
Summary text
How humans produce summaries
How humans summarise documents
• Determining how humans summarise documents is a difficult

task because it requires interdisciplinary research
• Endres-Niggemeyer (1998) breaks the process in three stages:
document exploration, relevance assessment and summary
production
• these have been determined through interviews with
professional summarisers
• use a top-down approach
• the expert summarisers do not attempt to understand the
source in great detail, instead they are trained to identify
snippets which contain important information
• very few automatic summarisation methods use an approach
similar to humans
Document exploration
• it’s the first step

• the source’s title, outline, layout and table of contents are
examined
• the genre of the texts is investigated because very often each
genre dictates a certain structure
• For example expository texts are expected to have a
problem-solution structure
• the abstractor’s knowledge about the source is represented as
a schema.
• schema = an abstractor’s prior knowledge of document types
and their information structure
Relevance assessment
• at this stage summarisers identify the theme and the thematic

structure
• theme = a structured mental representation of what the
document is about
• this structure allows identification of relations between text
chunks
• is used to identify important information, deletion of irrelevant
and unnecessary information
• the schema is populated with elements from the thematic
structure, producing an extended structure of the theme
Summary production
• the summary is produced from the expanded structure of the

theme
• in order to avoid producing a distorted summary, summarisers
relay mainly on copy/paste operations
• the chunks which are copied are reorganised to fit the new
structure
• standard sentence patters are also used
• summary production is a long process which requires several
iterations
• checklists can be used
Single-document summarisation
methods
Single document summarisation
• Produces summaries from a single document


• There are two main approaches:

• automatic text extraction → produces extracts also referred to
as extract and rearrange

• automatic text abstraction → produces abstracts also referred
to as understand and generate

• automatic text abstraction → produces abstracts also referred
to as understand and generate
• Automatic text extraction is the most used method to
produce summaries
Automatic text extraction
• Extracts important sentences from the text using different

methods and produces an extract by displaying the important
sentences (usually in order of appearance)
• A large proportion of the sentences used in human produces
summaries are sentences have been extracted directly from the
text or which contain only minor modifications
• Uses different statistical, surface-based and machine learning
techniques to determine which sentences are important
• First attempts made in the 50s
Automatic text extraction
• These methods are quite robust

• The main drawback of this method is that it overlooks the
way in which relationships between concepts in the text are
realised by the use of anaphoric links and other discourse
devices
• Extracting paragraphs can solve some of these problems
• Some methods involve excluding the unimportant sentences
instead of extracting the important sentences
Surface-based summarisation
methods
Term-based summarisation
• It was the first method used to produce summaries by Luhn

(1958)
• Relies on the assumption that important sentences have a
large number of important words
• The importance of a word is calculated using statistical
measures
• Even though this method is very simple it is still used in
combination with other methods
• A demo summariser which relies on term frequency can be
found at:
http://clg.wlv.ac.uk/projects/CAST/demos.php
How to compute the importance of a word
• Different methods can be used:

• Term frequency: how frequent is a word in the document
• TF*IDF: relies on how frequent is a word in a document and in
how many documents from a collection the word appears
Number of documents
TF ∗ IDF (w ) = TF (w ) ∗ log ( )
Number of documents with w
• other statistical measures, for examples see (Orăsan, 2009)
• Issues:
• stoplists should be used
• what should be counted: words, lemmas, truncation, stems
• how to select the document collection
Term-based summarisation: the algorithm
(and can be used for other types of summarisers)

1 Score all the words in the source according to the selected
measure

measure
2 Score all the sentences in the text by adding the scores of the
words from these sentences

measure
3 Extract the sentences with top N scores

measure
3 Extract the sentences with top N scores
4 Present the extracted sentences in the original order
Position method
• It was noticed that in some genres important sentence appear

in predefined positions
• First used by Edmundson (1969)
• Depends very much from one genre to another:
• newswire: lead summary the first few sentences from the text
• scientific papers: the first/last sentences in the paragraph are
relevant for the topic of the paragraph (Baxendale, 1958)
• scientific papers: important information occurs in specific
sections of the document (introduction/conclusion)
• Lin and Hovy (1997) use a corpus to determine the where
these important sentences occur
Title method
• words in titles and headings are positively relevant to

summarisation
• Edmundson (1969) noticed that can lead to an increase in
performance of up to 8% if the score of sentences which
include such words are increased
Cue words/indicating phrases
• Makes use of words or phrases classified as ”positive” or

”negative” which may indicate the topicality and thus the
sentence value in an abstract
• positive: significant, purpose, in this paper, we show,
• negative: Figure 1, believe, hardly, impossible, pronouns
• Paice (1981) proposes indicating phrases which are basically
patterns (e.g. [In] this paper/report/article we/I show)
Methods inspired from IR (Salton et. al.,
1997)
• decomposes a document in a set of paragraphs

• computes the similarity between paragraphs and it represents
the strength of the link between two paragraphs
• similar paragraphs are considered those which have a similarity
above a threshold
• paragraphs can be extracted according to different strategies
(e.g. the number of links they have, select connected
paragraphs)
How to combine different methods
• Edmundson (1969) used a linear combination of features:
Weight(S) = α∗Title(S)+β∗Cue(S)+γ∗Keyword(S)+δ∗Position(S)
• the weights were adjusted manually

• the best system was cue + title + position
• it is better to use machine learning methods to combine the
results of different modules
Machine learning methods
Kupiec et. al. (1995)
• used a Bayesian classifier to combine different features

• the features were:
• if the length of a sentence is above a threshold (true/false)
• contains cue words (true/false)
• position in the paragraph (initial/middle/final)
• contains keywords (true/false)
• contains capitalised words (true/false)
• the training and testing corpus consisted of 188 documents
with summaries
• humans identified sentences from the full text which are used
in the summary
• the best combination was position + cue + length
• Teufl and Moens (1997) used a similar method for sentence
extraction
Mani and Bloedorn (1998)
• learn rules about how to classify sentences

• features used:
• location features: location of sentence in paragraph, sentence
in special section, etc.
• thematic features: tf score, tf*idf score, number of section
heading words
• cohesion features: number of sentences with a synonym link to
sentence
• user focused features: number of terms relevant to the topic
• Example of rule learnt: IF sentence in conclusion & tf*idf high
& compression = 20% THEN summary sentence
Other ML methods
• Osborne (2002) used maximum entropy with features such as

word pairs, sentence length, sentence position, discourse
features (e.g., whether sentence follows the “Introduction”,
etc.)
• Knight and Marcu(2000) use noisy channel for sentence
compression
• Conroy et. al. (2001) use HMM
• Most of the methods these days try to use machine learning
Methods which exploit the discourse
structure
Methods which exploit discourse cohesion
• summarisation methods which use discourse structure usually

produce better quality summaries because they consider the
relations between the extracted chunks
• they rely on global discourse structure
• they are more difficult to implement because very often the
theories on which they are based are difficult and not fully
understood
• there are methods which use text cohesion and text coherence
• very often it is difficult to control the length of summaries
produced in this way
Methods which exploit text cohesion
• text cohesion involves relations between words, word senses,

referring expressions which determine how tightly connected
the text is
• (S13) ”All we want is justice in our own country,” aboriginal
activist Charles Perkins told Tuesday’s rally. ... (S14) ”We
don’t want budget cuts - it’s hard enough as it is ,” said
Perkins
• there are methods which exploit lexical chains and
coreferential chains
Methods which exploit text cohesion
• text cohesion involves relations between words, word senses,

referring expressions which determine how tightly connected
the text is
• (S13) ”All we want is justice in our own country,” aboriginal
activist Charles Perkins told Tuesday’s rally. ... (S14) ”We
don’t want budget cuts - it’s hard enough as it is ,” said
Perkins
• there are methods which exploit lexical chains and
coreferential chains
Lexical chains for text summarisation
• Telepattan system: Bembrahim and Ahmad (1995)

• two sentences are linked if the words are related by repetition,
synonymy, class/superclass, paraphrase
• sentences which have a number of links above a threshold
form a bond
• on the basis of bonds a sentence has to previous and following
sentences it is possible to classify them as start topic, end
topic and mid topic
• sentences are extracted on the basis of open-continue-end
topic
• Barzilay and Elhadad (1997) implemented a more refined
version of the algorithm which includes ambiguity resolution
Using coreferential chains for text
summarisation
• method presented in (Azzam, Humphreys, Gaizauskas, 1999)

• the underlying idea is that it is possible to capture the most
important topic of a document by using a principal
coreferential chain
• The LaSIE system was used to produce the coreferential
chains extended with a focus-based algorithm for resolution of
pronominal anaphora
Coreference chain selection
The summarisation module implements several selection criteria:

• Length of chain: prefers a chain which contains most entires
which represents the most mentioned instance in a text
• Spread of the chain: the distance between the earliest and the
latest entry in each chain
• Start of Chain: the chain which starts in the title or in the
first paragraph of the text (this criteria could be very useful
for some genres such as newswire)
Summarisation methods which use
rhetorical structure of texts
• it is based on the Rhetorical Structure Theory (RST) (Mann
and Thompson, 1988)
• according to this theory text is organised in non-overlapping
spans which are linked by rhetorical relations and can be
organised in a tree structure
• there are two types of spans: nuclei and satellites
• a nucleus can be understood without satellites, but not the
other way around
• satellites can be removed in order to obtain a summary
• the most difficult part is to build the rhetorical structure of a
text
• Ono, Sumita and Miike (1994), Marcu (1997) and
Corston-Oliver (1998) present summarisation methods which
use the rhetorical structure of the text
from (Marcu, 2000)
Summarisation using argumentative
zoning
• Teufel and Moens (2002) exploit the structure of scientific

documents in order to produce summaries
• the summarisation process is split into two parts
1 identification of important sentences using an approach similar
to the one proposed by Kupiec, Pederson, and Chen (1995)
2 recognition of the rhetorical roles of the extracted sentences
• for rhetorical roles the following classes are used: Aim,

Textual, Own, Background, Contrast, Basis, Other
Knowledge-rich methods
Knowledge rich methods
• Produce abstracts
• Most of them try to “understand” (at least partially a text)
and to make inferences before generating the summary
• The systems do not really understand the contents of the
documents, but they are using different techniques to extract
the meaning
• Since this process involves a huge amount of world knowledge
the application is restricted to a specific domain only
Knowledge-rich methods
• The abstracts obtained in this way are betters in terms of

cohesion and coherence
• The abstracts produced in this way tend to be more
informative
• This method is also known as the understand and generate
approach
• This method extracts the information from the text and holds
it in some intermediate form
• The representation is then used as the input for a natural
language generator to produce an abstract
FRUMP (deJong, 1982)
• uses sketchy scripts to understand a situation

• these scripts only keep the information relevant to the event
and discard the rest
• 50 scripts were manually created
• words from the source activate scripts and heuristics are used
to decide which script is used in case more than one script is
activated
Example of script used by FRUMP
1 The demonstrators arrive at the demonstration location


2 The demonstrators march

3 The police arrive on the scene

4 The demonstrators communicate with the target of the
demonstration

demonstration
5 The demonstrators attack the target of the demonstration

demonstration
6 The demonstrators attack the police

demonstration
7 The police attack the demonstrators

demonstration
7 The police attack the demonstrators
8 The police arrest the demonstrators
FRUMP
• the evaluation of the system revealed that it could not process

a large number of scripts because it did not have the
appropriate scripts
• the system is very difficult to be ported to a different domain
• sometimes it can misunderstand some scripts: Vatican City.
The dead of the Pope shakes the world. He passed away →
Earthquake in the Vatican. One dead.
• the advantage of this method is that the output can be in any
language
Concept-based abstracting (Paice and
Jones, 1993)
• Also referred to as extract and generate

• Summaries in the field of agriculture
• Relies on predefined text patterns such as this paper studies
the effect of [AGENT] on the [HLP] of [SPECIES] → This
paper studies the effect of G. pallida on the yield of potato.
• The summarisation process involves instantiation of patterns
with concepts from the source
• Each pattern has a weight with is used to decide whether the
generated sentence is included in the output
• This method is good to produce informative summaries
Other knowledge-rich methods
• Rumelhart (1975) developed a system to understand and

summarise simple stories, using a grammar which generated
semantic interpretations of the story on the basis of
hand-coded rules.
• Alterman (1986) used local understanding
• Fum, Guida, and Tasso (1985) tries to replicate the human
summarisation process
• Rau, Jacobs, and Zernik (1989) integrates a bottom-up
linguistic analyser and a top-down conceptual interpretation
Multi-document summarisation
methods
Multi-document summarisation
• multi-document summarisation is the extension of

single-document summarisation to collections of related
documents
• very rarely methods from single-document summarisation can
be directly used
• it is not possible to produce single-document summaries from
every single document in collection and then to concatenate
them
• normally they are user-focused summaries
Issues with multi-document summaries
• the collections to be summarised can vary a lot in size, so

different methods might need to be used
• a much higher compression rate is needed
• redundancy
• ordering of sentences (usually the date of publication is used)
• similarities and differences between different texts need to be
considered
• contradiction between information
• fragmentary information
IR inspired methods
• Salton et. al. (1997) can be adapted to multi-document

summarisation
• instead of using paragraphs from one documents, paragraphs
from all the documents are used
• the extraction strategies are kept
Maximal Marginal Relevance
• proposed by (Goldstein et al., 2000)

• addresses the redundancy among multiple documents
• allows a balance between the diversity of the information and
relevance to a user query
• MMR(Q, R, S) =
argmaxDi ∈R\S [λSim1 (Di , Q) − (1 − λ)maxDj ∈R Sim2 (Di , Dj ))]
• can be used also for single document summarisation
Cohesion text maps
• use knowledge based on lexical cohesion Mani and Bloedorn

(1999)
• good to compare pairs of documents and tell what’s common,
what’s different
• builds a graph from the texts: the nodes of the graph are the
words of the text. Arcs represent adjacency, grammatical,
co-reference, and lexical similarity-based relations.
• sentences are scored using tf.idf metric.
• user query is used to traverse the graph (a spread activation is
used)
• to minimize redundancy in extracts, extraction can be greedy
to cover as many different terms as possible
Cohesion text maps
Theme fusion Barzilay et. al. (1999)
• used to avoid redundancy in multi-document summaries

• Theme = collection of similar sentences drawn from one or
more related documents
• Computes theme intersection: phrases which are common to
all sentences in a theme
• paraphrasing rules are used (active vs. passive, different orders
of adjuncts, classifier vs. apposition, ignoring certain
premodifiers in NPs, synonymy)
• generation is used to put the theme intersection together
Centroid based summarisation
• a centroid = a set of words that are statistically important to

a cluster of documents
• each document is represented as a weighted vector of TF*IDF
scores
• each sentence receives a score equal with the sum of
individual centroid values
• sentence salience Boguraev and Kennedy (1999)
• centroid score Radev, Jing, and Budzikowska (2000)
Cross Structure Theory
• Cross Structure Theory provides a theoretical model for issues

that arise when trying to summarise multiple texts (Radev,
Otterbacher, and Zhang, 2004).
• describing relationships between two or more sentences from
different source documents related to the same topic.
• similar to RST but at cross-document level
• 18 domain-independent relations such as identity, equivalence,
subsumption, contradiction, overlap, fulfilment and
elaboration between texts spans
• can be used to extract sentences and avoid redundancy
Automatic summarisation and the
Internet
• New research topics have emerged at the confluence of
summarisation with other disciplines (e.g. question answering
and opinion mining)
• Many of these fields appeared as a result of the expansion of
the Internet
• The Internet is probably the largest source of information, but
it is largely unstructured and heterogeneous
• Multi-document summarisation is more necessary than ever
• Web content mining = extraction of useful information from
the Web
Challenges posed by the Web
• Huge amount of information

• Wide and diverse
• Information of all types e.g. structured data, texts, videos, etc.
• Semi-structured
• Linked
• Redundant
• Noisy
Summarisation of news on the Web
• Newsblaster (McKeown et. al. 2002) summarises news from

the Web (http://newsblaster.cs.columbia.edu/)
• it is mainly statistical, but with symbolic elements
• it crawls the Web to identify stories (e.g. filters out ads),
clusters them on specific topics and produces a
multidocument summary
• theme sentences are analysed and fused together to produce
the summary
• summaries also contain images using high precision rules
• similar services: newsinessence, Google News, News Explorer
• tracking and updating are important features of such systems
Email summarisation
• email summarisation is more difficult because they have a

dialogue structure
• Muresan et. al. (2001) use machine learning to learn rules for
salient NP extraction
• Nenkova and Bagga (2003) use developed a set of rules to
extract important sentences
• Newman and Blitzer (2003) use clustering to group messages
together and then they extract a summary from each cluster
• Rambow et. al. (2004) automatically learn rules to extract
sentences from emails
• these methods do not use may email specific features, but in
general the subject of the first email is used as a query
Blog summarisation
• Zhou et. al. (2006) see a blog entry as a summary of a news

stories with personal opinions added. They produce a
summary by deleting sentences not related to the story
• Hu et. al. (2007) use blog’s comments to identify words that
can be used to extract sentences from blogs
• Conrad et. al. (2009) developed a query-based opinion
summarisation for legal blog entries based on the TAC 2008
system
Opinion mining and summarisation
• find what reviewers liked and disliked about a product

• usually large number of reviews, so an opinion summary
should be produced
• visualisation of the result is important and it may not be a text
• analogous to, but different to multi-document summarisation
Producing the opinion summary
A three stage process:

1 Extract object features that have been commented on in each
review.
2 Classify each opinion
3 Group feature synonym and produce the summary (pro vs.
cons, detailed review, graphical representation)
Opinion summaries
• Mao and Lebanon (2007) suggest to produce summaries that

track the sentiment flow within a document i.e., how
sentiment orientation changes from one sentence to the next
• Pang and Lee (2008) suggest to create “subjectivity extracts.”
• sometimes graph-based output seems much more appropriate
or useful than text-based output
• in traditional summarization redundant information is often
discarded, in opinion summarization one wants to track and
report the degree of redundancy, since in the opinion-oriented
setting the user is typically interested in the (relative) number
of times a given sentiment is expressed in the corpus.
• there is much more contradictory information
Opinion summarisation at TAC
• the Text Analysis Conference 2008 (TAC) contained an

opinion summarisation from blogs
• http://www.nist.gov/tac/
• generate summaries of opinions about targets
• What features do people dislike about Vista?
• a question answering system is used to extract snippets that
are passed to the summariser
QA and Summarisation at INEX2009
• the QA track at INEX2009 requires participants to answer

factual and complex questions
• the complex questions will require to aggregate the answer
from several documents
• What are the main applications of bayesian networks in the
field of bioinformatics?
• for complex sentences evaluators will mark syntactic
incoherence, unresolved anaphora, redundancy and not
answering the question
• Wikipedia will be used as document collection
Conclusions
• research in automatic summarisation is still a very active, but

in many cases it merges with other fields
• evaluation is still a problem in summarisation
• the current state-of-the-art is still sentence extraction
• more language understanding needs to be added to the
systems
References
Alterman, Richard. 1986. Summarisation in small. In N. Sharkey, editor, Advances in
cognitive science. Chichester, England, Ellis Horwood.
Baxendale, Phyllis B. 1958. Man-made index for technical literature - an experiment.
I.B.M. Journal of Research and Development, 2(4):354 – 361.
Boguraev, Branimir and Christopher Kennedy. 1999. Salience-based content
characterisation of text documents. In Inderjeet Mani and Mark T. Maybury, editors,
Advances in Automated Text Summarization. The MIT Press, pages 99 – 110.
Conroy, James M., Jjudith D. Schlesinger, Dianne P. O’Leary, and Mary E. Okurowski.
2001. Using HMM and logistic regression to generate extract summaries for DUC. In
Proceedings of the 1st Document Understanding Conference, New Orleans, Louisiana
USA, September 13-14.
DeJong, G. 1982. An overview of the FRUMP system. In W. G. Lehnert and M. H.
Ringle, editors, Strategies for natural language processing. Hillsdale, NJ: Lawrence
Erlbaum, pages 149 – 176.
Edmundson, H. P. 1969. New methods in automatic extracting. Journal of the
Association for Computing Machinery, 16(2):264 – 285, April.
Endres-Niggemeyer, Brigitte. 1998. Summarizing information. Springer.
Fum, Danilo, Giovanni Guida, and Carlo Tasso. 1985. Evaluating importance: a step
towards text summarisation. In Proceedings of the 9th International Joint Conference
on Artificial Intelligence, pages 840 – 844, Los Altos CA, August.
Goldstein, Jade, Vibhu O. Mittal, Jamie Carbonell, and Mark Kantrowitz. 2000.
Multi-Document Summarization by Sentence Extraction. In Udo Hahn, Chin-Yew Lin,
Inderjeet Mani, and Dragomir R. Radev, editors, Proceedings of the Workshop on
Automatic Summarization at the 6th Applied Natural Language Processing
Conference and the 1st Conference of the North American Chapter of the Association
for Computational Linguistics, Seattle, WA, April.
Knight, Kevin and Daniel Marcu. 2000. Statistics-based summarization – step one:
Sentence compression. In Proceedings of the 17th National Conference on Artificial
Intelligence (AAAI), pages 703 – 710, Austin, Texas, USA, July 30 – August 3.
Kupiec, Julian, Jan Pederson, and Francine Chen. 1995. A trainable document
summarizer. In Proceedings of the 18th ACM/SIGIR Annual Conference on Research
and Development in Information Retrieval, pages 68 – 73, Seattle, July 09 – 13.
Lin, Chin-Yew and Eduard Hovy. 1997. Identifying topic by position. In Proceedings
of the 5th Conference on Applied Natural Language Processing, pages 283 – 290,
Washington, DC, March 31 – April 3.
Luhn, H. P. 1958. The automatic creation of literature abstracts. IBM Journal of
research and development, 2(2):159 – 165.
Mani, Inderjeet and Eric Bloedorn. 1998. Machine learning of generic and
user-focused summarization. In Proceedings of the Fifthteen National Conference on
Artificial Intelligence, pages 821 – 826, Madison, Wisconsin. MIT Press.
Mani, Inderjeet and Eric Bloedorn. 1999. Summarizing similarities and differences
among related documents. In Inderjeet Mani and Mark T. Maybury, editors, Advances
in automatic text summarization. The MIT Press, chapter 23, pages 357 – 379.
Marcu, Daniel. 2000. The theory and practice of discourse parsing and summarisation.
The MIT Press.
Orăsan, Constantin. 2009. Comparative evaluation of term-weighting methods for
automatic summarization. Journal of Quantitative Linguistics, 16(1):67 – 95.
Osborne, Miles. 2002. Using maximum entropy for sentence extraction. In
Proceedings of ACL 2002 Workshop on Automatic Summarization, pages 1 – 8,
Philadelphia, Pennsylvania, July. check day.
Paice, Chris D. 1981. The automatic generation of literature abstracts: an approach
based on the identification of self-indicating phrases. In R. N. Oddy, C. J. Rijsbergen,
and P. W. Williams, editors, Information Retrieval Research. London: Butterworths,
Kent, UK, pages 172 – 191.
Radev, Dragomir, Jahna Otterbacher, and Zhu Zhang. 2004. CSTBank: A Corpus for
the Study of Cross-document Structural Relationship. In Proceedings of Language
Resources and Evaluation Conference (LREC 2004), Lisbon, Portugal. find pages.
Radev, Dragomir R., Hongyan Jing, and Malgorzata Budzikowska. 2000.
Centroid-based summarization of multiple documents: sentence extraction,
utility-based evaluation and user studies. In Proceedings of the NAACL/ANLP
Workshop on Automatic Summarization, pages 21 – 29, Seattle, WA, USA, 30 April.
Rau, Lisa F., Paul S. Jacobs, and Uri Zernik. 1989. Information extraction and text
summarisation using linguistic knowledge acquisition. Information Processing &
Management, 25(4):419 – 428.
Rumelhart, E. 1975. Notes on a schema for stories. In D. G. Bobrow and A. Collins,
editors, Representation and Understanding: Studies in Cognitive Science. Academic
Press Inc, pages 211 – 236.
Salton, Gerard, Amit Singhal, Mandar Mitra, and Chris Buckley. 1997. Automatic
text structuring and summarization. Information Processing and Management,
33(3):193 – 207.
Teufel, Simone and Marc Moens. 1997. Sentence extraction as a classification task.
In Proceedings of the ACL’97/EACL’97 Workshop on Intelligent Scallable Text
Summarization, pages 58 – 59, Madrid, Spain, July 11.
Teufel, Simone and Marc Moens. 2002. Summarizing scientific articles: Experiments
with relevance and rhetorical status. Computational linguistics, 28(4):409 – 445.

Automatic Summarisation II: Methods

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Summarisation II: Methods

Uploaded by

Copyright:

Available Formats

Automatic summarisation II - methods

March 15, 2010

• Determining how humans summarise documents is a difficult

• it’s the first step

• at this stage summarisers identify the theme and the thematic

• the summary is produced from the expanded structure of the

• Produces summaries from a single document

• Produces summaries from a single document

• Produces summaries from a single document

• Produces summaries from a single document

• Produces summaries from a single document

• Extracts important sentences from the text using different

• These methods are quite robust

• It was the first method used to produce summaries by Luhn

• Different methods can be used:

(and can be used for other types of summarisers)

(and can be used for other types of summarisers)

(and can be used for other types of summarisers)

(and can be used for other types of summarisers)

• It was noticed that in some genres important sentence appear

• words in titles and headings are positively relevant to

• Makes use of words or phrases classified as ”positive” or

• decomposes a document in a set of paragraphs

• Edmundson (1969) used a linear combination of features:

• the weights were adjusted manually

• used a Bayesian classifier to combine different features

• learn rules about how to classify sentences

• Osborne (2002) used maximum entropy with features such as

• summarisation methods which use discourse structure usually

• text cohesion involves relations between words, word senses,

• text cohesion involves relations between words, word senses,

• Telepattan system: Bembrahim and Ahmad (1995)

• method presented in (Azzam, Humphreys, Gaizauskas, 1999)

The summarisation module implements several selection criteria:

• Teufel and Moens (2002) exploit the structure of scientific

• for rhetorical roles the following classes are used: Aim,

• The abstracts obtained in this way are betters in terms of

• uses sketchy scripts to understand a situation

1 The demonstrators arrive at the demonstration location

1 The demonstrators arrive at the demonstration location

1 The demonstrators arrive at the demonstration location

1 The demonstrators arrive at the demonstration location

1 The demonstrators arrive at the demonstration location

1 The demonstrators arrive at the demonstration location

1 The demonstrators arrive at the demonstration location

1 The demonstrators arrive at the demonstration location

• the evaluation of the system revealed that it could not process

• Also referred to as extract and generate

• Rumelhart (1975) developed a system to understand and

• multi-document summarisation is the extension of

• the collections to be summarised can vary a lot in size, so

• Salton et. al. (1997) can be adapted to multi-document

• proposed by (Goldstein et al., 2000)

• use knowledge based on lexical cohesion Mani and Bloedorn

• used to avoid redundancy in multi-document summaries

• a centroid = a set of words that are statistically important to

• Cross Structure Theory provides a theoretical model for issues

• Huge amount of information

• Newsblaster (McKeown et. al. 2002) summarises news from

• email summarisation is more difficult because they have a

• Zhou et. al. (2006) see a blog entry as a summary of a news

• find what reviewers liked and disliked about a product

A three stage process: