Met A Tools Final Report

Project Acronym: MetaTools Version: 1a Contact: (until 31 October 2008): Malcolm Polfreman (malcolm.polfreman@kcl.ac.
uk) Contact: (after 31 October 2008): Steve Grace (stephen.grace@kcl.ac.uk) Date: 30/10/2008
JISC Final Report

MetaTools Final Report
Project
Project Acronym Project Title Start Date Lead Institution Project Director Project Manager & contact details MetaTools 01/04/2007 CeRch (None) Dr. Malcolm Polfreman CeRch King's College London 26 - 29 Drury Lane LONDON, WC2B 5RL Tel: 0207 848 1985 Fax: 0207 848 1989 Email: malcolm.polfreman@kcl.ac.uk (None) http://www.ahds.ac.uk/about/projects/metatools/index.htm Repositories and Preservation, Call Area I Tools and Innovation Projects (Strand B) Amber Thomas Project ID End Date 31/10/2008 MetaTools - Investigating Metadata Generation Tools
Partner Institutions Project Web URL Programme Name (and number) Programme Manager
Document
Document Title Reporting Period Author(s) & project role Date URL Access Final Report n/a Dr. Malcolm Polfreman, Project Manager MetaTools_final_report.doc 30/10/2008 Filename n/a Project and JISC internal General dissemination
Document History
Version 1a Date 30/10/2008 Comments
Page 1 of 27
Project Acronym: MetaTools Version: 1a Contact: (until 31 October 2008): Malcolm Polfreman (malcolm.polfreman@kcl.ac.uk) Contact: (after 31 October 2008): Steve Grace (stephen.grace@kcl.ac.uk) Date: 30/10/2008
MetaTools - Investigating Metadata Generation Tools Final report

Malcolm Polfreman and Shrija Rajbhandari (CeRch, Kings College London) Contact (until 31 October 2008): Malcolm Polfreman (malcolm.polfreman@kcl.ac.uk) Contact (after 31 October 2008): Steve Grace (stephen.grace@kcl.ac.uk) 29 October 2008
Page 2 of 27
Table of Contents
Table of Contents............................................................................................................................. 3 Acknowledgements.......................................................................................................................... 3 Executive Summary ......................................................................................................................... 3 Background...................................................................................................................................... 4 Aims and Objectives ........................................................................................................................ 6 Stage 1: developing a methodology for evaluating metadata generation tools............................... 6 Methodology .................................................................................................................................... 6 Implementation ................................................................................................................................ 7 Outputs and Results ........................................................................................................................ 7 Outcomes......................................................................................................................................... 9 Stage 2: Comparing the quality of currently available metadata generation tools. ......................... 9 Methodology .................................................................................................................................... 9 Implementation .............................................................................................................................. 13 Outputs and Results ...................................................................................................................... 13 Outcomes....................................................................................................................................... 19 Stage 3: developing a methodology for evaluating metadata generation tools............................. 20 Methodology .................................................................................................................................. 20 Implementation .............................................................................................................................. 20 Outputs and Results ...................................................................................................................... 21 Outcomes....................................................................................................................................... 22 Conclusions ................................................................................................................................... 23 Implications .................................................................................................................................... 24 Recommendations ......................................................................................................................... 24 References..................................................................................................................................... 25
Acknowledgements
MetaTools - Investigating Metadata Generation Tools, to give the project its full title, was funded by JISC within the Digital Repositories and Preservation Programme, Tools and Innovation (Strand B). We acknowledge the help and shared experiences given by Dr Henk Muller, University of Bristol, and Emma Tonkin, UKOLN, and the iVia team at University of California Riverside particularly Dr. Johannes Ruscheinski. We acknowledge the use of data from Intute and the White Rose Consortium.
Executive Summary
Automatic metadata generation has sometimes been posited as a solution to the metadata bottleneck that repositories and portals are facing as they struggle to provide resource discovery metadata for a rapidly growing number of new digital resources. Unfortunately there is no registry or trusted body of documentation that rates the quality of metadata generation tools or identifies the most effective tool(s) for any given task. The aim of the first stage of the project was to remedy this situation by developing a framework for evaluating tools used for the purpose of generating Dublin Core metadata. A range of intrinsic and extrinsic metrics (standard tests or measurements) that capture the attributes of good metadata from various perspectives were identified from the research literature and evaluated in a report.
Page 3 of 27
A test program was then implemented using metrics from the framework. It evaluated the quality of metadata generated from 1) Web pages (html) and 2) scholarly works (pdf) by four of the more widely-known metadata generation tools - Data Fountains, DC-dot, SamgI, and the Yahoo! Term Extractor. The intention was also to test PaperBase, a prototype for generating metadata for scholarly works, but its developers ultimately preferred to conduct tests in-house. Some interesting comparisons with their results were nonetheless possible and were included in the stage 2 report. It was found that the output from Data Fountains was generally superior to that of the other tools that the project tested. But the output from all of the tools was considered to be disappointing and markedly inferior to the quality of metadata that Tonkin and Muller report that PaperBase has extracted from scholarly works. Over all, the prospects for generating high-quality metadata for scholarly works appear to be brighter because of their more predictable layout. It is suggested JISC should particularly encourage research into auto-generation methods that exploit the structural and syntactic features of scholarly works in pdf format, as exemplified by PaperBase, and strongly consider funding the development of tools in this direction. In the third stage of the project SOAP and RESTful Web Service interfaces were developed for three metadata generation tools Data Fountains, SamgI and Kea. This had a dual purpose. Firstly, the creation of an optimal metadata record usually requires the merging of output from several tools each of which, until now, had to be invoked separately because of the ad hoc nature of their interfaces. As Web services, they will be available for use in a network such as the Web with well-defined interfaces that are implementation-independent. These services will be exposed for use by clients without them having to be concerned with how the service will execute their requests. Repositories should be able to plug them into their own cataloguing environments and experiment with automatic metadata generation under more real-life circumstances than hitherto. Secondly, and more importantly (in view of the relatively poor quality of current tools) they enabled the project to experiment with the use of a high-level ontology for describing metadata generation tools. The value of an ontology being used in this way should be felt as higher quality tools (such as PaperBase?) emerge. The high-level ontology is part of a MetaTools system architecture that consists of various components to describe, register and discover services. Low level definitions within a service ontology are mapped to higherlevel human-understandable semantic descriptions contained within a MetaTools ontology. A user interface enables service providers register their service in a public registry. This registry is used by consumers to find services that match certain criteria. If the registry has such a service, it provides the consumer with a contract and an endpoint address for that service. The terms in the MetaTools ontology can, in turn, be part of a higherlevel ontology that describes the preservation domain as a whole. The team believes that an ontology-aided approach to service discovery, as employed by the MetaTools project, is a practical solution. A stage 3 technical report was also written.
Background
Resource discovery metadata is a crucial component of the lifecycle of digital resources. Standardised metadata is a powerful tool that enables the discovery and selection of relevant digital resources quickly and easily. Poor quality or non-existent metadata on the other hand is equally effective at rendering resources unusable, since without it a resource is essentially invisible within a repository or archive and thus remains undiscovered and inaccessible. Unfortunately, with digital resources being produced in ever-increasing quantities, finding the time and resources necessary for ensuring metadata of appropriate quality is created is becoming a more and more difficult task. Some repositories are hoping that automated metadata generation will provide a solution. Indeed without automation it may be impractical to describe resources at item-level or any finer level of granularity than for the collection as a whole. Automated metadata generation is still in its infancy but several approaches have emerged, including metatag harvesting, content extraction, automatic indexing or classification, text and data mining, social tagging, and the generation of metadata from associated contextual information or related
Page 4 of 27
resources. Some technical metadata captured by tools developed by the preservation community can also contribute to resource discovery: e.g. JHove1, Droid2, and the NLNZ Metadata Extraction tool3. The ideal scenario would be to auto-generate high-quality resource discovery metadata that requires no human intervention at all. In the short-term it is probably more realistic to look towards hybrid solutions, such as using auto-generation to pre-populate or seed a cataloguing interface in readiness for either the depositor of the resource or a repository administrator to subsequently amend manually. Ochoa and Duval suggest a reversal of the process in that they ask whether auto-generation might be the means to improve pre-existing expert assigned metadata4. According to this scenario, automatic evaluation software might identify any low quality metadata records within a repository and then, for any records that lack them, trigger the auto-generation of summaries, for instance, from the textual content of the resource. The first part of this vision, at least, is perhaps becoming reality in that two metadata analysis tools have recently been made available to repositories in New Zealand: the Metadata Analysis Tool (MAT) developed at the University of Waikato and the Kiwi Research Information Service (KRIS) from the National Library of New Zealand5. They harvest metadata using OAI-PMH, analyse the harvested metadata, and provide tools and visualisations that help repository administrators understand their metadata and identify errors. Nevertheless, most of the resource discovery metadata found within the JISC IE is still created and corrected manually either by authors, depositors and/or repository administrators. One of the biggest obstacles to portals and repositories using metadata generation tools or at least experimenting with them is the absence of a registry or trusted body of documents that rates their quality or identifies the most effective tool(s) for any given task. Several recent studies6 7 8 9 have emerged on the subject of metadata generation but none has involved a program of testing. Tools have rarely been tested specifically in relation to JISC resources and the studies have almost always had a narrow focus and small sample size10. Most commonly, any testing has been conducted in-house during a tools development phase in which case partiality may be a concern even when the test has been conducted professionally11. In the absence of accepted benchmarks and open source software for automating the testing process, even this kind of testing has been sporadic and unsatisfactory. Of the extraction tools that are currently available to repositories, only Data Fountains has an even rudimentary metadata evaluation module. Nor have tools generally been tested within real-life workflows. Little is known about how long it takes to manually upgrade/edit the partial output from metadata generation tools and how this compares with the time required for a cataloguer to describe the resource manually from scratch. Properly independent and widespread testing of metadata generation tools is a prerequisite for the development of a web services solution to the problem of metadata generation. The second major obstacle to repositories experimenting with metadata generation is the difficulty of invoking and using the available tools. Metadata generation services are highly specialised, in terms of both the types of input on which they are most effective, and the types of output produced, and this situation is likely to continue. They have generally been developed for specific institutions or in response to particular commercial http://hul.harvard.edu/jhove/ http://droid.sourceforge.net/ 3 www.natlib.govt.nz/about-us/current-initiatives/metadata-extraction-tool 4 Xavier Ochoa, Erik Duval: Towards Automatic Evaluation of Metadata Quality in Digital Repositories., p15. See also Hillmann, D, Dushay, N., & Phipps, J. (2004). Improving metadata quality: augmentation and recombination. Proceedings of the 2004 international conference on Dublin Core and metadata applications: ISBN:7543924129 5 David M. Nichols et al. Metadata Tools for Institutional Repositories. 6 Automatic Metadata Generation Application (AMeGa) Project final report http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf 7 JISCs unpublished Metadata Generation for Resource Discovery study 8 Baird, K. & JORUM Team. (2006), Automated metadata: A review of existing and potential metadata automatioin within JORUM and an overview of other automation systems. JORUM, p.23. Available at http://www.jorum.ac.uk/docs/pdf/automated_metadata_report.pdf 9 Preliminary framework for designing prototype tools for assisting with preservation quality metadata extraction for ingest into digital repository published by Delos 10 For instance, Greenbergs comparison of DC-dot and Klarity, like Irvins research, covered only NIEHS environmental health web pages and involved a tool, Klarity, that no longer exists. 11 E.g. Infomines PhraseRate team compared the Keyphrase output generated by the Data Fountains tool for 101 websites with that of Kea, DC-dot, and Turneys Extractor.
2 1
Page 5 of 27
opportunities and, consequently, handle a narrow range of source formats and/or generate a restricted element set. Keyword extractors, for instance, are generally effective within narrow subject domains or for documents of a predictable layout or genre. Metadata generation functionality is spread thinly across a dozen or so tools each of which offer, at best, only a partial solution and which must be called up separately because of a lack of commonality amongst their various ad hoc interfaces. The available APIs, when they exist at all, vary widely in format. In most cases no API is provided; rather, the interface is via a web-based form or a user-invoked stand-alone client. Currently, the variety of APIs means that it is not possible to call these tools automatically in a flexible manner, as differently formatted calls need to be explicitly programmed for each interface. In the case of screen-based interfaces, it is either necessary to copy information manually, or to use screen scraping techniques, which are not very reliable. A trusted body of documentation available to repository managers that rates the quality of metadata generation software would be beneficial particularly if it provides a mechanism for dynamically discovering services for generating resource discovery metadata appropriate for particular digital content, and automatically invoking them as part of a workflow, e.g. on ingest into a repository.
Aims and Objectives

The project therefore had the following aims, as listed in the project proposal: Develop a methodology for evaluating metadata generation tools. Compare the quality of currently available metadata generation tools Develop, test and disseminate prototype web services that integrate metadata generation tools.
Stage 1: developing a methodology for evaluating metadata generation tools. Methodology

The project began with a review of the research literature to identify criteria for evaluating metadata generation tools. It was expected that the evaluation criteria would fall under five broad headings: 1) Functionality 2) Quality of output 3) Availability 4) Cost 5) Configurability and ease of use. It was found that the AMEGA report had already addressed some of these aspects within its Recommended Functionalities for Automatic Metadata Generation Application12. Rather than replicate its work, it was decided to remedy an important deficiency in that the Recommended Functionalities said little about how to measure the actual quality of output (i.e. the quality of auto-generated metadata) other than that a range of criteria should be used. The project would therefore focus on 3) Quality of output. The project surveyed the research literature to identify a range of intrinsic and extrinsic metrics (standard tests or measurements) that capture the attributes of good metadata from various perspectives. An analysis of the Dublin Core metadata standard identified five broad groups of DC metadata elements for which metrics and tests were required. Element type
12
Example element
Greenberg, J., Spurgin, K., & Crystal, A. (2006). Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. Retrieved October 23, 2008, from: http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf . The Recommended Functionalities lists things that software developers should take into consideration when writing metadata generation software, such as the format(s) that are supported, the extraction method(s), elements generated, output format(s), encodings and bindings, standard vocabularies, configurability.
Page 6 of 27
Controlled field with hierarchy of terms Controlled field without hierarchy of terms Uncontrolled terms (open-ended and multiple) Free text abstracting fields Free text copying fields Categorical fields (i.e. only one value may be assigned)
Dc:subject (e.g. LCSH encoding scheme is used) Dc:type Dc:subject (if no encoding scheme is used) Dc:description Dc:title Dc:identifier
Two broad categories of metrics were considered: Intrinsic metrics that measure the quality of auto-generated metadata in of itself (i.e. in relation to some metadata gold standard, which, in practice, normally means in comparison with manually-created metadata for the same resource). e.g. Completeness; Exact Match Accuracy (EMA); Precision; Recall; F-score (F-measure).; False positives; Error rate; Compression Ratio; Summary length; Summary Coherence; Retention Ratio (Summary Informativeness); Content Word Precision and Content Word Recall; ROUGE; BLEU; Content Similarity; Sentence Rank; The Utility Method; Factoid Analysis; Likert Scale. Extrinsic metrics that measure the efficiency and acceptability of the auto-generated metadata in relation to some practical task (such as relevance assessment or reading comprehension). E.g. The Expert Game; The Shannon Game; The Question Game; The Classification Game; Keyword Association; Query-based methods; Learning Accuracy; Cost-based evaluation (Cost-based error); Time saved. Firstly, each metric was formally defined and the best possible (i.e. target) and worst possible values identified. Secondly, a metric is only an approximate representation of an underlying reality so consideration was given to what the metric really measures and how reliably. An important aspect of this was to understand how metrics can be used in combination to compensate for their various limitations. Thirdly, implementation issues were considered particularly the time, cost, and ease, of using each metric. This involved asking whether the metric can be automated (and, if so, whether open source software is available). Metrics were included within the framework if, for any given group of elements, they produced meaningful results and were practical to implement.
Implementation
It was relatively easy to identify evaluation metrics but harder to quantify their value. It was clear that a test framework would have to be flexible. There is a trade-off between quality and cost not only in terms of the quality and cost of the metadata but also of the process for evaluating it. The solution was to present a menu of tests that repositories, and the project itself, would be able to mix n match according to the time, resources and evaluation software at their disposal.
Outputs and Results

The table below lists the various evaluation metrics, the metadata elements for which they are appropriate, and whether they are capable of being automated or intrinsically manual. Some, such as Exact Match Accuracy, Precision and Recall are likely to be within the compass of individual repositories. Others, such as BLEU and ROUGE, are more suitable for dedicated research projects within the field of information retrieval. Repositories may wish to mix n match tests from the menu below according to the time, resources and evaluation software at their disposal. Metrics marked with an asterisk are capable, in principle of being automated although the project found that there is virtually open source metadata evaluation software currently available.
Page 7 of 27
Element group Controlled field with choice of terms in a hierarchy
Elements Subject terms (LCSH/AAT)
Controlled field with choice of terms but no hierarchy
Uncontrolled terms (open-ended and multiple)
Subject terms (iVia subject terms) Media type Language Keywords
Proposed metrics Cost-based evaluation False positives* Error rate* Balanced Distance Metric (errors weighted) Augmented precision (errors weighted) Augmented recall (errors weighted) Augmented f-measure (errors weighted) Learning Accuracy Cost-based evaluation Learning Accuracy Exact match (if very few instances)* Likert scale Subfield precision* Subfield recall* F-score* Extrinsic evaluation Shannon Game Question Game Classification game Keyword association Intrinsic evaluation Summary coherence Factoid analysis Likert scale Content-word/sentence precision* Content-word/sentence recall* Stemmed content-word precision* Stemmed content-word recall* F-score* ROUGE (N-grams)* BLEU (N-grams)* Compression Ratio* Summary length* Retention Ratio Sentence rank Utility method Vocabulary test (content similarity)* Content-word precision* Content-word recall* Stemmed content-word precision* Stemmed content-word recall* F-score* Exact match* Exact match* QConf-Textual* Time saved Completeness*
Free text abstracting fields
Description
Free text copying fields
Title
Categorical elements All elements/overview
Identifier Format
A more detailed analysis, including formal definitions for the metrics, is found in the Stage 1 Test Framework report.
Page 8 of 27
Outcomes
The outputs from this section of the project were intended primarily to inform the test phase of the project. On reflection, the objective of providing a widely applicable test framework was over-ambitious in view of the differing requirements of repositories, developers, etc. The final output took more the form of a menu of suggested metrics and methods that repositories could choose from according to their circumstances.
Stage 2: Comparing the quality of currently available metadata generation tools.
Methodology
The second stage of the project used parts of the test framework from stage one to compare the quality of currently available metadata generation tools. A Web-survey identified thirty (mostly poor-quality) metadata generation tools, from which a short-list of five were selected, using criteria from the Recommended Functionalities checklist. Key factors in the decision were the range of format(s) that are supported, the extraction method(s), elements generated, output format(s), encodings and bindings, standard vocabularies, configurability, technology base (i.e. platforms/operating systems supported), dependencies, maturity, and claims made about the quality of output, and, above all, any licensing restrictions (the tools had to be open source). The selected tools were (in no particular order): DC-dot13 This is the most well known of a group of usually Java-based tools that generates Dublin Core metadata, when an identifier such as a URL is manually submitted to an online form. It mostly harvests metatags from Web page headers. It was developed by Andy Powell at UKOLN but has not been updated since the year 2000. KEA14 Kea is a tool developed by the New Zealand Digital Library Project for the specific purpose of extracting native keywords (or, rather, keyphrases) from html and plain text files. A nave Bayesian algorithm learns words frequencies based on a training corpus. Keyphrases are extracted according to a combination of how specific they are to the given document (in comparison with the training data) and their position within the document. KEA is available as a download and must be trained with a document set from the relevant subject domain. Data Fountains15 Unlike the other tools, Data Fountains can be used to discover, as well as describe, Internet resources. Data Fountains can generate metadata for a given page if provided with a URL. Alternatively, it can trawl for and generate metadata for Internet resources on a particular topic. Or it can drill down and follow links from a starting url. The metadata generation method varies from the simple harvesting of metatags in the case of dc:creator to the use of a combination of metatag harvesting and sophisticated NLP algorithms (phraseRate), as in the case of dc:subject. Data Fountains differs from KEA in not requiring a training corpus it generates its keyphrases mostly by using clues found within the structure of the target document itself. It was expected that differences between the keyphrase algorithms used by KEA, Data Fountains, and Yahoo! Term Extractor would lead to some interesting test results. Data Fountains generates the entire range of simple Dublin Core elements plus a few elements specific to itself. It is the result of an on-going project at the University of California, Riverside. Data Fountains is available for use online via a user-friendly gui and free account or as a download of the iVia suite of tools, upon which it is based. The download and installation are not trivial tasks, however.
13 14
http://www.ukoln.ac.uk/metadata/dcdot/ http://www.nzdl.org/Kea/ 15 Version 2.2.0, http://datafountains.ucr.edu/
Page 9 of 27
SamgI (Simple Automated Metadata Generation Interface)16 Samgl is designed to extract metadata from learning objects (LOM), such as courseware. Several potentially useful Dublin Core elements are generated. The SAmgl is a framework/system that one could call "federated AMG" in that it combines the auto-generated output from several Web services into a single metadata instance. For instance, it invokes the Yahoo! Term Extraction service to provide key words and phrases. SamgI is the work of the HyperMedia and DataBases Research Group at the Katholieke Universiteit Leuven. It is available as a download or online. Yahoo! Term Extraction service17 The Yahoo! Term Extraction Web Service is a RESTful service that provides a list of significant key words or phrases extracted from a larger content. The Yahoo! Term Extractor was evaluated indirectly when testing the SamgI. The intention at the outset was also to test PaperBase18, a prototype that uses a nave Bayesian algorithm to generate metadata for scholarly works. However, the software was not made available in time for the full testing program. Format identification and characterisation tools, such as DROID19 and JHOVE20 can provide useful metadata for dc:format but were considered out of scope because they have been evaluated elsewhere. The plan was to test the five selected tools for their effectiveness in generating Dublin Core metadata from Web pages (html) and journal articles (pdf). These formats were chosen for their ubiquity within the JISC IE and because textual resource perhaps offer the greatest scope for automatic metadata generation. The website sample was drawn from Intute. Five test samples, each of about 50 items, were obtained between 6 November 2007 and 4 March 2008, comprising urls for the 50 most recently accessioned resources in: 1) the Intute database [i.e. all broad subjects], 2) Intute:Health and life sciences, 3) Intute:English Literature, 4) Intute:Political History, and 5) Intute: Demographic Geography 21 The scholarly works sample comprised 120 journal articles in pdf format (which is the dissemination format for scholarly works in most institutional repositories). They were downloaded from the Leeds and Sheffield sections of the White Rose Consortium institutional repository. The sample was carefully selected to correct the bias towards science/technology (and particularly computer science). One article was selected randomly from each subject in the Leeds and Sheffield sections of the repository an approach that also reduced the likelihood of selecting multiple items from a single journal. The project was interested in the potential for extracting http://www.cs.kuleuven.be/~hmdb/joomla/index.php?option=com_content&task=view&id=48&Itemid=78 http://developer.yahoo.com/search/content/V1/termExtraction.html 18 Tonkin, E., & Muller, H. (2008a). Keyword and metadata extraction from pre-prints. Proceedings of the 12th International Conference on Electronic Publishing held in Toronto, Canada 25-27 June 2008 / Edited by: Leslie Chan and Susanna Mornati. ISBN 978-0-7727-6315-0, 2008, pp. 30-44. Retrieved October 23, 2008, from: http://elpub.scix.net/data/works/att/030_elpub2008.content.pdf . Also, Tonkin, E., & Muller, H. (2008b). Semi Automated Metadata Extraction for Preprints Archives. Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries held in Pittsburgh Pittsburgh PA, USA, June 2008. ISBN:978-1-59593-998-2, pp.157-166. Retrieved October 23, 2008, from: http://portal.acm.org/ft_gateway.cfm?id=1378917&type=pdf&coll=GUIDE&dl=GUIDE&CFID=7481537&CF TOKEN=69623572 19 DROID (Digital Record Object Identification), http://droid.sourceforge.net/wiki/index.php/Introduction 20 JSTOR/Harvard Object Validation Environment, http://hul.harvard.edu/jhove/ 21 http://www.intute.ac.uk/ . One sample was representative of the Intute database as a whole and covered a variety of subjects and digital formats. The other four were to test whether the effectiveness of metadata extraction the various tools was affected by the subject orientation of the resources.
17 16
Page 10 of 27
metadata once the repository movement grows so a broad range of title-page layouts was preferable to a strictly representative sample of the current repository22. The output from the various tools was evaluated using the following metrics and methods: The most basic test was for completeness i.e. whether each of the tools auto-generated all of the metadata elements that it should have. A baseline impression of completeness could be obtained by validating the xml output from each tool against the relevant application profile (which for the html sample was the Intute Cataloguing Guidelines)23 and noting the number of mandatory tags that were either empty or absent. DC Identifier, DC Language, DC Format and DC Type were evaluated by Exact Match Accuracy. EMA is the proportion of times that the automatically generated value exactly matches the expert cataloguer's assignment for the element (after simple normalizations) and is appropriate for elements to which a single value from a controlled vocabulary is normally assigned. Titles and Keywords were evaluated by Precision, Recall, and F1-score. Precision and recall are more sophisticated than EMA in that multiple-value fields are split into their individual "subfields" so that each unit (e.g. keyword or title word) is matched individually. Precision and Recall can therefore take into account situations where one value or instance is correct and another is incorrect. Precision is the number of correctly generated units as a percentage of the number of generated units. A low precision score would suggests that there is a lot of noise (i.e. that many spurious terms are also generated). Recall measures the extent to which all of the correct values have been generated regardless of whether they are accompanied by noise from incorrect values. Recall is defined as the number of matching units (i.e. correctly generated terms) divided by the number of correct units in the reference set of documents (e.g. terms supplied by the expert cataloguer). F1-score was also calculated to provide realistic trade-off or weighted average between precision and recall. Used on their own, those two metrics can give a correct but misleading representation. Keyword precision, for instance, can be maximised artificially by a strategy of returning just a single word - the single most frequently occurring word within the source text after stop words have been removed. But the result is an absurdity because of the unacceptably low level of recall that this implies. Conversely, a strategy of extracting all of the words from the resource as keywords will ensure 100% recall but with very low level of precision. F1-score is defined as: F1 = 2 * precision * recall precision + recall
F1-score reaches its best value at 1 and worst score at 0. Its significance is that it tends towards its target value (of 1) only when both the precision and recall are good. The auto-generated summaries were tested with particular interest because DC Description and DCTerms:abstract are, more often than not, the most labour-intensive elements to catalogue manually24. Unfortunately, free-text summaries are monstrously difficult to evaluate because there is no single correct answer25. No two cataloguers, for instance, can be expected to produce summaries identical in every respect (i.e. in terms of their semantic meaning, choice of words, and word order). Also, perceptions about the quality of a summary will vary from one person to the next according to their expectations, needs, assumptions and biases.
22
http://eprints.whiterose.ac.uk/ . Many titles within the White Rose repository contained an intsitutional cover page and had to be rejected. 23 http://www.intute.ac.uk/cataloguer/guidelines-socialsciences.doc 24 This is true even for scholarly works that contain an author-supplied abstract because special characters often fail to copy and paste over from a pdf file into a cataloguing interface properly by point and click methods. 25 Halteren, Hans van, and Teufel, Simone. 2003. Examining the consensus between human summaries: initial experiments with factoid analysis. In Proceedings of the HLT-NAACL 03 on Text summarization workshop, 57 64.
Page 11 of 27
Ideally, at least one extrinsic evaluation method would have been employed to show how useful participants find the metadata in real-life external tasks (e.g. relevance assessment, reading comprehension, or, in following instructions)26. However, time and resource constraints made it more practical to use a technique known as factoid analysis, which can reveal the richness of the information in a summary and the degree of consensus between summary and source text at a fundamental level of semantic meaning despite the existence of superficial differences in vocabulary and word/sentence order. The assessor first divided each reference summary (i.e. expert created abstract) manually into its atomic semantic units, known as factoids. For instance, the sentence In addition, there is a timeline section which focuses on the history of the Holocaust, tracing the years from the rise of the Nazi Party during the 14 years following the end of World War I to the aftermath of the Second World War in which Nazi perpetrators of war crimes faced retribution for their war crimes and survivors began rebuilding their lives. was split into the following factoids: There is a timeline section The timeline focuses on the history of the Holocaust The timeline begins by tracing the years from the rise of the Nazi Party The Nazi Party rose during the 14 years following the end of World War I The timeline ends in the aftermath of the Second World War In the aftermath of the Second World War, Nazi perpetrators of war crimes faced retribution for their war crimes In the aftermath of the Second World War, survivors began rebuilding their lives The auto-generated summary was likewise split into factoids. The assessor then counted the degree of overlap between the two summaries on a factoid by factoid basis. Factoids in summary A and B were regarded as matches if they share the same semantic meaning regardless of differences in vocabulary. The test was then refined to see how well the factoids matched the specific requirements of the Intute Cataloguing Guidelines, which expected the following to be described: The nature of the resource, e.g. an electronic journal, collection of reports etc. The intended audience of the information Who is providing the information (author, publisher, funder, organisation) The subject coverage/content of the resource Any geographical or temporal limits Any form or process issues that might affect access or ease of use (charging, registration, need for any special software not on the technical requirements list, etc.) Availability of the resource in other languages Any cross referencing notes, for example, "These pages form part of the ... website, " This was an intellectually intensive process but not as time consuming as evaluation by extrinsic methods. There was, of course, a small subjective element involved in deciding whether a match was close enough to be registered but it was deemed to be at an acceptable level. The project also conducted a small experiment to compare the time taken to create metadata for a given set of documents using manual and hybrid cataloguing methods respectively. Two subsets, each of seven Webpages, were taken randomly from the Intute: Political History sample. The first was catalogued manually into an MS Excel spreadsheet by a subject expert (a librarian with a PhD in History). The second set was described using a hybrid method whereby metadata was auto-generated by Data Fountains into an MS Excel spreadsheet and then manually enhanced by the subject expert. In both cases, the metadata was created to the standard stipulated by the Intute Cataloguing guidelines. The average time taken for each method was recorded.
26
E.g. The Expert Game; Shannon Game; Question Game; Classification Game; Keyword Association; Querybased methods.
Page 12 of 27
Implementation
It was relatively easy to find a test bed for html resources. Intute was an ideal candidate because: 1) It contains a ready-made reference set of expertly catalogued metadata records for over 100,000 html resources 2) Its subject coverage is universal 3) Its broad subject terms assist the identification of subsets. It was more difficult to find a scholarly works test bed from within the JISC IE because: 1) Institutional repositories (as yet) contain few items27. 2) The subject coverage is usually unbalanced. 3) Most repositories prefix their scholarly works with an institutional cover page with logo etc. which confuses tools28. The process of testing was also unexpectedly time-consuming. This was partly because of the need to convert between multiple output formats. This was particularly problematic in relation to KEA, which requires a large training corpus. It therefore made more sense test KEA after Stage 3 was completed, by which time metadata generation tools and utilities (e.g. file conversion software) would be available for easier use as Web services. The scarcity of open source software for calculating intrinsic metrics was even more troublesome. The Data Fountains evaluation module ostensibly calculates precision, recall, exact match, etc. but is no longer supported and is unable to evaluate the output from other metadata generation tools. Most calculation had to be done manually. On the other hand, most of the extrinsic metrics would have required a project in themselves. The main disappoint was in not being able to test PaperBase. Its developers ultimately preferred to conduct tests in-house. The risk analysis conducted during the planning phase paid insufficient attention to this possibility. The disappointment was heightened because the test phase suggested that there is greater scope for extracting metadata from scholarly works than from websites and, according to the research literature, paperBase appears to be the most sophisticated prototype yet developed for this purpose. Fortunately, Tonkin and Muller published their findings during the later stages of the project and some useful comparisons could be made with the other tools.
Outputs and Results

A Stage 2: Tool Evaluation report provides detailed description of the results from the test phase of the MetaTools project. Its main findings are described below. The quality of metadata generated for Websites (html) was mostly disappointing. In many respects, Data Fountains proved to be the most successful of the tools in the test program at generating metadata from html but it fell short of providing a complete metadata solution. For a few items, other tools generated superior metadata for specific elements. Data Fountains generated metadata that was 69% complete (i.e. it generated a value for 69% of the 270 mandatory elements required by a sample of 50 Websites) which means that it generates no value at all for a full 31% of mandatory elements. Very often tools shared the same systemic error(s) because they used identical methods. DC:creator was the most problematic element for all of the tools. DC-dot, SamgI and Data Fountains produced an exact or partial match for it only 10% of the time29. The approach of all three tools was to extract dc:creator Glasgow Eprints, http://eprints.gla.ac.uk/, for instance, contained a large number of ppt presentations but too few eprints in pdf 28 JSTOR, http://www.jstor.org/ , which would otherwise have been very suitable, was ruled out for this reason.
27
Page 13 of 27
or author metatags, although Data Fountains additionally post-processed the list to remove duplicate entries and blacklist undesirable values. Unfortunately, these metatags were rarely supplied or else they instead contained the name of the software used create the page rather than a personal or corporate name. DC:title was usually generated more successfully because harvestable dc:title metatags and <title> html tags were often available. However, even Data Fountains achieved relatively modest precision and recall rates of 0.45 and 0.65 respectively for this element30. Metatags often did not match the displayed titles, which are increasingly rendered from image files (e.g. gif), which DC-dot and Data Fountains are unable to read. This problem has no obvious solution. The ALT attribute may provide an alternative text description for a linked image file but it is infrequently used. The auto-generated title metadata was of very variable quality and the story was similar for most other elements. Data Fountains generated an exact match for only 34% of the titles in the Intute:Political History sample and a partial match for 30%. DC-dot performed similarly: exact match 32%; partial match 34%. SamgI was far worse: exact match only 4%, partial match 6%31. A rough idea of the output can be gained from the table below, which shows a selection of titles generated by Data Fountains and their equivalents in the reference set (expertassigned titles from the Intute Website). Intute Title (expert assigned) British Society for Population Studies A fact, and an imagination : or Canute and Alfred, on the seashore A bibliography of Thomas More's 'Utopia' A tribute to R. S. Thomas 17th century reenacting and living history resources Dan Berger's pages at Bluffton University 221 Baker Street ABES : annotated bibliography for English studies Data Fountains Title (auto-generated) British Society for Population Studies Wordsworth, William. 1888. Complete Poetical Works. [EMLS 1.2 (August 1995): 6.1-10] A Bibliography of Thomas More's Utopia Obituary Ruins The Hearth Here Want More? 17TH Cen Reenacting Examples of pericyclic reactions 221B Baker Street: Sherlock Holmes Abes Organic chemistry I : Chem 341 Adelmorn by Matthew G. Lewis Defoe, Apparation of Mrs. Veal Library System - Howard University Abraham Cowley text and image archive Centro regionale progettazione e restauro di Palermo The Surrealism Server
Adelmorn, the outlaw : a romantic drama in three acts A relation of the apparition of Mrs. Veal A centennial tribute to Langston Hughes The Abraham Cowley Text and Image Archive: University of Virginia !Surralisme!
According to tests using the Intute:Political History sample For the Intute:English Literature sample. Data Fountains generated an exact match for only 34% of the titles in the Intute:Political History sample and a partial match for 30%. DC-dot performed similarly: exact match 32%; partial match 34%. SamgI was far worse: exact match only 4%, partial match 6% 31 SamgI perhaps performs so poorly in this respect because it was designed to extract LOM metadata and the title module may have been a token effort. It only harvests META tags where NAME="DC.Title which are rarely used. Data Fountains exploits additional sources of title metadata when there is no DC.Title META tag:
30
29
1. The content of any META tag whose name is title or dc:title. 2. The Title tag. 3. All H1 tags. 4. The sequence of words in the first 50 letters of body text. Data Fountains then post-processes the initial list to remove duplicate entries, blacklist undesirable values (e.g. Homepage, Untitled Document), and remove unwanted prefixes (e.g. Welcome to, Homepage of) while preserving the order of the list. The values remaining in the list are assumed to be in order of decreasing quality, so that when a single Title is required, the first is used.
Page 14 of 27
Creator and Title exemplify two of the key problems of generating metadata from Webpages. Firstly, it is difficult to extract metadata for specific entities from the body of a page section because html contains few tags to indicate semantic meaning other than in the header. E.g. A personal or corporate name within the text could be a creator, subject, or somebody mentioned in passing32. Secondly, the order of priority accorded to extraction methods is usually hard-coded within each tool. This can cause the output to be suboptimal. e.g. Data Fountains outputs a harvested dc:title metatag as a matter of course, even when there is a better title in the <title> html tag. DC-dot gives preference to the title tag and this is why its output is occasionally preferable. Nevertheless, in purely statistical terms, it was clear that Data Fountains is an improvement over older tools (e.g. DC-dot) and also other relatively recent tools (e.g. SamgI and Yahoo! Term extractor). The improvement was most visible in relation to free text elements, such as key phrases and summaries33. Statistical analysis suggests that Data Fountains generates more suitable keywords or keyphrases for Websites and scholarly works than either DC-dot or SamgI, which uses the Yahoo! Term Extractor web service. When the author-supplied keywords found in thirty scholarly works were compared with those generated by Data Fountains and SamgI it was found that the output from Data Fountains showed superior precision, recall and F1 score34. Human evaluation supported this finding. Data Fountains produces a larger number of unnecessary variations from the same root phrase (e.g. deprived neighbourhoods, deprived neighbourhoods of Southampton) but SamgIs errors are more serious. It more frequently generates entirely irrelevant phrases, or treats entities that should populate other elements as key phrases. For instance, it is not uncommon for the title of the journal in which the article is found (e.g. Journal of Social Policy) to be entered as a keyphrase. Where terms are relevant they are also slightly less specific (e.g. economic geography rather than deprived neighbourhoods). DC-dot generates dc:subject metadata by harvesting metatags and, where they are absent, extracts keywords from the content by analyzing anchors (hyperlinked concepts) and presentation encoding, such as <strong>, bolding and font size. This is a crude method because: 1) The number of keywords assigned to a resource by an author can vary enormously, as can the number of hyperlinked or emphasised words within a page35 2) Authors differ in their interpretation of what constitutes a keyword and many do not, strictly-speaking map to dc:subject (e.g. some creators enter the name of their home institution as keywords) 3) there is no guarantee that hyperlinked or emphasised words indicate the subject of a site. [evidence of DC-dots poor extraction of keywords needed here] Data Fountains uses a more sophisticated method to generate keywords or, to be precise, keyphrases. A (phraseRate) algorithm analyses the structure of the given html document and uses this knowledge as the basis for its identification and ranking of candidate keyphrases. It uses an array of indicators, including the nesting structure of the document (extra value is given to, say, introductory or emphasized text); the grammatical and syntactical structure of sentences (in order to identify the correct start and end-points of candidate phrases); the relative frequency of each word, its position within candidate keyphrases, and the frequency of those keyphrases within the source document as a whole; and the relative weight that should be given, respectively, to keyphrases extracted from the body and keyword meta tags from the document header (the latter being given a boost when they are sparse so that each carries more weight). Data Fountains applies a more sophisticated method for generating summaries too. DC-dot derives them entirely from the harvesting of meta tags (i.e. dc:description) whereas Data Fountains uses the scores given to the candidate keyphrases by the phraseRate algorithm to find the highest-scoring text division, and the highest-
32 33
The <title> tag is generally the only useful signpost of this kind. In absolute terms, Data Fountains produced better results for dc:identifier than keywords or abstract. It generated a correct value (i.e. an exact match) 83% of the time. However, an even higher accuracy threshold is desirable for identifier because even the smallest error renders a unique machine-readable identifier useless. 34 The comparison was between unique content words (i.e. after breaking down the auto-generated keyphrases into their component words, removing duplicates, and stop words, such as of, the, and). 35 From none to ??? in the sample.
Page 15 of 27
scoring paragraph within the resource, which is returned as the final summary. If this strategy fails, a set of contiguous high-scoring sentences are used instead36. Data Fountains succeeds in improving upon the performance of DC-dot but factoid analysis suggests that its auto-generated summaries still only contain only a small proportion of the information found in the reference set of human created summaries. [figures and examples needed] Interestingly, the Data Fountains summary of an item was usually no better than and, indeed, was usually identical to - the brief Google result (they were presumably generated by a similar algorithm). Attitudes towards auto-generation are likely to be shaped, to some extent, by context and expectations so that a description that is acceptable within the context of a search engine (which offers universal retrieval) is likely to require considerable amendment before it is suitable for the relatively small but richly-descriptive context of a JISC portal. In some respects, tools generated higher quality metadata from scholarly works in pdf format than from Webpages. This was true even for tools, such as Data Fountains, that were, in the first instance, designed to support html. For the Website sample, Data Fountains was unable to generate any summaries that approached the standard stipulated by the Intute Cataloguing Guidelines. On the other hand, it will extract the author/publisher-supplied abstract perfectly from 17.5% of scholarly works and, altogether, 25.8% of the autogenerated abstracts are suitable for seeding a cataloguing interface.37 Despite this still relatively low retrieval rate, the abstracts are useful because they are rarely incorrect. There is either an exact match, a truncated match (i.e. where only the first portion of an abstract is extracted), or else the auto-generated abstract contains extraneous text from beyond the point at which it should stop. Here, the cataloguer can always accept the autogenerated output with only the smallest amendment being needed. Moreover, there is considerable scope for further improving the quality of extraction of summaries from scholarly works. Unlike Websites, for which an author-created abstract or summary would be a rarity, the vast majority of scholarly works (92% of the scholarly works that Data Fountains crawler managed to locate in the sample) do contain a summary that a bespoke tool could be expected to extract. Even after conversion to plain text, 63% of these files have a textual marker such as Abstract, ABSTRACT, Abstract:- or, less frequently, Summary or Executive summary and so on.38 Similarly, a cleverly designed algorithm could surely exploit the fact that 30% of scholarly works contain author- or publisher-assigned keywords that are either labelled (e.g. Key words, keywords, KEYWORDS, etc.) or, if not labelled, are identifiable from being the only information typically found within a string that is both greater than 50 characters in length and contains no stop words. Significant testing by other projects Recent work by Tonkin and Muller suggests that a bespoke tool, designed specifically with scholarly works in mind, can produce markedly better results than Data Fountains. Their research was published in June 2008 and came too late to be fully assessed but it is very significant and should be mentioned here. Their paperBase prototype generates titles, authors, keywords, abstracts, page numbers and references from preprints and, via a Dublin Core-based REST API, uses them to populate a web form that the user can than proof read and amend. PaperBase has been integrated into the institutional repository that stores papers written by members of the Department of Computer Science at the University of Bristol where it has been trialled.
36
For details of Data Fountains retrieval extraction methods see: Gordon Paynter (2005) Developing Practical Automatic Metadata Assignment and Evaluation Tools for Internet Resources. Proc. Fifth ACM/IEEE Joint Conference on Digital Libraries, Denver, Colorado, June 7-11, 2005, ACM Press, pp. 291-300. Retrieved October 23, 2008, from: http://ivia.ucr.edu/projects/publications/Paynter-2005-JCDL-Metadata-Assignment.pdf . Also go to the iVia project website: http://ivia.ucr.edu/projects/ 37 This figure includes 1) perfect abstracts 2) abstracts that have been prematurely truncated (usually because the parser reached a special character such as < or >) 3) abstracts that contain some additional unwanted material (on account of the parser continuing beyond the end and extracting the first few lines of the body of the text) 38 Author-supplied abstracts are less frequent in older journal articles and those in arts and humanities subjects than elsewhere.
Page 16 of 27
Tonkin and Muller have manually tested the quality of title and author extraction by paperBase for 186 papers in their repository39. They claim that 86% of the titles were extracted correctly, with just 8% being partially correct and 8% completely wrong. This is a huge improvement on the figures that Data Fountains had managed for the White Rose scholarly works sample (28% of the titles correctly, with 27% being partially correct and 45% completely wrong). The comparison must be treated with some caution because it is based on different samples and the tests were undertaken under different circumstances40 but the disparity in results is startling. Similar figures are claimed for other DC elements that they tested. E.g. Tonkin and Muller claim that only 13% of the author names generated by paperBase from preprints were completely wrong, which compares with Data Fountains error rate of 90%41 for creators of Websites. Tonkin and Mullers work confirms the impression derived from the MetaTools tests that, in the near term, there appears to be greater scope for extracting high-quality metadata from scholarly works than Websites because their layout is more formulaic with most conforming to one or another of five or six traditional arrangements. E.g. a paper is likely to start with a title, then authors, affiliations, abstract, keywords, and end with various references. Tonkin and Muller claim to have captured these five or six alternative layouts in a probabilistic grammar that make use of the syntactic structure in journal publishing42. PaperBase makes use of two techniques: a Bayesian classifier and a state machine. Firstly, a Bayesian classifier was trained, using the word frequencies found in a corpus of titles, author names, and institute names from DBLP and Citeseer. Using these word frequencies, the Bayesian classifier then assigns a score to each token (word) extracted from the eprint to show the probability of it belonging, in turn, to each of the various element. (author, title, etc.). An example might be the word MetaTools which, according to the training corpus, might have a 99% chance of being a title word, but only 1% chance of being part of an author. Most words are ambiguous in that they could appear in more than one context. Secondly, a state machine (Markov chain) provides background knowledge of the sequences of tokens in each of the five or six common eprint layouts. e.g. a simple state-machine could be: PrePrint ::= Title+ Author+ (i.e. one or more title tokens followed by one or more author tokens) A parser then considers the likelihood of each possible distribution of tokens between, in this case, Title and Author. The likelihood of each parse is computed by multiplying the probability of each token belonging to the specific class. e.g. Assuming that the Bayesian classifier has assigned the following probabilities: MetaTools Malcolm Polfreman - probability as a title word = .99; probability as an author word=.01 - probability as a title word = .12; probability as an author word=.88 - probability as a title word = .02; probability as an author word=.98
For the following three-word string there are four possible parses. (Bold denotes the title and Italics denote the author):
39
Tonkin & Muller. Keyword and Metadtaa Extraction from Pre-prints, pp. 40-41. As Tonkin and Muller explain, Of the remaining 87%, 32% included the right authors but had extraneous text that was misconstrued as authors. [The] sample was not sufficiently randomised and had many papers by a Slovenian author with a diacritical mark in both surname and first name, which skewed [the] results 40 E.g. Some clarification is needed as to their definition of correct. For instance, the creators extracted by Data Fountains were often accompanied by footnote numbers, e.g. Smith, John5 , and it is unclear whether Tonkin and Muller would regard a similar result in their tests as a correct or partially-correct value. 41 Intute Political History sample. 42 The explanation of PaperBase that is provided here is a prcis of the account provided in: Tonkin & Muller, Semi Automated Metadata Extraction for Preprint archives, pp, 160-162.
Page 17 of 27
MetaTools, Malcolm Polfreman (.01 x .88 x .98 = .0086) MetaTools, Malcolm Polfreman (.99 x .88 x .98 = .8537) MetaTools, Malcolm Polfreman (.99 x .12 x .98 = .1164) MetaTools, Malcolm Polfreman (.99 x .12 x .02 = .0024) The accepted version is the one that results in maximal probabilities over all for the authors, title, affiliation, email addresses which, in this case, is the second parse, which has a probability of 0.853743. Alternative approaches are used in relation to the small number of items (e.g. scanned legacy documents) from which it is not possible to extract the text as a simple stream of characters. For instance the visual structure of a document may be exploited to segment the image before OCR software, such as gOCR, is then used to translate it into text. The objective ranking of confidence is a potentially very significant feature PaperBase because it means that the system has the capacity to learn. This can be seen in relation to the generation of keywords. Its institutional repository uses a small controlled vocabulary of about 40 terms and a Bayesian classifier is used to predict these keywords using the words found in the preprints title, author name(s), and abstract. As Tonkin and Muller explain: The classifier is trained as and when preprints are added to the repository; when a preprint is added with specific keywords, the classifier is retrained, enabling it to make better predictions for the keywords. Tonkin and Muller suggest that confidence in the PaperBase system builds with size because of the in-built capacity to cross-reference entries from multiple metadata records and rank their likelihood of being a match. The greater the size of the database, the greater the amount of machine learning that can take place. Connections can be made because two pre-prints are written by an author with the same, because they cover a similar subject matter according to the keywords. Those connections can be used to, for example, disambiguate author identities. 44 It could eventually, in some senses, be self-correcting. SamgI too provides confidence ratings45 and the Data Fountains Evaluation Module, no doubt, could - based upon automatic metrics, such as precision and recall. But the metadata needs to be of a certain quality (likely to be above the level that they currently achieve) otherwise errors will merely propagate. This is particularly true of author-supplied metadata from metatags. Most promising of all, Tonkin and Mullers have conducted trials that suggest that a deposit process can benefit from a bespoke tool, such as PaperBase being used to, seed a cataloguing interface upon ingest of a scholarly work into a repository. The split volunteers (twelve academic members of staff and PhD students) into two groups, with each volunteer being required to deposit six preprints into the repository. The first group entered the first three preprints manually and the remaining preprints using paperBase. The second group was required to enter the first three using PaperBase and the last three manually. Tonkin and Muller observed: Most participants (9 out of 11) thought the hybrid PaperBase method was faster than the manual approach even though, as Muller and Tonkin acknowledge, the quantitative results do not support this. Metadata creation is more accurate under the hybrid PaperBase system although only for title. It is surmised that errors crept in with the manual approach because a surprising number of participants opted to type information, rather than use a cut-and-paste from the PDF file. The hybrid approach resulted in more metadata because the abstract is filled in. Some participants will not fill in an abstract manually.
43 44
Confidence ratings reach their maximal and minimal values at 1 and 0. Presentation on PaperBase by Henk Muller and Emma Tonkin at UKOLN, 22nd September 2006. 45 http://www.ariadne-eu.org/index.php?option=com_content&task=view&id=53&Itemid=83 . The <miValue> is the confidence rating.
Page 18 of 27
Most importantly of all, there was buy-in from the participants, who unambiguously preferred the semiautomated system46.
Outcomes
The project succeeded in its aim of testing the quality of output of several prominent metadata generation tools (DC-dot, Data Fountains, Yahoo! Term Extractor, and SamgI) in relation to Websites (html) and scholarly works (pdf). Sufficient combinations of tests were conducted to be able to draw broad conclusions although it was not possible to evaluate every tool in relation to every metadata element on account of the difficulty of managing the varied output from standalone tools. The published work of Tonkin and Muller provided useful points of comparison. Unfortunately, the project did not manage to test KEA. It was felt that the auto-generation process would be so much easier to conduct after KEA had been converted into a Web service. It would be easier to manage the large training corpus that KEA requires and the various conversion programs at that stage. Unfortunately, difficulties in recruiting a software developer meant that the project had insufficient time to test KEA but, like the other tools, it can now be tested more easily and meaningfully now that repositories will be able to plug it into their own cataloguing environments. Data Fountains, SamgI, and KEA were the three tools taken forward to Stage 3 of the project. This was not so much because they were seen as offering a particularly significant benefit to repositories. KEA was not even tested and the usefulness even of Data Fountains is likely to be relatively limited. It was more to demonstrate the potential for using an ontology that provides metadata generation services with machine-interpretable semantic annotations, the value of which could be expected to increase as new tools (such as PaperBase?) emerge. The methodology was successful in that it enabled the project to ascertain the quality levels of the various tools relative to one another in statistical terms but it was flawed to the extent that it could not indicate their absolute value i.e. the extent to which any of them would be useful within real-life workflows. Tonkin and Mullers results suggest that there are benefits to be had from seeding a Web form with metadata for scholarly works from PaperBase. However, paperBase has not yet been publicly released and it is far from clear whether the much lower quality of metadata that Data Fountains generates (for Webpages and scholarly works) would be a help or deterrent/hindrance. Although it was not in the initial project plan to use teams of cataloguers in blind tests to compare the efficiency of manual and hybrid methods of metadata creation, some informal tests of this sort were carried out (see Methodology above). The results suggested that any time/accuracy benefits in relation to cataloguing html resources would be, at best, minimal. It was clear that cataloguing Websites according to the Intute cataloguing standards is very different from author-self deposit of scholarly works. The former requires much more intellectual input particularly in relation to the free-text summary, which is likely to be by far the most labour-intensive element to catalogue manually. For a scholarly work, it is likely that a tool will either have grabbed the author-supplied abstract or missed it altogether in which case it is a simple task to cut and paste it from the pdf into the Web form. A Website, in contrast, never contains an author-supplied abstract and, such was the poor quality of the auto-generation process, the MetaTools cataloguer generally found it easier to completely re-write than enhance the auto-generated summary. Similarly, the keyphrases generated by Data Fountains for scholarly works were found to be a moderately useful prompt by the cataloguer in relation to dc:subject (even though about half were usually discarded) whereas those generated from html were of very doubtful benefit in this respect. However, this is just the personal experience of one individual. Clearly, more representative tests, similar to those of Tonkin and Muller, are necessary before firm conclusions can be drawn. The MetaTools test program makes more meaningful testing possible and points to some combinations of tools, documents types, and metadata elements that might be worth testing within real-life workflows. Above all, the study suggests a direction for future research and development. It has suggests that future efforts ought to focus particularly on improving the extraction of metadata from scholarly works.
46
Tonkin & Muller, Semi Automated Metadata Extraction for Preprint archives, pp, 162-166.
Page 19 of 27
Stage 3: developing a methodology for evaluating metadata generation tools. Methodology

The third stage of the project involved developing a subset of tools - Data Fountains, SamgI, KEA - into services with well-defined service interfaces, based on XML-based standards such as SOAP and WSDL. By using such standards, web services provide a mechanism for enabling interoperability between distributed applications. Typically, these standards enforce only weak or implicit typing of data, and do not allow the semantics of web services to be represented, which restricts the potential for dynamic discovery and invocation of such services. One solution could have been to require web service definitions to be more strongly validated against a set of schemas, but this would be rather inflexible, and infeasible in an environment containing many service providers. The MetaTools solution was to use an ontology that allows the metadata generation services to be given machine-interpretable semantic annotations and descriptions, explicitly representing knowledge about the services in a flexible and extensible way. Given suitably extended registry functionality, these annotations would allow the dynamic discovery of appropriate services by software agents, which can invoke these services and integrate them into workflows. The project builds upon several previous initiatives in semantic web service description and semantic registries, such as PANIC47, myGrid/Feta and Grimoires48. In particular, it examines existing web service description ontologies, such as OWL-S49, WSMO50 and the myGrid ontology51, with a view to extending them for semantics specific to metadata generation. To summarize, the MetaTools method was not to develop new metadata generation tools but to find ways to investigate, develop and test common interfaces for various metadata generation services in order to leverage them fully and flexibly.
Implementation
The first task was to understand the metadata generation tools by installing and using all the standalone tools. Kea was easy to install and use, and was run by creating a program. As Kea only extracted key phrases from text files, a number of programs were written to convert files such as PDF, MS Power Point and MS Word to plain text. It was noted that the Kea approach is to process files from a file system and use existing files for purposes of training so that key phrases can be extracted from a new file. Therefore, the process of uploading resources to the server hosting the service requires managing. Thus, certain additional operations had to be added to the Web service, such as view the list of uploaded files, delete a given file, and add new files. The implemented Kea Web service contains the following four methods or operations: extractKeyphrases listTrainingFileNames addTrainingFile deleteTrainingFile
47 48
http://metadata.net/panic/ http://www.grimoires.org; http://www.mygrid.org.uk/index.php?module=pagemaster&PAGE_user_op=view_page&PAGE_id=57&MMN _position=64:51:63 49 http://www.daml.org/services/owl-s/ 50 http://www.w3.org/Submission/WSMO/ 51 http://www.mygrid.org.uk/
Page 20 of 27
Note: the only configuration required when installing the Web service is to edit a file to specify the directory location to save the uploaded files. The SamgI application package contained standalone and Web service versions of the tool and also online and command line clients. It was easy to use because [it already had a Web service version with a method called getMetadata. The provided Web service implementation was used. It contains setter and getter methods that allow the user to get and set the output formats that are supported (such as dc, oaidc, dcrdf, dc used for Dspace, lom) and supported sources (e.g. file and url). The following operations are used: setMetadataFormat getMetadataFormat setConflictHandlingMethod getConflictHandlingMethod getSupportedMetadataFormats getSupportedConflictHandlingMethods getSupportedMetadatasourceIds setCheckRelatedMetadatasourceIds getCheckRelatedMetadatasourceIds getMetadata getMetadataWithMergingInformation
Data fountain was installed by following the requirements and installation instructions. A number of supporting applications had to be installed before it was possible to install the tool itself. After a couple of trial and errors, the tool was installed and its functions were evaluated. The main part of this tool is a program for metadata generation that can be run on a shell and provides options to generate each metadata elements such as title. The Web service interface was created for this program with the following operations: iViaMetadataAssign iViaMetadataAssignXML assignTitle assignDescription assignAuthors assignKeyPhrases assignBroadSubjectCatagories
In real-life situations (when, as the projects test phase showed, metadata is often incomplete) this could be problematic. For example, the generated metadata in dc format that is incomplete could not be attached directly to a resource. Similarly, Kea was only able to extract key phrases whereas data fountain could extract most required metadata elements in an xml format. The aim was not to alter the existing tool but to merge the results from all the tools into a common format. Programs were also developed to map each of these outputs to a common dc METS format as a proof of concept.
Outputs and Results

Three metadata generation tools Data Fountains, SamgI and Kea have been exposed as Web services. This means that they are available for use in a network such as the Web with well-defined interfaces that are implementation-independent, i.e. the service interface (the what) is separated from its implementation (the how). These services will be available for use by clients without them having to be concerned with how the service will execute their requests. The Web services are loosely coupled, making them independent and capable of being reused. Web services are interoperable - meaning that clients and services can use different programming languages (C/C++, python, java, etc). The clear definition of functionality enables the Web services to be consumed by clients in different applications or processes. For example, a programmer can easily generate client stubs from the interface description (WSDL) and use them within a cataloguing application. This can either be
Page 21 of 27
incorporated within a graphical user interface (GUI) or a Web form. An online tool is available at http://www.soapclient.com/soaptest.html to automatically generate a web client for any given WSDL document. A MetaTools system architecture is implemented that consists of various components to describe, register and discover services. Web services are the preferred standards-based way to realize SOA (service oriented architecture), an architectural style that uses a find-bind-execute paradigm. In this paradigm, service providers register their service in a public registry. This registry is used by consumers to find services that match certain criteria. If the registry has such a service, it provides the consumer with a contract and an endpoint address for that service. The WSDL interfaces provide low-level definitions of inputs/outputs and operations that are machine processable (e.g. type of input for an operation may be defined as byte stream or string). A service ontology was developed that contains these low-level definitions and maps them to higher-level semantic descriptions contained within a MetaTools ontology (e.g. input for one of the tools can be either from a Webpage_location or from a local_ file_location). The MetaTools ontology contains high-level human-understandable terms relating specifically to metadata generation. These in turn can be part of a higher-level ontology that describes the preservation domain as a whole. It is flexible enough to expand as terms for new tools are added and can provide a complete domain ontology. A PeDRo annotator provides a user interface by which users can produce the service descriptions. It provides access to the MetaTools ontology so that the user can select appropriate terms to describe a service, its operations and input/output parameters. A proposal was also made to include a quality assessment of metadata output within the descriptions but was not implemented because of a shortage of time. The service descriptions have been mapped to the Grimoires registry. A client interface allows the service descriptions to be published in the registry and the registry can be searched for services using terms from the MetaTools ontology It would be possible to use other open source UDDI clients as interfaces to the Grimoires registry but Grimoires has the advantage of allowing the addition of supplementary metadata from the MetaTools ontology to the service descriptions. For example, file format, semantic type (i.e. what the parameter of an operation is), and so on. Functionalities are provided so that the service descriptions may be stored in any relational or XML databases or in RDF format. A Stage 3: technical report provides detailed description of the MetaTools ontology and system architecture.
Outcomes
The project achieved its aim of developing Web services interfaces for the metadata generations tools. Researchers, application developers and cataloguers should benefit from well-defined and implementationindependent interfaces that enable metadata generation tools to be plugged more easily into their applications and processes and therefore tested under more realistic circumstances and, perhaps, used in earnest in certain circumstances. The MetaTools system is based on a service-oriented architecture that is applicable to any kind of process not just metadata generation and can therefore be easily incorporated within wider preservation workflows. The experience in the MetaTools project suggests that service-based model and domain ontology will play a significant role in the context of preservation of digital resources, particularly where the process of preservation contain interactive and executable elements. Domain ontology may also be crucial as a technique to describe basic concepts in digital preservation and curation domains and defines relations among them. It provides a common vocabulary for researchers who need to share information in the domain and to reuse this domain knowledge. The notion of semantic description of services for preservation metadata tools had its origins in projects such as PANIC52 [2] using OWL-S. But PANIC only presented broad generic functional descriptions of inputs and outputs for the preservation domain, e.g. getObject defines an atomic process that has input range digitalObject.
52
http://www.metadata.net/panic/index.html
Page 22 of 27
To our knowledge, PANIC provides no means of describing the specific types of digital object (such as jpeg image) that the functions supports. In our approach, the terms in the MetaTools ontology allow the types and support formats within service descriptions to be defined. Also, our service ontology is simple and common enough to be applicable to processes or tools in any application domain. The intention was to make a common service ontology that maps well with the WSDL and provides additional intuitive descriptions. This was achieved by choosing a bottom-up approach i.e. using WSDL as the base to model the service ontology and, from there, lead to MetaTools ontology. The service ontology consists of properties, such as contact information that was adopted from the OWL-S ontology and business information from UDDI. For the sake of simplicity, and to save time, a decision was made to adopt properties from various models to form service ontology that was coherent rather than use one of a generic kind as OWL-S. However, it may be possible to map the elements of our service ontology to an extended OWL-S ontology.
Conclusions
Websites (html) Over all, Data Fountains was the most impressive of the tools at generating metadata from html but its output for that digital format is still only likely to be of relatively limited benefit (at best) to portals. The output from html was frustrating because some good metadata was generated but there was no way to separate the wheat from the chaffe. It had been hoped that time and accuracy savings would accrue from using metadata auto-generated from Websites to seed a cataloguing interface for subsequent amendment by an expert cataloguer. But the autogenerated metadata, at least for Websites, was judged to be of too poor quality to make this likely. Enhancing the auto-generated output appears to be at least as much effort for a cataloguer as working from scratch at least if (s)he is working according to the exacting standards of the Intute Cataloguing Guidelines. The decisive factor in relation to Websites (html) was the poor quality of the auto-generated summaries. Even the best summaries were either very brief, had an inappropriate focus or order, or made questionable claims or other value judgments about the quality of the resource. Such was the poverty of the generated metadata that it was not considered worthwhile to build a cataloguing interface and conduct tests with a larger group of cataloguers. The unpredictability of Webpage layout and the absence of tags or any other means for easily identifying semantic meaning within the content of html appears likely to limit the scope for improvements in extraction from this document type. Extraction errors tend to be either systemic and affect all tools identically or else the quality of metadata varies unpredictably from one item to the next depending upon quite micro-scale differences within the html. The gains to be had from invoking the highest-ranked tool for a given element and, if that is unable to generating a value, turning to the next best tool in turn are likely to be relatively modest. Scholarly works (pdf) The outlook for scholarly works, on the other hand, appears to be more positive because the greater efficiency of the extraction process for that document type. Tonkin and Mullers real-life experiments with PaperBase suggest that at least for scholarly works a repository can benefit from using a bespoke tool to seed a cataloguing interface with auto-generated metadata upon ingest. PaperBase has yet to be released publicly but the MetaTools tests suggest that it is worth developing Data Fountains (which was not designed primarily for this purpose) as a Web service so that, in the meantime repositories might usefully experiment with seeding a cataloguing interface. The tests suggest that Data Fountains might help in this way for two metadata elements abstract and keywords although much less successfully than PaperBase. MetaTools ontology We presented a prototype service ontology and MetaTools ontology that facilitated service descriptions and discovery. The design of the MetaTools ontology is sufficiently flexible to form part of an upper domain ontology and can contribute terms for various activities within the domain. The proposed MetaTools system is generic and has the potential to encourage institutions to describe, discover, coordinate, and share activities
Page 23 of 27
other than metadata generation services. As services or tools evolve, it will be possible to integrate them by adding semantic descriptions, thereby updating the registry. To conclude, we believe that an ontology-aided approach to service discovery, as employed by the MetaTools project, is a practical solution that is adaptable to relevant domains such as digital preservation.
Implications
The service data model and ontology described in this report points a way forward that can be extended to other digital preservation and curation domains. As a result of the MetaTools project, institutional repositories should be able to plug Data Fountains, the most promising of currently available tools for auto-generating resource discovery metadata, into their cataloguing environments and derive some immediate (although relatively modest) benefit from it. It should be possible to seed the usual cataloguer interface in this way with abstracts and keywords for perhaps a quarter or a fifth of scholarly works. (Most other Dublin Core elements can be generated but unreliably). For the first time, cataloguers will be able to experiment properly with auto-generated metadata within their own cataloguing environments. It was not possible to incorporate auto-generated resource discovery metadata from standalone tools within their normal workflows. Cataloguer perception of the usefulness of auto-generated output may be as important as statistical evaluation and way, and this can only truly be ascertained from cataloguers using tools in relation to their usual resources, processes, interfaces, cataloguing and quality standards. Portals, such as Intute that catalogue Websites, may also wish to experiment but the output from html is expected to be of fairly low quality. The project may perhaps make a small but useful contribution to overcoming one of the main barriers to author self-archiving. Self archiving of scholarly works has been slow to gain momentum, the common objection being that it is an extra task that puts an unnecessary burden on each researcher. As Carr and Harnad note, In particular, the need to enter the extra bibliographic metadata demanded by repositories for accurate searching and identification is presumed to be a particularly onerous task. Tonkin and Muller have presented results that suggest an interface seeded with auto-generated metadata can encourage authors to submit metadata more conscientiously and believe that it is a quicker process than before, perhaps because they feel that an effort has been made to help them. Whether cataloguers using auto-generated metadata provided by Data Fountains will feel in the same way should be tested in view of the poorer over all quality of its output. Data Fountains could probably make a small start in this direction, although major progress is only likely to come about once extraction methods based on a proper analysis of the specific structure and semantics of scholarly works are made public as Web services. (e.g. as and when PaperBase, a REST API is distributed as open access software). At that point it may be possible to auto-generate metadata of sufficient quality from the pre-print to populate a catalogue in a relatively raw but acceptable form. Once the final version was published, this metadata could perhaps then be used to search Web of Science from which a high-quality metadata record could be obtained. Whatever, the strategy, the finding that that metadata generation tools are invariably confused by the widespread practice of repositories inserting an institutional header/front page at the start of their scholarly work pdfs has workflow implications for repositories. For the process to be effective, automatic metadata generation it must be undertaken before this point in the workflow.
Recommendations
Page 24 of 27
JISC should encourage research into auto-generation methods that exploit the structural and syntactic features of scholarly works in pdf format, as exemplified by PaperBase, and strongly consider funding the development of tools in this direction. Tonkin and Muller should be encouraged to expose PaperBase and the results for its trial program to external scrutiny, testing and validation. JISC should encourage, and probably fund, the testing of metadata generation tools in real-life environments. JISC should monitor the development of the KRIS and MATS metadata analysis tools and encourage repositories to experiment with their use. As a prerequisite, JISC should encourage research identifying the combination of auto-generation and manual metadata creation that leads to the optimal trade-off between cost of production and effectiveness of retrieval. E.g. the cost of production and effectiveness of retrieval arising from various levels of manual and hybrid metadata could be compared, together with that of purely auto-generated metadata and retrieval via non-metadata solutions, such as Google. The MetaTools ontology should be published and its use for describing and registering their software encouraged so that repositories and projects can identify the most appropriate tools for auto-generation tasks. Future work should include extending the MetaTools ontology to include a quality or confidence-rating for the generated metadata along the lines suggested in the Technical Report. This could perhaps ultimately lead to a certain degree of automation in the invoking of appropriate tools. JISC should encourage the development of a shared Dublin Core syntax for describing the confidence level of auto-generated metadata. Data Fountains, SamgI and PaperBase each currently use a different syntax. It is proposed that the MetaTools Web services should be released as part of the wider SOAPI suite of software.
References
Automatic Metadata Evaluation. iVia website, UC Riverside Libraries. Retrieved October 23, 2008, from: http://ivia.ucr.edu/projects/Metadata/Evaluation.shtml Constantopoulos, Panos and Dritsou, Vicky. (2007) An ontological model for digital preservation. In the International symposium in digital curation (DigCCurr2007), April 18-20 2007, Chapel hill, CA, USA. Greenberg, J., Spurgin, K., & Crystal, A. (2006). Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. Retrieved October 23, 2008, http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf
from:
M. Hassel 2004. Evaluation of Automatic Text Summarization - A practical implementation. Licentiate Thesis, Department of Numerical Analysis and Computer Science, Royal Institute of Technology, Stockholm, Sweden. Retrieved October 23, 2008, from: http://www.csc.kth.se/~xmartin/papers/licthesis_xmartin_notrims.pdf Hillmann, D, Dushay, N., & Phipps, J. (2004). Improving metadata quality: augmentation and recombination. Proceedings of the 2004 international conference on Dublin Core and metadata applications: ISBN:7543924129 Hovy, E. and Lin, C. 1996. Automated text summarization and the SUMMARIST system. In Proceedings of A Workshop on Held At Baltimore, Maryland: October 13-15, 1998 (Baltimore, Maryland, October 13 - 15, 1998). Annual Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 197-214. DOI= http://dx.doi.org/10.3115/1119089.1119121 Hunter, J. and Choudhury, S. 2004. A semi-automated digital preservation system based on semantic web services. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (Tuscon, AZ, USA, June 07 11, 2004). JCDL '04. ACM, New York, NY, 269-278. DOI= http://doi.acm.org/10.1145/996350.996415
Page 25 of 27
Maynard, D. (2005). Benchmarking ontology-based annotation tools for the Semantic Web. UK e-Science Programme All Hands Meeting (AHM2005) Workshop "Text Mining, e-Research and Grid-enabled Language Technology", Nottingham, UK, 2005. Retrieved October 23, 2008, from: http://gate.ac.uk/sale/ahm05/ahm.pdf Maynard, D., Peters, W., Li, Y.: Metrics for evaluation of ontology-based information extraction. In: WWW 2006 Workshop on Evaluation of Ontologies for the Web (EON), Edinburgh, Scotland (2006) Retrieved October 25, 2008, from: http://gate.ac.uk/sale/eon06/eon.pdf Mikroyannidis, Alexander, Ong, Bee , Ng, Kia, and Giaretta, David. (2007) Ontology-Driven Digital Preservation of Interactive Multimedia Performances. 2nd International Conference on Metadata and Semantics Research (MTSR 2007). 11-12 Oct, Corfu, Greece. Moen, W.E., Stewart, E.L., & McClure, C.R. (1998). Assessing metadata quality: Findings and methodological considerations from an evaluation of the u.s. government information locator service (gils). In: ADL 98: Proceedings of the Advances in Digital Libraries Conference, Washington, DC, USA, IEEE Computer Society 246 Nichols, David M., Paynter, Gordon W., Chan, Chu-Hsiang, Bainbridge, David, McKay, Dana, Twidale, Michael B., and Blandford, Ann (2008) Metadata tools for institutional repositories. Working Paper: 10/2008. Working Paper Series, ISSN 1177-777X. Retrieved October 25, 2008, from:
http://eprints.rclis.org/archive/00014732/01/PDF_(18_pages).pdf
Ochoa, X. & Duval., E. (200?). Towards Automatic Evaluation of Learning Object Metadata Quality. Retrieved October 23, 2008, from: http://ariadne.cti.espol.edu.ec/M4M/files/TowardsAutomaticQuality.pdf Ochoa, X. & Duval, E. (2006). Quality Metrics for Learning Object Metadata. In E. Pearson & P. Bohman (Eds.), Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2006 (pp. 1004-1011). Chesapeake, VA: AACE. Retrieved October 25, 2008, from: http://ariadne.cti.espol.edu.ec/M4M/files/QM4LOM.pdf Ochoa, X. & Duval, E. (2006). Towards Automatic Evaluation of Learning Object Metadata Quality. Advances in Conceptual Modeling - Theory and Practice, 4231, 372-381 Paynter, G. W. 2005. Developing practical automatic metadata assignment and evaluation tools for internet resources. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (Denver, CO, USA, June 07 11, 2005). JCDL '05. ACM, New York, NY, 291-300. DOI= http://doi.acm.org/10.1145/1065385.1065454 Karen Sparck Jones & Julia R. Galliers, Evaluating Natural Language Processing Systems: An Analysis and Review. Lecture Notes in Artificial Intelligence 1083, Springer, Berlin, 1995, xv + 228 pp. ISBN 3-540-61309-9. Stvilia, B., Gasser, L., Twidale, M., Shreeves, S., Cole, T. (2004). Metadata quality for federated collections. In: Proceedings of ICIQ04 - 9th International Conference on Information Quality. Boston, MA 111-125. Retrieved October 25, 2008, from: http://www.isrl.uiuc.edu/~gasser/papers/metadataqualitymit_v4-10lg.pdf Tonkin, E., & Muller, H. (2008a). Keyword and metadata extraction from pre-prints. Proceedings of the 12th International Conference on Electronic Publishing held in Toronto, Canada 25-27 June 2008 / Edited by: Leslie Chan and Susanna Mornati. ISBN 978-0-7727-6315-0, 2008, pp. 30-44. Retrieved October 23, 2008, from: http://elpub.scix.net/data/works/att/030_elpub2008.content.pdf Tonkin, E. and Muller, H. L. 2008. Semi automated metadata extraction for preprints archives. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (Pittsburgh PA, PA, USA, June 16 - 20, 2008). JCDL '08. ACM, New York, NY, 157-166. DOI= http://doi.acm.org/10.1145/1378889.1378917
Page 26 of 27
van Halteren, H. and Teufel, S. 2003. Examining the consensus between human summaries: initial experiments with factoid analysis. In Proceedings of the HLT-NAACL 03 on Text Summarization Workshop - Volume 5 Human Language Technology Conference. Association for Computational Linguistics, Morristown, NJ, 57-64. DOI= http://dx.doi.org/10.3115/1119467.1119475
Page 27 of 27
Project Acronym: MetaTools Version: 1a (Stage 1 Report) Contact: (until 31 October 2008): Malcolm Polfreman (malcolm.polfreman@kcl.ac.uk) Contact: (after 31 October 2008): Steve Grace (stephen.grace@kcl.ac.uk) Date: 30/10/2008
MetaTools stage 1 report. Test framework: A methodology for evaluating metadata generation tools
TABLE OF CONTENTS INTRODUCTION ..................................................................................................................................2 INTRINSIC METRICS .........................................................................................................................5
Metrics for an overview .............................................................................................5 Completeness .........................................................................................................5 Metrics for categorical elements (e.g. dc:identifier)..............................................5 Categorical elements, such as DC Identifier, DC Language and DC Type, which normally assign a single value from a controlled vocabulary, provide the simplest kind of evaluation scenario. Exact Match Accuracy is suitable for them..............6 Exact Match Accuracy...........................................................................................6 Metrics for Multi-value fields (e.g. dc:subject) .........................................................6 Precision.................................................................................................................6 Recall .....................................................................................................................7 F-score....................................................................................................................7 False positives........................................................................................................8 Error rate ................................................................................................................8 Metrics for Free-Text elements (e.g. dc:description).................................................9 Compression Ratio.................................................................................................9 Summary length .....................................................................................................9 Retention Ratio ....................................................................................................10 Content Word Precision.......................................................................................10 Content Word Recall............................................................................................11 ROUGE................................................................................................................11 BLEU ...................................................................................................................12 Vocabulary Test ...................................................................................................12 Conformance to Expectation Metric base on Textual Information .....................13 Summary Coherence............................................................................................14 Sentence Rank......................................................................................................15 Utility Method......................................................................................................15 Factoid Analysis...................................................................................................15 Likert Scale ..........................................................................................................16
EXTRINSIC METRICS ......................................................................................................................18
The Expert Game .................................................................................................18 The Shannon Game..............................................................................................18 The Question Game .............................................................................................19 The Classification Game......................................................................................19 Keyword Association...........................................................................................20 Query-based methods...........................................................................................20 Cost-based metrics ...................................................................................................21 Learning Accuracy...............................................................................................21 Cost-based evaluation ..........................................................................................21 Time saved ...........................................................................................................22
A MENU OF METRICS ...................................................................................................................24 REFERENCES .....................................................................................................................................26
INTRODUCTION The MetaTools test framework builds upon the AMEGA reports Recommended Functionalities for Automatic Metadata Generation Applications (Greenberg, 2006). The AMeGA report was the first major survey of automated metadata generation techniques and was sponsored by the Library of Congress. The Recommended Functionalities are a detailed list of attributes that should be taken into consideration when developing metadata generation software e.g. the format(s) that are supported, the extraction method(s), elements generated, output format(s), encodings and bindings, standard vocabularies, configurability, ease of use, licensing constraints, and so on. They are listed under the following broad headings System Goals General System Recommendations System Configuration Metadata Identification/Gathering Support for Human Metadata Generation Metadata Enhancement/Refinement and Publishing Metadata Evaluation Metadata Generation for Non-textual Resources Although intended for developers1, the Recommended Functionalities could be used as a checklist by anyone structuring an evaluation program with weightings and scoring being applied according to the requirements of any given repository. There is one respect, however, in which the Recommended Functionalities are deficient. As the title implies, their focus is the functionality of tools. Accordingly, Section 7 (Metadata Evaluation) says relatively little about how to measure the quality of the output. The current report aims to make up for this deficiency by identifying a range of intrinsic and extrinsic metrics (standard tests or measurements) to capture the key attributes of good metadata in all their complexity. First it is necessary to define what is meant by quality. Quality is perhaps best thought of as a measure of fitness for a task. A widely accepted view is that the purpose of metadata is to help the user find, identify, select and obtain resources. It follows, therefore, that the quality of a metadata record is proportional to how much it fulfils those tasks. Several attempts have been made to place quality factors into a theoretical framework. Moen et al (1998) identified 23 quality parameters but some (such as ease of use, ease of creation, protocols, etc.) are more focused on things that are not the concern of the current study, such as the quality of the metadata standard or the metadata generation tool. Bruce and Hillmann (2004) have, perhaps more usefully, distilled the more relevant parameters into seven broad metadata quality criteria: completeness, accuracy, provenance, conformance to expectations, logical consistency and coherence, timeliness, and accessibility. Their list was not drawn up with auto-generated metadata specifically in mind and some of their criteria are more relevant to our purposes than others. Completeness seems entirely relevant. Completeness is the degree to which the metadata record contains all the information needed to have a comprehensive representation of the described object (i.e. does it contain all the required metadata elements?) Accuracy is also highly relevant to our purposes. Accuracy measures the semantic distance between the information in the metadata record and the information within the resource. A shorter distance implies The motivation for the AMEGA report was that Applications [have tended to be] developed in isolation, failing to incorporate previous as well as new advances, partly because of the absence of standards or recommended functionalities guiding the development of metadata generation applications.
1
higher-quality metadata. In other words, it is a measure of whether the metadata values are correct. Much of the MetaTools framework is concerned with automating the measurement of accuracy in one way or another. The artificial intelligence algorithms for this are often very complex but, while humans can discern the accuracy of a metadata record with relative ease, the continuous testing of tools by humans would be a resource-intensive business. Provenance has some relevance. It measures the reputation that a metadata record has in a community the subjective perception that a user has about the origin of the metadata. A comprehensive evaluation framework for auto-generation tools must, of course, measure the completeness and accuracy of the output statistically but most also consider user perceptions of it. E.g. it may be that users (e.g. repositories that harvest metadata by OAI-PMH) trust manually created metadata, or trust metadata from certain repositories or auto-generation tools and not others even in the absence of much hard evidence. Conformance to expectations is concerned with the degree in which the metadata record fulfils the requirements of a given community of users for a given task (Ochoa & Duval, 200?). It is not enough for a metadata record to only be complete (i.e. contain all the required fields) and accurate (i.e. for the metadata not to be incorrect). The amount of information contained within the record should be enough to identify and describe the resource. e.g. Auto-generated summaries (dc:description) need to contain sufficient unique information to enable the user to know what the resource is about, how it differs from other resources, and whether it is of interest to them. Logical consistency and coherence are closely related to the issue of Accuracy. Logic dictates that certain combinations of metadata values are required for a record to maintain internal consistency and coherence. E.g. dc:created cannot contain a date that is later than the date dc:issued or dc:available. Auto-generated metadata should ideally be evaluated for its logical consistency and coherence. Timeliness, the degree to which a metadata record remains current among a certain community over the course of time, appears to be less relevant because we are interested in the quality of output from the auto-generation process rather than how well the output is maintained. Accessibility is probably more relevant to the issue of resource sharing than to the evaluation of autogeneration software (Bruce and Hillmanns interest in metadata quality was motivated, in part by a desire to overcome the obstacle that poor quality metadata presents to widespread resource sharing by OAI-PMH). Accessibility relates to obstacles which impede or prevent access and use of the metadata. These may be physical such as when metadata is held within obsolete, unusual or proprietary file formats that can only be read with special equipment or software. Or they may be intellectual such as when a publisher refuses to make its metadata available for harvesting on the ground that it has a commercial value (Ochoa & Duval, 200?). Bruce and Hillmanns broad evaluation criteria have informed the practical evaluation framework presented below, although a different arrangement has been preferred that has a more practical orientation. Intrinsic metrics are considered first. Intrinsic metrics assume that the quality or usefulness of metadata can be inferred objectively by quantifying or ranking significant features or dimensions that are identifiable within the metadata record itself. (i.e. in relation to some metadata gold standard, which, in practice, normally means in comparison with manually-created metadata for the same resource). This approach has the advantages of being relatively easy, quick and cheap. Consequently, developers have mostly used metrics such as precision, recall and F1 score (all apparently measured using bespoke, inhouse software) during the iterative processes of tool development. They are arranged according to the broad categories of metadata element for which they are deemed suitable. These are, in turn, 1) metrics that provide an overview of the metadata record as a whole 2) metrics for simple categorical elements (e.g. dc:identifier, dc:format, dc:type, dc:language) 3) metrics for complex, multi-value, elements (e.g. dc:subject, dc:title), and
4) metrics for free-text elements (e.g. dc:description). 5) Cost-based metrics Extrinsic evaluation methods are considered immediately after that because they are particularly pertinent for free-text summaries. Extrinsic metrics, in contrast, measure the efficiency and acceptability of the auto-generated metadata in some task (such as relevance assessment or reading comprehension). (Spark-Jones & Galliers 1995). Finally a menu or summary of the metrics is presented in the form of a table. The purpose of the current study is to identify practical ways in which the quality of the auto-generated information can be measured rather than to evaluate the functionality or ease of use of the extraction tool, the quality of the resource itself, or the quality of the metadata standard. A debt should also be acknowledged to Ochoa and Duval (Ochoa & Duval, 2006), who have developed several automated quality metrics based on Bruce and Hillmanns categorisation, some of which are included within the framework or referred to at appropriate points. Other metrics are derived from the work of Gordon W. Paynters Infomine/Data Fountains project at the University of California, Riverside (Paynter, 2005) (Automatic Metadata Evaluation, 200?). The work of Emma Tonkin and Henk L. Muller (2008a and 2008b) at Bath and Bristol, respectively, should also be acknowledged as should that of Diana Maynard at Sheffield (Maynard, 2005). Hovy.
INTRINSIC METRICS
Metrics for an overview
1 Definition/Description Alternative name
Completeness The degree to which the metadata record contains all the information needed to have a comprehensive representation of the described object.
Method
A baseline impression of completeness is obtainable by validating the xml output from a tool against its schema and noting the percentage of mandatory metadata elements that are empty or missing 2. Completeness is generally recorded as a percentage. (i.e. a record is x% complete) Used to provide an overview of the metadata record as a whole rather than a specific element. Completeness is perhaps the most fundamental attribute of good metadata. It is straightforward and easy to automate. This process can be a springboard to metadata enhancement or weeding programs whether manual or The completeness metric differs a little from how humans measure the completeness of a record. E.g. A human assessor is likely to assign a higher degree of completeness to a metadata record that has a value for, say, title but not for extent than one that contains extent but no title. This problem can be mitigated somewhat by supplying a weighting factor to the presence or absence of each metadata element to reflect its relative importance in comparison with other fields. The weightings could be based upon information from Web logs showing the relative frequency with which each element has been used in user-queries issued to the repository.3 A bigger disadvantage is that metadata quality is related at least as much to accuracy (the correctness of values) as to their presence or absence (completeness).
Uses
Advantages
Disadvantages
An array of metrics is required for evaluating accuracy (correctness) because the range of the expected values and their characteristics are likely to vary from one metadata element to the next. All of the following metrics are, in one way or another, concerned with correctness rather than completeness.
Metrics for categorical elements (e.g. dc:identifier)
As Ochoa and Duval explain, While most metadata standards (as DC or LOM) define all of their fields as optional, a working definition of the ideal representation could be considered as the mandatory and suggested fields defined by a community of use in its application-profiles A first approach to assess the completeness of a metadata record will be to count the number of fields that contain a nonull value. In the case of multi-valued fields, the field is considered complete if at least one instance exists. 3 Ochoa and Duval (p.6)
Categorical elements, such as DC Identifier, DC Language and DC Type, which normally assign a single value from a controlled vocabulary, provide the simplest kind of evaluation scenario. Exact Match Accuracy is suitable for them. 2 Exact Match Accuracy
EMA is the proportion of times that the automatically generated value Definition/Description exactly matches the expert cataloguer's assignment for the element (after simple normalizations)4. Alternative name EMA Suitable for the simplest kind of evaluation scenario, which involves metadata elements, such as DC Identifier, DC Language and DC Type, for which just a single value from a controlled vocabulary is normally assigned. Easy to calculate and easy to automate. It is unambiguous when used in the correct situation, which is one where there is no grey area between correct or incorrect values. EMA is not appropriate for more complicated, multiple-value, fields (e.g. DC Subject) because it doesnt distinguish between cases in which one value or instance is correct and another is incorrect. Under such circumstances, EMA would record that there is no match at all.
Uses
Advantages
Disadvantages
Metrics for Multi-value fields (e.g. dc:subject)

Multi-value fields, such as DC Subject, are more complicated. For these elements, Precision, Recall, Fscore and False-Positives/Error rate are more appropriate.
Precision
Precision may be defined as the number of correctly generated units as a proportion of the number of generated units. It reaches its best value at 1 Definition/Description and worst score at 0. A low precision score suggests a lot of noise (i.e. that many spurious terms are also generated) Alternative name Uses Subfield precision Useful for multiple-value fields, such as dc:subject - particularly those that use a controlled vocabulary. Precision is more sophisticated than EMA in that it splits multiple-value fields into their individual "subfields" so that each unit (e.g. keyword or title word) is matched individually. Should be relatively easy to automate. While useful, precision should be used with care because it can give a correct but misleading representation. Keyword precision, for instance, can be maximised artificially by a strategy of returning just a single word - the single most frequently occurring word within the source text after stop words have been removed. But the result is an absurdity because of the unacceptably low level of recall that this implies.
Advantages
Disadvantages
e.g. transforming both the extracted and reference metadata sets to lower case or removing white space so that non-substantive differences between the two sets do not skew the results.
While straightforward, precision does not necessarily reflect how humans measure the precision of a metadata element. E.g. In many situations humans will consider synonyms (e.g. football and soccer) to be a match whereas the Precision metric does not.
Recall
It is defined as the number of matching units (i.e. correctly generated terms) divided by the number of correct units in the reference set of Definition/Description documents (e.g. terms supplied by the expert cataloguer). It reaches its best value at 1 and worst score at 0. Alternative name Uses Subfield recall Multiple-value fields, particularly those that use a controlled vocabulary, are more usefully measured by Precision and Recall5 than by EMA. Like Precision, Recall is more sophisticated than EMA in that it splits multiple-value fields into their individual "subfields" so that each unit (e.g. keyword or title word) is matched individually. Advantages Recall measures the extent to which all of the correct values have been generated regardless of whether they are accompanied by noise from incorrect values. Should be relatively easy to automate. While useful, recall should be used with care because it can give a correct but misleading representation. E.g. A strategy of extracting all of the words from the resource as keywords will ensure 100% recall but at very low levels of precision. Also, like precision, recall does not record synonyms as a match.
Disadvantages
F-score F-score provides a weighted average between the precision and recall. It reaches its best value at 1 and worst score at 0. It is defined as:
Definition/Description
F1 = 2 * precision * recall precision + recall 6 F-measure Use for multiple-value fields, particularly those that use a controlled vocabulary, for which Precision and Recall7 are already used. It is desirable to include the F1 within a testing program because of its realistic trade-off between recall and precision both of which are important considerations in practice. It tends towards its target value (of 1) only when both the precision and recall are good. The metric is very flexible because the weighting given to precision/recall respectively can be altered according to the requirements of the repository
Alternative name Uses
Advantages
5 6 7
Sometimes called subfield precision/SFP
Van Rijsbergen 1979

Sometimes called subfield precision/SFP
Disadvantages
The weighting can make the score a little arbitrary.
False positives A false positive is value incorrectly identified as a positive result. E.g. an an assigned (i.e. auto-generated) keyword that is found to be incorrect. A constant, c, that is independent from document richness should be applied to allow for valid comparison between documents of varying textual richness. e.g. c may be the number of tokens or sentences in the document.
Definition/Description The False Positive metric is formally defined as: FalsePositive = Number of spurious values C The target value for false positive is 0. Alternative name Use this metric when testing a single auto-generators performance in relation to various document types because in this situation the relative document richness of the documents can be crucial. Relative document richness is the relative number of entities or annotations of each type to be found in a set of documents. For instance, a text that contains 100 instances of an entity (e.g. personal names that equate to dc:creator) has a greater document richness for that element than a text of the same length that contains only 1 such entity. As Maynard, Peters and Yaoyong Li point out, a single error will impact on Precision to a much greater extent where, as above, the total number of correct entities=1, than where the total = 100. Assuming that the document lengths are the same, the false positive score for the two texts in the above example should be identical despite their misleadingly divergent levels of Precision. Advantages Unlike precision, recall and F-score the false positives metric is not affected by the relative document richness. When comparing the performance of multiple auto-generation tools in relation to a single document set this metric is not necessary. In this situation, relative document richness is not a factor because it is the same for each tool.
Uses
Disadvantages
Error rate
Error rate may be defined as the number of incorrectly generated units as a Definition/Description proportion of the number of generated units. It is the inverse of Precision. The target value is 0 and worst value is 1. Alternative name Uses Advantages Disadvantages Error rate can be used as an alternative to false positives. Error rate has the advantage over False Positives that it does not require an arbitrary constant, which can make the results hard to interpret, and means that comparison on documents of different length is also skewed. F-measure is based upon the scores for Precision and Recall and so cannot
be calculated if error rate is used instead of precision.
Metrics for Free-Text elements (e.g. dc:description)

Free text fields, such as DC Description and DC Title, are more difficult to evaluate than even multivalue fields, such as DC Subject. Indeed, in the words of Hans van Halteren and Simone Teufel, It is an understatement to say that measuring the quality of summaries is hard. In fact, there is unanimous consensus in the summarisation community that evaluation of summaries is a monstrously difficult task8 . What exactly makes a summary beneficial is an elusive property. But a thorough testing program should probably involve several of the following metrics and methods: Automated metrics (for free text elements) 8 Compression Ratio The compression ration measures how much shorter the summary is than the original. It is the length of the generated summary divided by the length of the source text. Definition/Description Compression Ratio (CR) = length of summary length of full text Alternative name Uses Use for summaries (e.g. dc:description or dc:abstract). The compression ratio is not, of itself, an indication of quality but is an important consideration when selecting (or more likely developing) a tool. Information theory suggests that the information content of a summary is likely to be inversely correlated to its Compression Ratio. It is therefore important to ensure that a summarising tool uses a suitable correct compression ratio. However, compression ratio is not a metric that should be used in isolation. A tool that uses a constant compression ratio may generate abstracts of very varying length because of the variable length of the source documents.
Advantages
Disadvantages
Summary length
Summary length can be measured by counting the number of letters, Definition/Description words, or sentences in the summary. The key metrics to calculate are the mean length and standard deviation. Alternative name Uses Use for summaries (e.g. dc:description or dc:abstract).
Halteren, Hans van, and Teufel, Simone. 2003. Examining the consensus between human summaries: initial experiments with factoid analysis. In Proceedings of the HLT-NAACL 03 on Text summarization workshop, 57 64.
Advantages
Summary length is important to monitor because consistency of description contributes to the degree of confidence in an online catalogue. Tools that use tag harvesting or a mixture of natural language processing and tag harvesting have an obvious potential for variability because of the variability of author-supplied metatags. This metric is easy to understand and easy to automate. Summary length is probably not the most important attribute of a good summary to measure although it is significant.
Disadvantages
10
Retention Ratio The retention ratio9 measures how much information from the source text is retained in the summary.
Definition/Description Retention Ratio (RR) = Information in summary Information in source (full text)
Alternative name Uses
Omission Ratio; Summary Informativeness Use to evaluate summaries (e.g. dc:description or dc:abstract). This metric is essentially notional. In practice it is approximated by more specific metrics (outlined below). Each involves comparing the generated summary with the text being summarized in an effort to assess how much information from the source is preserved in the condensation. Alternatively, the generated summary can be compared with a manuallycatalogued reference summary although that is likely to be resource intensive to produce.
Advantages
Disadvantages
11
Content Word Precision
Content Word Precision is the proportion of unique content words in the automatically-assigned metadata that are also present in the expertDefinition/Description assigned metadata. It reaches its best value at 1 and worst score at 0. A low CWP score suggests that the auto-generated summary is talking about things that are not mentioned in the expert-assigned summary. Alternative name Uses CWP Use to evaluate summaries (e.g. dc:description or dc:abstract). By automating the process of measuring the closeness of information content between source and summary, this metric brings advantages of efficiency and impartiality. CWP compares the automatically-generated and expertly-assigned metadata by analyzing the set of unique content words in each passage (ignoring case). Content words are the words that
Advantages
10
remain after common stop words, such as "of" and "the" have been removed10. CWP should be used with care because it can give a correct but misleading representation. The content word precision can be maximised artificially by a strategy of returning just the single most relevant sentence as a summary. While straightforward, content word precision only focuses on one facet and so does not necessarily reflect how humans measure the suitability of a summary. E.g. It is possible for two summaries to share a large number of unique content words and have a different meaning. Conversely, in many situations humans will consider synonyms (e.g. football and soccer) to be a match whereas the Precision metric does not. CWP should ideally be used with metrics that measure other facets, such as summary coherence.
Disadvantages
12
Content Word Recall
Content-word recall is the proportion of unique content words from the expert-assigned metadata that also appear in the automatically-assigned Definition/Description metadata. It reaches its best value at 1 and worst score at 0. A low CWP score suggests that most of the things that are said in the expert-assigned summary are not mentioned in the auto-generated equivalent. Alternative name Uses Advantages CWR should be used with care because, like CWP, it can give a correct but misleading representation. CWR can be maximised artificially by a strategy of returning a very large number of sentences as a summary. The other disadvantages are the same as for Content Word Precision. i.e. While straightforward, content word recall only focuses on one facet and so does not necessarily reflect how humans measure the suitability of a summary. E.g. It is possible for two summaries to share a large number of unique content words and have a different meaning. Conversely, in many situations humans will consider synonyms (e.g. football and soccer) to be a match whereas the Precision metric does not. CWR should ideally be used with metrics that measure other facets, such as summary coherence. CWR Use to evaluate summaries (e.g. dc:description or dc:abstract).
Disadvantages
13
ROUGE
10
A stemming process will also normally be applied to both sets of metadata to ensure that superficial differences, such as between singular and plural noun forms, do not skew the results.
11
Definition/Description Alternative name Uses
ROUGE counts the number of n-grams shared between the generated summaries and a set of human-created reference summaries11. Recall-Oriented Understudy for Gisting Evaluation Use to evaluate summaries (e.g. dc:description or dc:abstract). A more sophisticated technique for evaluating abstracting or extracting tools than whole-word based approaches, such as CWP and CWR N-grams are more efficient than the full-word approaches of content-word precision and content-word recall at matching related but different words even when the latter use stemming. E.g. community and communal, which share the same lexical root, would be considered as non-matches by full-word precision and recall systems whereas ROUGE would (probably correctly) consider them a match because they share two 5-grams, commu, ommun. It is unlikely that repositories within the JISC would be in a position to implement either ROUGE. Its use requires specialist knowledge, sophisticated algorithms, and heavy-weight computing capabilities. This metric is more for academic researchers in the field of information retrieval.
Advantages
Disadvantages
14 Definition/Description Alternative name Uses
BLEU BLEU is similar system to ROUGE. The size of the n-grams used by BLEU is also configurable. BLEU-n uses 1-grams through ngrams.
Use to evaluate summaries (e.g. dc:description or dc:abstract). Lin and Hovy believe that both ROUGE and BLEU correlate reasonably well with human judgements of summary quality, and the summarization community (i.e. the developers of summarisation algorithms) has now accepted these metrics as a credible and less time-consuming alternative to manual summary evaluation12. It is unlikely that repositories within the JISC would be in a position to implement either BLEU. Like ROUGE, its use requires specialist knowledge, sophisticated algorithms, and heavy-weight computing capabilities. This metric is more for academic researchers in the field of information retrieval.
Advantages
Disadvantages
15
Vocabulary Test
Metrics for content similarity, such as the Vocabulary Test compare term Definition/Description frequency vectors calculated over stemmed or lemmatized summaries (extraction based or true abstracts) and reference summaries. Precision is
11
For a sequence of words, (for example "the civilisation of the Renaissance in Italy"), the trigrams would be: "the civilisation of", "civilisation of the", "of the renaissance", and "the Renaissance in". For sequences of characters, the 3-grams ("trigrams") would be "the", "he ", "e c", " ci", " civ", "ivi" and so on. 12 http://www.cs.sfu.ca/~anoop/students/agattani/agattani_msc_project.pdf, p.15-16
12
in this case defined as the number of sentences in the generated summary that are present in the reference summary. Alternative name Uses VT Use to evaluate the semantic content in both extraction-based summaries and true abstracts (e.g. dc:description or dc:abstract). Controlled thesauri and synonym sets created with Latent Semantic Analysis (Landauer et al. 1998) or Random Indexing (Kanerva et al. 2000, Sahlgren 2001) can be used to reduce the terms in the vectors by combining the frequencies of terms deemed synonymous, thus allowing for greater variation among summaries. This is a more sophisticated approach than Content Word Precision and Content Word Recall and is especially useful when evaluating abstracted summaries. These methods are quite sensitive to negation and word order differences.
Advantages
Disadvantages
16
Conformance to Expectation Metric base on Textual Information QConf-Textual measures the amount of unique information within a metadata record. Within a single repository there may be many similar documents and metadata records. It is the unique information in each metadata record that that enables the user to distinguish one resource from another. Ochoa and Duval, who developed the metric explain:
For categorical fields, the Information Content is equal to 1 minus the entropy of the value (the entropy is the negative log of the probability of the value). For example, if the difficulty level of a learning object metadata record is set to high, where the majority of the repository is set to medium, it will provide more unique information about the resource and, thus, a higher score (high quality). On the other hand, if the records nominal fields only content the default values used in the repository, they Definition/Description will provide less unique information resource and a lower quality score. For free text, on the other hand the calculation of the importance of a word is directly proportional to how frequently that word appears in the document and inversely proportional to how frequently documents contain that word. This relation is handled by the Term Frequency- Inverse Document Frequency calculation. The frequency in which a word appear in the document is multiplied by the negative log of the frequency in which that word appear in all the documents in the corpora (could be considered as a weighted entropy measurement). For example, if the title field of a record is Lecture in Java, given that lecture and java are common words in the repository, the record will have lower score (lower quality) than a record in which the title is Introduction to Java objects and classes, not only because objects and classes are less frequent in the repository, but also because the latter title contains more words. Alternative name Uses Advantages QConf-Textual Useful for evaluating all metadata elements but particularly summaries. In a controlled test, Ochoa and Duval found that QConf-Textual correlated in a high degree (0.842) with the average quality evaluation given by a set of human reviewers to the same set of documents. In other words, it is a
13
relatively good predictor of the human evaluation of metadata records. The metric can be refined so that important elements for resource discovery, such as dc:description or dc:title are given a heavier weighting than, say, dc:format. Disadvantages
Manual Methods (for free-text elements) The automated evaluation techniques mentioned above posit a single ideal summary whereas in reality there may be many possible, and equally good, alternatives summaries. Even when an auto-generated summary shares the same semantic meaning as a human expert summary precisely, the choice of words that are used may vary legitimately (e.g. with the use of synonyms) and they may be applied in a different order. This is true regardless of whether the auto-generated summaries are extracts or (more rarely) abstracts13. Moreover, summaries that have a very different meaning may sometimes get very similar scores. As Hovy points out, Ideally, one wants to measure not information content, but interesting information content only. Although it is very difficult to define what constitutes interestingness, one can approximate measures of information content in several ways. But this requires some at least some direct human intervention.
17
Summary Coherence A ranking or grading of the coherence of auto-generated summaries by human assessors (ideally users of the resource).
Summaries generated through extraction-based methods (cut-and-paste operations on phrase, sentence or paragraph level) sometimes suffer from parts of the summary being extracted out of context, resulting in coherence problem (e.g. dangling anaphors or gaps in the rhetorical structure of the summary).
Alternative name Uses Use to evaluate summaries (e.g. dc:description or dc:abstract). Subjects can rank or grade summary sentences for coherence and then compare the grades for the summary sentences with the scores for reference summaries, with the scores for the source sentences, or for that matter with the scores for other summarization systems. This metric is unsuitable for automation and, like any metric that involves human assessment, is likely to be resource intensive.
Advantages
Disadvantages
Extraction techniques are the simpler of the two in that they select the individually most relevant passages of text (e.g. sentences) from the source document and presenting them verbatim as a summary. Abstracting techniques are more sophisticated because they first analyse the context of a document then rewrite it into an informative summary.
13
14
18
Sentence Rank
Sentences within the expert assigned reference summary are ranked by assessors for their worthiness of being included in a summary of the text. Definition/Description Correlation metrics can then be calculated between the auto-generated summary and the reference sentences. Alternative name Uses Use to evaluate summaries (e.g. dc:description or dc:abstract) For extracted summaries, there is some benefit to be had from correlating the generated and reference summaries at the level of sentences rather than individual words. Sentences tend to have a less ambiguous meaning within a given context. Precision and Recall can both be applied at the level of sentences. Sentence recall, for instance, measures how many of the sentences in the reference summary (supplied by an expert cataloguer) are present in the generated summary. Sentence rank introduces a potentially useful element of human judgement to enhance the correlation process. Sentence Precision and Sentence Recall and Sentence Rank can be useful in relation to extracted, but not abstracted, summaries.
Advantages
Disadvantages
19
Utility Method
The Untility Method is a refinement of the sentence rank method. The utility method (UM) (Radev et al. 2000) refines things by considering the relationship between sentences. The sentences or other extraction units in Definition/Description the reference set are marked up to show the negative support that one sentence may exert on another. For instance, the UM method can automatically penalize the evaluation score of a tool that extracts two or more equivalent sentences. Alternative name Uses UM Use to evaluate summaries (e.g. dc:description or dc:abstract). UM is mainly useful for evaluating extraction-based summaries. Human assessors judge the over all content, meaning, and coherence of a summary rather than the individual sentences. Therefore this method is closer to real life than sentence rank. The process of marking-up in this way is complicated and timeconsuming. Not suitable for abstracted summaries.
Advantages
Disadvantages
20
Factoid Analysis Factoid analysis measures the proportion of factoids shared by the autogenerated and reference summaries.
The assessor first divides the reference summary (i.e. expert created abstract) manually into its atomic semantic units, known as factoids. For Definition/Description instance, the sentence The police have arrested a white Dutch man May be represented by the following factoids:
15
A suspect was arrested The police did the arresting The suspect is white The suspect is Dutch The suspect is male The auto-generated summary is likewise split into factoids. The assessor then counts the degree of overlap between the two summaries on a factoid by factoid basis. Factoids in summary A and B are regarded as matches if they share the same semantic meaning regardless of differences in vocabulary. Alternative name Uses Use to evaluate summaries (e.g. dc:description or dc:abstract). Factoid analysis is useful because it can reveal the degree of consensus between summary and source text at a fundamental level of semantic meaning despite the existence of superficial differences in vocabulary and word/sentence order. It also considers the summary at a more appropriate level of granularity than Sentence Rank. Unfortunately, this is an intellectually intensive, and therefore timeconsuming, process and there is a small subjective element involved in deciding whether a match is close enough to be accepted.
Advantages
Disadvantages
21
Likert Scale The Likert scale s a psychometric scale commonly used in questionnaires and surveys in which respondents specify their level of agreement to a statement.
Experiments can be conceived whereby participants are shown sample documents, each of which is accompanied by a single list of keywords that have been generated variously by a human indexer and/or one or more metadata generation tools. It is a blind test in the sense that the participant is left unaware of the method by which each keyword has been created. The participant is than asked to evaluate the appropriateness of Definition/Description each key word using a 5-point Likert scale that has the following values: Strongly relevant = 5 Relevant = 4 Undecided = 3 Irrelevant = 2 Strongly irrelevant = 1. The average score for each metadata creation method is then compared, with 5 being the perfect score. Other scales can be used, such as Excellent, Good, Average, Poor, and Bad Alternative name Uses Method Advantages The Likert Scale brings a useful human element to the evaluation process. It can measure user perceptions which can be particularly helpful to the Likert item Use to evaluate summaries (e.g. dc:description or dc:abstract).
16
evaluation of free-text elements such as DC Description, for which there is no single correct value. The technique may also stand as a proxy for intrinsic evaluation techniques for which (as is often the case) the necessary software is not readily available although the human involvement means that it is invariably more time consuming and costly. Disadvantages There is, however, a downside to this subjectivity. The Likert Scale shows participants beliefs about the quality of metadata. It is possible for respondents to be mistaken.
17
EXTRINSIC METRICS Developers have mostly used cheap, quick, and easy intrinsic metrics, such as precision, recall and F1 score (apparently using in-house software), during the iterative processes of tool development. For repository managers, however, the situation is different. Metadata sits at the intersection between technological and human activity and managers must decide what the introduction of auto-generated metadata will mean in human terms for their end-users. Users perceptions about the quality of metadata will vary from one person to the next along with their differing expectations, needs, assumptions and biases. Extrinsic metrics have the advantage of showing how useful participants find the metadata in practice.A thorough testing framework ought, therefore, to include some more resource-intensive extrinsic tests that aim to mimic iterative real-life scenarios more closely. Extrinsic evaluation measures the efficiency and acceptability of the generated metadata in some external; task, for example relevance assessment or reading comprehension. Also, if, say, the summary contains some sort of instructions, it is possible to measure to what extent it is possible to follow the instructions and the result thereof. Other possible measurable tasks are information gathering in a large document collection, the effort and time required to post-edit the machine generated summary for some specific purpose, or the summarization systems impact on a system of which it is part of, for example relevance feedback (query expansion) in a search engine or a question answering system14. Several game like scenarios have been proposed as surface methods for summarization evaluation inspired by different disciplines, among these are The Expert Game, The Shannon Game (information theory), The Question Game (task performance), The Classification/ Categorization Game and Keyword Association (information retrieval).
22
The Expert Game
In this scenario, experts underline and extract the most interesting or Definition/Description informative fragments of a text. The recall and precision of the systems summary I measured against the humans extract. Alternative name Uses Advantages Disadvantages This metric is resource-intensive. Use to evaluate summaries (e.g. dc:description or dc:abstract).
23
The Shannon Game
The Shannon Game involves three groups of participants, who are, respectively, shown either the full text of an article, an auto-generated Definition/Description summary, or no text at all. The participants are then asked to recreate the source text at certain points by guessing the next token, e.g. letter or word, in turn. The ease with which they can do so, which can be measured by
14
de Smedt, K., Liseth, A., Hassel, M., Dalianis, H. (2005) How short is good? An evaluation of automatic summarization. In Nordisk Sprkteknologisk Forskningsprogram 2000-2004, edited by H. Holmboe (Museum Tusculanums Forlag), p 6, http://www.nada.kth.se/~xmartin/reports/ScandSumyearbook2004-fullpage.pdf
18
number of keystrokes required, provides an approximate indication of the quality of the auto-generated summary. Alternative name Uses Advantages Hassel notes that, The problem is that Shannons work is relative to the person doing the guessing and therefore implicitly conditioned on the readers knowledge. The information measure will infallibly change with more knowledge of the language, the domain, etc15. This metric is also resource-intensive. Use to evaluate summaries (e.g. dc:description or dc:abstract).
Disadvantages
24
The Question Game The Question Game takes a slightly different approach in that it focuses on participants comprehension levels. A series of questions are devised for the test based on factual statements found within a source text. Each of the participants is then asked the same set of questions in three stages: (a) before having seen any text (b) after seeing the auto-generated summary for the text (c) after seeing the full text If the answers given in stage (b) are close to those given in stage (c) then we can regard it as a good summary because it conveys most of the information that is in the full text. Conversely, if stage (b) produces few answers that hadnt already been given in stage (a), that is suggestive of a poor-quality summary.
Alternative name Uses Advantages Disadvantages This metric is resource-intensive. Use to evaluate summaries (e.g. dc:description or dc:abstract).
25
The Classification Game
In this game participants are asked to classify either auto-generated summaries or the source documents that they represent according to a Definition/Description classification scheme that is provided. Summary quality can be gauged by the closeness of classification that is found between originals and their summaries. Alternative name Uses Use to evaluate summaries (e.g. dc:description or dc:abstract).
15
Evaluation of Automatic Text Summarization: A practical implementation MARTIN HASSEL, Licentiate Thesis. Stockholm, Sweden 2004, p.11
19
Advantages Disadvantages This metric is resource-intensive.
26
Keyword Association
Keyword association relies on keywords associated (either manually or automatically) to the documents being summarized. For example Saggion and Lapalme (2000) presented human judges with summaries generated by their summarization system together with five lists of keywords taken from Definition/Description the source article as presented in the publication journal. The judges were then given the task to associate the each summary with the correct list of keywords. If successful the summary was said to cover the central aspects of the article since the keywords associated to the article by the publisher were content indicative. Alternative name Uses Advantages Use to evaluate summaries (e.g. dc:description or dc:abstract). Its main advantage is that it is inexpensive and requires no cumbersome manual annotation. This is a somewhat shallower approach than the other game-like scenarios mentioned above. This metric is also resource-intensive.
Disadvantages
27
Query-based methods Kristina M. Irvin conducted an interesting experiment that used automated query techniques to evaluate auto-generated metadata in ways that mimicked real-life searching and information retrieval. She asked scientists at the National Institute of Environmental Health Sciences (NIEHS) to rank 34 Health Sciences web pages for their relevance to 20 queries that the NIEHS library would be likely to receive from its users. Two sets of metadata were then created to describe the 34 web pages. One set was manually catalogued. The other was auto-generated by a combination of DC-dot and Klarity. She then formulated the 20 queries into SQL queries and ran them against the two sets of metadata.
e.g. Q1 Is NIEHS conducting any research in the area of HIV-related proteins? This becomes the sql query: select distinct url from elements where (content like '*HIV*' or content like '*protein*') and (participant='DC-Dot' or participant='Klarity') select distinct url from elements where (content like '*HIV*' or content like '*protein*') and (participant='LoggerA' or participant='LoggerB' or participant='LoggerC') Thus it was possible to compare the retrieval effectiveness (measured by precision and recall) of the two sets of metadata in a life-like situation.
Alternative name
20
Uses Advantages Disadvantages The disadvantage of this method is that it is labour intensive.
Cost-based metrics
28
Learning Accuracy Learning accuracy is a metric designed to take into account the gravity and not just the frequency of errors. E.g. it is a more serious error to assign a term from a completely irrelevant branch of the taxonomy (such as Biology) when assigning DC Subject terms to, say, texts about Social History, than from a branch that is incorrect but adjacent to the correct term (Military history). Similarly assigning a term at the wrong level of specificity (e.g. History rather than Military history) is a relatively modest error. LA uses the following measurements:
SP (Shortest Path) = the shortest length from root to the key concept FP = shortest length from root to the predicted concept. If the predicted Definition/Description concept is correct, then FP = 0, i.e. FP is only considered in the case that the answer given by the system is wrong. CP (Common Path) = shortest length from root to the MSCA (Most Specific Common Abstraction, i.e.the lowest concept common to SP and FP paths) DP = shortest length from MSCA to predicted concept If the predicted concept is correct, i.e. if FP =0, LA = CP SP = 1 If the predicted concept is incorrect, LA = CP FP + DP Essentially, this measure provides a score somewhere between 0 (incorrect) and 1 (correct) for any term assigned.. Alternative name Uses Useful for evaluating metadata elements that use a controlled vocabulary containing a hierarchy of terms. All of the metrics mentioned so far have the serious disadvantage of tending to be concerned with the frequency rather than the seriousness of errors. Learning Accuracy remedies this by augmenting traditional Precision and Recall metrics with a system of semantic distance weights, such that the gravity of the error can be taken into account. Learning accuracy is potentially very useful but is only possible where a controlled vocabulary has been used and is not easy to implement
Advantages
Disadvantages
29
Cost-based evaluation
Definition/Description Cost-based evaluation takes the idea of assessing the seriousness of errors
21
a step further than Learning Accuracy by quantifying them in practical (financial) terms. It involves associating each category of error or unfulfilled task with a suitable level of cost. (e.g. a minor error such as correcting faulty capitalisation in the DC Title element might have a lower tariff than an absent entry in DC Abstract, which would be more labour intensive to remedy). It uses the formula: Performance = Time Saved or Lost x Salary Alternative name Uses The cost-model can be developed to reflect the circumstances of the specific repository and its priorities. Ochoa and Duval point out that: One user might be more concerned with Precision than Recall, or one user might be more concerned about getting particular types of entities right, and not so concerned about other types, or one user might be more concerned with the fact that getting something partially right is important. Therefore a costbased model is useful because it enables the parameters to be modified according to the particular evaluation or task. A serious drawback of the cost-based approach has been the lack of reliable research evidence upon which to base decisions about the cost of either creating or amending metadata records. The issues of cost and performance are difficult because there is a cost/benefit on both sides of the equation. The cost to repositories of upgrading auto-generated metadata must be measured against the costs (incurred by their users) of not doing so which will be manifested in poorer retrieval rates and the missed opportunities that ensue (e.g. potential research projects or research papers that fail to materialise as a result). But what is the likelihood of these missed opportunities and what is the average cost/benefit of a research project or research paper to an institution? It is difficult to say. And it is doubtful whether all costs and benefits can be represented in financial terms. It is problematic (to say the least) to quantify the real-life informativeness of metadata as shown by the Expert, Shannon, Question, and Classification games - in terms of a bottom-line figure. But until we can do so, it will be difficult to know whether we should always aim for the best, most expensive, metadata or whether auto-generated metadata may sometimes be good enough. Cost-based error
Advantages
Disadvantages
30
Time saved
This metric records the difference in the amount of time that it takes to create metadata by manual or hybrid methods (i.e. manual enhancement of Definition/Description auto-generated metadata). A representative sample of cataloguers or depositors (as appropriate) is
22
split, randomly, into two groups. The first group is asked to create metadata for n documents manually, and then do the same for n other documents using the hybrid method. The second group does this in reverse - creating metadata for the first n documents using the hybrid method and the last n documents manually. Tonkin and Muller conducted such an experiment within the context of an author self-deposit preprint repository. They measured the time that people took on the server; from the moment they provided the PDF file to the moment that they submitted the metadata. It may be advisable to record the median time for each participant over all papers in order to limit the influence of outlying values. Alternative name Uses This metric is relevant to one of the key issues of automatic metadata generation which is whether it is quicker to produce metadata by manual or hybrid methods. The time saved could be multiplied by the cost (salary) as part of a costbased evaluation. For the results to be valid, the experiment must be conducted under identical conditions to those that will pertain in real-life which can be difficult to arrange. E.g. Tonkin and Muller performed the experiment on PCs in peoples officers in order for them to use their natural environment but that meant that the participant might be disturbed by receiving a phone-call during the experiment.
Advantages
Disadvantages
This is by no means an exhaustive list of metrics. There are many others that can be considered.
23
A MENU OF METRICS Commonly used metadata elements have contrasting characteristics that need to be measured in different ways. Unfortunately, optimising performance in relation to a specific metric may lead to a sub-optimal result for the system as a whole. For instance, quality and cost within any system are often inversely related as are the metrics for recall and precision in an IR environment. A suite of metrics (a balanced scorecard) should therefore be used to address multiple performance perspectives. The table provides a summary of metrics mentioned in this report. Some, such as Exact Match Accuracy, Precision and Recall are likely to be within the compass of individual repositories. Others, such as BLEU and ROUGE, are more suitable for dedicated research projects within the field of information retrieval. Repositories may wish to mix n match tests from the menu below according to the time, resources and evaluation software at their disposal. Metrics marked with an asterisk are capable, in principle of being automated although the project found that there is virtually open source metadata evaluation software currently available. Element group Controlled field with choice of terms in a hierarchy Elements Subject terms (LCSH/AAT) Proposed metrics Cost-based evaluation False positives* Error rate* Balanced Distance Metric (errors weighted) Augmented precision (errors weighted) Augmented recall (errors weighted) Augmented f-measure (errors weighted) Learning Accuracy Cost-based evaluation Learning Accuracy Exact match (if very few instances)* Likert scale Subfield precision* Subfield recall* F-score* Extrinsic evaluation Shannon Game Question Game Classification game Keyword association Intrinsic evaluation Summary coherence Factoid analysis Likert scale Content-word/sentence precision* Content-word/sentence recall* Stemmed content-word precision* Stemmed content-word recall* F-score* ROUGE (N-grams)* BLEU (N-grams)* Compression Ratio* Summary length* Retention Ratio Sentence rank Utility method
Controlled field with choice of terms but no hierarchy
Uncontrolled terms (open-ended and multiple)
Subject terms (iVia subject terms) Media type Language Keywords
Free text abstracting fields
Description
24
Free text copying fields
Title
Categorical elements All elements/overview
Identifier Format
Vocabulary test (content similarity)* Content-word precision* Content-word recall* Stemmed content-word precision* Stemmed content-word recall* F-score* Exact match* Exact match* QConf-Textual* Time saved Completeness*
25
REFERENCES Automatic Metadata Evaluation. iVia website, UC Riverside Libraries. Retrieved October 23, 2008, from: http://ivia.ucr.edu/projects/Metadata/Evaluation.shtml Greenberg, J., Spurgin, K., & Crystal, A. (2006). Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. Retrieved October 23, 2008, from: http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf Maynard, D. (2005). Benchmarking ontology-based annotation tools for the Semantic Web. UK eScience Programme All Hands Meeting (AHM2005) Workshop "Text Mining, e-Research and Gridenabled Language Technology", Nottingham, UK, 2005. Retrieved October 23, 2008, from: http://gate.ac.uk/sale/ahm05/ahm.pdf Moen, W.E., Stewart, E.L., & McClure, C.R. (1998). Assessing metadata quality: Findings and methodological considerations from an evaluation of the u.s. government information locator service (gils). In: ADL 98: Proceedings of the Advances in Digital Libraries Conference, Washington, DC, USA, IEEE Computer Society 246
Ochoa, X. & Duval., E. (200?). Towards Automatic Evaluation of Learning Object Metadata Quality. Retrieved October 23, 2008, from: http://ariadne.cti.espol.edu.ec/M4M/files/TowardsAutomaticQuality.pdf Ochoa, X. & Duval, E. (2006). Towards Automatic Evaluation of Learning Object Metadata Quality. Advances in Conceptual Modeling - Theory and Practice, 4231, 372-381 Gordon Paynter (2005) Developing Practical Automatic Metadata Assignment and Evaluation Tools for Internet Resources. Proc. Fifth ACM/IEEE Joint Conference on Digital Libraries, Denver, Colorado, June 7-11, 2005, ACM Press, pp. 291-300. Retrieved October 23, 2008, from: http://ivia.ucr.edu/projects/publications/Paynter-2005-JCDL-Metadata-Assignment.pdf Spark-Jones and Galliers 1995 Tonkin, E., & Muller, H. (2008a). Keyword and metadata extraction from pre-prints. Proceedings of the 12th International Conference on Electronic Publishing held in Toronto, Canada 25-27 June 2008 / Edited by: Leslie Chan and Susanna Mornati. ISBN 978-0-7727-6315-0, 2008, pp. 30-44. Retrieved October 23, 2008, from: http://elpub.scix.net/data/works/att/030_elpub2008.content.pdf Tonkin, E., & Muller, H. (2008b). Semi Automated Metadata Extraction for Preprints Archives. Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries held in Pittsburgh Pittsburgh PA, USA, June 2008. ISBN:978-1-59593-998-2, pp.157-166. Retrieved October 23, 2008, from: http://portal.acm.org/ft_gateway.cfm?id=1378917&type=pdf&coll=GUIDE&dl=GUIDE&CFID=7481 537&CFTOKEN=69623572
26
MetaTools Stage 2 report: Comparing the quality of currently available metadata generation tools.
Table of Contents
Methodology .............................................................................................................................. 1 Implementation .......................................................................................................................... 5 Outputs and Results .................................................................................................................. 6 Conclusions ............................................................................................................................. 19 References .............................................................................................................................. 21
Methodology
The second stage of the project used parts of the test framework from stage one to compare the quality of currently available metadata generation tools. A Web-survey identified thirty (mostly poor-quality) metadata generation tools, from which a short-list of five were selected, using criteria from the Recommended Functionalities checklist. Key factors in the decision were the range of format(s) that are supported, the extraction method(s), elements generated, output format(s), encodings and bindings, standard vocabularies, configurability, technology base (i.e. platforms/operating systems supported), dependencies, maturity, and claims made about the quality of output, and, above all, any licensing restrictions (the tools had to be open source). The selected tools were (in no particular order): DC-dot1 This is the most well known of a group of usually Java-based tools that generates Dublin Core metadata, when an identifier such as a URL is manually submitted to an online form. It mostly harvests metatags from Web page headers. It was developed by Andy Powell at UKOLN but has not been updated since the year 2000. KEA2 Kea is a tool developed by the New Zealand Digital Library Project for the specific purpose of extracting native keywords (or, rather, keyphrases) from html and plain text files. A nave Bayesian algorithm learns words frequencies based on a training corpus. Keyphrases are extracted according to a combination of how specific they are to the given document (in comparison with the training data) and their position within the document. KEA is available as a download and must be trained with a document set from the relevant subject domain. Data Fountains3 Unlike the other tools, Data Fountains can be used to discover, as well as describe, Internet resources. Data Fountains can generate metadata for a given page if provided with a URL. Alternatively, it can trawl for and generate metadata for Internet resources on a particular topic. Or it can drill down and follow links from a starting url. The metadata generation method varies from the simple harvesting of metatags in the case of dc:creator to the use of a combination of metatag harvesting and sophisticated NLP algorithms (phraseRate), as in the case of dc:subject. Data Fountains differs from KEA in not requiring a training corpus it generates its keyphrases mostly by using clues found within the structure of the target document itself. It was expected that differences between the keyphrase algorithms used by KEA, Data Fountains, and Yahoo! Term Extractor would lead to some interesting test results. Data Fountains generates the entire range of simple Dublin Core elements plus a few
1 2
http://www.ukoln.ac.uk/metadata/dcdot/ http://www.nzdl.org/Kea/ 3 Version 2.2.0, http://datafountains.ucr.edu/
elements specific to itself. It is the result of an on-going project at the University of California, Riverside. Data Fountains is available for use online via a user-friendly gui and free account or as a download of the iVia suite of tools, upon which it is based. The download and installation are not trivial tasks, however. SamgI (Simple Automated Metadata Generation Interface)4 SAmgl is designed to extract metadata from learning objects (LOM), such as courseware. Several potentially useful Dublin Core elements are generated. The Samgl is a framework/system that one could call "federated AMG" in that it combines the auto-generated output from several Web services into a single metadata instance. For instance, it invokes the Yahoo! Term Extraction service to provide key words and phrases. SamgI is the work of the HyperMedia and DataBases Research Group at the Katholieke Universiteit Leuven. It is available as a download or online. Yahoo! Term Extraction service5 The Yahoo! Term Extraction Web Service is a RESTful service that provides a list of significant key words or phrases extracted from a larger content. The Yahoo! Term Extractor was evaluated indirectly when testing the SamgI. The intention at the outset was also to test PaperBase6, a prototype that uses a nave Bayesian algorithm to generate metadata for scholarly works. However, the software was not made available in time for the full testing program. Format identification and characterisation tools, such as DROID7 and JHOVE8 can provide useful metadata for dc:format but were considered out of scope because they have been evaluated elsewhere. The plan was to test the five selected tools for their effectiveness in generating Dublin Core metadata from Web pages (html) and journal articles (pdf). These formats were chosen for their ubiquity within the JISC IE and because textual resource perhaps offer the greatest scope for automatic metadata generation. The website sample was drawn from Intute. Five test samples, each of about 50 items, were obtained between 6 November 2007 and 4 March 2008, comprising urls for the 50 most recently accessioned resources in: 1) the Intute database [i.e. all broad subjects], 2) Intute:Health and life sciences, 3) Intute:English Literature, 4) Intute:Political History, and 5) Intute: Demographic Geography 9
http://www.cs.kuleuven.be/~hmdb/joomla/index.php?option=com_content&task=view&id=48&Itemid =78 5 http://developer.yahoo.com/search/content/V1/termExtraction.html 6 Tonkin, E., & Muller, H. (2008a). Keyword and metadata extraction from pre-prints. Proceedings of the 12th International Conference on Electronic Publishing held in Toronto, Canada 25-27 June 2008 / Edited by: Leslie Chan and Susanna Mornati. ISBN 978-0-7727-6315-0, 2008, pp. 30-44. Retrieved October 23, 2008, from: http://elpub.scix.net/data/works/att/030_elpub2008.content.pdf . Also, Tonkin, E., & Muller, H. (2008b). Semi Automated Metadata Extraction for Preprints Archives. Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries held in Pittsburgh Pittsburgh PA, USA, June 2008. ISBN:978-1-59593-998-2, pp.157-166. Retrieved October 23, 2008, from: http://portal.acm.org/ft_gateway.cfm?id=1378917&type=pdf&coll=GUIDE&dl=GUIDE&CFID=7481 537&CFTOKEN=69623572 7 DROID (Digital Record Object Identification), http://droid.sourceforge.net/wiki/index.php/Introduction 8 JSTOR/Harvard Object Validation Environment, http://hul.harvard.edu/jhove/
The scholarly works sample comprised 120 journal articles in pdf format (which is the dissemination format for scholarly works in most institutional repositories). They were downloaded from the Leeds and Sheffield sections of the White Rose Consortium institutional repository. The sample was carefully selected to correct the bias towards science/technology (and particularly computer science). One article was selected randomly from each subject in the Leeds and Sheffield sections of the repository an approach that also reduced the likelihood of selecting multiple items from a single journal. The project was interested in the potential for extracting metadata once the repository movement grows so a broad range of title-page layouts was preferable to a strictly representative sample of the current repository10. The output from the various tools was evaluated using the following metrics and methods: The most basic test was for completeness i.e. whether each of the tools auto-generated all of the metadata elements that it should have. A baseline impression of completeness could be obtained by validating the xml output from each tool against the relevant application profile (which for the html sample was the Intute Cataloguing Guidelines)11 and noting the number of mandatory tags that were either empty or absent. DC Identifier, DC Language, DC Format and DC Type were evaluated by Exact Match Accuracy. EMA is the proportion of times that the automatically generated value exactly matches the expert cataloguer's assignment for the element (after simple normalizations) and is appropriate for elements to which a single value from a controlled vocabulary is normally assigned. Titles and Keywords were evaluated by Precision, Recall, and F1-score. Precision and recall are more sophisticated than EMA in that multiple-value fields are split into their individual "subfields" so that each unit (e.g. keyword or title word) is matched individually. Precision and Recall can therefore take into account situations where one value or instance is correct and another is incorrect. Precision is the number of correctly generated units as a percentage of the number of generated units. A low precision score would suggests that there is a lot of noise (i.e. that many spurious terms are also generated). Recall measures the extent to which all of the correct values have been generated regardless of whether they are accompanied by noise from incorrect values. Recall is defined as the number of matching units (i.e. correctly generated terms) divided by the number of correct units in the reference set of documents (e.g. terms supplied by the expert cataloguer). F1-score was also calculated to provide realistic trade-off or weighted average between precision and recall. Used on their own, those two metrics can give a correct but misleading representation. Keyword precision, for instance, can be maximised artificially by a strategy of returning just a single word - the single most frequently occurring word within the source text after stop words have been removed. But the result is an absurdity because of the unacceptably low level of recall that this implies. Conversely, a strategy of extracting all of the words from the resource as keywords will ensure 100% recall but with very low level of precision. F1-score is defined as: F1 = 2 * precision * recall precision + recall
F1-score reaches its best value at 1 and worst score at 0. Its significance is that it tends towards its target value (of 1) only when both the precision and recall are good.
http://www.intute.ac.uk/ . One sample was representative of the Intute database as a whole and covered a variety of subjects and digital formats. The other four were to test whether the effectiveness of metadata extraction the various tools was affected by the subject orientation of the resources. 10 http://eprints.whiterose.ac.uk/ . Many titles within the White Rose repository contained an intsitutional cover page and had to be rejected. 11 http://www.intute.ac.uk/cataloguer/guidelines-socialsciences.doc
The auto-generated summaries were tested with particular interest because DC Description and DCTerms:abstract are, more often than not, the most labour-intensive elements to catalogue manually12. Unfortunately, free-text summaries are monstrously difficult to evaluate because there is no single correct answer13. No two cataloguers, for instance, can be expected to produce summaries identical in every respect (i.e. in terms of their semantic meaning, choice of words, and word order). Also, perceptions about the quality of a summary will vary from one person to the next according to their expectations, needs, assumptions and biases. Ideally, at least one extrinsic evaluation method would have been employed to show how useful participants find the metadata in real-life external tasks (e.g. relevance assessment, reading comprehension, or, in following instructions)14. However, time and resource constraints made it more practical to use a technique known as factoid analysis, which can reveal the richness of the information in a summary and the degree of consensus between summary and source text at a fundamental level of semantic meaning despite the existence of superficial differences in vocabulary and word/sentence order. The assessor first divided each reference summary (i.e. expert created abstract) manually into its atomic semantic units, known as factoids. For instance, the sentence In addition, there is a timeline section which focuses on the history of the Holocaust, tracing the years from the rise of the Nazi Party during the 14 years following the end of World War I to the aftermath of the Second World War in which Nazi perpetrators of war crimes faced retribution for their war crimes and survivors began rebuilding their lives. was split into the following factoids: There is a timeline section The timeline focuses on the history of the Holocaust The timeline begins by tracing the years from the rise of the Nazi Party The Nazi Party rose during the 14 years following the end of World War I The timeline ends in the aftermath of the Second World War In the aftermath of the Second World War, Nazi perpetrators of war crimes faced retribution for their war crimes In the aftermath of the Second World War, survivors began rebuilding their lives The auto-generated summary was likewise split into factoids. The assessor then counted the degree of overlap between the two summaries on a factoid by factoid basis. Factoids in summary A and B were regarded as matches if they share the same semantic meaning regardless of differences in vocabulary. The test was then refined to see how well the factoids matched the specific requirements of the Intute Cataloguing Guidelines, which expected the following to be described: The nature of the resource, e.g. an electronic journal, collection of reports etc. The intended audience of the information Who is providing the information (author, publisher, funder, organisation) The subject coverage/content of the resource Any geographical or temporal limits This is true even for scholarly works that contain an author-supplied abstract because special characters often fail to copy and paste over from a pdf file into a cataloguing interface properly by point and click methods. 13 Halteren, Hans van, and Teufel, Simone. 2003. Examining the consensus between human summaries: initial experiments with factoid analysis. In Proceedings of the HLT-NAACL 03 on Text summarization workshop, 57 64. 14 E.g. The Expert Game; Shannon Game; Question Game; Classification Game; Keyword Association; Query-based methods.
12
Any form or process issues that might affect access or ease of use (charging, registration, need for any special software not on the technical requirements list, etc.) Availability of the resource in other languages Any cross referencing notes, for example, "These pages form part of the ... website, " This was an intellectually intensive process but not as time consuming as evaluation by extrinsic methods. There was, of course, a small subjective element involved in deciding whether a match was close enough to be registered but it was deemed to be at an acceptable level. The project also conducted a small experiment to compare the time taken to create metadata for a given set of documents using manual and hybrid cataloguing methods respectively. Two subsets, each of seven Webpages, were taken randomly from the Intute: Political History sample. The first was catalogued manually into an MS Excel spreadsheet by a subject expert (a librarian with a PhD in History). The second set was described using a hybrid method whereby metadata was auto-generated by Data Fountains into an MS Excel spreadsheet and then manually enhanced by the subject expert. In both cases, the metadata was created to the standard stipulated by the Intute Cataloguing guidelines. The approach that the cataloguer took and the average time taken for each method was recorded.
Implementation
It was relatively easy to find a testbed for html resources. Intute was an ideal candidate because: 1) It contains a ready-made reference set of expertly catalogued metadata records for over 100,000 html resources 2) Its subject coverage is universal 3) Its broad subject terms assist the identification of subsets. It was more difficult to find a scholarly works testbed from within the JISC IE because: 1) Institutional repositories (as yet) contain few items15. 2) The subject coverage is usually unbalanced. 3) Most repositories prefix their scholarly works with an institutional cover page with logo etc. which confuses tools16. The process of testing was also unexpectedly time-consuming and less could be achieved than expected. This was partly because of the need to convert between multiple output formats. This was particularly problematic in relation to KEA, which requires a large training corpus. It therefore made more sense to test KEA after Stage 3 was completed, by which time metadata generation tools and utilities (e.g. file conversion software) would be available for easier use as Web services. The scarcity of open source software for calculating intrinsic metrics was even more troublesome. The Data Fountains evaluation module ostensibly calculates precision, recall, exact match, etc. but is no longer supported and is unable to evaluate the output from other metadata generation tools. Most calculation had to be done manually. On the other hand, most of the extrinsic metrics would have required a project in themselves. The main disappoint was in not being able to test PaperBase. Its developers ultimately preferred to conduct tests in-house. The risk analysis conducted during the planning phase paid insufficient attention to this possibility. The disappointment was heightened because the test phase suggested that there is greater scope for extracting metadata from scholarly works than from websites and, according to the research literature, paperBase appears to be the most sophisticated prototype yet developed for
Glasgow Eprints, http://eprints.gla.ac.uk/, for instance, contained a large number of ppt presentations but too few eprints in pdf 16 JSTOR, http://www.jstor.org/ , which would otherwise have been very suitable, was ruled out for this reason.
15
this purpose. Fortunately, Tonkin and Muller published their findings during the later stages of the project and some useful comparisons could be made with the other tools.
Outputs and Results

The quality of metadata generated for Websites (html) was mostly disappointing. In many respects Data Fountains was most successful of the tools in the test program at generating metadata from html but it fell short of providing a complete metadata solution. The improvement shown by Data Fountains, when compared with the older DC-dot, was most visible in relation to free text elements, such as key phrases and summaries. For categorical elements (e.g. format and type) and for title and creator there was little to choose between the various tools. Indeed, Data Fountains, SamgI and DC-dot tended to share the same systemic error(s) for these basic elements because they used identical or similar methods for them. Over all, even Data Fountains only managed to generate metadata that was only 69% complete (i.e. it generated a value for 69% of the 270 mandatory elements required by a sample of 50 Websites). In other words, it generates no value at all for a full 31% of the mandatory elements in the Intute cataloguing guidelines many of the generated values are incorrect or only partially correct. For a few items, other tools generated superior metadata for specific elements. DC-dot had been an innovative piece of work when it was developed in 2000. It attracts between 5000 and 6000 unique visitors per month17 but it is unclear whether anybody is using the tool in earnest and to what effect. The MetaTools tests results suggest that its output is of very variable quality and completeness. The effectiveness of metatag harvesters, such as DC-dot, is entirely constrained by the number and quality of meta tags found within the source document. A survey, conducted for this study, found a mean of just 4.5 meta tags had been applied per home page in 40 randomly-chosen websites from the Intute catalogue. Title was the only omnipresent tag. Only nine home pages contained creator, author or contributor tags although a further four contained publisher tags and the quality was extremely variable. The following table shows the number of DC, meta and html tags found in a sample of 40 websites catalogued by Intute that map to each Dublin Core element. Number of tags in sample 41* 31 30 22 16 9 8 7 6 4 4 3 2 1 1
Dublin Core Element Title Format Type Subject Description Creator Date Audience Language Publisher Rights Coverage Identifier Contributor Relation
17
Figures supplied by Greg Tourte, Systems Administrator/Develop, UKOLN. Email:
erg.tourte@ukoln.ac.uk
Source 1 Empty or misused tag 17 Table 3: Metadata found in a website sample *NB. One website contained a title metatag as well as an html <title> tag hence the discrepancy between the 40 website in the sample and 41 tags that map to DC title. Consequently DC-dot usually generates only partial metadata records when used on its own. The data for any given element also varies considerably from one record to the next because of the variety of ways that cataloguers interpret DC cataloguing guidelines when applying meta tags. For example, in the sample mentioned above, <author> and <creator> were populated with a variety of data such as: 1. CRFR Centre for Research on Families and Relationships 2. roda.ro 3. firstname surname 4. Department for Environment, Food and Rural Affairs (Defra), Communications Directorate, webmaster@defra.gsi.gov.uk The limitations of tag harvesting may be traced back partly to the fact that the Web editor software used for creating most web pages currently supports the auto-generation of a narrow range of metatags. Dream-weaver inserts no system-generated metatags at all that map to Dublin Core, and CityDesk uses automatic techniques only for the date element. The existence of metatags is largely at the whim of human input and there is no support for that in Dreamweaver and CityDesk for most Dublin Core elements or their generic equivalents. The situation is similar for most other types of content creation software including Acrobat and Word, the key applications for eprints - as is clear from the following table, reproduced from the AMeGA Report18.
18
Greenberg, J., Spurgin, K. & Crystal, A. (2005). Final report for the AMeGA (Automatic Metadata
Generation Applications) Project, UNC & Library of Congress, p.16. http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf
Table 4: Support for metadata generation in content creation software (reproduced with permission from the AMeGA Report) The AMeGA report speculates that: One possible reason is that these applications focus on the creation of HTML documents and emphasize resource appearance (e.g., color and font size) over structure and content. Even so, it seems development of X/HTML, which supports structured metadata, would have had an impact on such applications, improving metadata functionalities. Regardless of the underlying reasons, improving Web editor metadata creation functions could lead to an increase in metadata production, and ultimately improve resource discovery and other metadatasupported functions. Perhaps better communication between selected metadata communities and Web editor vendors would improve current metadata functionalities for these tools.19
19
Greenberg, J., Spurgin, K. & Crystal, A. (2005). Final report for the AmeGA (Automatic Metadata
Generation Applications) Project, UNC & Library of Congress, p.37. http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf
DC:creator was the most problematic element for all of the tools because, like DC-dot, they all relied solely on extracting dc:creator or author metatags, although Data Fountains additionally post-processed the list to remove duplicate entries and blacklist undesirable values. Unfortunately, these metatags were rarely supplied or else they instead contained the name of the software used create the page rather than a personal or corporate name. DC-dot, SamgI and Data Fountains produced an exact or partial match for dc:creator only 10% of the time20. As the table below shows, dc:creator was particularly difficult for tools (Data Fountains) to generate: Table: Website (html) sample: Intute: All % of records containing element (includes correct and incorrect values) 88 78 22 68 78 76 78 70 74 74 74
Data Fountains element url broad_subject_categories Creators Description Format key_phrases Language Lcc rich_full_text Titles proper_names_and_capitalized_phrases
DC:title was usually generated more successfully because harvestable dc:title metatags and <title> html tags were often available. However, even Data Fountains achieved relatively modest precision and recall rates of 0.45 and 0.65 respectively for this element21. Metatags often did not match the displayed titles, which are increasingly rendered from image files (e.g. gif), which DC-dot and Data Fountains are unable to read. This problem has no obvious solution. The ALT attribute may provide an alternative text description for a linked image file but it is infrequently used. The auto-generated title metadata was of very variable quality and the story was similar for most other elements. Data Fountains generated an exact match for only 34% of the titles in the Intute:Political History sample and a partial match for 30%. DC-dot performed similarly: exact match 32%; partial match 34%. SamgI was far worse: exact match only 4%, partial match 6%22. As the table below shows, performance levels were better for some subjects than others but not hugely so. According to tests using the Intute:Political History sample For the Intute:English Literature sample. Data Fountains generated an exact match for only 34% of the titles in the Intute:Political History sample and a partial match for 30%. DC-dot performed similarly: exact match 32%; partial match 34%. SamgI was far worse: exact match only 4%, partial match 6% 22 SamgI perhaps performs so poorly in this respect because it was designed to extract LOM metadata and the title module may have been a token effort. It only harvests META tags where NAME="DC.Title which are rarely used. Data Fountains exploits additional sources of title metadata when there is no DC.Title META tag:
21 20
1. The content of any META tag whose name is title or dc:title. 2. The Title tag. 3. All H1 tags. 4. The sequence of words in the first 50 letters of body text.
Data Fountains: Title evaluation Exact match 12 12 Near match 8 13 Distant match 12 6
Sample Intute: Health Intute: All Intute: English Literature Intute: Political History Intute: Social Science TOTAL
Wrong 5 7
None 2 5
Total 39 43
10
28
43
17
15
10
50
13 64
10 74
6 24
15 40
0 17
44 219
The attitude that a portal takes towards the titles generated by Data Fountain will probably depend, in part, upon the strictness of its own cataloguing policy. The process of manually assigning a title is an art rather than a science and, for the sake of clarity, Intute, for instance, would sometimes assign a title that differed somewhat from the title in the resource (e.g. a subtitle might be added or omitted or the authors name added as a subtitle) thus departing from the Data Fountains title. The MetaTools project used a very strict definition as to what constituted a match. A match required every word in the Data-Fountains generated title to be identical to every word in the Intute title apart from capitalisation and the existence of a leading article (if any). Near matches were generally very near matches. A rough idea of the output can be gained from the table below, which shows a selection of titles generated by Data Fountains and their equivalents in the reference set (expert-assigned titles from the Intute Website). Intute Title (expert assigned) British Society for Population Studies A fact, and an imagination : or Canute and Alfred, on the seashore A bibliography of Thomas More's 'Utopia' A tribute to R. S. Thomas 17th century reenacting and living history resources Dan Berger's pages at Bluffton University 221 Baker Street ABES : annotated bibliography for English studies Data Fountains Title (auto-generated) British Society for Population Studies Wordsworth, William. 1888. Complete Poetical Works. [EMLS 1.2 (August 1995): 6.1-10] A Bibliography of Thomas More's Utopia Obituary Ruins The Hearth Here Want More? 17TH Cen Reenacting Examples of pericyclic reactions 221B Baker Street: Sherlock Holmes Abes Organic chemistry I : Chem 341 Adelmorn, the outlaw : a romantic drama in three acts A relation of the apparition of Mrs. Veal Adelmorn by Matthew G. Lewis Defoe, Apparation of Mrs. Veal
Data Fountains then post-processes the initial list to remove duplicate entries, blacklist undesirable values (e.g. Homepage, Untitled Document), and remove unwanted prefixes (e.g. Welcome to, Homepage of) while preserving the order of the list. The values remaining in the list are assumed to be in order of decreasing quality, so that when a single Title is required, the first is used.
10
A centennial tribute to Langston Hughes The Abraham Cowley Text and Image Archive: University of Virginia
Library System - Howard University Abraham Cowley text and image archive Centro regionale progettazione e restauro di Palermo The Surrealism Server
!Surralisme!
Creator and Title exemplify two of the key problems of generating metadata from Webpages. Firstly, it is difficult to extract metadata for specific entities from the body of a page section because html contains few tags to indicate semantic meaning other than in the header. E.g. A personal or corporate name within the text could be a creator, subject, or somebody mentioned in passing23. Secondly, the order of priority accorded to extraction methods is usually hard-coded within each tool. This can cause the output to be suboptimal. e.g. Data Fountains outputs a harvested dc:title metatag as a matter of course, even when there is a better title in the <title> html tag. DC-dot gives preference to the title tag and this is why its output is occasionally preferable. Nevertheless, Data Fountains does appear to be an improvement over older tools (e.g. DC-dot) and also other relatively recent tools (e.g. SamgI and Yahoo! Term extractor). The improvement was most visible in relation to free text elements, such as key phrases and summaries24. For keywords, it was decided not to conduct precise statistical analysis of the sort first envisaged (e.g. to calculate precision, recall, and f1-score) because it was suspected that the keywords in the reference set of expert-assigned metadata (i.e. the Intute metadata) had not been assigned as consistently as the other metadata elements. However, an informal comparison of keywords (keyphrases) generated for each Website by Data Fountains, DC-dot and SamgI with the summary for that item in the Intute catalogue suggests that there ere was not a huge difference between the quality and relevance of the three sets of keyphrases. The big difference was that Data Fountains was much more consistent in terms of how many it assigned. Data Fountains assigned keywords to 88% of the Intute:Political History sample whereas DC-dot did so for less than half. The number of keyphrases assigned was also much more consistent from one item to another. Several metadata records from DC-dot had over one hundred keywords and one contained 486! Data Fountains, in contrast, uniformly assigned between 7 and 10 for the vast majority of items. This is because DC-dot generates dc:subject metadata by harvesting metatags and, where they are absent, extracts keywords from the content by analyzing anchors (hyperlinked concepts) and presentation encoding, such as <strong>, bolding and font size. This is a crude method because: 1) The number of keywords assigned to a resource by an author can vary enormously, as can the number of hyperlinked or emphasised words within a page25 2) Authors differ in their interpretation of what constitutes a keyword and many do not, strictly-speaking map to dc:subject (e.g. some creators enter the name of their home institution as keywords) 3) there is no guarantee that hyperlinked or emphasised words indicate the subject of a site. Data Fountains uses a more sophisticated method to generate keywords or, to be precise, keyphrases. A (phraseRate) algorithm analyses the structure of the given html document and uses this knowledge as the basis for its identification and ranking of candidate keyphrases. It uses an array of indicators, including the nesting structure of the document (extra value is given to, say, introductory or emphasized text); the grammatical and syntactical structure of sentences (in order to identify the correct start and end-points of candidate phrases); the relative frequency of each word, its position within candidate keyphrases, and the frequency of those keyphrases within the source document as a whole; and the relative weight that should be given, respectively, to keyphrases extracted from the body and
23 24
The <title> tag is generally the only useful signpost of this kind. In absolute terms, Data Fountains produced better results for dc:identifier than keywords or abstract. It generated a correct value (i.e. an exact match) 83% of the time. However, an even higher accuracy threshold is desirable for identifier because even the smallest error renders a unique machine-readable identifier useless. 25 From none to ??? in the sample.
11
keyword meta tags from the document header (the latter being given a boost when they are sparse so that each carries more weight). Data Fountains applies a more sophisticated method for generating summaries too. DC-dot derives them entirely from the harvesting of meta tags (i.e. dc:description) whereas Data Fountains uses the scores given to the candidate keyphrases by the phraseRate algorithm to find the highest-scoring text division, and the highest-scoring paragraph within the resource, which is returned as the final summary. If this strategy fails, a set of contiguous high-scoring sentences are used instead26. Data Fountains appeared to produce slightly higher quality summaries than DC-dot or SamgI or at least that was how it felt, intuitively, to the project team. This was borne out by the test results, which showed that recall was superior for Data Fountains. This meant that its auto-generated summaries shared more of the unique content words from the Intute summaries than other tools did although its level of precision was the lowest. Recall is perhaps the better measure of the quality of a summary. It gives an indication as to the proportion of the things that the expert cataloguer thought were important to say about the resource that have actually made it into the auto-generated summary. Precision is less of a consideration where there are an infinite number of other things that could be said about a resource. Despite their marginally better quality, the Data Fountains summaries are nevertheless still only about one-quarter of the length of the expert-created summaries in the Intute database. The Data Fountains summaries had an average length of 40.972 words, much shorter than the expert-created records, which had an average length of 164.84 words. This is broadly in line with the results of an in-house test by the iVia/Data Fountains team, which recorded an average summary length of 36.2 words27. As one might expect from their length, factoid analysis showed that they still contain only a small proportion of the information that is found in the reference set of human created summaries. This is true even of the better descriptions such as the Website American political prints 1766-187628. The expert-assigned Intute summary for this resource is: American Political Prints, 1766-1876 is an online exhibition published by HarpWeek. The site features an extremely important Library of Congress collection of eighteenth and nineteenth century political prints, catalogued by Bernard F. Reilly Jr. The digitised prints can be searched by keyword, name, or topic, and they can also be browsed by date. Each one is accompanied by an explanation of the image, placing it in context for the user. Further information on the collection can also be found in Bernard F. Reilly's introduction. Amongst the topics covered by these fascinating prints are the slave trade, Dred Scott, elections, politicians, and Ireland. Data Fountains has generated the following description: Website design 2005 HarpWeek, LLC & Caesar Chaves Design All Content 1998-2005 HarpWeek, LLC Please submit questions to webmaster@harpweek.com This catalog, which HarpWeek has the privilege of bringing to the public in electronic format, is an unmatched source of information on American political prints. lease read Bernard F. Reilly, Jr.'s introduction to this website. Website visitors should be warned that several of the words, descriptions, and images in these 19th-century caricatures are considered racially offensive by todays standards. Twenty-seven factoids may be identified in the Intute summary. The Data Fountains summary, on the other hand, contains material that equated to just 6 of them. It managed to generate 7 additional
26
For details of Data Fountains retrieval extraction methods see: Gordon Paynter (2005) Developing Practical Automatic Metadata Assignment and Evaluation Tools for Internet Resources. Proc. Fifth ACM/IEEE Joint Conference on Digital Libraries, Denver, Colorado, June 7-11, 2005, ACM Press, pp. 291-300. Retrieved October 23, 2008, from: http://ivia.ucr.edu/projects/publications/Paynter-2005JCDL-Metadata-Assignment.pdf . Also go to the iVia project website: http://ivia.ucr.edu/projects/ 27 http://ivia.ucr.edu/projects/Metadata/Description.shtml 28 http://loc.harpweek.com/
12
factoids not found in the Intute summary but only three of these were considered relevant to dc:description. And it should be remembered that this is one of the best auto-generated summaries in the sample. Over all, Data Fountains generated just two legitimate factoids per Website for the Intute:Political History sample. Table: Intute and Data Fountains factoids for American Political Prints, 1766-1876
Intute Factoid Data Fountains factoid Relevancy of Data Fountains factoid yes yes yes yes
It is a collection of political prints The prints cover the period 1766-1876 The prints are American It is an online exhibition It is published by HarpWeek. The site features a Library of Congress collection The collection is extremely important The collection covers the eighteenth century. The collection covers the nineteenth century The collection was catalogued by Bernard F. Reilly Jr Each one is accompanied by an explanation of the image, placing it in context for the user. The prints are digitised The prints can be searched by keyword, name, or topic The prints can be searched by name The prints can be searched by topic The keywords can be browsed by date Each print is accompanied by an explanation of the image, placing it in context for the user. The explanation places the print in context There is an introduction Bernard F. Reilly wrote the introduction Further information can be found in the introduction The prints are fascinating The topics include the slave trade The topics include Dred Scott The topics include politicians The topics include Ireland There are other topics
American Political prints American Political prints American Political prints which HarpWeek has the privilege of bringing to the public
public in electronic format
yes
read Bernard F. Reilly, Jr.'s introduction
Yes
Several of the words in these 19th century caricatures are considered racially offensive by todays standards. Several of the descriptions in these 19th century caricatures are considered racially offensive by todays standards. Several of the images in these 19th century caricatures are considered racially offensive by todays standards. The Website design is the copyright of HarpWeek, LLC & Caesar Chaves Design The copyright date is 2005 All Content is copyright 1998-2005 to HarpWeek, LLC
Yes Yes
Yes
13
All questionis should be submitted to webmaster@harpweek.com
The Intute cataloguing guidelines stated that The subject coverage/content of the resource should be described in the metadata and this type of detail was included more than any other kind in the Data Fountains descriptions. Sixty-seven percent of the Data Fountains descriptions contained a mention of the subject/content of the resource but it was very often just a single line of text. E.g. From the Intute description: . split into two interlinked sections: The Peel Web, and the Age of George III. The Peel web covers the years of the pre-eminence of Sir Robert Peel and deals mostly with the political history of Britain and Ireland 1830-1850. The Age of George III focuses more on Britain's overseas relations, albeit with information on each of the ministries to serve under George III and the Prince Regent (later George IV).
From the Data Fountains description: The reign of George III The nature of the resource (e.g. whether it is an electronic journal, collection of reports etc.) was often touched upon (60% of the time) but other details were very scarce even where relevant. Other crucial information was usually absent, as the table below shows: Table: Composition of Data Fountains summaries Details required by Intute Cataloguing Guidelines Data Fountains metadata records that contain the required detail (at least minimally) 60% 6.7% 13.3% 67% 27.0% 6.7%
The nature of the resource, e.g. an electronic journal, collection of reports etc. The intended audience of the information Who is providing the information (author, publisher, funder, organisation) The subject coverage/content of the resource Any geographical or temporal limits Any form or process issues that might affect access or ease of use (charging, registration, need for any special software, etc.) Availability of the resource in other languages Any cross referencing notes, for example, "These pages form part of the ... website, " Other
0% 6.7% 13.3%
Interestingly, the Data Fountains summary of an item was usually no better than and, indeed, was usually identical to - the brief Google result (they were presumably generated by a similar algorithm). Attitudes towards auto-generation are likely to be shaped, to some extent, by context and expectations so that a description that is acceptable within the context of a search engine (which offers universal retrieval) is likely to require considerable amendment before it is suitable for the relatively small but richly-descriptive context of a JISC portal. Many of the descriptions do little more than re-state the title. Over all, the projects expert cataloguer found it easier, in virtually every case, to write descriptions from scratch rather than to amend the autogenerated output in order to meet the standards of the Intute Cataloguing guidelines. More rigorous testing, however, would be needed to confirm this.
14
Data Fountains was generally more successful than the other tools that were tested in extracting metadata from scholarly works (pdf) too. Statistical analysis suggests that it generates more suitable keywords or keyphrases for scholarly works than either DC-dot or SamgI, which uses the Yahoo! Term Extractor web service. For instance, when the author-supplied keywords found in thirty scholarly works were compared with those generated by Data Fountains and SamgI it was found that the output from Data Fountains showed superior precision, recall and F1 score (see table)29. Human evaluation supported this finding. Each tool rarely achieves an exact match against the human supplied keywords though this should not, in itself, be regarded as a problem. Human cataloguers frequently disagree amongst themselves when assigning uncontrolled keywords because of the use of synonyms. More seriously: Relevant terms may be accompanied by unnecessary variations from the same root phrase. (e.g. neighbourhoods of southampton may be superfluous in the presence of the more specific deprived neighbourhoods of Southampton). Sometimes one phrase is just the plural of another. Some terms lack context (e.g. help themselves) and appear meaningless or amateurish The may be erroneous capitalisation The name of the author(s), the url of the eprint or the journals name are sometimes captured erroneously as keyphrases (e.g. below, williams and jan windebank are the authors)
Data Fountains generally produces a larger number of unnecessary variations from the same root phrase but the SamgI errors tends to be more serious. SamgI includes more of the worst type of error: entirely irrelevant phrases. It also wrongly treats entities that should populate other elements as key phrases (e.g. Journal of Social Policy below). The relevant terms are also slightly less specific (e.g. economic geography below). Table: Auto-generated keyphrase metadata for the scholarly work: Helping People To Help Themselves: Policy Lessons From a Study of Deprived Urban Neighbourhoods in Southampton, by Colin C. Williams and Jan Windebank30. The clearly-unsuitable key phrases are marked in red: Data Fountains deprived neighbourhoods community exchange tax credit neighbourhoods of Southampton jan windebank deprived neighbourhoods of southampton unpaid community exchange help themselves williams and jan windebank self-help in deprived neighbourhoods SamgI joseph rowntree foundation urban neighbourhoods senior lecturer welfare provision leicester le1 department of geography cambridge university press jan windebank economic geography sheffield s10 2tn Journal of Social Policy
There are significant differences in the output between tools. Although it is almost impossible to predict which tool would be better for any specific item, Data Fountains key phrases appear to be preferable over all. It should be noted, however, that SamgI has been developed primarily to investigate the potential for generating leaning object metadata (LOM, i.e. Intended User Role, Interactivity type, Learning resource type, Typical Learning Time, etc.), which is out of scope for the MetaTools project. The results presented here should not be taken as an indication of its suitability for its main purpose.
29
The comparison was between unique content words (i.e. after breaking down the auto-generated keyphrases into their component words, removing duplicates, and stop words, such as of, the, and). 30 http://scholarly works.whiterose.ac.uk/1586/1/windebank.j2.pdf
15
Taken as a whole, the Data Fountains keyphrases give an impression of the subject but appear somewhat amateurish. A cataloguer is likely to want to retain about half of them. The time saved test (see below) suggests that modest efficiency savings may accrue from pre-seeding a cataloguing interface with keyphrases from Data Fountains but much probably depends upon the interface design (e.g. how easy it is to delete erroneous entries). Data Fountains produces an impressively uniform number of key phrases. It succeeds in its aim of extracting nine key phrases from each resource. It extracted key phrases for all of the 112 scholarly works within the sample that its crawler could locate bar two although for five of the scholarly works this amounted to a single character (e.g. b). 1,021 key phrases were extracted in total, which amounts to a mean of 8.5 key phrases per eprint in the entire sample of 120 pdfs. However, the quality of the key phrases is sub-optimal because Data Fountains does not appear to set out to grab the 3-8 high-quality author/publisher-supplied key phrases that exist in thirty (27%) of the sample31. DF appears to rely entirely on a tf-idf algorithm to produce its keywords rather than searching for clue words such as keyword. The Bayesian tf-idf algorithms (which weigh candidate phrases according to two criteria: 1) their unexpectedness relative to their usage in the global corpus of English language documents 2) their position in the document (i.e. phrases at the start are generally more important than later phrases). It is probable that DF takes this route because it is oriented mostly towards extracting metadata from web pages where creator-supplied keywords are rare or are abused but it is a rough and ready approach. This makes it all the more disappointing that it wasnt possible to test PaperBase a tool designed specifically to exploit the layout and clues within scholarly works. In some respects, Data Fountains, at least, generated higher quality metadata from scholarly works in pdf format than from Webpages even though it was designed primarily to support html. For example, it was unable to generate any summaries for the Website sample that approached the standard stipulated by the Intute Cataloguing Guidelines. On the other hand, it will extract the author/publishersupplied abstract perfectly from 17.5% of scholarly works and, altogether, 25.8% of the autogenerated abstracts are suitable for seeding a cataloguing interface.32 This might sound like a low percentage but the auto-generated abstracts are nonetheless useful because they are rarely incorrect. There is either an exact match, a truncated match (i.e. where only the first portion of an abstract is extracted), or else the auto-generated abstract contains extraneous text from beyond the point at which it should stop. Here, the cataloguer can always accept the auto-generated output with only the smallest amendment being needed. Moreover, there is considerable scope for further improving the quality of extraction of summaries from scholarly works. Unlike Websites, for which an author-created abstract or summary would be a rarity, the vast majority of scholarly works (92% of the scholarly works that Data Fountains crawler managed to locate in the sample) do contain a summary that a bespoke tool could be expected to extract. Even after conversion to plain text, 63% of these files have a textual marker such as Abstract, ABSTRACT, Abstract:- or, less frequently, Summary or Executive summary and so on.33 Similarly, a cleverly designed algorithm could surely exploit the fact that 30% of scholarly works contain author- or publisher-assigned keywords that are either labelled (e.g. Key words, keywords, KEYWORDS, etc.) or, if not labelled, are identifiable from being the only information typically found within a string that is both greater than 50 characters in length and contains no stop words.
Significant testing by other projects
An exact match for key phrases was achieved for only one item. This figure includes 1) perfect abstracts 2) abstracts that have been prematurely truncated (usually because the parser reached a special character such as < or >) 3) abstracts that contain some additional unwanted material (on account of the parser continuing beyond the end and extracting the first few lines of the body of the text) 33 Author-supplied abstracts are less frequent in older journal articles and those in arts and humanities subjects than elsewhere.
32
31
16
Recent work by Tonkin and Muller suggests that a bespoke tool, designed specifically with scholarly works in mind, can indeed produce markedly better results than Data Fountains. Their research was published in June 2008 and came too late to be fully assessed but it is very significant and should be mentioned here. Their paperBase prototype generates titles, authors, keywords, abstracts, page numbers and references from preprints and, via a Dublin Core-based REST API, uses them to populate a web form that the user can than proof read and amend. PaperBase has been integrated into the institutional repository that stores papers written by members of the Department of Computer Science at the University of Bristol where it has been trialled. Tonkin and Muller have manually tested the quality of title and author extraction by paperBase for 186 papers in their repository34. They claim that 86% of the titles were extracted correctly, with just 8% being partially correct and 8% completely wrong. This is a huge improvement on the figures that Data Fountains had managed for the White Rose scholarly works sample (28% of the titles correctly, with 27% being partially correct and 45% completely wrong). The comparison must be treated with some caution because it is based on different samples and the tests were undertaken under different circumstances35 but the disparity in results is startling. Similar figures are claimed for other DC elements that they tested. E.g. Tonkin and Muller claim that only 13% of the author names generated by paperBase from preprints were completely wrong, which compares with Data Fountains error rate of 90%36 for creators of Websites. Tonkin and Mullers work confirms the impression derived from the MetaTools tests that, in the near term, there appears to be greater scope for extracting high-quality metadata from scholarly works than Websites because their layout is more formulaic with most conforming to one or another of five or six traditional arrangements. E.g. a paper is likely to start with a title, then authors, affiliations, abstract, keywords, and end with various references. Tonkin and Muller claim to have captured these five or six alternative layouts in a probabilistic grammar that make use of the syntactic structure in journal publishing37. PaperBase makes use of two techniques: a Bayesian classifier and a state machine. Firstly, a Bayesian classifier was trained, using the word frequencies found in a corpus of titles, author names, and institute names from DBLP and Citeseer. Using these word frequencies, the Bayesian classifier then assigns a score to each token (word) extracted from the eprint to show the probability of it belonging, in turn, to each of the various element. (author, title, etc.). An example might be the word MetaTools which, according to the training corpus, might have a 99% chance of being a title word, but only 1% chance of being part of an author. Most words are ambiguous in that they could appear in more than one context. Secondly, a state machine (Markov chain) provides background knowledge of the sequences of tokens in each of the five or six common eprint layouts. e.g. a simple state-machine could be: PrePrint ::= Title+ Author+ (i.e. one or more title tokens followed by one or more author tokens)
Tonkin & Muller. Keyword and Metadtaa Extraction from Pre-prints, pp. 40-41. As Tonkin and Muller explain, Of the remaining 87%, 32% included the right authors but had extraneous text that was misconstrued as authors. [The] sample was not sufficiently randomised and had many papers by a Slovenian author with a diacritical mark in both surname and first name, which skewed [the] results 35 E.g. Some clarification is needed as to their definition of correct. For instance, the creators extracted by Data Fountains were often accompanied by footnote numbers, e.g. Smith, John5 , and it is unclear whether Tonkin and Muller would regard a similar result in their tests as a correct or partially-correct value. 36 Intute Political History sample. 37 The explanation of PaperBase that is provided here is a prcis of the account provided in: Tonkin & Muller, Semi Automated Metadata Extraction for Preprint archives, pp, 160-162.
34
17
A parser then considers the likelihood of each possible distribution of tokens between, in this case, Title and Author. The likelihood of each parse is computed by multiplying the probability of each token belonging to the specific class. e.g. Assuming that the Bayesian classifier has assigned the following probabilities: MetaTools Malcolm Polfreman - probability as a title word = .99; probability as an author word=.01 - probability as a title word = .12; probability as an author word=.88 - probability as a title word = .02; probability as an author word=.98
For the following three-word string there are four possible parses. (Bold denotes the title and Italics denote the author): MetaTools, Malcolm Polfreman (.01 x .88 x .98 = .0086) MetaTools, Malcolm Polfreman (.99 x .88 x .98 = .8537) MetaTools, Malcolm Polfreman (.99 x .12 x .98 = .1164) MetaTools, Malcolm Polfreman (.99 x .12 x .02 = .0024) The accepted version is the one that results in maximal probabilities over all for the authors, title, affiliation, email addresses which, in this case, is the second parse, which has a probability of 0.853738. Alternative approaches are used in relation to the small number of items (e.g. scanned legacy documents) from which it is not possible to extract the text as a simple stream of characters. For instance the visual structure of a document may be exploited to segment the image before OCR software, such as gOCR, is then used to translate it into text. It is its use of a state machine (Markov chain) to provide background knowledge of the sequences of tokens in each of the five or six common eprint layouts that apparently gives PaperBase a big advantage over Data Fountains and the other tools that were tested. Data Fountains is relatively good at extracting the title of an article and the title of the journal from which it comes but, because it doesnt have this background knowledge, has trouble deciding which is which. Indeed, the article title is more often deposited into the isPartOf element (53% of the time) than into the title element, where it should be. It only completes the process entirely correctly 26% of the time. For similar reasons, it also often concatenates lines from the opening of the source text to form a title, separating the various lines by a semi-colon. Very often, one of these sections contains what would otherwise be an exact match title. e.g. 2740 IEEE TRANSACTIONS ON MAGNETICS;Enhanced Longitudinal Magnetooptic Kerr Effect Contrast in Nanomagnetic Structures Some post-processing of the output of these two elements should make it possible to improve these figures very significantly. The semi-colons could be used to separate the various sections as separate instances of the element before a pattern matching process is then used to move any instance containing, say, trans., transactions, Journal, J. to the isPartOf element, and any that contains to the rights element. But this would be, in effect, to create a new tool. A system along the lines of paperBase suggests a more effective and efficient approach. Its ranking of confidence is a potentially very significant feature because it means that the paperBase system has the capacity to learn. This can be seen in relation to the generation of keywords. Its institutional repository uses a small controlled vocabulary of about 40 terms and a Bayesian classifier is used to predict these keywords using the words found in the preprints title, author name(s), and abstract. As Tonkin and Muller explain:
38
Confidence ratings reach their maximal and minimal values at 1 and 0.
18
The classifier is trained as and when preprints are added to the repository; when a preprint is added with specific keywords, the classifier is retrained, enabling it to make better predictions for the keywords. Tonkin and Muller suggest that confidence in the PaperBase system builds with size because of the inbuilt capacity to cross-reference entries from multiple metadata records and rank their likelihood of being a match. The greater the size of the database, the greater the amount of machine learning that can take place. Connections can be made because two pre-prints are written by an author with the same, because they cover a similar subject matter according to the keywords. Those connections can be used to, for example, disambiguate author identities. 39 It could eventually, in some senses, be self-correcting. SamgI too provides confidence ratings40 and the Data Fountains Evaluation Module, no doubt, could based upon automatic metrics, such as precision and recall. But the metadata needs to be of a certain quality (likely to be above the level that they currently achieve) otherwise errors will merely propagate. This is particularly true of author-supplied metadata from metatags. Most promising of all, Tonkin and Mullers have conducted trials that suggest that a deposit process can benefit from a bespoke tool, such as PaperBase being used to, seed a cataloguing interface upon ingest of a scholarly work into a repository. The split volunteers (twelve academic members of staff and PhD students) into two groups, with each volunteer being required to deposit six preprints into the repository. The first group entered the first three preprints manually and the remaining preprints using paperBase. The second group was required to enter the first three using PaperBase and the last three manually. Tonkin and Muller observed: Most participants (9 out of 11) thought the hybrid PaperBase method was faster than the manual approach even though, as Muller and Tonkin acknowledge, the quantitative results do not support this. Metadata creation is more accurate under the hybrid PaperBase system although only for title. It is surmised that errors crept in with the manual approach because a surprising number of participants opted to type information, rather than use a cut-and-paste from the PDF file. The hybrid approach resulted in more metadata because the abstract is filled in. Some participants will not fill in an abstract manually. Most importantly of all, there was buy-in from the participants, who unambiguously preferred the semi-automated system41.
Conclusions
Over all, Data Fountains was the most impressive of the tools at generating metadata from html but its output for that digital format is still only likely to be of relatively limited benefit (at best) to portals. The output from html was frustrating because some good metadata was generated but there was no way to separate the wheat from the chaffe. It had been hoped that time and accuracy savings would accrue from using metadata auto-generated from Websites to seed a cataloguing interface for subsequent amendment by an expert cataloguer. But the auto-generated metadata, at least for Websites, was judged to be of too poor quality to make this likely with the possible exception of keywords. Enhancing the auto-generated output appears to be at
39 40
Presentation on PaperBase by Henk Muller and Emma Tonkin at UKOLN, 22nd September 2006. http://www.ariadne-eu.org/index.php?option=com_content&task=view&id=53&Itemid=83 . The <miValue> is the confidence rating. 41 Tonkin & Muller, Semi Automated Metadata Extraction for Preprint archives, pp, 162-166.
19
least as much effort for a cataloguer as working from scratch at least if (s)he is working according to the exacting standards of the Intute Cataloguing Guidelines. The decisive factor in relation to Websites (html) was the poor quality of the auto-generated summaries. Even the best summaries were either very brief, had an inappropriate focus or order, or made questionable claims or other value judgments about the quality of the resource. The unpredictability of Webpage layout and the absence of tags or any other means for easily identifying semantic meaning within the content of html appears likely to limit the scope for improvements in extraction from this document type. Extraction errors tend to be either systemic and affect all tools identically or else the quality of metadata varies unpredictably from one item to the next depending upon quite micro-scale differences within the html. The gains to be had from invoking the highest-ranked tool for a given element and, if that is unable to generating a value, turning to the next best tool in turn are likely to be relatively modest. The outlook for scholarly works, on the other hand, appears to be more positive because the greater efficiency of the extraction process for that document type. Tonkin and Mullers real-life experiments with PaperBase suggest that at least for scholarly works a repository can benefit from using a bespoke tool to seed a cataloguing interface with auto-generated metadata upon ingest. PaperBase has yet to be released publicly but the MetaTools tests suggest that it could be worth developing Data Fountains (which was not designed primarily for this purpose) as a Web service so that, in the meantime repositories might usefully experiment with seeding a cataloguing interface. The tests suggest that Data Fountains might help in this way for two metadata elements abstract and keywords although much less successfully than PaperBase.
Table: Summary of Results mentioned in this report Element All elements Creator Test Completeness Correct or partially correct values Relevant factoids per record (28.2 in reference set) Precision Recall Length of summary: number of words (126 in reference set) Exact match accuracy Precision Recall F1 Sample Website (html): Intute: Political History Website (html): Intute: Political History Website (html): Intute: Political History Data Fountains metadata 69% 10% DC-dot metadata 10% SamgI metadata 10%
Description
9.3
4.8
1.34
Description Description Description
Website (html): Intute: Political History Website (html): Intute: Political History Website (html): Intute (all five subsamples combined)
0.231 0.054 40.972
0.43 0.031 -
0.361 0.025 -
Identifier Keywords42 Keywords Keywords

42
Website (html): Intute: Political History Scholarly Works: White Rose Scholarly Works: White Rose Scholarly Works: White
83%
94% -
88% 0.74 0.087 0.08
0.93 0.18 0.19 -
This test compared the extracted keyphrases with the reference set of author-assigned key terms.
20
Keywords Keywords Keywords Title
Precision Recall F1 Likert scale
Rose Scholarly Works: White Rose Scholarly Works: White Rose Scholarly Works: White Rose Website (html): Intute: Political History
0.22 0.43 0.29 Correct: 34% Partially-correct: 30% 0.65 0.47 Correct: 28% Partially correct: 27% Wrong: 45% 32 34
0.20 0.23 0.21 4 6
Title Title Title
Precision Recall Likert scale
Website (html): Intute: Political History Website (html): Intute: Political History Scholarly Works: White Rose
0.59 0.55 -
0.36 0.14 -
References
Automatic Metadata Evaluation. iVia website, UC Riverside Libraries. Retrieved October 23, 2008, from: http://ivia.ucr.edu/projects/Metadata/Evaluation.shtml Greenberg, J., Spurgin, K., & Crystal, A. (2006). Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. Retrieved October 23, 2008, from: http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf M. Hassel 2004. Evaluation of Automatic Text Summarization - A practical implementation. Licentiate Thesis, Department of Numerical Analysis and Computer Science, Royal Institute of Technology, Stockholm, Sweden. Retrieved October 23, 2008, from: http://www.csc.kth.se/~xmartin/papers/licthesis_xmartin_notrims.pdf Hillmann, D, Dushay, N., & Phipps, J. (2004). Improving metadata quality: augmentation and recombination. Proceedings of the 2004 international conference on Dublin Core and metadata applications: ISBN:7543924129 Hovy, E. and Lin, C. 1996. Automated text summarization and the SUMMARIST system. In Proceedings of A Workshop on Held At Baltimore, Maryland: October 13-15, 1998 (Baltimore, Maryland, October 13 - 15, 1998). Annual Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 197-214. DOI= http://dx.doi.org/10.3115/1119089.1119121 Hunter, J. and Choudhury, S. 2004. A semi-automated digital preservation system based on semantic web services. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (Tuscon, AZ, USA, June 07 - 11, 2004). JCDL '04. ACM, New York, NY, 269-278. DOI= http://doi.acm.org/10.1145/996350.996415 Maynard, D. (2005). Benchmarking ontology-based annotation tools for the Semantic Web. UK eScience Programme All Hands Meeting (AHM2005) Workshop "Text Mining, e-Research and Gridenabled Language Technology", Nottingham, UK, 2005. Retrieved October 23, 2008, from: http://gate.ac.uk/sale/ahm05/ahm.pdf
21
Maynard, D., Peters, W., Li, Y.: Metrics for evaluation of ontology-based information extraction. In: WWW 2006 Workshop on Evaluation of Ontologies for the Web (EON), Edinburgh, Scotland (2006) Retrieved October 25, 2008, from: http://gate.ac.uk/sale/eon06/eon.pdf Moen, W.E., Stewart, E.L., & McClure, C.R. (1998). Assessing metadata quality: Findings and methodological considerations from an evaluation of the u.s. government information locator service (gils). In: ADL 98: Proceedings of the Advances in Digital Libraries Conference, Washington, DC, USA, IEEE Computer Society 246 Nichols, David M., Paynter, Gordon W., Chan, Chu-Hsiang, Bainbridge, David, McKay, Dana, Twidale, Michael B., and Blandford, Ann (2008) Metadata tools for institutional repositories. Working Paper: 10/2008. Working Paper Series, ISSN 1177-777X. Retrieved October 25, 2008, from:
http://eprints.rclis.org/archive/00014732/01/PDF_(18_pages).pdf
Ochoa, X. & Duval., E. (200?). Towards Automatic Evaluation of Learning Object Metadata Quality. Retrieved October 23, 2008, from: http://ariadne.cti.espol.edu.ec/M4M/files/TowardsAutomaticQuality.pdf Ochoa, X. & Duval, E. (2006). Quality Metrics for Learning Object Metadata. In E. Pearson & P. Bohman (Eds.), Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2006 (pp. 1004-1011). Chesapeake, VA: AACE. Retrieved October 25, 2008, from: http://ariadne.cti.espol.edu.ec/M4M/files/QM4LOM.pdf Ochoa, X. & Duval, E. (2006). Towards Automatic Evaluation of Learning Object Metadata Quality. Advances in Conceptual Modeling - Theory and Practice, 4231, 372-381 Paynter, G. W. 2005. Developing practical automatic metadata assignment and evaluation tools for internet resources. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (Denver, CO, USA, June 07 - 11, 2005). JCDL '05. ACM, New York, NY, 291-300. DOI= http://doi.acm.org/10.1145/1065385.1065454 Karen Sparck Jones & Julia R. Galliers, Evaluating Natural Language Processing Systems: An Analysis and Review. Lecture Notes in Artificial Intelligence 1083, Springer, Berlin, 1995, xv + 228 pp. ISBN 3-540-61309-9. Stvilia, B., Gasser, L., Twidale, M., Shreeves, S., Cole, T. (2004). Metadata quality for federated collections. In: Proceedings of ICIQ04 - 9th International Conference on Information Quality. Boston, MA 111-125. Retrieved October 25, 2008, from: http://www.isrl.uiuc.edu/~gasser/papers/metadataqualitymit_v410lg.pdf Tonkin, E., & Muller, H. (2008a). Keyword and metadata extraction from pre-prints. Proceedings of the 12th International Conference on Electronic Publishing held in Toronto, Canada 25-27 June 2008 / Edited by: Leslie Chan and Susanna Mornati. ISBN 978-0-7727-6315-0, 2008, pp. 30-44. Retrieved October 23, 2008, from: http://elpub.scix.net/data/works/att/030_elpub2008.content.pdf Tonkin, E. and Muller, H. L. 2008. Semi automated metadata extraction for preprints archives. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (Pittsburgh PA, PA, USA, June 16 - 20, 2008). JCDL '08. ACM, New York, NY, 157-166. DOI= http://doi.acm.org/10.1145/1378889.1378917 van Halteren, H. and Teufel, S. 2003. Examining the consensus between human summaries: initial experiments with factoid analysis. In Proceedings of the HLT-NAACL 03 on Text Summarization Workshop - Volume 5 Human Language Technology Conference. Association for Computational Linguistics, Morristown, NJ, 57-64. DOI= http://dx.doi.org/10.3115/1119467.1119475
22
Project Acronym: MetaTools Version:1a (Stage 3 Report) Contact: (until 31 October 2008): Malcolm Polfreman (malcolm.polfreman@kcl.ac.uk) Contact: (after 31 October 2008): Steve Grace (stephen.grace@kcl.ac.uk) Date: 30/10/2008
O MetaTools stage 3 report. Technical report: Development and integration of prototype web service metadata generation tools
TABLE OF CONTENTS
INTRODUCTION .........................................................................................................2 METADATA GENERATION TOOLS ........................................................................4 Service Descriptions and Discovery ......................................................................4 METATOOLS SERVICE DATA MODEL ..................................................................6 MetaTools Ontology ..............................................................................................7 Quality of Service ..................................................................................................9 MetaTools System Architecture ..........................................................................11 CONCLUSION............................................................................................................13 REFERENCES ............................................................................................................14 APPENDICES .............................................................................................................16 Appendix A: Preservation Domain Ontology (RDFS) ............................................16 Appendix B: PeDRo Application used that allow user to load, update or edit service descriptions files in XML. .......................................................................................20 Appendix C: Brief descriptions of some of the better metadata generation tools ...21 DC-dot..................................................................................................................21 Data Fountains .....................................................................................................21 JHOVE.................................................................................................................22 KEA .....................................................................................................................22 SamgI (Simple Automated Metadata Generation Interface) ...............................23 Yahoo! Term Extraction service..........................................................................23
INTRODUCTION The evaluation program in phase two of the MetaTools project informed phase three, which saw Web-services interfaces being developed for some of the more promising metadata generation systems. The evaluation program contributed by, firstly, identifying the most suitable candidate tools for development as Web services and, secondly, by contributing data particularly on such matters as the supported format(s), extraction functions(s), element(s) generated, an output format(s), to the ontology lying at the heart of the system. The following section describes the rationale and technical design of the third stage of the project particularly the use of an ontology for describing and discovering metadata generation Web services within broader preservation workflows. The MetaTools middleware API has been developed within the context of a high-level conceptual model for digital preservation as a whole. The high-level conceptual model is presented in the top half of Figure 1. It is based on a brief study of recommendations from various data and metadata models and from in-house experience with digital preservation processes. The model contains a minimal set of appropriately inter-related concepts borrowed and modified from [1] - namely, digital object, digital content and activity as basic structural entities. The three entities are defined as follows: Digital object: Is digital content plus other descriptive and administrative information about that content. Digital content: Holds the primary content. Activity: Takes digital objects as input and produce digital objects as output.
All of the activities within digital preservation may be integrated within a service-based model. Figure 1 shows how activities are mapped to the service data model. This service data model is explained in detail later in the section. As the content and variety of digital content grows, so does the number of initiatives and tools for processing that content, with the result that the preservation life cycle evolves and expands in terms of the number of activities carried out. Auto-generation of metadata for digital contents is one kind of activity or event that may occur within the digital preservation life cycle. Current software programs for performing digital preservation activities must generally be hard-coded and are usually installed locally. As a result, it is difficult to incorporate the latest and best solutions available. It is currently difficult to automatically discover new services and integrate them into the digital preservation process.
Digital Content Rights

contains is subject to
Elements of a conceptual model for digital preservation
Digital Object
has gives as output takes as input
Metadata
Activity
mapped as
Operation
0..n
has
Service
Service Data Model
Figure 1. Mapping the digital preservation activity as an operation of a service
Service oriented architectures and Web services offer exciting possibilities for the orchestration and composition of distributed resources. To assist the composition of processes, it is necessary to have effective methods for describing services and suitable means to identify them appropriately. Exposing the activities as Web services would lead to selfcontained and self-describing application components that communicate using open protocols. The Web Services platform is a simple, interoperable, messaging framework that enables the exchange of data between different applications and platforms [3]. The MetaTools project has succeeded in converting standalone metadata generation tools into Web services and has contributed a generic, extendible and flexible data model for services that is applicable to any preservation activity. It has contributed to the creation of a rich service layer that builds upon existing architectures and service data models [4, 10]. This section describes the services data model and the domain ontology, which provides human understandable terminologies for describing services. It also presents the MetaTools architecture and discusses its components and their interactions.
METADATA GENERATION TOOLS Following the test phase of the MetaTools project, three tools were identified as being plausible candidates for generating metadata useful for the discovery of digital resources from textual objects, such as scholarly works (i.e. journal articles) and Web resources in various formats. 1. Key Phrases Extraction Algorithm (KEA) [5] 2. Simple Automatic Metadata Generation Interface(SAmgI) [6] 3. Data Fountains (which uses the iVia Projects libiViaMetadata software and will be referred to as iVia in this report). [7] Each of these tools is restricted in terms of the digital formats that it is able to support. The problem is usually solved by using format converter programs. A number of converters were developed using different libraries to convert pdf, MS word and power point to text format particularly for KEA, which only supports text files. File upload and download functions were also integrated in order to support KEA because it is a tool that is only able to read from a local file system. The other two applications or tools support not all, but most, common formats (such as pdf and html) and are able to download and upload. A common interface has been developed to use these tools. More importantly, SOAP and RESTful interfaces, the Web service styles that are the most widely used, were developed. The two main tasks were, firstly, to identify the data format/s they support and, secondly, to identify the types of metadata they extract. WSDL (Web Services Description Language) interface documents for each of the tools could then be written describing the different operations or functions that they each provide. The WSDL document uses XML format to describe where a Web service is deployed, what operations that service provides, and how it is possible to bind with them. In WSDL, the definitions of endpoints and messages are separate from the data format bindings. Hence, WSDL allows the reuse of definitions: messages, which are abstract descriptions of the data being exchanged, and port types which are abstract collections of operations. RESTful (REpresentational Transfer Protocol) was used to create interfaces for the Web services. REST allows application state and functionality that is uniquely addressable based on logical URIs to be abstracted into resources [11]. REST can support any media type, but XML which was used in the project is expected to be the most popular transport for structured information. The same messages descriptions in WSDL are used to exchange structured data through the REST interfaces of the Web services. Restful resources operations or functions are based on HTTP communication verbs making the architecture simpler to understand and use than SOAP. An XML vocabulary for expressing the behavior of HTTP resources, the Web Application Description Language (WADL) is provided that makes it easy to write clients for the RESTful Web services.
Service Descriptions and Discovery

Software components exposed as services can be described at many different levels of detail, depending on the intended use of the description. The description needed to discover a service is different to that required to configure and execute it. In order to catch the diversity of Web services and their many facets, a clear and abstracted representation of customizable Web services is necessary. However, such representation can certainly not be expressed in terms of low level concepts attached to Web services that are typical of standards such as WSDL,
UDDI or SOAP. Web services languages and protocols are implementation oriented, whilst we need to address functional integration problems such as those that emerge in the context of high level organizational processes. A generic data model that incorporates semantics to the service description is produced in order for services to be effectively discovered, thus assisting in the integration of processes.
METATOOLS SERVICE DATA MODEL The core data model developed to describe the services is shown as a conceptual UML class diagram in Figure 1. The model simply presents features from various models, such as the service profile in OWL-S [10] and service registries, such as UDDI and Grimoires, to provide a generic representation that is not tied to an underlying middleware layer or to specific service grounding information. The data model is mapped to distinguish between functional data, i.e. operation, and nonfunctional information about the service, i.e. the service publishing data. The various metadata generation tools provide a range of related, but independent, functionalities. Thus, the service element must contain information about publications, such as the name of the organisation that is the service provider, and the author and text description of the service functionality. A service may provide one or more operations and these operations do not directly link to the operations in a WSDL layer. For Restful services, each is modelled as a service with one or multiple operations or resources. For a local java object, multiple WSDL operations may be presented as a single functionality or else the method objects in an interface for a tool can be mapped as WSDL operations.
Figure 2. Service Data Model
The majority of information in the data model consists of annotations from the ontology describing the preservation domain that is discussed later in the section. The ontology acts as an annotation vocabulary describing core entities such as data types, tasks, and activities commonly involved in the preservation domain. In figure 2, the operations are characterized by inputs/ outputs and have a number of domain specific attributes. The attribute values would be the relevant concepts from the digital preservation domain ontology. The attributes for the service entity are:
serviceApplication: the functionality of a service is part of an application or tool, for example, KEA, or JHOVE. This helps the user to know the underlying tool used as a service. serviceType: defines the type of service. i.e. whether it is WSDL, RESTful or Java object.
These main operation attributes are: operationTask: describes the preservation activity being performed by the operation. It describes what exactly the operation does in more detail. For example, that metadata is being generated for a digital collection or data is being ingested into the repository is useful information for a preservation manager to know in order to perform a preservation task. operationMethod: describes the algorithm used to perform the task, as there may be more then one algorithm available to execute a given preservation task. For example, an algorithm to generate title or key phrases from a digital content.
The entity operations inputs and outputs are mapped with parameter entity. The attributes of the parameter are: semanticType: defines the domain specific data type. E.g. webpage location. In preservation, digital collections are from many different domains, such as literature and performing arts. It is essential to know the format of the digital resources supported by the services- such as web pages, file system, emails or databases. parameterFormat: describes the representation of data. Data can be represented using many different formats. The format can be specific to the digital preservation domain, such as pdf document format. The description depends upon the type of data the task processes and returns. transportDataType: representation of the actual data being consumed or produced by the service. For example string or byte array. The service model is presented as a service ontology that describes the physical and operational features of Web services, such as, inputs and outputs using the described attributes. The digital preservation domain ontology is the filler for these attributes. Initially, ontological terms were determined specifically for metadata generation services and the functionality that each of the tools provided. With a common service data model, the need for a digital preservation domain ontology was recognized, where the terms defined for metadata generation services are part of this domain ontology.
MetaTools Ontology
The notion behind using an ontology is to capture information about the core activities involved in any domain, including the data types and their relations to one another. Having a common ontology for an application domain helps to unify information presentation and permits software and information reuse. We present a MetaTools ontology that is designed to serve needs from the wider digital preservation domain (a complete version of the digital preservation domain ontology will be incorporated in the SOAPI project final report). The MetaTools ontology is understood as a common information hierarchy describing the data (streams) and functions that can be provided by various metadata generation tools.
Figure 3 outlines the MetaTools ontology that was developed by the Project (see Appendix A for full RDFS). The rectangular boxes are classes that represent concepts in the ontology. The MetaTools ontology defines four main concepts that are sub-classes of digital_preservation_concept: preservation_task: defines the various activity types. For example, metadata generating. (Activity in Figure 1) preservation_collection: defines any data or data files that are an input/output of the task. (Digital object in Figure 1) preservation_file_format: defines a number of data formats that may be applicable to a certain task, while a given task may support one or more specific formats. (Related to digital object) preservation_algorithm: defines a program or algorithm used for a particular task. For example, WEKA text mining algorithm which is being used in KEA. (Related to activity)
The ontology is self explanatory. The sub-classes of preservation_tasks and preservation_algorithm are used as annotations to describe the operation entity whereas the digital_collection and preservation_file_format sub-classes are used to describe the parameter entity in the service data model. For example, the diagram captures that pdf and ms_word are the sub-classes of document_formats, which are sub-classes of preservation_file_formats. All these classes are therefore subclass of digital_preservation_concept. The ontology is designed in such a way that they are reusable and more terminologies for various functions and data types could easily be added to the four classes.
Figure 3. MetaTools Ontology
Quality of Service The test program highlighted the necessity of defining the quality dimensions of the registered services, in particular focusing on the outputs generated from the tools. Metrics for this purpose are usually quantitative measurements and are described in the Phase two report. In addition to this, now that the tools are available as Web services it is important to define quality metrics such as response time, availability, reliability and scalability, and so on, as part of the service description. The users decision to use a service may depend on certain metrics reaching a particular threshold. For example, in the case of metadata generation tools,
it is essential to know how successfully a tool can extract metadata once the number of resources involved increases (defining the scalability of a tool). There are common quality metrics that can be defined based on the requirements of the tool users and which can be incorporated within the service data model as additional attributes. Nevertheless the main objective here remains to integrate metrics that define the quality of outputs generated by the tools. The following figure outlines various quality metrics and metadata elements for which they are appropriate drawn from the tests metrics and results discussed in Chapter 2:
Exact Match Accuracy
Recall Ratio
Precision Ratio
F1 score
Quality Metrics
Identifier
Format
Description
Creators
Title
KeyPhrases
Metadata Elements
Completeness
Quality Metric
Metadata generation Web services may output multiple metadata elements as an element array from any given operation. Alternatively, each defined operation may generate one element. For Data fountains, operations, such as assignTitle(), assignCreators(), assignKeyPhrases(), and assignDescription() are defined for generating various elements from the metadata assigner service. The Data fountains Web Service also defines an operation called iViaMetadataAssign() that outputs a list of values for all the elements. SAmgI also defines an operation getMetadata() to output a list of elements. For each operation, these attributes can easily be attached to a new element called ResultQuality and thus be incorporated into the service description. The Operation element will have a one to one relationship with the ResultQuality element in order to associate it with the output quality information. For example, to describe the quality of results for the extractKeyPhrases() operation of the KEA Web service, the operation will have a ResultQuality element with only the values for recall ratio, precision ration and F1 score. This will ensure that only the appropriate quality attributes will be provided in its description. Defining the quality of results from the operations in the case where the output is a list of elements a more difficult task. A quality metric such as completeness is, by definition, concerned with the generation of all metadata elements and contains a percentage measure representing the degree of completeness in generating the metadata elements by an operation. But metrics like exact match accuracy are specific to each of the elements. That is to say that the degree of exact match accuracy for identifier might be very different to that for the
10
title element. The solution is to define a value for each quality metric for each of the metadata elements. For example, completeness is defined as: <Completeness> <percentageValue>69</percentageValue> </Completeness> and Extract match accuracy can be defined as: <ExactMatchAccuracy> <percentageValue element =Identifier>83</percentageValue> <percentageValue element = Title>34</percentageValue> </ExactMatchAccuracy>
Attaching such quality information about the tools in a service description enables users to search for services using these attributes, for example get services with completeness of 50% or more. MetaTools System Architecture
Figure 4 shows our MetaTools architecture. It consists of three components: the description importer, the MetaTools client and the Grimoires mapper client. The MetaTools client consists of client-side libraries to invoke the Web services and the local tools. The description importer takes the low-level description of the services to produce an xml structure. The description importer is able to process the service description, both in WSDL and WADL. It maps the information provided by the low level description based on the service ontology.
Figure 4. MetaTools System Architecture
11
The PeDRo Annotator [8] is used to process the xml structure. It presents a form-based dataentry interface by which a user can enter data for the attributes in the service model. Using the annotator, the terms in the preservation domain ontology can be added as values to the service description attributes (e.g. operationMethod with value metadata_generating_algorithm). A service description complete with the domain semantics is produced using the annotator. PeDRo can also be used to annotate any local services or java interface. We used this application because it provided a user-friendly way to develop our services data model. It also provided a convenient way of associating controlled terms from the ontology to each attribute. [for the interface, see Appendix B]. The Grimoires client provides a mapper that maps the service description to the format required by the Grimoires registry and publishes them. The Grimoires client also allows for the querying of services using the preservation domain ontology. The Grimoires registry [9] was preferred for the project because it is based on the UDDI registry - with an added advantage that semantic information about the service can be attached in a simple way. The registry is mainly for the WSDL-based services but WADL or simple local services were also tested with Grimoires. The fact that the service data model is independent of any registry provides us with the flexibility to use any existing Web service registry. Although, the system architecture specifies this registry, we also provide functions to map the service descriptions to RDF format and store them in a repository. They could then be queried using existing query languages such as SPARQL and RDQL. Figure 5 presents a sequence diagram showing the interaction between the components summarised in the architecture. The MetaTools components, by which we mean the description importer and the Grimoires client, are the middleware API that is used to create a service description which is then mapped to, and registered, in the Grimoires registry. The user gets the xml structure by sending the WSDL or WADL location to the description importer. The user launches the PeDRo annotator programmatically or by an executable file; the file is opened using the PeDRo interface. Once this has happened, the user fills the attributes of the service data model into the xml structure. Some attributes like service name and operation name are pre-filled from the WSDL/WADL. The text description of the service and other business details are optional and are to be filled by the user. For attributes that require the ontology terms as fillers, the annotator loads terms from the rdf document and allows the user to select one of the terms from the list (Appendix B). Once data entry is completed, the user locally saves the file. Note that both the importer and the PeDRo annotator use the service ontology to create the xml structure and service description file, respectively. The user registers the service description using the Grimoires client. It first maps the description and then publishes it to the registry using its client-side library. During the mapping, any service attributes, such as operationTask, operationMethod and semanticType, that are not present in the service data model of the registry are attached as metadata. A metadata instance consists of metadata type and value pair, where metadata type is the attribute type (e.g. URI#operationTask or URI#operationMethod) and the value is the data from the domain ontology??]. After the service is published, the details of the service are returned to the user. A query client within the Grimoires client provides interfaces by which the registry can be queried based on the metadata type and value pair. The user first sends a request to generate ontology classes and this creates classes for each metadata type for the domain ontology. The classes are only created once or when the ontology is updated. The query client provides an
12
easy way to query the registry using the MetaTools ontology. It provides functions to which one can attach values from the ontology (i.e. so you do not have to know in advance the metadataType and value pair. Details of all the available services with the given metadata including their access details are returned to the user.
User
Description Importer
PeDRo Annotator
Grimoires Client
Grimoires Registry Services
getXmlStructure(loc) read(wsdl/wadl) xml Structure launchPedro() openFile(xmlStructure) Entry form enterData() getTerms(rdfFile)
saveXML(fileLocation,name) save(serviceDescription)
publishToGrimoires(serviceDescription,registryLoc)
mapToGrimoires() publish(desc) keys
Service Details generateOntoClasses() inquireByClassName (ontologyTerm) inquireMetadata (type,value) EntityReferenceList process(entityReference) Details of available services
Figure 5. Service description and discovery
CONCLUSION We presented a prototype service ontology and MetaTools ontology that facilitated service descriptions and discovery. The design of the MetaTools ontology is sufficiently flexible to form part of an upper domain ontology and can contribute terms for various activities within the domain. The proposed MetaTools system is generic and has the potential to encourage
13
institutions to describe, discover, coordinate, and share activities other than metadata generation services. As services or tools evolve, it will be possible to integrate them by adding semantic descriptions, thereby updating the registry. To conclude, we believe that an ontology-aided approach to service discovery, as employed by the MetaTools project, is a practical solution that is adaptable to relevant domains such as digital preservation.
REFERENCES
[1] Panos Constantopoulos and Vicky Dritsou. An ontological model for digital preservation. In the International symposium in digital curation (DigCCurr2007), April 18-20 2007, Chapel hill, CA, USA. [2] Alexander Mikroyannidis, Bee Ong, Kia Ng, and David Giaretta. OntologyDriven Digital Preservation of Interactive Multimedia Performances. 2nd International Conference on Metadata and Semantics Research (MTSR 2007). 11-12 Oct, Corfu, Greece. [3] Web Services Architecture. W3C Working Group Note 11 February 2004. http://www.w3.org/TR/ws-arch/. [4] myGrid project. http://www.mygrid.org.uk/. [5] http://www.nzdl.org/Kea/index.html [6] http://www.cs.kuleuven.be/~hmdb/joomla/ [7] http://ivia.ucr.edu/manuals/stable/libiViaMetadata/5.4.0/ [8] http://pedrodownload.man.ac.uk/history.html [9] Grid Registry with Metadata Oriented Interface (Grimoires). http://twiki.grimoires.org/bin/view/Grimoires/ [10] W3C Member Submission. OWL-S Semantic Markup Language, 22 November 2004. http://www.w3.org/Submission/OWL-S/ [11] Leonard Richardson and Sam Ruby, RESTful Web Services. May 2007.
14
15
APPENDICES Appendix A: Preservation Domain Ontology (RDFS)

<rdf:RDF>  <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#preservation_task" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#digital_preservation_concept"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#metadata_generating" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#preservation_task"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#generating_descriptive_and_technical_metadata" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generating"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#generating_descriptive_metadata" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generating"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#generating_technical_metadata" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generating"/> </a:Class>  <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#preservation_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#digital_preservation_concept"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#preservation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#key_phrases_extraction_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#WEKA_text_mining_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#title_extraction_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#author_extraction_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#abstract_extraction_algorithm" a:comment="">
16
Project Acronym: MetaTools Version:1a (Stage 3 Report) Contact: (until 31 October 2008): Malcolm Polfreman (malcolm.polfreman@kcl.ac.uk) Contact: (after 31 October 2008): Steve Grace (stephen.grace@kcl.ac.uk) Date: 30/10/2008 <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#description_extraction_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#broad_subjects_catagories_extraction_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#format_type_extraction_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#language_extraction_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#size_extraction_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#title_description_keyphrases_language_format_type_url_extraction_a lgorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#filename_format_size_language_extraction_algorithm" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#title_keyphrases_filename_format_size_language_extraction_algorith m" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#metadata_generation_algorithm"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#preservation_format_conversion" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#preservation_algorithm"/> </a:Class>  <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#digital_collections" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#digital_preservation_concept"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#online_collection_resources" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#digital_collections"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#webpage_location" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#online_collection_resources"/> </a:Class>
17
Project Acronym: MetaTools Version:1a (Stage 3 Report) Contact: (until 31 October 2008): Malcolm Polfreman (malcolm.polfreman@kcl.ac.uk) Contact: (after 31 October 2008): Steve Grace (stephen.grace@kcl.ac.uk) Date: 30/10/2008 <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#file_system_collection" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#digital_collections"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#local_directory_location" a:comment="upload all files in the directory- asynchronous"> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#file_system_collection"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#local_file_location" a:comment="upload files"> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#file_system_collection"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#resource_metadata" a:comment="metadata generated"> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#digital_collections"/> </a:Class>  <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#preservation_file_formats" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#digital_preservation_concept"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#document_formats" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#preservation_file_formats"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#plain_text_format" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#document_formats"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#pdf_format" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#document_formats"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#rtf_format" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#document_formats"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#html_format" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#document_formats"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#ms_document_format" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#document_formats"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#ms_powerpoint_format" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#document_formats"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#open_office_text_format" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#document_formats"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#open_office_presentation_formats" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#document_formats"/> </a:Class>
18
Project Acronym: MetaTools Version:1a (Stage 3 Report) Contact: (until 31 October 2008): Malcolm Polfreman (malcolm.polfreman@kcl.ac.uk) Contact: (after 31 October 2008): Steve Grace (stephen.grace@kcl.ac.uk) Date: 30/10/2008 <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#ps" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#document_formats"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#eps" a:comment=""> <a:subClassOf rdf:resource="http://www.cerch.kcl.ac.uk/ontology#document_formats"/> </a:Class> <a:Class rdf:about="http://www.cerch.kcl.ac.uk/ontology#digital_preservation_concept" a:comment="preservation domain"/> </rdf:RDF>
19
Appendix B: PeDRo Application used that allow user to load, update or edit service descriptions files in XML.
Select a term to describe the input format the service allows from the list of terminologies.
Tree view of the service description file. Shows all the operations and the parameters of the corresponding operations. Clicking on them shows the details on the right-hand side frame, where values the attributes such as parameterDescription can be added or edited.
Right-clicking on these attributes let user select a value from the list of terms loaded from the ontology.
20
Appendix C: Brief descriptions of some of the better metadata generation tools DC-dot DC-dot1 is the most well known of a group of usually Java-based tools that harvests metatags from web pages, when an identifier such as a URL is manually submitted to an online form, in order to generate Dublin Core metadata. Others are Reggie - The Metadata Editor2, Describethis3, the Viewer-Generator Dublin Core metadata free online tool4, and Metatag Extractor5. The harvesting of metatags is the simplest form of metadata generation. When they are available, DC-dot harvests the contents of Dublin Core or other metatags within the header of the source document but extracts content identified by html tags within the body of the page when the metatag for the given element is absent6. DC.dot copies resource identifier metadata from the Web browsers address prompt, and harvests title, keywords, description, and type metadata from resource META tags. If source code metadata is absent (meaning META tag are absent), DC-dot will automatically generate keywords by analyzing anchors (hyperlinked concepts) and presentation encoding, such as <strong>, bolding and font size, but will not produce description metadata in this way. Description metadata is only extracted from a description meta tag. Greenberg states that DC.dot also automatically generates type, format and date metadata, and can read source code programming that automatically tracks date. For example, last modified might be coded as: Last Modified + lm_day+ +monthName[lm_month-1]+ '+lm_year for last updated date7. If so configured, DC-dot attempts to guess the value of the Publisher element. The guess is based on the owner of the domain name of the supplied url but the process of guessing the Publisher may be quite slow. A small range of system-generated metadata, such as file size, may also be derived from the http transport response. Data Fountains Data Fountains is a tool for not only describing Internet resources but for discovering them in the first place. It has three modes of operation. It can generate metadata for a given page by providing it with the relevant URL It can trawl for and generate metadata for Internet resources on a particular topic It can drill down and follow links from a starting url. It then generates metadata records, and extracts rich full-text (i.e. key paragraphs/strings that best represent the text) for them.
DC-dot, http://www.ukoln.ac.uk/metadata/dcdot/ Reggie - The Metadata Editor, http://metadata.net/dstc/ 3 Describethis, http://www.describethis.com/ 4 Viewer-Generator Dublin Core metadata free online tool, http://www.library.kr.ua/dc/lookatdce.html 5 Metatag Extractor, http://www.dnlodge.com/seo/metatags_extractor.php 6 E.g. for the title element, the order of preference will be <meta name="DC.title"> then <meta name="Title"> and finally <title>, AMeGA report, p37 7 Greenberg, J. (2004), Metadata extraction and harvesting: A comparison of two automatic metadata generation applications, Journal of Internet Cataloging, 6(4), p.8. http://www.ils.unc.edu/mrc/pdf/automatic.pdf
2 1
21
Data Fountains is in some ways the most extensive and sophisticated metadata generation tool available at the moment. Data Fountains is available as an online tool with a GUI or as a download. The metadata generation functions (as opposed to the discovery functions) use Infomine's iVia suite of metadata generation software. The generation method varies according to the nature of a given element. e.g. for <dc:creator> it just extracts metatags (presumably because the project team feel that extracting from the free text is too unreliable), whereas for title it extracts firstly from metatags and then from <title> and <h1> tags in the text (presumably because the project team considers them sufficiently reliable indicators for title). Keywords are extracted via a combination of using <metatags> and sophisticated NLP algorithms (phraseRate). There is a user interface for the manual augmentation/correction of the generated metadata. Data Fountains can follow links to gather 'rich text' from related pages. This function is based on the belief that often the most useful metadata lies hidden within other pages (e.g. an 'about page') rather than in the 'home' page. This function can be configured, as can the selection of which elements to generate in the first place. Data Fountains has the big advantage of NOT needing to be seeded with training data. The resource also includes a metadata evaluation module for each element. JHOVE JHOVE, the JSTOR/Harvard Object Validation Environment, is an extensible software framework for performing formatidentification, validation, and characterisation of digitalobjects. It outputs: file pathname or URI last modification date byte size format format version MIME type format profiles and optionally, CRC32, MD5, and SHA-1 checksums [CRC32, MD5, SHA-1]. The software is able to recognise 12 distinct file formats and 45 encoding methods at the time of writing (October 2006) and various sub-categories. Unrecognised file formats are classified as a bytestream. It recognises: XML, WAV, TIFF, UTF-8, PDF, JPEG, JPEG2000, HTML,XHTML, GIF, Bytestream, ASCCI, AIFF. Also has default profile for objects not conforming to these types. KEA Kea is a tool developed by the New Zealand Digital Library Project for the specific purpose of extracting native keywords (or, rather, keyphrases) by keyphrase extraction from html and plain text files. It can be either used for free indexing or for indexing with a controlled vocabulary. It is implemented in Java and is platform independent. It is an open-source software distributed under the GNU General Public License.
22
Kea requires seeding with good pre-existing metadata for between 25 and 50 resources similar to those at hand in order to learn the pattern between the words and phrases within the text and the author-assigned keywords that have been applied. The resulting model is used to automatically assign keyphrases to new resources. Kea shares the same underlying weakness as all extractors that rely on a body of training data. They can normally only be relied upon to work successfully in relation to new resources from within that same domain (i.e. with similar subject coverage and document layout). So far Kea has only been tested in relation to the Computer Science Technical Reports (CSTR) collection of the NZDL and within the specific subject domain of agriculture. The need for labeled training data would normally be a distinct hindrance because the size of a required training set is known to increase considerably (and the quality of the output decrease) for corpuses of broader, or in the case of Intute or the JISC IE as a whole, universal subject coverage. SamgI (Simple Automated Metadata Generation Interface) SAmgl is designed to extract metadata from learning objects (LOM), such as courseware. The JISC IE is no longer focused on FE so LOM metadata is less relevant now. However, SamgI extracts several potentially useful DC elements. The SAmgl is a framework/system that one could call "federated AMG". SAmgI envisages several installations, i.e. systems that do "some form of metadata generation". Examples of this are a Java webservice running on a Tomcat, a .NET web service running on a .NET server, or for example a (closed) Learning Management System that does some metadata generation for its content. The idea is that a client can call each of those clients to let them do each a part of the metadata generation job. The results of those systems can then be combined into 1 global metadata instance. In the future one could then imagine that some "Federated AMG engine" is written that does this job of contacting several installations and combining their results. The SAmgl interface may point to several other Web services? Yahoo! Term Extraction service The Yahoo! Term Extraction Web Service is a RESTful service that provides a list of significant key words or phrases extracted from a larger content. There was no need to separately test the Yahoo! Term Extraction Service because SamgI, which is itself a Web service, invokes the Yahoo! Term Extraction service to provide its key words and phrases.
23

Met A Tools Final Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Met A Tools Final Report

Uploaded by

Copyright:

Available Formats

Project Acronym: MetaTools Version: 1a Contact: (until 31 October 2008): Malcolm Polfreman (malcolm.polfreman@kcl.ac.

JISC Final Report

MetaTools - Investigating Metadata Generation Tools Final report

Aims and Objectives

Stage 1: developing a methodology for evaluating metadata generation tools. Methodology

Outputs and Results

Element group Controlled field with choice of terms in a hierarchy

Elements Subject terms (LCSH/AAT)

Controlled field with choice of terms but no hierarchy

Uncontrolled terms (open-ended and multiple)

Subject terms (iVia subject terms) Media type Language Keywords

Free text abstracting fields

Free text copying fields

Categorical elements All elements/overview

Stage 2: Comparing the quality of currently available metadata generation tools.

http://www.ukoln.ac.uk/metadata/dcdot/ http://www.nzdl.org/Kea/ 15 Version 2.2.0, http://datafountains.ucr.edu/

Outputs and Results

Stage 3: developing a methodology for evaluating metadata generation tools. Methodology

http://metadata.net/panic/ http://www.grimoires.org; http://www.mygrid.org.uk/index.php?module=pagemaster&PAGE_user_op=view_page&PAGE_id=57&MMN _position=64:51:63 49 http://www.daml.org/services/owl-s/ 50 http://www.w3.org/Submission/WSMO/ 51 http://www.mygrid.org.uk/

Outputs and Results

Metrics for an overview

1 Definition/Description Alternative name

Metrics for categorical elements (e.g. dc:identifier)

Metrics for Multi-value fields (e.g. dc:subject)

Alternative name Uses

Sometimes called subfield precision/SFP

Van Rijsbergen 1979

The weighting can make the score a little arbitrary.

be calculated if error rate is used instead of precision.

Metrics for Free-Text elements (e.g. dc:description)

Alternative name Uses

Content Word Precision

Content Word Recall

Definition/Description Alternative name Uses

14 Definition/Description Alternative name Uses

The Expert Game

The Shannon Game

The Classification Game

Advantages Disadvantages This metric is resource-intensive.

Controlled field with choice of terms but no hierarchy

Uncontrolled terms (open-ended and multiple)

Subject terms (iVia subject terms) Media type Language Keywords

Free text abstracting fields

Free text copying fields

Categorical elements All elements/overview

http://www.ukoln.ac.uk/metadata/dcdot/ http://www.nzdl.org/Kea/ 3 Version 2.2.0, http://datafountains.ucr.edu/

Outputs and Results

Figures supplied by Greg Tourte, Systems Administrator/Develop, UKOLN. Email:

Generation Applications) Project, UNC & Library of Congress, p.16. http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf

Generation Applications) Project, UNC & Library of Congress, p.37. http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf

public in electronic format

read Bernard F. Reilly, Jr.'s introduction

All questionis should be submitted to webmaster@harpweek.com

Significant testing by other projects

Confidence ratings reach their maximal and minimal values at 1 and 0.

Description Description Description

0.231 0.054 40.972

Identifier Keywords42 Keywords Keywords

88% 0.74 0.087 0.08

0.93 0.18 0.19 -

Keywords Keywords Keywords Title

Precision Recall F1 Likert scale

0.20 0.23 0.21 4 6