You are on page 1of 2

Technical Brief

SoftSound
SoftSound - a unique combination of conceptual and phonetic technology.
Originating from specialist research in intelligence gathering SoftSound is the world leader in telephone speech recognition technology. Unlike other speech technologies that only use a phonetic approach SoftSound combines the ability to match and understand the audio stream in the phonetic domain whilst also taking into account the probability of the estimators to determine the concepts used. In other words SoftSound can both recognize the smallest linguistic unit or speech sound - the phonemes (38 such basic speech sounds make up the words we use in the so called General American English. By example the word hid is formed from 3 phonemes - h-i-d) and actually understand at a mathematical level the very meaning of the words they form. For example if during a radio phone in on the subject of the White House a caller says George the technology can predict that their following word is likely to be Bush. Further to this, SoftSound understands that the word Bush refers to a person and not a plant. However, if the caller used the same word when calling into a discussion on gardening entitled the upkeep of herbaceous borders SoftSound understands that the different topics are conceptually related and the callers interests are most probably botanical and not political. Crucially, as SoftSound listens it creates faster than real time a combined mathematical phonemic and conceptual matrix. A common misperception when comparing SoftSound with related technologies is that SoftSound simply searches through the text from the transcripts it can produce. This is not the case, unlike entry level technologies SoftSounds search space is the mathematical matrix itself. A textual transcript is produced as a visual guide to users and as a required input to applications such a machine translation but it is not the search medium. It is possible to only use the phonetic component of speech recognition, and many entry level speech technology companies do so. However, without a conceptual language component it is impossible to scale or achieve acceptable accuracy levels. Such approaches are forced to use cumbersome and computationally expensive rules to perform the most basic recognition tasks. For example, to distill from the speech stream the permissible 1 to 2 thousand English syllables, phonetic based vendors are forced to apply (on a millisecond basis) a strict rule that imposes the following criteria syllables in English cannot start with an [ng] phoneme. This rule has to be applied since syllables like ngess or ngoot are not present in spoken English.

www.autonomy.com

SoftSound

Think of searching for the word "cat". The phonetic sounds also match hundreds of other words - not only those that start with "cat" such as "catalog" but also words that have the same sounds inside such as "scatty". What's more despite our normal human perception that words are discrete entities with clear beginning and endings the normal acoustics of speech contain no obvious gaps between words. So "cat" will also match across words such as "anarchic attitude". Only systems that can form a conceptual understanding can disambiguate these cases. Consider the opposite - searching with a long sequence of words. In pure phonetic search the words used need to match the phonemes in the audio. For example the phoneme sequence in "Cambridge University" is a poor match to audio that contains "The University of Cambridge" and if our search was for "Cambridge Colleges" we'd like to return conceptually similar documents, e.g. references to "Kings College" and "Trinity Hall". Thus with a phonetic approach the user is forced to use keywords. Also, they must specify their keyword exactly - for example typing "American skiing gold" will fail if the actual words used were "American downhill champion" or "Americas first goal in the Winter Olympics". The scientific community has evaluated speech search for many years as part of the NIST sponsored Text Retrieval Conference. These evaluations were the first to bring together researchers in audio and video information retrieval. In the early years some systems used a purely phonetic approach, others also used very large vocabulary recognition. Continued performance testing and information sharing rapidly led to all major academic systems adopting a language model (which describes what words generally follow other words in the language) as well as a phonetic model. SoftSound research allowed to the further development of the conceptual model and an understanding of the spoken words. The initial reason for limiting a system to phonetic search only was to allow real time processing. However, advances in search algorithms, such as used by SoftSound (US Patent 5,983,180) mean that incorporating a conceptual model as an additional source of information can be achieved at fast or faster rates than phonetic recognition alone. The reason for this comes back to the use of the additional information sources - the conceptual model can help actually understand what is said so it's not necessary to search all phoneme sequences. Pure phonetic search can't use this information and so needs to search more phonetic models which slows it down. A suggested disadvantage of the use of a language or conceptual models is that some words or concepts may not be included. However, with recent technology, such as SoftSound, vocabulary size is no longer a limitation - the SoftSound technology has no practical limit on vocabulary size and can easily cope with millions of words. It is also very easy to generate new models, either to make them up-to-the-minute or to tailor for specific domains of conversation. Indeed, SoftSound will self learn when monitoring rapidly changing domains such as the news where complex and previously unforeseen stories can suddenly and dramatically appear with little warning. SoftSound is now the standard for such fast moving high volume broadcast domains. Claims that the phonetic searches are "high speed" also need investigating. Even with claims as high as 30,000+ times faster than real time a single news channel recorded 24x7 for a year would take 15 minutes per search - this is clearly impractical which means that this approach has not yet been applied to large databases. In comparison SoftSound can search in milliseconds thousands of such news channels each with several years of archived stories. In summary, SoftSounds technological combination of the phonetic and conceptual retains all the power of the pure phonetic approach and provides significant gains in speed, accuracy, scalability and the unique ability to understand what is said.

Autonomy Inc. One Market, Spear Tower, 19th Floor, San Francisco, CA 94105, USA Tel: +1 415 243 9955 Fax: +1 415 243 9984 Email: autonomy@autonomy.com

Autonomy Systems Ltd Cambridge Business Park, Cowley Rd, Cambridge CB4 0WZ, UK Tel: +44 (0) 1223 448 000 Fax: +44 (0) 1223 448 001 Email: autonomy@autonomy.com

Other Offices Autonomy has additional offices in Boston, New York and Washington DC, as well as in Amsterdam, Brussels, Hamburg, London, Madrid, Milan, Munich, Oslo, Paris, Rome, Singapore, Stockholm and Sydney.

www.autonomy.com
Copyright 2005 Autonomy Corp. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners. Product specifications and features are subject to change without notice. Use of Autonomy software is under license.

You might also like