You are on page 1of 9

December 15, 2011

Five Ways Entity Extraction Enhances the Intelligence Cycle


Much usable intelligence comes from unstructured text. It exists in many forms, including electronic documents, email, web pages, news feeds, and social media. The challenge is to quickly idenfy which documents will most likely yield pernent intelligence. Enty extracon is an eecve tool to enhance this idencaon process.

We put the World in the World Wide Web

ABOUT BASIS TECHNOLOGY Basis Technology provides soware soluons for text analycs, informaon retrieval, digital forensics, and identy resoluon in over forty languages. Our Rosee linguiscs plaorm is a widely used suite of interoperable components that power search, business intelligence, e-discovery, social media monitoring, nancial compliance, and other enterprise applicaons. Our linguiscs team is at the forefront of applied natural language processing using a combinaon of stascal modeling, expert rules, and corpus-derived data. Our forensics team pioneers beer, faster, and cheaper techniques to extract forensic evidence, keeping government and law enforcement ahead of exponenal growth of data storage volumes. Soware vendors, content providers, nancial instuons, and government agencies worldwide rely on Basis Technologys soluons for Unicode compliance, language idencaon, mullingual search, enty extracon, name indexing, and name translaon. Our products and services are used by over 250 major rms, including Cisco, EMC, Exalead/Dassault Systems, Hewle-Packard, Microso, Oracle, and Symantec. Our text analysis products are widely used in the U.S. defense and intelligence industry by such rms as CACI, Lockheed Marn, Northrop Grumman, SAIC, and SRI. We are the top provider of mullingual technology to web and e-commerce search engines, including Amazon.com, Bing, Google, and Yahoo!. Company headquarters are in Cambridge, Massachuses, with branch oces in San Francisco, Washington, London, and Tokyo. For more informaon, visit www.basistech.com.

2012 Basis Technology Corporaon. Basis Technology, Geoscope, Odyssey Digital Forensics, Rosee, and We put the World in the World Wide Web are registered trademarks of Basis Technology Corporaon. All other trademarks, service marks, and logos used in this document are the property of their respecve owners. (2012-08-15)

In the processing and exploitaon stage of the intelligence cycle, being able to quickly assess the relevance of informaon as it arriveswhether HUMINT, SIGINT, or OSINTis key in keeping the cycle short. Enese.g., people, places, and organizaonsare frequently the pivotal data points. Being able to quickly pick out the documents menoning enes of interest speeds the triage of data, and the following analysis and disseminaon. A shorter cycle allows quicker adjustments to requirements and quicker follow-up of fruiul avenues. This paper discusses the ulity of enty extracon for the intelligence cycle, and weighs dierent methods of enty extracon available today. THE IMPACT OF ENTITY EXTRACTION 1. Search Beyond Key Words to Key Enes Keyword search is good at what it was built to do: return exact keyword matches. However, this approach ends up under-including many relevant documents and over-including many irrelevant documents. Since keyword search does not know the meaning of words, a search for car will miss references to Chrysler, Saturn, or Toyota. An enty extractor, however, can be trained to recognize these references as names of companies or products. Also, keyword search is unlikely to nd documents where the keyword is misspelled, whereas enty extracon will nd misspelled enes because it nds enes based on their context. Lets consider an example where keyword search is over-inclusive. Given United Naons, keyword search may return the United of United Airlines or united we stand or united as a people. Enty extracon understands that United Naons and UN are enes and only those documents will be agged. Similarly, in many languages, including English, personal names are created from common nounsfor example, Mary Jane Brown. Keyword search may over-include irrelevant documents such as brown sugar and brown Mary Janes (a type of shoe) when the meaning of the word varies by context. 2. Reveal Rising Trends and Paerns Enty extracon can illuminate paerns and rising trends when the same enty is agged in mulple sources. Suppose you are an analyst monitoring incoming chaer. You noce a sudden increase in the menons of a town in Iowa the president is due to visit in two weeks. This increase might trigger a closer analysis of documents which menon both enes.

Five Ways Enty Extracon Enhances the Intelligence Cycle

Finding this paern with keyword search would require pre-knowledge to ask about this town and the president, but enty extracon doesnt. Extractors can spot new, unknown enes when they rst appear, giving analysts me to examine and grade their signicance. Nocing this upck in menons of the town and the president could lead the analyst to read those documents rst, and inform the security arrangements for the president if necessary. Early detecon of unmonitored enes can alert analysts to new, emerging paerns even if they dont yet know the signicance of the enty.

3. Use as Foundaonal Data for Drawing Relaonships Enes which have a relaonship will appear in the same document, for example, President Obama to visit Oumwa on Thursday or James Bond works for Brish Secret Service. 1 Knowing which documents contain enes of interest allows analysts to focus on those rst. From examining the smaller set of documents, an analyst may nd relaonships and conclusions to feed into applicaons for linking, data visualizaon, and alerng. Without enty extracon, document ltering is done by analysts reading through les and agging ones of interest. Enty extracon automates that work, leaving more me for analysts to draw connecons and accelerate the intelligence cycle. 4. Tighten the Intelligence Feedback Loop Enty extracon does more than conserve valuable analyst me; it reveal shortcuts through a confusing mountain of data that would not have been found even with double the manpower. At the same me, producing onlyand alltrue enes eliminates the need to analyze false informaon at each stage of the cycle. Many tools for idenfying threats rely upon triage rules to raise alerts upon discovering paerns or signicant words. Savings can be even greater in eliminang the false posives that might trigger a chain reacon, consuming precious human resources on dead ends. 5. Reduce Costs and Increase Eciency Enty extracon reduces labor costs, but also the cost of not having the right intelligence soon enough. What relaonship, if any, is there between Obama and Oumwa? Shortcuts through the data reveal signicant paerns quickly and pinpoint relevant documents. Avoiding over- and under-inclusion of search results means more complete informaon and fewer chases down fruitless paths.
1

Although the converse statementenes appearing in the same document have a relaonshipis not necessarily true!

Five Ways Enty Extracon Enhances the Intelligence Cycle

Individually, enes can populate analyst repositories of structured data, driving further analysis for fact and concept extracon; link and data visualizaon; paern and trend analysis; and identy resoluon. In the aggregate, the appearance of enes may trigger new avenues of invesgaon. Since most analysis centers around enes, judicious deployment of enty extracon decreases the manpower required to idenfy and isolate relevant informaon, while reducing costs and increasing the quality and speed of intelligence analysis. ESSENTIAL ELEMENTS OF ENTITY EXTRACTION The analysis stage of the intelligence cycle depends on eecve enty extracon to analyze mulple languages and draw accurate conclusionseven when words are used with dierent meanings in dierent contexts. Eecve enty extracon has these characteriscs: Foreign language capabiliesIntelligence is not limited to English or European languages, and tools cannot be either. Comprehensive intelligence analysis requires prociency in the languages of Asia, the Middle East, and elsewhere.

Context-sensive extraconThe word gates may be a plural noun in real estate, a verb in electronics, or the last name of a person. Within a single document, many common words can occur with dierent meanings. Only true enes should be agged. Seamless integraonEnty extracon is just one step in the analysis-process chain, so the capability should oer easy integraon with other analysis tools to present data in a format that is easy to use and manipulate. Easy customizaonEvery enty extractor works best on the genre of material it was trained on, thus a certain amount of customizability is required in almost any applicaon, whether it means training on text with a dierent prole or adding custom-enes.

Five Ways Enty Extracon Enhances the Intelligence Cycle

High accuracyTo be useful, an extractor should be at least 85% accurate in major languages, and over 90% for English and the most common European languages. High throughputThe volume of data to be analyzedfrom both closed and open sources is growing exponenally every year. An extractor should have a small footprint and be fast enough to rapidly analyze high volumes of data. METHODS OF ENTITY EXTRACTION Enty extractors use several dierent approaches: List-BasedA simple list of all enes in a category is simple to implement; easy to extend; and oen quite accurate within its narrow scope. Finite categories like names of countries, medicines, or weapons work well with this approach. A list of important geographical and polical enes can extract the most important ones. However, lists require constant maintenance and the enes must be known a priori. Further, they lack context, so a list of countries will nd not only China (the country) but also ne china. Paern MatchingThese mathemacal paerns do a good job with enes that fall into set paerns, such as Social Security numbers, postal codes, street addresses, phone numbers, and email addresses. However, paerns can also be myopic and lack context. Unconvenonal formang such as 10.9.1999 or 10*9*1999 can confuse them. There are also common paerns that are used in dierent ways. A phone number and a SKU may have the same structure, but since the regular expression detector isnt aware of the surrounding context, it could result in a SKU mistakenly being extracted as a phone number. Rule-BasedSophiscated paerns are oen referred to as rules. Rule-based enty extractors look for more complex paerns. For instance, the words that follow Mr. are usually the name of a person. Rules like these require connuous maintenance by experts. Furthermore, soware can only idenfy enes dened by known rules, and new enes require reprogramming. StascalThese extractors idenfy enes based on the context of surrounding words. Stascal models are trained on hundreds of thousands of words which have been marked by human annotators. Through machine training, the models creates their own rules for idenfying the context in which enes appear. Enty extracon is then a stascal analysis of the probability of each word being an enty, based on the rules it has learned. With this approach, new enes can be discovered without retraining.

Each of these approaches works well with some enes but fails with others which leads us to our nal approach:

Five Ways Enty Extracon Enhances the Intelligence Cycle

HybridThis soluon integrates the results from mulple approaches. It weighs the results based upon the known strength and reliability of each approach. The hybrid approach will produce a more accurate result than any one approach alone. BASIS TECHNOLOGYS APPROACH TO ENTITY EXTRACTION The Rosee Enty Extractor is a hybrid mechanism that integrates the results from three techniques: list-based, paern matching, and stascal. The target text is fed to all three modules and then a fourth module called the redactor balances the results and acts as judge when answers conict. Rosee uses a weighted set of criteria to merge results and idenfy people, places, and other enes. Rosee Enty Extractor Features Foreign language capabiliesRosee extracts enes from text in many languages, including English, Arabic, Pashto, Persian, Urdu, Chinese, Japanese Korean, and major European languages. Context-sensive extraconRosees stascal models consider context when extracng key enes such as person, place, and organizaon (including company names). Seamless integraonRosee is a soware development kit (SDK) accessible via via single C, C++, .NET, or Java applicaon programming interface (API). It has been designed for simple integraon with Apache Lucene, Apache Solr, dtSearch,and other search engines. Easy customizaonUsers can add custom enes via regular expressions or lists, or enhance the stascal model with training data with addional data relevant to the users problem domain. High accuracy and throughputRosees accuracy and speed is industry-tested and used by customers such as Microso Bing, which handle a high volume of transacons and require high quality for every system component. Pre-Trained and Customizable Enty Extracon Rosee Enty Extractor comes ready to idenfy a wide-range of enty types. It has a database of carefully curated enes for the list-based module. The pre-wrien regular expressions for the paern-matching modules support a variety of formats for each enty type. The stascal modules achieve high accuracy by pre-training on carefully tagged and quality checked corpora, which are representave of a variety of genres for each supported language Many users customize Rosee to their own problem domains by adding their own lists of enes and regular expressions. The stascal models can be enhanced by feeding the training soware documents with the same prole as those needing enty extracon. The enhanced model retains all of the pre-loaded knowledge , but also includes stascal data generated from the new documents.

Five Ways Enty Extracon Enhances the Intelligence Cycle

ENTITY EXTRACTION IN ACTION: PREDICTING IEDS Rosee Enty Extractor was incorporated into an applicaon to predict where improvised explosive devices (IEDs) might be found. The applicaon used enes to nd links between messages and then clustered the messages to idenfy possible hot spots where the probability of future IED aacks was high. However, examining coalion message trac is a far cry from news arcles which enty extractors are normally trained on. HAMID KHALILI and his followers were spotted on the border of PAK and AFG using IMINT from UAVs. It is believed this group is responsible for the recent IDF attacks in MsE (42SUF31386402, JADID District, BALKH Province). Coalition forces are working with AFG locals in this region to track KHALILI before he escapes too far into the border of PAK. Out-of-the-box, the enty extractor did not recognize PAK as Pakistan, a place, and required addional training.

In an ideal world, from 2MB to 20MB of message text would have been annotated and added to the model but me and data were limited. Instead, in three days, two annotators and a soware developer added about 100 annotated messages and rebuilt the stascal model. The result was more accurate enty extracon for more accurate message linking to detect IED hot spots.

Five Ways Enty Extracon Enhances the Intelligence Cycle

ONE STEP OF AN INTEGRATED DOCUMENT EXPLOITATION WORKFLOW Rosee Enty Extractor is just one step in the text analysis pipeline that starts with mullingual text and ends with a master index of enes in the document set or a translated list of names. 1. Rosee Language Idener sorts documents by language and encoding. 2. Rosee Base Linguiscs idenes word and sentence boundaries and performs other linguisc processing to allow search engines to index the data for highly accurate search. 3. Rosee Enty Extractor nds enes from these words. 4. Rosee Name Indexer matches dierent instances of the same name (e.g., Barack Obama, a misspelled Barak Obama, Obama, and President Obama). 5. Rosee Name Translator displays English translaons of foreign names to help Englishspeaking analysts. Reliable Enty Extracon as a Foundaon for Analycs Government agencies and technology companies trust Rosee Enty Extractor to augment the mechanisms and intelligence analysis used to prepare an increasingly large collecon of text for exploitaon. With good data on enes going into the intelligence cycles data processing stage, focused and dependable conclusions can be quickly generated and acted upon. EXPLORE FURTHER For more informaon or to request an evaluaon, please call us at 617-386-2090 or 800-697-2062, or write to info@basistech.com. We will be happy to assist you in evaluang the performance of our products on your data.

Five Ways Enty Extracon Enhances the Intelligence Cycle

You might also like