You are on page 1of 4

How Text Mining works The text mining actually consists of the application of techniques from few components

such as information retrieval, natural language processing, information extraction and data mining. All these various components of a text mining process are combined together to perform a single workflow or text mining pipeline. The text mining pipeline involves these four stages of processes.

Figure 1: Overview of the components of text mining (Dr Diane McDonald, 2012)

Figure 2: Schematic overview of the processes involved in text mining of scholarly content (Dr Diane McDonald, 2012) All the processes have been specifying below:

a) Information Retrieval (IR) In text mining, Stage 1, Information Retrieval (IR) systems will discover and identify the documents in a collection which match a with a users query. For example, Information Retrieval systems are the search engines such as Google, Yahoo and Bing. All these

search engines will identify those documents on the World Wide Web that are appropriate with the users words key in. IR systems are often used in libraries, where the documents are normally not the books themselves but digital records containing information about the books. This is however changing with the advent of digital libraries, where the documents being regained are digital versions of books and journals.

Then, the IR systems allow us to narrow down the set of documents that are relevant to a particular problem. As text mining involves applying very computationally intensive algorithms to large document collections, IR can speed up the analysis considerably by reducing the number of documents for analysis. For example, if we are interested in mining information only about diseases symptoms, we might restrict our analysis to documents that contain the name of a disease, or some form of the verb symptoms or one of its synonyms to get more accurate documents.

b) Natural Language Processing (NLP) Text mining software tools often use computational algorithms based on Natural Language Processing, or NLP, to enable a computer to "read" and analyze textual information. It will interpret the meaning of the text and identifies extracts, synthesizes and analyzes relevant facts and relationships that directly answer what the user ask.

The Natural Language Processing (NLP) is an analysis of human language so that computers can understand natural languages as humans do. This NLP actually is one of the oldest and most difficult problems in the field of artificial intelligence. The NLP can execute some types of analysis with a high degree of success.

For example:
Part-of-speech tagging classifies words into categories such as noun, verb or adjective Word sense disambiguation identifies the meaning of a word, given its usage, from

among the multiple meanings that the word may have


Parsing performs a grammatical analysis of a sentence. Shallow parsers identify only

the main grammatical elements in a sentence, such as noun phrases and verb phrases, whereas deep parsers generate a complete representation of the grammatical structure of a sentence.

Then, NLP role is to provide the systems in the information extraction phase with linguistic data that they need to perform their task. Often this is done by annotating documents with information like sentence boundaries, part-of-speech tags, parsing results, which can then be read by the information extraction tools.

c) Information Extraction (IE) Text mining extracts precise information based on much more than just keywords, such as entities or concepts, relationships, phrases, sentences and even numerical information in context.

For stage 3, Information Extraction (IE) is the process of automatically accomplish structured data from an unstructured natural language document. This involves defining the general form of the information that we are interested in as one or more templates, which are then used to guide the extraction process. Normally, the IE systems rely on the data generated by NLP systems.

The tasks that IE systems can perform include:


Term analysis, which identifies the terms in a document, where a term may consist of

one or more words. This is especially useful for documents that contain many complex multi-word terms, such as scientific research papers
Named-entity recognition, which identifies the names in a document, such as the

names of people or organisations. Some systems are also able to recognise dates and expressions of time, quantities and associated units, percentages, and so on
Fact extraction, which identifies and extracts complex facts from documents. Such facts

could be relationships between entities or events Here is the example of a template of IE. The figure 3 will simplify it and show how it might be filled from a sentence. The IE system must be able to identify that bind is a kind of interaction, and that myosin and actin are the names of proteins. This kind of information might be stored in a dictionary or an ontology, which defines the terms in a particular field and their relationship to each other. The data generated during IE are normally stored in a database ready for analysis by the final stage, which is the data mining part.

Figure 3: Template-based Information Extraction (National Centre for Text Mining, 2008)

d) Data Mining (DM) In text mining process, text can be mined in a systematic, comprehensive and reproducible way. Here, the data mining often also known as knowledge discovery is the process of identifying patterns in large sets of data.

The main goal for data mining is to discover previously unknown, useful knowledge. When used in text mining, data mining is applied to the facts generated by the information extraction phase.

For example when the user ask for the protein interaction, it may have extracted a large number of protein interactions from a document collection which from search engine. Then, it stored these interactions as facts (meaningful) in a database. Through applying data mining to this database, we may be able to identify patterns in the facts. This may direct to new discoveries about the types of interactions that can or cannot occur, or the relationship between types of interactions and particular diseases and so on.

Then, the results of data mining process will put into another database that can be query by the end-user using a suitable graphical interface. References: 1. Dr Diane McDonald, 2012, Value and benefits of text mining, Intelligent Digital Options and Ursula Kelly, Viewforth Consulting 2. National Centre for Text Mining, 2008, Text Mining, Version 2: September 2008 www.jisc.ac.uk/publications

You might also like